Saltar al contenido

EFA Nitro V3 Query

0

1.Does libfabric pingping apps works with Nitro V3 variant of EFA ( i am using g4dn.8xlarge)

2.fi_info -p efa is showing

fi_info -p efa -t FI_EP_RDM
provider: efa
    fabric: efa-direct
    domain: rdmap0s29-rdm
    version: 201.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: efa
    domain: rdmap0s29-rdm
    version: 201.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA

While documentation show little ouptut e.g provider: efa fabric: EFA-fe80::94:3dff:fe89:1b70 domain: efa_0-rdm version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_EFA

Does this mean , efa didnt got the GID in my setup?

3.How to run libfarbic applicaiton , configuration etc, how to provide server address , EFA dont have IP Address and GID is all zero

c1uslt-nccl02:~> ibv_devinfo
hca_id: rdmap0s29
        transport:                      unspecified (4)
        fw_ver:                         0.0.0.0
        node_guid:                      0000:0000:0000:0000
        sys_image_guid:                 0000:0000:0000:0000
        vendor_id:                      0x1d0f
        vendor_part_id:                 61344
        hw_ver:                         0xEFA0
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x01
                        link_layer:             Unspecified
preguntada hace 4 meses27 visualizaciones
1 Respuesta
0

Let me address your questions about EFA (Elastic Fabric Adapter) on your g4dn.8xlarge instance:

  1. Yes, libfabric pingpong applications should work with the Nitro V3 variant of EFA on g4dn.8xlarge instances. The g4dn instance family supports EFA, and the libfabric layer is designed to work with it.

  2. Regarding your fi_info -p efa output: The differences you're seeing between your output and the documentation are normal. The output format can vary between different EFA driver versions. Your output shows "efa-direct" and "efa" fabrics with "rdmap0s29-rdm" domains, while the documentation shows "EFA-fe80::94:3dff:fe89:1b70" fabric with "efa_0-rdm" domain.

This doesn't necessarily mean your EFA setup is missing the GID. The output format has evolved across different EFA versions. What's important is that you're seeing the EFA provider with FI_EP_RDM type and FI_PROTO_EFA protocol, which indicates that EFA is properly detected by libfabric.

  1. For running libfabric applications with EFA:
  • For server address: You typically use the hostname or IP address of the instance, not the EFA device itself. EFA works at a lower level but applications still use the host's identity for initial connection setup.
  • The zero GIDs in your ibv_devinfo output are expected for EFA devices, as EFA uses a different addressing scheme than traditional InfiniBand.
  • To run libfabric applications: a. Make sure your security groups allow all traffic between the instances in the same security group b. Use the hostname or IP address of the target instance c. Set the environment variable FI_PROVIDER=efa to ensure libfabric uses the EFA provider d. For MPI applications, use FI_PROVIDER=efa mpirun -np <processes> --hostfile <hostfile> <application>

The EFA setup appears to be working correctly based on your fi_info output showing the EFA provider. The zero GIDs in ibv_devinfo are normal for EFA devices and don't indicate a problem with your configuration.
Sources
Launching a AWS Deep Learning AMIs Instance With EFA - AWS Deep Learning AMIs

respondido hace 4 meses

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.