Available MPI versions (and comparison)

The cluster has OpenMPI installed.

On all nodes:

module load openmpi/4.1.1-gcc-10.3.0-r8-tcp

NB: Currently only use TCP transport! There are several OpenMPI modules for diffrent versions, different modules for the same version only differ in environment variables for transport selection.

Normally, OpenMPI will choose the fastest interface, it will try RDMA over Ethernet (RoCE) which causes “[qelr_create_qp:683]create qp: failed on ibv_cmd_create_qp” messages, these can be ignored, it will fail over to IB (higher bandwidth anyway) or TCP.

NB: mpirun is not available, use srun

For MPI jobs prefer the green-ib partition (#SBATCH -p green-ib) or stay within a single node (#SBATCH -N 1).

export OMPI_MCA_btl_openib_warn_no_device_params_found=0 srun ./hello-mpi



Layers in OpenMPI


  • PML = Point-to-point Management Layer:

    • UCX

  • MTL = Message Transfer Layer:

    • PSM,

    • PSM2,

    • OFI

  • BTL = Byte Transfer Layer:

    • TCP,

    • openib

    • self

    • sm (OpenMPI 1), vader (OpenMPI 4)


The layers can be confusing, so was openib originally developed for InfiniBand, but is now used for RoCE and is deprecated for IB. However, on some IB cards and configurations it is the only working option. Also, the MVAPICH implementation still uses the openib (verbs) instead of UCX.

Layers can be selected using environment variables:

To select TCP transport:

export OMPI_MCA_btl=tcp,self,vader

To select RDMA transport (verbs):

export OMPI_MCA_btl=openib,self,vader

To select UCX transport:

export OMPI_MCA_pml=ucx

NB! UCX is not supported on QLogic FastLinQ QL41000 Ethernet controllers.





Different MPI implementations exist:


  • OpenMPI

  • MPICH

  • MVAPICH

  • IBM Platform MPI (MPICH descendant)

  • IBM Spectrum MPI (OpenMPI descendant)

  • (at least one for each network and CPU manufacturer)


OpenMPI

  • available in any Linux or BSD distribution

  • combining technologies and resources from several other projects (incl. LAM/MPI)

  • can use TCP/IP, shared memory, Myrinet, Infiniband and other low latency interconnects

  • chooses fastest interconnect automatically (can be manually choosen, too)

  • well integrated into many schedulers (e.g. SLURM)

  • highly optimized

  • FOSS (BSD license)


MPICH

  • highly optimized

  • supports TCP/IP and some low latency interconnects

  • (older versions) DO NOT support InfiniBand (however, it supports MELLANOX IB)

  • available in many Linux distributions

  • ? not intgrated into schedulers <!— is this correct? Maybe, “?” mark is better?—>

  • used to be a PITA to get working smoothly

  • FOSS


MVAPICH

  • highly optimized (maybe slightly faster than OpenMPI)

  • fork of MPICH to support IB

  • comes in many flavors to support TCP/IP, InfiniBand and many low latency interconnects: OpenSHMEM, PGAS

  • need to install several flavors and users need to choose the right one for the interconnect they want to use

  • generally not available in Linux distributions

  • not integrated with schedulers (integrated with SLURM only after version 18)

  • FOSS (BSD license)


Recommendation

  • default: use OpenMPI on our clusters

  • if unsatisfied with performance and running on single node or over TCP, try MPICH

  • if unsatisfied with performance and running on IB try MVAPICH