Síguenos ...

  • Twitter FacebbokFlickrYouTube CESGA

Servizos PYME's

  • Servicios para Empresas

díxitos Decembro

  • díxitos Decembro 2017



Intel MPI Library

Versións dispoñibles:




  • 2017.3.196


Intel® MPI Library

Deliver Flexible, Efficient, and Scalable Cluster Messaging

Intel® MPI Library focuses on making applications perform better on Intel® architecture-based clusters implementing the high performance Message Passing Interface Version 2.2 specification on multiple fabrics. It enables you to quickly deliver maximum end user performance even if you change or upgrade to new interconnects, without requiring changes to the software or operating environment.

Guía de uso

Loading the corresponding module is required previous to use it:

module load impi

A brief user guide can be obtained running the following command:

module help impi

In order to compile using this library, the compilation command varies depending on the compiler you want to use. If you want to use the Intel compiler, its corresponding module must be previously loaded.

  • mpicc, compiles MPI code in C, using the GNU gcc compiler 
  • mpicxx, compiles MPI code in C , using the GNU gcc compiler 
  • mpif90, compiles MPI code in Fortran 90, using the GNU gfortran compiler 
  • mpiicc, compiles MPI code in C, using the Intel icc compiler 
  • mpiicpc, compiles MPI code in C , using the Intel icpc compiler
  • mpifort, compiles MPI code in Fortran 90, using the Intel ifort compiler

The following table lists a number of options which can be used with the compiler wrappers in addition to the usual switches for optimization etc.

Option Meaning Remarks
-mt-mpi link against thread-safe MPI thread-safeness up to MPI_THREAD_MULTIPLE is provided
-static-mpi use static instead of dynamic MPI libraries default is dynamic
-t[=log] compile with MPI tracing. The module itac must be loaded after the impi module
-ilp64 link against MPI interface with 8 byte integers you may need to also specify -i8 for compiling your code.

In order to execute the application the following commands should be used:

  • mpirun
  • mpiexec

The use of mpirun is recommended because of its greater simplicity. The number of processes to be used must be indicated to this command (-np) and the executable with its parameters must also be indicated. In case of needing more than one node for the execution the node file is assigned by the batch system and its location is $TMPDIR/machines.

Script example for the in-queue execution (ex.sh):

cd $PWD
module load impi
mpirun -np $NSLOTS executable_impi

This script must be sent to queue with a qsub such as:

qsub -l num_proc=1,s_rt=10:00:00,s_vmem=2G,h_fsize=10G -pe mpi 8 ex.sh

Attention must be paid to the following facts:

  • num_proc=1
  • the number of slots you are applying for is specified in -pe mpi 8
  • $TMPDIR/machines is going to contain the name of the node where the slot resides.

Thus, you are applying for 8 processors (NSLOTS=8) and each slot is going to have 2Gb RAM available (the slots are composed of 1 proc / 2Gb and you are applying for 8 processors and 16 Gb RAM altogether).

If you are going to launch a pure MPI calculation, you must specify num_proc=1 and the number of processes to be used in -pe mpi. This is configured this way in order to allow mixed parallelizations MPI/OpenMP where the number of mpi processes is specified in -pe mpi and the number of threads in num_proc.

Topology aware parallel environments (FT and SVG:sandy):

-pe mpi <nslots>:

Maximum number of slots in a minimum number of nodes (fill_up policy).The queue system places the demanded slots in the first available nodes.

-pe mpi_rr <nslots>:

Minimum number of slots per available node (round_robin policy). The slots are place in as many as possible available nodes.

-pe mpi_[1-16]p <nslots>:

Homogeneous distribution of mpi processes between the nodes (fixed number of processes per host policy).  Recommended option to get maximum performance. A exact number of slots [1-16] is placed in every node (notice that the demanded number of slots must be a multiple of the specific parallel environment used)


Let suppose 5 nodes with 16 processors each are available on the server

qsub -l num_proc=1,s_rt=10:00:00,s_vmem=2G,h_fsize=10G -pe mpi 8 ex.sh : 8 one processor slots are demanded, the queue system will place all of them in 1 node

qsub -l num_proc=1,s_rt=10:00:00,s_vmem=2G,h_fsize=10G -pe mpi_rr 8 ex.sh: 8 one processor slots are demanded, the queue system will place them using all available nodes: 3 nodes with 2 slots and 2 nodes with 1 slot

qsub -l num_proc=1,s_rt=10:00:00,s_vmem=2G,h_fsize=10G -pe mpi_2p 8 ex.sh: 8 one processor slots are demanded, the queue system will place them using a 2 slots per node policy, 4 nodes with 2 slots each will be used

qsub -l num_proc=1,s_rt=10:00:00,s_vmem=2G,h_fsize=10G -pe mpi_4p 8 ex.sh: 8 one processor slots are demanded, the queue system will place them using a 4 slots per node policy, 2 nodes with 4 slots each will be used

By default the environment variable I_MPI_DEBUG is fixed to 5, which allows obtaining a detailed information of the MPI execution (which nodes it went to, which network it is used...)

Hybrid program execution:

Newer versions of Intel MPI can use process pinning to achieve good performance for mixed MPI and OpenMP programs: For example, this script (run.sh):

export I_MPI_PIN_DOMAIN=omp:compact
mpirun -np 12 ./myprog.exe

will start 12 MPI tasks with 4 threads each, keeping threads as near to their master tasks by spreading out the MPI tasks. This is probably the most efficient way to proceed in the majority of cases.




FT default version 3.2.

It is possible to use equivalently mpdboot.sge/mpiexec.sge/mpdallexit instead of mpirun. Only it is advisory to use this combination when it is needed to run several chained MPI programs into one job. In this way the process managers mpd will be executed only once. In this case the script ex.sh will be:

module load impi
mpdboot.sge -n $NSLOTS

mpiexec.sge -n $NSLOTS executable_impi
mpiexec.sge -n $NSLOTS executable_impi
mpiexec.sge -n $NSLOTS executable_impi.



SVG default version 4.1.

AMD nodes: only intra node execution is permitted by default. Until 24 processes can be run (24 cores per node)

SANDY nodes: IB low latency interconnection is available in these nodes so inter node execution is permitted.


URL Manual




Ante calquera dúbida ou problema co uso deste paquete de software diríxase a aplicacions@cesga.es