Message Passing Interface | mpirun command | ||
InfiniBand | Compiling openmpi with SGE suppo | GPFS on Red Hat | NFS performance tuning |
Linux Troubleshooting | Admin Horror Stories | Humor | Etc |
--with-sge
"
command line switch. SGE support is not enabled by default. For Open MPI
v1.3 and later, you need to explicitly request the SGE support during compilation
with the "--with-sge
" command line switch to the Open MPI
configure
script.
For example:
./configure --with-sge
|
||||
Bulletin | Latest | Past week | Past month |
|
The OpenMPI programs may be executed only via the PBS Workload manager, by entering an appropriate queue. On Anselm, the bullxmpi-1.2.4.1 and OpenMPI 1.6.5 are OpenMPI based MPI implementations.
Basic usageUse the mpiexec to run the OpenMPI code.
Example:
$ qsub -q qexp -l select=4:ncpus=16 -I qsub: waiting for job 15210.srv11 to start qsub: job 15210.srv11 ready $ pwd /home/username $ module load openmpi $ mpiexec -pernode ./helloworld_mpi.x Hello world! from rank 0 of 4 on host cn17 Hello world! from rank 1 of 4 on host cn108 Hello world! from rank 2 of 4 on host cn109 Hello world! from rank 3 of 4 on host cn110Please be aware, that in this example, the directive -pernode is used to run only one task per node, which is normally an unwanted behaviour (unless you want to run hybrid code with just one MPI and 16 OpenMP tasks per node). In normal MPI programs omit the -pernode directive to run up to 16 MPI tasks per each node.
In this example, we allocate 4 nodes via the express queue interactively. We set up the openmpi environment and interactively run the helloworld_mpi.x program.
Note that the executable helloworld_mpi.x must be available within the same path on all nodes. This is automatically fulfilled on the /home and /scratch filesystem.You need to preload the executable, if running on the local scratch /lscratch filesystem
$ pwd /lscratch/15210.srv11 $ mpiexec -pernode --preload-binary ./helloworld_mpi.x Hello world! from rank 0 of 4 on host cn17 Hello world! from rank 1 of 4 on host cn108 Hello world! from rank 2 of 4 on host cn109 Hello world! from rank 3 of 4 on host cn110In this example, we assume the executable helloworld_mpi.x is present on compute node cn17 on local scratch. We call the mpiexec whith the --preload-binary argument (valid for openmpi). The mpiexec will copy the executable from cn17 to the /lscratch/15210.srv11 directory on cn108, cn109 and cn110 and execute the program.
MPI process mapping may be controlled by PBS parameters.
The mpiprocs and ompthreads parameters allow for selection of number of running MPI processes per node as well as number of OpenMP threads per MPI process.
One MPI process per nodeFollow this example to run one MPI process per node, 16 threads per process. Note that no options to mpiexec are needed
$ qsub -q qexp select=4:ncpus=16:mpiprocs=1:ompthreads=16 -I $ module load openmpi $ mpiexec ./helloworld_mpi.xIn this example, we demonstrate recommended way to run an MPI application, using 1 MPI processes per node and 16 threads per socket, on 4 nodes.
Two MPI processes per nodeFollow this example to run two MPI processes per node, 8 threads per process. Note the options to mpiexec.
$ qsub -q qexp select=4:ncpus=16:mpiprocs=2:ompthreads=8 -I $ module load openmpi $ mpiexec -bysocket -bind-to-socket ./helloworld_mpi.xIn this example, we demonstrate recommended way to run an MPI application, using 2 MPI processes per node and 8 threads per socket, each process and its threads bound to a separate processor socket of the node, on 4 nodes
16 MPI processes per nodeFollow this example to run 16 MPI processes per node, 1 thread per process. Note the options to mpiexec.
$ qsub -q qexp select=4:ncpus=16:mpiprocs=16:ompthreads=1 -I $ module load openmpi $ mpiexec -bycore -bind-to-core ./helloworld_mpi.xIn this example, we demonstrate recommended way to run an MPI application, using 16 MPI processes per node, single threaded. Each process is bound to separate processor core, on 4 nodes.
In all cases, binding and threading may be verified by executing for example:
$ mpiexec -bysocket -bind-to-socket numactl --show $ mpiexec -bysocket -bind-to-socket echo $OMP_NUM_THREADSOpenMPI Process Mapping and Binding
The mpiexec allows for precise selection of how the MPI processes will be mapped to the computational nodes and how these processes will bind to particular processor sockets and cores.
MPI process mapping may be by specified by a hostfile or rankfile input to the mpiexec program. Altough all implementations of MPI provide means for process mapping and binding, following examples are valid for the openmpi only.
HostfileExample hostfile
cn110.bullx cn109.bullx cn108.bullx cn17.bullxUse the hostfile to control process placement
$ mpiexec -hostfile hostfile ./helloworld_mpi.x Hello world! from rank 0 of 4 on host cn110 Hello world! from rank 1 of 4 on host cn109 Hello world! from rank 2 of 4 on host cn108 Hello world! from rank 3 of 4 on host cn17In this example, we see that ranks have been mapped on nodes according to the order in which nodes show in the hostfile
RankfileExact control of MPI process placement and resource binding is provided by specifying a rankfile
Appropriate binding may boost performance of your application.
Example rankfile
rank 0=cn110.bullx slot=1:0,1 rank 1=cn109.bullx slot=0:* rank 2=cn108.bullx slot=1:1-2 rank 3=cn17.bullx slot=0:1,1:0-2 rank 4=cn109.bullx slot=0:*,1:*This rankfile assumes 5 ranks will be running on 4 nodes and provides exact mapping and binding of the processes to the processor sockets and cores
Explanation:
rank 0 will be bounded to cn110, socket1 core0 and core1
rank 1 will be bounded to cn109, socket0, all cores
rank 2 will be bounded to cn108, socket1, core1 and core2
rank 3 will be bounded to cn17, socket0 core1, socket1 core0, core1, core2
rank 4 will be bounded to cn109, all cores on both sockets$ mpiexec -n 5 -rf rankfile --report-bindings ./helloworld_mpi.x [cn17:11180] MCW rank 3 bound to socket 0[core 1] socket 1[core 0-2]: [. B . . . . . .][B B B . . . . .] (slot list 0:1,1:0-2) [cn110:09928] MCW rank 0 bound to socket 1[core 0-1]: [. . . . . . . .][B B . . . . . .] (slot list 1:0,1) [cn109:10395] MCW rank 1 bound to socket 0[core 0-7]: [B B B B B B B B][. . . . . . . .] (slot list 0:*) [cn108:10406] MCW rank 2 bound to socket 1[core 1-2]: [. . . . . . . .][. B B . . . . .] (slot list 1:1-2) [cn109:10406] MCW rank 4 bound to socket 0[core 0-7] socket 1[core 0-7]: [B B B B B B B B][B B B B B B B B] (slot list 0:*,1:*) Hello world! from rank 3 of 5 on host cn17 Hello world! from rank 1 of 5 on host cn109 Hello world! from rank 0 of 5 on host cn110 Hello world! from rank 4 of 5 on host cn109 Hello world! from rank 2 of 5 on host cn108In this example we run 5 MPI processes (5 ranks) on four nodes. The rankfile defines how the processes will be mapped on the nodes, sockets and cores. The --report-bindings option was used to print out the actual process location and bindings. Note that ranks 1 and 4 run on the same node and their core binding overlaps.
It is users responsibility to provide correct number of ranks, sockets and cores.
In general, Open MPI requires that its executables are in your
PATH
on every node that you will run on and if Open MPI was compiled as dynamic libraries (which is the default), the directory where its libraries are located must be in yourLD_LIBRARY_PATH
on every node.Specifically, if Open MPI was installed with a prefix of /opt/openmpi, then the following should be in your
PATH
andLD_LIBRARY_PATH
PATH: /opt/openmpi/bin LD_LIBRARY_PATH: /opt/openmpi/libDepending on your environment, you may need to set these values in your shell startup files (e.g.,
.profile
,.cshrc
, etc.).NOTE: there are exceptions to this rule -- notably the
--prefix
option to mpirun.See this FAQ entry for more details on how to add Open MPI to your
PATH
andLD_LIBRARY_PATH
.Additionally, Open MPI requires that jobs can be started on remote nodes without any input from the keyboard. For example, if using
rsh
orssh
as the remote agent, you must have your environment setup to allow execution on remote nodes without entering a password or passphrase.
Open MPI's versioning and ABI scheme is described here, but is summarized here in this FAQ entry for convenience.
Open MPI provided forward application binary interface (ABI) compatibility for MPI applications starting with v1.3.2. Prior to that version, no ABI guarantees were provided.
NOTE: Prior to v1.3.2, subtle and strange failures are almost guaranteed to occur if applications were compiled and linked against shared libraries from one version of Open MPI and then run with another. The Open MPI team strongly discourages making any ABI assumptions before v1.3.2.NOTE: ABI for the "use mpi" Fortran interface was inadvertantly broken in the v1.6.3 release, and was restored in the v1.6.4 release. Any Fortran applications that utilize the "use mpi" MPI interface that were compiled and linked against the v1.6.3 release will not be link-time compatible with other releases in the 1.5.x / 1.6.x series. Such applications remain source compatible, however, and can be recompiled/re-linked with other Open MPI releases.Starting with v1.3.2, Open MPI provides forward ABI compatibility -- with respect to the MPI API only -- in all versions of a given feature release series and its corresponding super stable series. For example, on a single platform, an MPI application linked against Open MPI v1.3.2 shared libraries can be updated to point to the shared libraries in any successive v1.3.x or v1.4 release and still work properly (e.g., via the LD_LIBRARY_PATH environment variable or other operating system mechanism).
For the v1.5 series, this means that all releases of v1.5.x and v1.6.x will be ABI compatible, per the above definition.
Open MPI reserves the right to break ABI compatibility at new feature release series. For example, the same MPI application from above (linked against Open MPI v1.3.2 shared libraries) will not work with Open MPI v1.5 shared libraries. Similarly, MPI applications compiled/linked against Open MPI 1.6.x will not be ABI compatible with Open MPI 1.7.x
No, but it certainly makes life easier if you do.
A common environment to run Open MPI is in a "Beowulf"-class or similar cluster (e.g., a bunch of 1U servers in a bunch of racks). Simply stated, Open MPI can run on a group of servers or workstations connected by a network. As mentioned above, there are several prerequisites, however (for example, you typically must have an account on all the machines, you can
ssh
orssh
between the nodes without using a password etc.).Regardless of whether Open MPI is installed on a shared / networked filesystem or independently on each node, it is usually easiest if Open MPI is available in the same filesystem location on every node. For example, if you install Open MPI to
/opt/openmpi-1.8
on one node, ensure that it is available in/opt/openmpi-1.8
on all nodes.This FAQ entry has a bunch more information about installation locations for Open MPI.
Open MPI must be able to find its executables in your
PATH
on every node (if Open MPI was compiled as dynamic libraries, then its library path must appear inLD_LIBRARY_PATH
as well). As such, your configuration/initialization files need to add Open MPI to yourPATH
/LD_LIBRARY_PATH
properly.How to do this may be highly dependent upon your local configuration, so you may need to consult with your local system administrator. Some system administrators take care of these details for you, some don't. YMMV. Some common examples are included below, however.
You must have at least a minimum understanding of how your shell works to get Open MPI in your
PATH
/LD_LIBRARY_PATH
properly. Note that Open MPI must be added to yourPATH
andLD_LIBRARY_PATH
in two situations: (1) when you login to an interactive shell, (2) and when you login to non-interactive shells on remote nodes.
- If (1) is not configured properly, executables like
mpicc
will not be found, and it is typically obvious what is wrong. The Open MPI executable directory can manually be added to thePATH
, or the user's startup files can be modified such that the Open MPI executables are added to thePATH
every login. This latter approach is preferred.All shells have some kind of script file that is executed at login time to set things like
PATH
andLD_LIBRARY_PATH
and perform other environmental setup tasks. This startup file is the one that needs to be edited to add Open MPI to thePATH
andLD_LIBRARY_PATH
. Consult the manual page for your shell for specific details (some shells are picky about the permissions of the startup file, for example). The table below lists some common shells and the startup files that they read/execute upon login:
Shell Interactive login startup file sh (Bourne shell, or bash named " sh
").profile
csh .cshrc
followed by.login
tcsh .tcshrc
if it exists,.cshrc
if it does not, followed by.login
bash .bash_profile
if it exists, or.bash_login
if it exists, or.profile
if it exists (in that order). Note that some Linux distributions automatically come with.bash_profile
scripts for users that automatically execute.bashrc
as well. Consult the bash man page for more information.- If (2) is not configured properly, executables like
mpirun
will not function properly, and it can be somewhat confusing to figure out (particularly forbash
users).The startup files in question here are the ones that are automatically executed for a non-interactive login on a remote node (e.g., "
rsh othernode ps
"). Note that not all shells support this, and that some shells use different files for this than listed in (1). Some shells will supersede (2) with (1). That is, fulfilling (2) may automatically fulfill (1). The following table lists some common shells and the startup file that is automatically executed, either by Open MPI or by the shell itself:
Shell Non-interactive login startup file sh (Bourne or bash named " sh
")This shell does not execute any file automatically, so Open MPI will execute the .profile
script before invoking Open MPI executables on remote nodescsh .cshrc
tcsh .tcshrc
if it exists, or.cshrc
if it does notbash .bashrc
if it exists
PATH
and/or LD_LIBRARY_PATH
?
There are some situations where you cannot modify the
PATH
orLD_LIBRARY_PATH
-- e.g., some ISV application prefer to hide all parallelism from the user, and therefore do not want to make the user modify their shell startup files. Another case is where you want a single user to be able to launch multiple MPI jobs simultaneously, each with a different MPI implementation. Hence, setting shell startup files to point to one MPI implementation would be problematic.In such cases, you have two options:
- Use
mpirun
's--prefix
command line option (described below).- Modify the wrapper compilers to include directives to include run-time search locations for the Open MPI libraries (
see this FAQ entry
)
mpirun
's--prefix
command line option takes as an argument the top-level directory where Open MPI was installed. While relative directory names are possible, they can become ambiguous depending on the job launcher used; using absolute directory names are strongly recommended.For example, say that Open MPI was installed into
/opt/openmpi-1.8
. You would use the--prefix
option like this:shell$ mpirun --prefix /opt/openmpi-1.8 -np 4 a.outThis will prefix the
PATH
andLD_LIBRARY_PATH
on both the local and remote hosts with/opt/openmpi-1.8/bin
and/opt/openmpi-1.8/lib
, respectively. This is usually unnecessary when using resource managers to launch jobs (e.g., SLURM, Torque, etc.) because they tend to copy the entire local environment -- to include thePATH
andLD_LIBRARY_PATH
-- to remote nodes before execution. As such, ifPATH
andLD_LIBRARY_PATH
are set properly on the local node, the resource manager will automatically propagate those values out to remote nodes. The--prefix
option is therefore usually most useful inrsh
orssh
-based environments (or similar).Beginning with the 1.2 series, it is possible to make this the default behavior by passing to
configure
the flag--enable-mpirun-prefix-by-default
. This will makempirun
behave exactly the same as "mpirun --prefix $prefix
...", where$prefix
is the value given to--prefix
inconfigure
.Finally, note that specifying the absolute pathname to
mpirun
is equivalent to using the--prefix
argument. For example, the following is equivalent to the above command line that uses--prefix
:shell$ /opt/openmpi-1.8/bin/mpirun -np 4 a.out
Similar to many MPI implementations, Open MPI provides the commands
mpirun
andmpiexec
to launch MPI jobs. Several of the questions in this FAQ category deal with using these commands.Note, however, that these commands are exactly identical. Specifically, they are symbolic links to a common back-end launcher command named
orterun
(Open MPI's run-time environment interaction layer is named the Open Run-Time Environment, or ORTE -- henceorterun
).As such, the rest of this FAQ usually refers only to
mpirun
, even though the same discussions also apply tompiexec
andorterun
(because they are all, in fact, the same command).
Open MPI provides both
mpirun
andmpiexec
commands. A simple way to start a single program, multiple data (SPMD) application in parallel is:shell$ mpirun -np 4 my_parallel_applicationThis starts a four-process parallel application, running four copies of the executable named
my_parallel_application
.The
rsh
starter component accepts the--hostfile
(also known as--machinefile
) option to indicate which hosts to start the processes on:shell$ cat my_hostfile host01.example.com host02.example.com shell$ mpirun --hostfile my_hostfile -np 4 my_parallel_applicationThis command will launch one copy of
my_parallel_application
on each ofhost01.example.com
andhost02.example.com
.More information about the
--hostfile
option, and hostfiles in general, is available in this FAQ entry.Note, however, that not all environments require a hostfile. For example, Open MPI will automatically detect when it is running in batch / scheduled environments (such as SGE, PBS/Torque, SLURM, and LoadLeveler), and will use host information provided by those systems.
Also note that if using a launcher that requires a hostfile and no hostfile is specified, all processes are launched on the local host.
Both the
mpirun
andmpiexec
commands support multiple program, multiple data (MPMD) style launches, either from the command line or from a file. For example:shell$ mpirun -np 2 a.out : -np 2 b.outThis will launch a single parallel application, but the first two processes will be instances of the
a.out
executable, and the second two processes will be instances of theb.out
executable. In MPI terms, this will be a singleMPI_COMM_WORLD
, but thea.out
processes will be ranks 0 and 1 inMPI_COMM_WORLD
, while theb.out
processes will be ranks 2 and 3 inMPI_COMM_WORLD
.
mpirun
(andmpiexec
) can also accept a parallel application specified in a file instead of on the command line. For example:shell$ mpirun --app my_appfilewhere the file
my_appfile
contains the following:# Comments are supported; comments begin with # # Application context files specify each sub-application in the # parallel job, one per line. The first sub-application is the 2 # a.out processes: -np 2 a.out # The second sub-application is the 2 b.out processes: -np 2 b.outThis will result in the same behavior as running
a.out
andb.out
from the command line.Note that
mpirun
andmpiexec
are identical in command-line options and behavior; using the above command lines withmpiexec
instead ofmpirun
will result in the same behavior.
There are three general mechanisms:
- The
--hostfile
option tompirun
. Use this option to specify a list of hosts on which to run. Note that for compatibility with other MPI implementations,--machinefile
is a synonym for--hostfile
. See this FAQ entry for more information about the--hostfile
option.- The
--host
option tompirun
can be used to specify a list of hosts on which to run on the command line. See this FAQ entry for more information about the--host
option.- If you are running in a scheduled environment (e.g., in a SLURM, Torque, or LSF job), Open MPI will automatically get the lists of hosts from the scheduler.
NOTE: The specification of hosts using any of the above methods has nothing to do with the network interfaces that are used for MPI traffic. The list of hosts is only used for specifying which hosts on which to launch MPI processes.
(you should probably also see this FAQ entry, too)
If you can run
ompi_info
and possibly even launch MPI processes locally, but fail to launch MPI processes on remote hosts, it is likely that you do not have yourPATH
and/orLD_LIBRARY_PATH
setup properly on the remote nodes.Specifically, the Open MPI commands usually run properly even if
LD_LIBRARY_PATH
is not set properly because they encode the Open MPI library location in their executables and search there by default. Hence, runningompi_info
(and friends) usually works, even in some improperly setup environments.However, Open MPI's wrapper compilers do not encode the Open MPI library locations in MPI executables by default (the wrappers only specify a bare minimum of flags necessary to create MPI executables; we consider any flags beyond this bare minimum set a local policy decision). Hence, attempting to launch MPI executables in environments where
LD_LIBRARY_PATH
is either not set or was set improperly may result in messages aboutlibmpi.so
not being found.You can change Open MPI's wrapper compiler behavior to specify the run-time location of Open MPI's libraries, if you wish.
Depending on how Open MPI was configured and/or invoked, it may even be possible to run MPI applications in environments where
PATH
and/orLD_LIBRARY_PATH
is not set, or is set improperly. This can be desirable for environments where multiple MPI implementations are installed, such as multiple versions of Open MPI.
In addition to what is mentioned in this FAQ entry, when you are able to run MPI jobs on a single host, but fail to run them across multiple hosts, try the following:
- Ensure that your launcher is able to launch across multiple hosts. For example, if you are using
ssh
, try tossh
to each remote host and ensuring that you are not prompted for a password. For example:shell$ ssh remotehost hostname remotehostIf you are unable to launch across multiple hosts, check that your SSH keys are setup properly. Or, if you are running in a managed environment, such as in a SLURM, Torque, or other job launcher, check that you have reserved enough hosts, are running in an allocated job, etc.
- Ensure that your PATH and LD_LIBRARY_PATH are set correctly on each remote host on which you are trying to run. For example, with
ssh
:shell$ ssh remotehost env | grep -i path PATH=...path on the remote host... LD_LIBRARY_PATH=...LD library path on the remote host...If your PATH or LD_LIBRARY_PATH are not set properly, see this FAQ entry for the correct values. Keep in mind that it is fine to have multiple Open MPI installations installed on a machine; the first Open MPI installation found by PATH and LD_LIBARY_PATH is the one that matters.
- Run a simple, non-MPI job across multiple hosts. This verifies that the Open MPI run-time system is functioning properly across multiple hosts. For example, try running the
hostname
command:shell$ mpirun --host remotehost hostname remotehost shell$ mpirun --host remotehost,otherhost hostname remotehost otherhostIf you are unable to run non-MPI jobs across multiple hosts, check for common problems such as:
- Check that your non-interactive shell setup on each remote host to ensure that it is setting up the PATH and LD_LIBRARY_PATH properly.
- Check that Open MPI is finding and launching the correct version of Open MPI on the remote hosts.
- Ensure that you have firewalling disabled between hosts (Open MPI opens random TCP and sometimes random UDP ports between hosts in a single MPI job).
- Try running with the
plm_base_verbose
MCA parameter at level 10, which will enable extra debugging output to see how Open MPI launches on remote hosts. For example:mpirun --mca plm_base_verbose 10 --host remotehost hostname
- Now run a simple MPI job across multiple hosts that does not involve MPI communications. The "hello_c" program in the
examples
directory in the Open MPI distribution is a good choice. This verifies that the MPI subsystem is able to initialize and terminate properly. For example:shell$ mpirun --host remotehost,otherhost hello_c Hello, world, I am 0 of 1, (Open MPI v1.7.5, package: Open MPI [email protected] Distribution, ident: 1.7.5, Mar 20, 2014, 99) Hello, world, I am 1 of 1, (Open MPI v1.7.5, package: Open MPI [email protected] Distribution, ident: 1.7.5, Mar 20, 2014, 99)If you are unable to run simple, non-communication MPI jobs, this can indicate that your Open MPI installation is unable to initialize properly on remote hosts. Double check your non-interactive login setup on remote hosts.
- Now run a simple MPI job across multiple hosts that does does some simple MPI communications. The "ring_c" program in the
examples
directory in the Open MPI distribution is a good choice. This verifies that the MPI subsystem is able to pass MPI traffic across your network. For example:shell$ mpirun --host remotehost,otherhost ring_c Process 0 sending 10 to 0, tag 201 (1 processes in ring) Process 0 sent to 0 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exitingIf you are unable to run simple MPI jobs across multiple hosts, this may indicate a problem with the network(s) that Open MPI is trying to use for MPI communications. Try limiting the networks that it uses, and/or exploring levels 1 through 3 MCA parameters for the communications module that you are using. For example, if you're using the TCP BTL, see the output of
ompi_info --level 3 --param btl tcp
.
The problem is usually because the Intel libraries cannot be found on the node where Open MPI is attempting to launch an MPI executable. For example:
shell$ mpirun -np 1 --host node1.example.com mpi_hello orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory -------------------------------------------------------------------------- A daemon (pid 11893) died unexpectedly with status 127 while attempting to launch so we are aborting. [...more error messages...]Open MPI first attempts to launch a "helper" daemon (
orted
) onnode1.example.com
, but it failed because one oforted
's dependent libraries was not able to be found. This particular library,libimf.so
, is an Intel compiler library. As such, it is likely that the user did not setup the Intel compiler library in their environment properly on this node.Double check that you have setup the Intel compiler environment on the target node, for both interactive and non-interactive logins. It is a common error to ensure that the Intel compiler environment is setup properly for interactive logins, but not for non-interactive logins. For example:
shell$ cd $HOME shell$ mpicc mpi_hello.c -o mpi_hello shell$ ./mpi_hello Hello world, I am 0 of 1. shell$ ssh node1.example.com Welcome to node1. node1 shell$ ./mpi_hello Hello world, I am 0 of 1. node1 shell$ exit shell$ ssh node1.example.com $HOME/mpi_hello mpi_hello: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directoryThe above example shows that running a trivial C program compiled by the Intel compilers works fine on both the head node and
node1
when logging in interactively, but fails when run onnode1
non-interactively. Check your shell script startup files and verify that the Intel compiler environment is setup properly for non-interactive logins.
The problem is usually because the PGI libraries cannot be found on the node where Open MPI is attempting to launch an MPI executable. For example:
shell$ mpirun -np 1 --host node1.example.com mpi_hello orted: error while loading shared libraries: libpgc.so: cannot open shared object file: No such file or directory -------------------------------------------------------------------------- A daemon (pid 11893) died unexpectedly with status 127 while attempting to launch so we are aborting. [...more error messages...]Open MPI first attempts to launch a "helper" daemon (
orted
) onnode1.example.com
, but it failed because one oforted
's dependent libraries was not able to be found. This particular library,libpgc.so
, is a PGI compiler library. As such, it is likely that the user did not setup the PGI compiler library in their environment properly on this node.Double check that you have setup the PGI compiler environment on the target node, for both interactive and non-interactive logins. It is a common error to ensure that the PGI compiler environment is setup properly for interactive logins, but not for non-interactive logins. For example:
shell$ cd $HOME shell$ mpicc mpi_hello.c -o mpi_hello shell$ ./mpi_hello Hello world, I am 0 of 1. shell$ ssh node1.example.com Welcome to node1. node1 shell$ ./mpi_hello Hello world, I am 0 of 1. node1 shell$ exit shell$ ssh node1.example.com $HOME/mpi_hello mpi_hello: error while loading shared libraries: libpgc.so: cannot open shared object file: No such file or directoryThe above example shows that running a trivial C program compiled by the PGI compilers works fine on both the head node and
node1
when logging in interactively, but fails when run onnode1
non-interactively. Check your shell script startup files and verify that the PGI compiler environment is setup properly for non-interactive logins.
The problem is usually because the Pathscale libraries cannot be found on the node where Open MPI is attempting to launch an MPI executable. For example:
shell$ mpirun -np 1 --host node1.example.com mpi_hello orted: error while loading shared libraries: libmv.so: cannot open shared object file: No such file or directory -------------------------------------------------------------------------- A daemon (pid 11893) died unexpectedly with status 127 while attempting to launch so we are aborting. [...more error messages...]Open MPI first attempts to launch a "helper" daemon (
orted
) onnode1.example.com
, but it failed because one oforted
's dependent libraries was not able to be found. This particular library,libmv.so
, is a Pathscale compiler library. As such, it is likely that the user did not setup the Pathscale compiler library in their environment properly on this node.Double check that you have setup the Pathscale compiler environment on the target node, for both interactive and non-interactive logins. It is a common error to ensure that the Pathscale compiler environment is setup properly for interactive logins, but not for non-interactive logins. For example:
shell$ cd $HOME shell$ mpicc mpi_hello.c -o mpi_hello shell$ ./mpi_hello Hello world, I am 0 of 1. shell$ ssh node1.example.com Welcome to node1. node1 shell$ ./mpi_hello Hello world, I am 0 of 1. node1 shell$ exit shell$ ssh node1.example.com $HOME/mpi_hello mpi_hello: error while loading shared libraries: libmv.so: cannot open shared object file: No such file or directoryThe above example shows that running a trivial C program compiled by the Pathscale compilers works fine on both the head node and
node1
when logging in interactively, but fails when run onnode1
non-interactively. Check your shell script startup files and verify that the Pathscale compiler environment is setup properly for non-interactive logins.
mpirun
/ mpiexec
?
Yes.
Indeed, Open MPI's
mpirun
andmpiexec
are actually synonyms for our underlying launcher namedorterun
(i.e., the Open Run-Time Environment layer in Open MPI, or ORTE). So you can usempirun
andmpiexec
to launch any application. For example:shell$ mpirun -np 2 --host a,b uptimeThis will launch a copy of the unix command
uptime
on the hostsa
andb
.Other questions in the FAQ section deal with the specifics of the
mpirun
command line interface; suffice it to say that it works equally well for MPI and non-MPI applications.
Yes, but it will depend on your local setup and may require additional setup.
In short: you will need to have X forwarding enabled from the remote processes to the display where you want output to appear. In a secure environment, you can simply allow all X requests to be shown on the target display and set the
DISPLAY
environment variable in all MPI process' environments to the target display, perhaps something like this:shell$ hostname my_desktop.secure-cluster.example.com shell$ xhost + shell$ mpirun -np 4 -x DISPLAY=my_desktop.secure-cluster.example.com a.outHowever, this technique is not generally suitable for unsecure environments (because it allows anyone to read and write to your display). A slightly more secure way is to only allow X connections from the nodes where your application will be running:
shell$ hostname my_desktop.secure-cluster.example.com shell$ xhost +compute1 +compute2 +compute3 +compute4 compute1 being added to access control list compute2 being added to access control list compute3 being added to access control list compute4 being added to access control list shell$ mpirun -np 4 -x DISPLAY=my_desktop.secure-cluster.example.com a.out(assuming that the four nodes you are running on are
compute1
throughcompute4
).Other methods are available, but they involve sophisticated X forwarding through mpirun and are generally more complicated than desirable.
Maybe. But probably not.
Open MPI provides fairly sophisticated stdin / stdout / stderr forwarding. However, it does not work well with curses, ncurses, readline, or other sophisticated I/O packages that generally require direct control of the terminal.
Every application and I/O library is different -- you should try to see if yours is supported. But chances are that it won't work.
Sorry. :-(
mpirun
?
mpirun
supports the "--help" option which provides a usage message and a summary of the options that it supports. It should be considered the definitive list of what options are provided.Several notable options are:
- --hostfile: Specify a hostfile for launchers (such as the
rsh
launcher) that need to be told on which hosts to start parallel applications. Note that for compatibility with other MPI implementations, --machinefile is a synonym for --hostfile.- --host: Specify a host or list of hosts to run on (see this FAQ entry for more details)
- --np (or -np): Indicate the number of processes to start.
- --mca (or -mca): Set MCA parameters (see the Run-Time Tuning FAQ)
- --wdir <directory>: Set the working directory of the started applications. If not supplied, the current working directory is assumed (or
$HOME
, if the current working directory does not exist on all nodes).- -x <env-variable-name>: The name of an environment variable to export to the parallel application. The -x option can be specified multiple times to export multiple environment variables to the parallel application.
--hostfile
option to mpirun
?How do I use theThe
--hostfile
option tompirun
takes a filename that lists hosts on which to launch MPI processes.NOTE: The hosts listed in a hostfile have nothing to do with which network interfaces are used for MPI communication. They are only used to specify on which hosts to launch MPI processes.
Hostfiles
my_hostfile
are simple text files with hosts specified, one per line. Each host can also specify a default a maximum number of slots to be used on that host (i.e., the number of available processors on that host). Comments are also supported, and blank lines are ignored. For example:# This is an example hostfile. Comments begin with # # # The following node is a single processor machine: foo.example.com # The following node is a dual-processor machine: bar.example.com slots=2 # The following node is a quad-processor machine, and we absolutely # want to disallow over-subscribing it: yow.example.com slots=4 max-slots=4slot and max-slots are discussed more in this FAQ entry
Hostfiles works in two different ways:
- Exclusionary: If a list of hosts to run on has been provided by another source (e.g., by a hostfile or a batch scheduler such as SLURM, PBS/Torque, SGE, etc.), the hosts provided by the hostfile must be in the already-provided host list. If the hostfile-specified nodes are not in the already-provided host list,
mpirun
will abort without launching anything.In this case, hostfiles act like an exclusionary filter -- they limit the scope of where processes will be scheduled from the original list of hosts to produce a final list of hosts.
For example, say that a scheduler job contains hosts
node01
throughnode04
. If you run:shell$ cat my_hosts node03 shell$ mpirun -np 1 --hostfile my_hosts hostnameThis will run a single copy of
hostname
on the hostnode03
. However, if you run:shell$ cat my_hosts node17 shell$ mpirun -np 1 --hostfile my_hosts hostnameThis is an error (because
node17
is not listed inmy_hosts
);mpirun
will abort.Finally, note that in exclusionary mode, processes will only be executed on the hostfile-specified hosts, even if it causes oversubscription. For example:
shell$ cat my_hosts node03 shell$ mpirun -np 4 --hostfile my_hosts hostnameThis will launch 4 copies of
hostname
on hostnode03
.- Inclusionary: If a list of hosts has not been provided by another source, then the hosts provided by the
--hostfile
option will be used as the original and final host list.In this case,
--hostfile
acts as an inclusionary agent; all--hostfile
-supplied hosts become available for scheduling processes. For example (assume that you are not in a scheduling environment where a list of nodes is being transparently supplied):shell$ cat my_hosts node01.example.com node02.example.com node03.example.com shell$ mpirun -np 3 --hostfile my_hosts hostnameThis will launch a single copy of
hostname
on the hostsnode01.example.com
,node02.example.com
, andnode03.example.com
.Note, too, that
--hostfile
is essentially a per-application switch. Hence, if you specify multiple applications (as in an MPMD job),--hostfile
can be specified multiple times:shell$ cat hostfile_1 node01.example.com shell$ cat hostfile_2 node02.example.com shell$ mpirun -np 1 --hostfile hostfile_1 hostname : -np 1 --hostfile hostfile_2 uptime node01.example.com 06:11:45 up 1 day, 2:32, 0 users, load average: 21.65, 20.85, 19.84Notice that
hostname
was launched onnode01.example.com
anduptime
was launched on host02.example.com.
--host
option to mpirun
?The
--host
option tompirun
takes a comma-delimited list of hosts on which to run. For example:shell$ mpirun -np 3 --host a,b,c hostnameWill launch one copy of
hostname
on hostsa
,b
, andc
.NOTE: The hosts specified by the
--host
option have nothing to do with which network interfaces are used for MPI communication. They are only used to specify on which hosts to launch MPI processes.
--host
works in two different ways:
- Exclusionary: If a list of hosts to run on has been provided by another source (e.g., by a hostfile or a batch scheduler such as SLURM, PBS/Torque, SGE, etc.), the hosts provided by the
--host
option must be in the already-provided host list. If the--host
-specified nodes are not in the already-provided host list,mpirun
will abort without launching anything.In this case, the
--host
option acts like an exclusionary filter -- it limits the scope of where processes will be scheduled from the original list of hosts to produce a final list of hosts.For example, say that the hostfile
my_hosts
contains the hostsnode1
throughnode4
. If you run:shell$ mpirun -np 1 --hostfile my_hosts --host node3 hostnameThis will run a single copy of
hostname
on the hostnode3
. However, if you run:shell$ mpirun -np 1 --hostfile my_hosts --host node17 hostnameThis is an error (because
node17
is not listed inmy_hosts
;mpirun
will abort.Finally, note that in exclusionary mode, processes will only be executed on the
--host
-specified hosts, even if it causes oversubscription. For example:shell$ mpirun -np 4 --host a uptimeThis will launch 4 copies of
uptime
on hosta
.- Inclusionary: If a list of hosts has not been provided by another source, then the hosts provided by the
--host
option will be used as the original and final host list.In this case,
--host
acts as an inclusionary agent; all--host
-supplied hosts become available for scheduling processes. For example (assume that you are not in a scheduling environment where a list of nodes is being transparently supplied):shell$ mpirun -np 3 --host a,b,c hostnameThis will launch a single copy of
hostname
on the hostsa
,b
, andc
.Note, too, that
--host
is essentially a per-application switch. Hence, if you specify multiple applications (as in an MPMD job),--host
can be specified multiple times:shell$ mpirun -np 1 --host a hostname : -np 1 --host b uptimeThis will launch
hostname
on hosta
anduptime
on hostb
.
I'm not using a hostfile. How are slots calculated?The short version is that if you are not oversubscribing your nodes (i.e., trying to run more processes than you have told Open MPI are available on that node), scheduling is pretty simple and occurs either on a by-slot or by-node round robin schedule. If you're oversubscribing, the issue gets much more complicated -- keep reading.
The more complete answer is: Open MPI schedules processes to nodes by asking two questions from each application on the
mpirun
command line:
- How many processes should be launched?
- Where should those processes be launched?
The "how many" question is directly answered with the
-np
switch tompirun
. The "where" question is a little more complicated, and depends on three factors:
- The final node list (e.g., after
--host
exclusionary or inclusionary processing)- The scheduling policy (which applies to all applications in a single job)
- The default and maximum number of slots on each host
As briefly mentioned in this FAQ entry, slots are Open MPI's representation of how many processors are available on a given host.
The default number of slots on any machine, if not explicitly specified, is 1 (e.g., if a host is listed in a hostfile by has no corresponding "slots" keyword). Schedulers (such as SLURM, PBS/Torque, SGE, etc.) automatically provide an accurate default slot count.
Max slot counts, however, are rarely specified by schedulers. The max slot count for each node will default to "infinite" if it is not provided (meaning that Open MPI will oversubscribe the node if you ask it to -- see more on oversubscribing in this FAQ entry).
Open MPI currently supports two scheduling policies: by slot and by node:
- By slot: This is the default scheduling policy, but can also be explicitly requested by using either the
--byslot
option tompirun
or by setting the MCA parameterrmaps_base_schedule_policy
to the string "slot".In this mode, Open MPI will schedule processes on a node until all of its default slots are exhausted before proceeding to the next node. In MPI terms, this means that Open MPI tries to maximize the number of adjacent ranks in
MPI_COMM_WORLD
on the same host without oversubscribing that host.For example:
shell$ cat my-hosts node0 slots=2 max_slots=20 node1 slots=2 max_slots=20 shell$ mpirun --hostfile my-hosts -np 8 --byslot | sort Hello World I am rank 0 of 8 running on node0 Hello World I am rank 1 of 8 running on node0 Hello World I am rank 2 of 8 running on node1 Hello World I am rank 3 of 8 running on node1 Hello World I am rank 4 of 8 running on node0 Hello World I am rank 5 of 8 running on node0 Hello World I am rank 6 of 8 running on node1 Hello World I am rank 7 of 8 running on node1- By node: This policy can be requested either by using the
--bynode
option tompirun
or by setting the MCA parameterrmaps_base_schedule_policy
to the string "node".In this mode, Open MPI will schedule a single process on each node in a round-robin fashion (looping back to the beginning of the node list as necessary) until all processes have been scheduled. Nodes are skipped once their default slot counts are exhausted.
For example:
shell$ shell$ cat my-hosts node0 slots=2 max_slots=20 node1 slots=2 max_slots=20 shell$ mpirun --hostname my-hosts -np 8 --bynode hello | sort Hello World I am rank 0 of 8 running on node0 Hello World I am rank 1 of 8 running on node1 Hello World I am rank 2 of 8 running on node0 Hello World I am rank 3 of 8 running on node1 Hello World I am rank 4 of 8 running on node0 Hello World I am rank 5 of 8 running on node1 Hello World I am rank 6 of 8 running on node0 Hello World I am rank 7 of 8 running on node1In both policies, if the default slot count is exhausted on all nodes while there are still processes to be scheduled, Open MPI will loop through the list of nodes again and try to schedule one more process to each node until all processes are scheduled. Nodes are skipped in this process if their maximum slot count is exhausted. If the maximum slot count is exhausted on all nodes while there are still processes to be scheduled, Open MPI will abort without launching any processes.
NOTE: This is the scheduling policy in Open MPI because of a long historical precedent in LAM/MPI. However, the scheduling of processes to processors is a component in the RMAPS framework in Open MPI; it can be changed. If you don't like how this scheduling occurs, please let us know.
Can I run multiple parallel processes on a uniprocessor machine?If you are using a supported resource manager, Open MPI will get the slot information directly from that entity. If you are using the
--host
parameter tompirun
, be aware that each instance of a hostname bumps up the internal slot count by one. For example:shell$ mpirun --host node0,node0,node0,node0 ....This tells Open MPI that host "node0" has a slot count of 4. This is very different than, for example:
shell$ mpirun -np 4 --host node0 a.outThis tells Open MPI that host "node0" has a slot count of 1 but you are running 4 processes on it. Specifically, Open MPI assumes that you are oversubscribing the node.
Yes.
But be very careful to ensure that Open MPI knows that you are oversubscibing your node! If Open MPI is unaware that you are oversubscribing a node, severe performance degredation can result.
See this FAQ entry for more details on oversubscription.
Yes.
However, it is critical that Open MPI knows that you are oversubscribing the node, or severe performance degredation can result.
The short explanation is as follows: never specify a number of slots that is more than the available number of processors. For example, if you want to run 4 processes on a uniprocessor, then indicate that you only have 1 slot but want to run 4 processes. For example:
shell$ cat my-hostfile localhost shell$ mpirun -np 4 --hostfile my-hostfile a.outSpecifically: do NOT have a hostfile that contains "
slots = 4
" (because there is only one available processor).Here's the full explanation:
Open MPI basically runs its message passing progression engine in two modes: aggressive and degraded.
- Degraded: When Open MPI thinks that it is in an oversubscribed mode (i.e., more processes are running than there are processors available), MPI processes will automatically run in degraded mode and frequently yield the processor to its peers, thereby allowing all processes to make progress (be sure to see this FAQ entry that describes how degraded mode affects processor and memory affinity).
- Aggressive: When Open MPI thinks that it is in an exactly- or under-subscribed mode (i.e., the number of running processes is equal to or less than the number of available processors), MPI processes will automatically run in aggressive mode, meaning that they will never voluntarily give up the processor to other processes. With some network transports, this means that Open MPI will spin in tight loops attempting to make message passing progress, effectively causing other processes to not get any CPU cycles (and therefore never make any progress).
For example, on a uniprocessor node:
shell$ cat my-hostfile localhost slots=4 shell$ mpirun -np 4 --hostfile my-hostfile a.outThis would cause all 4 MPI processes to run in aggressive mode because Open MPI thinks that there are 4 available processors to use. This is actually a lie (there is only 1 processor -- not 4), and can cause extremely bad performance.
Yes.
The MCA parameter
mpi_yield_when_idle
controls whether an MPI process runs in Aggressive or Degraded performance mode. Setting it to zero forces Aggressive mode; any other value forces Degraded mode (see this FAQ entry to see how to set MCA parameters).Note that this value only affects the behavior of MPI processes when they are blocking in MPI library calls. It does not affect behavior of non-MPI processes, nor does it affect the behavior of a process that is not inside an MPI library call.
Open MPI normally sets this parameter automatically (see this FAQ entry for details). Users are cautioned against setting this parameter unless you are really, absoultely, positively sure of what you are doing.
Generally, you can run Open MPI processes with TotalView as follows:
shell$ mpirun --debug ...mpirun arguments...Assuming that TotalView is the first supported parallel debugger in your path, Open MPI will autmoatically invoke the correct underlying command to run your MPI process in the TotalView debugger. Be sure to see this FAQ entry for details about what versions of Open MPI and TotalView are compatible.
For reference, this underlying command form is the following:
shell$ totalview mpirun -a ...mpirun arguments...So if you wanted to run a 4-process MPI job of your
a.out
executable, it would look like this:shell$ totalview mpirun -a -np 4 a.outAlternatively, Open MPI's
mpirun
offers the "-tv
" convenience option which does the same thing as TotalView's "-a" syntax. For example:shell$ mpirun -tv -np 4 a.outNote that by default, TotalView will stop deep in the machine code of
mpirun
itself, which is not what most users want. It is possible to get TotalView to recognize thatmpirun
is simply a "starter" program and should be (effectively) ignored. Specifically, TotalView can be configured to skipmpirun
(andmpiexec
andorterun
) and jump right into your MPI application. This can be accomplished by placing some startup instructions in a TotalView-specific file named$HOME/.tvdrc
.Open MPI includes a sample TotalView startup file that performs this function (see
etc/openmpi-totalview.tcl
in Open MPI distribution tarballs; it is also installed, by default, to$prefix/etc/openmpi-totalview.tcl
in the Open MPI installation). This file can be either copied to$HOME/.tvdrc
or sourced from the$HOME/.tvdrc
file. For example, placing the following line in your$HOME/.tvdrc
(replacing/path/to/openmpi/installation
with the proper directory name, of course) will use the Open MPI-provided startup file:shell$ source /path/to/openmpi/installation/etc/openmpi-totalview.tcl
If you've used DDT at least once before (to use the configuration wizard to setup support for Open MPI), you can start it on the command line with:
shell$ mpirun --debug ...mpirun arguments...Assuming that you are using Open MPI v1.2.4 or later, and assuming that DDT is the first supported parallel debugger in your path, Open MPI will autmoatically invoke the correct underlying command to run your MPI process in the DDT debugger. For reference (or if you are using an earlier version of Open MPI), this underlying command form is the following:
shell$ ddt -n {nprocs} -start {exe-name}Note that passing arbitrary arguments to Open MPI's
mpirun
is not supported with the DDT debugger.You can also attach to already-running proceses with either of the following two syntaxes:
shell$ ddt -attach {hostname1:pid} [{hostname2:pid} ...] {exec-name} # Or shell$ ddt -attach-file {filename of newline separated hostname:pid pairs} {exec-name}DDT can even be configured to operate with cluster/resource schedulers such that it can run on a local workstation, submit your MPI job via the scheduler, and then attach to the MPI job when it starts.
See the official DDT documentation for more details.
The documentation contained in the Open MPI tarball will have the most up-to-date information, but as of v1.0, Open MPI supports:
- BProc versions 3 and 4 (discontinued starting with OMPI v1.3)
- Sun Grid Engine (SGE), and the open source Grid Engine (support first introduced in Open MPI v1.2)
- PBS Pro, Torque, and Open PBS
- LoadLeveler scheduler (full support since 1.1.1)
- rsh / ssh
- SLURM
- LSF/li>
- XGrid (discontinued starting with OMPI 1.4)
- Yod (Cray XT-3 and XT-4)
rsh
launcher to use rsh
or
ssh
?A new feature was added into Open MPI 1.3.1 that supports suspend/resume of an MPI job. To suspend the job, you send a SIGTSTP (not SIGSTOP) signal to
mpirun
.mpirun
will catch this signal and forward it to thea.outs
as a SIGSTOP signal. To resume the job, you send a SIGCONT signal tompirun
which will be caught and forwarded to thea.outs
.By default, this feature is not enabled. This means that both the SIGTSTP and SIGCONT signals will simply be consumed by the
mpirun
process. To have them forwarded, you have to run the job with--mca orte_forward_job_control 1
. Here is an example on Solaris.shell$ mpirun -mca orte_forward_job_control 1 -np 2 a.outIn another window, we suspend and continue the job.
shell$ shell$ prstat -p 15301,15303,15305 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 15305 rolfv 158M 22M cpu1 0 0 0:00:21 5.9% a.out/1 15303 rolfv 158M 22M cpu2 0 0 0:00:21 5.9% a.out/1 15301 rolfv 8128K 5144K sleep 59 0 0:00:00 0.0% orterun/1 shell$ kill -TSTP 15301 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 15303 rolfv 158M 22M stop 30 0 0:01:44 21% a.out/1 15305 rolfv 158M 22M stop 20 0 0:01:44 21% a.out/1 15301 rolfv 8128K 5144K sleep 59 0 0:00:00 0.0% orterun/1 shell$ prstat -p 15301,15303,15305 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 15305 rolfv 158M 22M cpu1 0 0 0:02:06 17% a.out/1 15303 rolfv 158M 22M cpu3 0 0 0:02:06 17% a.out/1 15301 rolfv 8128K 5144K sleep 59 0 0:00:00 0.0% orterun/1 shell$ kill -CONT 15301 shell$ prstat -p 15301,15303,15305Note that all this does is stop the
a.outs
. It does not, for example, free any pinned memory when the job is in the suspended state.To get this to work under the SGE environment, you have to change the
suspend_method
entry in the queue. It has to be set to SIGTSTP. Here is an example of what a queue should look like.shell$ qconf -sq all.q qname all.q [...snip...] starter_method NONE suspend_method SIGTSTP resume_method NONENote that if you need to suspend other types of jobs with SIGSTOP (instead of SIGTSTP) in this queue then you need to provide a script that can implement the correct signals for each job type.
If you want to load a the shared library
libmpi
explicitly at runtime either by usingdlopen()
from C/C ++ or something like thectypes
package from Python, some extra care is required. The default configuration of Open MPI usesdlopen()
internally to load its support components. These components rely on symbols available inlibmpi
. In order to make the symbols inlibmpi
available to the components loaded by Open MPI at runtime,libmpi
must be loaded with theRTLD_GLOBAL
option.In C/C++, this option is specified as the second parameter to
dlopen()
. When usingctypes
with Python, this can be done with the second (optional) parameter toCDLL()
. For example (shown below in Mac OS X, where Open MPI's shared library name ends in ".dylib"; other operating systems use other suffixes, such as ".so")from ctypes import * mpi = CDLL('libmpi.0.dylib', RTLD_GLOBAL) f = pythonapi.Py_GetArgcArgv argc = c_int() argv = POINTER(c_char_p)() f(byref(argc), byref(argv)) mpi.MPI_Init(byref(argc), byref(argv)) mpi.MPI_Finalize()Other scripting languages should have similar options when dynamically loading shared libraries.
Beginning with the 1.3 release, Open MPI provides the following environmental variables that will be defined on every MPI process:
- OMPI_COMM_WORLD_SIZE - the number of processes in this process' MPI Comm_World
- OMPI_COMM_WORLD_RANK - the MPI rank of this process
- OMPI_COMM_WORLD_LOCAL_RANK - the relative rank of this process on this node within its job. For example, if four processes in a job share a node, they will each be given a local rank ranging from 0 to 3.
- OMPI_UNIVERSE_SIZE - the number of process slots allocated to this job. Note that this may be different than the number of processes in the job.
- OMPI_COMM_WORLD_LOCAL_SIZE - the number of ranks from this job that are running on this node.
- OMPI_COMM_WORLD_NODE_RANK - the relative rank of this process on this node looking across ALL jobs.
Open MPI guarantees that these variables will remain stable throughout future releases
e.g, the connections are only opened when the MPI process actually attempts to send a message to another process for the first time. This is done since (a) Open MPI has no idea what connections an application process will really use, and (b) creating the connections takes time. Once the connection is established, it remains "connected" until one of the two connected processes terminates, so the creation time cost is paid only once.Applications that require a fully connected topology, however, can see improved startup time if they automatically "pre-connect" all their processes during MPI_Init. Accordingly, Open MPI provides the MCA parameter "mpi_preconnect_mpi" which directs Open MPI to establish a "mostly" connected topology during MPI_Init (note that this MCA parameter used to be named "mpi_preconnect_all" prior to Open MPI v1.5; in v1.5, it was deprecated and replaced with "mpi_preconnect_mpi"). This is accomplished in a somewhat scalable fashion to help minimize startup time.
Users can set this parameter in two ways:
- in the environment as OMPI_MCA_mpi_preconnect_mpi=1
- on the cmd line as mpirun -mca mpi_preconnect_mpi 1
See this FAQ entry for more details on how to set MCA parameters.
Open MPI 1.7.4 has added some support to take advantage of GPU Direct RDMA on Mellanox cards. However, the supporting driver has not been released yet, so these features cannot be used yet. Note that to get GPU Direct RDMA support, you also need to configure your Open MPI library with CUDA 6.0.
To see if you have GPU Direct RDMA compiled into your library, you can check like this:
> ompi_info --all | grep btl_openib_have_cuda_gdr MCA btl: informational "btl_openib_have_cuda_gdr" (current value: "true", data source: default, level: 4 tuner/basic, type: bool)To see if your OFED stack has GPU Direct RDMA support, you can check like this.
> ompi_info -all | grep btl_openib_have_driver_gdr MCA btl: informational "btl_openib_have_driver_gdr" (current value: "true", data source: default, level: 4 tuner/basic, type: bool)To run with GPU Direct RDMA support, you have to enable it as it is off by default.
--mca btl_openib_want_cuda_gdr 1
NOTE: To build SGE support in v1.3, you will need to explicitly request the SGE support with the "
--with-sge
" command line switch to Open MPI'sconfigure
script.See
this FAQ entry for a description of how to correctly build Open MPI with SGE support.
To verify if support for SGE is configured into your Open MPI installation, run ompi_info as shown below and look for gridengine.
For Open MPI 1.3:
shell$ ompi_info | grep gridengine MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)Open MPI will automatically detect when it is running inside SGE and will just "do the Right Thing."Specifically, if you execute an
mpirun
command in a SGE job, it will automatically use the SGE mechanisms to launch and kill processes. There is no need to specify what nodes to run on -- Open MPI will obtain this information directly from SGE and default to a number of processes equal to the slot count specified. For example, this will run 4 MPI processes on the nodes that were allocated by SGE:# Get the environment variables for SGE # (Assuming SGE is installed at /opt/sge and $SGE_CELL is 'default' in your environment) # C shell settings shell% source /opt/sge/default/common/settings.csh # bourne shell settings shell$ . /opt/sge/default/common/settings.sh # Allocate an SGE interactive job with 4 slots from a parallel # environment (PE) named 'orte' and run a 4-process Open MPI job shell$ qrsh -pe orte 4 -b y mpirun -np 4 a.outThere are also other ways to submit jobs under SGE:
# Submit a batch job with the 'mpirun' command embedded in a script shell$ qsub -pe orte 4 my_mpirun_job.csh # Submit an SGE and OMPI job and mpirun in one line shell$ qrsh -V -pe orte 4 mpirun hostname # Use qstat(1) to show the status of SGE jobs and queues shell$ qstat -fIn reference to the setup, be sure you have a Parallel Environment (PE) defined for submitting parallel jobs. You don't have to name your PE "orte". The following example shows a PE named 'orte' that would look like:
% qconf -sp orte pe_name orte slots 99999 user_lists NONE xuser_lists NONE start_proc_args NONE stop_proc_args NONE allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE qsort_args NONE "qsort_args" is necessary with the Son of Grid Engine distribution, version 8.1.1 and later, and probably only applicable to it. For very old versions of SGE, omit "accounting_summary" too. You may want to alter other parameters, but the important one is "control_slaves", specifying that the environment has "tight integration". Note also the lack of a start or stop procedure. The tight integration means that mpirun automatically picks up the slot count to use as a default in place of the '-np' argument, picks up a host file, spawns remote processes via 'qrsh' so that SGE can control and monitor them, and creates and destroys a per-job temporary directory ($TMPDIR), in which Open MPI's directory will be created (by default). Be sure the queue will make use of the PE that you specified:% qconf -sq all.q ... pe_list make cre orte ... To determine whether the SGE parallel job is successfully launched to the remote nodes, you can pass in the MCA parameter "
--mca plm_base_verbose 1
" to mpirun.This will add in a -verbose flag to qrsh -inherit command that is used to send parallel tasks to the remote SGE execution hosts. It will show whether the connections to the remote hosts are established successfully or not.
Various SGE documentation with pointers to more is available at the Son of GridEngine site, and configuration instructions can be found at the Son of GridEngine configuration how-to site..
If you are running SGE6.2 Update 3 or later, then the -notify flag is supported. If you are running earlier versions, then the -notify flag will not work and using it will cause the job to be killed.
To use -notify, one has to be a careful. First, let us review what -notify does. Here is an excerpt from the qsub man page for the -notify flag.
- -notify
- This flag, when set causes Sun Grid Engine to send
warning signals to a running job prior to sending the
signals themselves. If a SIGSTOP is pending, the job
will receive a SIGUSR1 several seconds before the SIGSTOP.
If a SIGKILL is pending, the job will receive a SIGUSR2
several seconds before the SIGKILL. The amount of time
delay is controlled by the notify parameter in each
queue configuration.Let us assume you the reason you want to use the
-notify
flag is to get the SIGUSR1 signal prior to getting the SIGTSTP signal. As mentioned in thisthis FAQ entry one could run the job as shown in this batch script.
#! /bin/bash #$ -S /bin/bash #$ -V #$ -cwd #$ -N Job1 #$ -pe orte 16 #$ -j y #$ -l h_rt=00:20:00 mpirun -np 16 -mca orte_forward_job_control 1 a.outHowever, one has to make one of two changes to this script for things to work properly. By default, a SIGUSR1 signal will kill a shell script. So we have to make sure that does not happen. Here is one way to handle it.
#! /bin/bash #$ -S /bin/bash #$ -V #$ -cwd #$ -N Job1 #$ -pe orte 16 #$ -j y #$ -l h_rt=00:20:00 exec mpirun -np 16 -mca orte_forward_job_control 1 a.outAlternatively, one can catch the signals in the script instead of doing an exec on the mpirun.
#! /bin/bash #$ -S /bin/bash #$ -V #$ -cwd #$ -N Job1 #$ -pe orte 16 #$ -j y #$ -l h_rt=00:20:00 function sigusr1handler() { echo "SIGUSR1 caught by shell script" 1>&2 } function sigusr2handler() { echo "SIGUSR2 caught by shell script" 1>&2 } trap sigusr1handler SIGUSR1 trap sigusr2handler SIGUSR2 mpirun -np 16 -mca orte_forward_job_control 1 a.out3. Can I suspend and resume my job?A new feature was added into Open MPI 1.3.1 that supports suspend/resume of an MPI job. To suspend the job, you send a SIGTSTP (not SIGSTOP) signal to
mpirun
.mpirun
will catch this signal and forward it to thea.outs
as a SIGSTOP signal. To resume the job, you send a SIGCONT signal tompirun
which will be caught and forwarded to thea.outs
.By default, this feature is not enabled. This means that both the SIGTSTP and SIGCONT signals will simply be consumed by the
mpirun
process. To have them forwarded, you have to run the job with--mca orte_forward_job_control 1
. Here is an example on Solaris.shell$ mpirun -mca orte_forward_job_control 1 -np 2 a.outIn another window, we suspend and continue the job.shell$ prstat -p 15301,15303,15305 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 15305 rolfv 158M 22M cpu1 0 0 0:00:21 5.9% a.out/1 15303 rolfv 158M 22M cpu2 0 0 0:00:21 5.9% a.out/1 15301 rolfv 8128K 5144K sleep 59 0 0:00:00 0.0% orterun/1 shell$ kill -TSTP 15301 shell$ prstat -p 15301,15303,15305 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 15303 rolfv 158M 22M stop 30 0 0:01:44 21% a.out/1 15305 rolfv 158M 22M stop 20 0 0:01:44 21% a.out/1 15301 rolfv 8128K 5144K sleep 59 0 0:00:00 0.0% orterun/1 shell$ kill -CONT 15301 shell$ prstat -p 15301,15303,15305 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 15305 rolfv 158M 22M cpu1 0 0 0:02:06 17% a.out/1 15303 rolfv 158M 22M cpu3 0 0 0:02:06 17% a.out/1 15301 rolfv 8128K 5144K sleep 59 0 0:00:00 0.0% orterun/1Note that all this does is stop thea.outs
. It does not, for example, free any pinned memory when the job is in the suspended state.To get this to work under the SGE environment, you have to change the
suspend_method
entry in the queue. It has to be set to SIGTSTP. Here is an example of what a queue should look like.sheel$ qconf -sq all.q qname all.q [...snip...] starter_method NONE suspend_method SIGTSTP resume_method NONENote that if you need to suspend other types of jobs with SIGSTOP (instead of SIGTSTP) in this queue then you need to provide a script that can implement the correct signals for each job type.
|
||||
Bulletin | Latest | Past week | Past month |
|
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Haters Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|
You can use PayPal to to buy a cup of coffee for authors of this site |
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last modified: November, 14, 2014