|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
mpi_leave_pinned
parameter affect large message transfers? mpi_leave_pinned
parameter affect memory management? mpi_leave_pinned
parameter affect memory management in Open MPI v1.2? mpi_leave_pinned
parameter affect memory management in Open MPI v1.3? mpi_leave_pinned
MCA parameter? system()
, popen()
,
or fork()
in an MPI application that uses the OpenFabrics support?
openib
BTL; how can I fix this? RoCE
?
FCA
? MXM
?
1. What OpenFabrics-based components does Open MPI have? |
Open MPI supports the verbs and (starting in version 1.2) udapl APIs from the OpenFabrics Alliance (OFA) software stack. The OpenFabrics software stack supports both InfiniBand and iWARP networks. Note that the OpenFabrics Alliance used to be known as the OpenIB project -- so if you're thinking "OpenIB", you're thinking the right thing. Open MPI's support of the OpenFabrics stack is provided through the multiple different components; see the table below.
Open MPI also supported the Mellanox VAPI (mVAPI) software stack through the v1.2 series. See this FAQ entry for more details on VAPI support.
Open MPI series | OpenFabrics support |
---|---|
v1.0 series |
openib and mvapi BTLs |
v1.1 series |
openib and mvapi BTLs |
v1.2 series |
openib and mvapi BTLs |
v1.3 / v1.4 series |
openib BTL |
v1.5 / v1.6 series |
openib BTL, mxm MTL, fca coll |
v1.7 / v1.8 series |
openib BTL, mxm MTL, fca and
ml and hcoll coll |
2. Does Open MPI support iWARP? |
iWARP is fully supported as of the Open MPI v1.3 release.
3. Does Open MPI support RoCE (RDMA over Converged Ethernet)? |
RoCE is fully supported as of the Open MPI v1.4.4 release.
For more details on how to run Open MPI over RoCE, see this FAQ entry.
4. I have an OFED-based cluster; will Open MPI work with that? |
Yes.
OFED (OpenFabrics Enterprise Distribution) is basically the release mechanism for the OpenFabrics software packages. So OFED releases are officially tested and released versions of the OpenFabrics stacks.
5. Where do I get the OFED software from? |
The "Download" section of the OpenFabrics web site has links for the various OFED releases.
Additionally, several InfiniBand vendors have their own releases of the OFED stack, customized for their hardware and/or support requirements. Consult with your IB vendor for more details.
6. Isn't Open MPI included in the OFED software package? Can I install another copy of Open MPI besides the one that is included in OFED? |
Yes, Open MPI is included in the OFED software. And yes, you can easily
install a later version of Open MPI on OFED-based clusters, even if you're also
using the Open MPI that was included in OFED.
You can simply download the Open MPI version that you want and install it to an alternate directory from where the OFED-based Open MPI was installed. You therefore have multiple copies of Open MPI that do not conflict with each other. Make sure you set the PATH and LD_LIBRARY_PATH variables to point to exactly one of your Open MPI installations at a time, and never try to run an MPI executable compiled with one version of Open MPI with a different version of Open MPI.
Ensure to specify to build Open MPI with OpenFabrics support (e.g., use the "--with-openib=/usr/local/ofed
"
option to Open MPI's configure
command, potentially replacing
/usr/local/ofed
with the path of your OFED installation).
7. What versions of Open MPI are in OFED? |
The following versions of Open MPI shipped (or will soon ship) in OFED (note that OFED stopped including MPI implementations as of OFED 1.5):
NOTE: A prior version of this FAQ entry specified that "v1.2ofed" would be included in OFED v1.2, representing a temporary branch from the v1.2 series that included some OFED-specific functionality. All of this functionality was included in the v1.2.1 release, so OFED v1.2 simply included than. Some public betas of "v1.2ofed" releases were made available, but this version was never officially released.
8. Why are you using the name "openib" for the BTL name? |
Before the iWARP vendors joined the OpenFabrics Alliance, the project was known
as OpenIB. Open MPI's support for this software stack was originally written during
this timeframe -- the name of the group was "OpenIB", so we named the BTL
openib
.
Since then, iWARP vendors joined the project and it changed names to "OpenFabrics".
Open MPI did not rename its BTL mainly for historical reasons -- we didn't want
to break compatibility for users who were already using the openib
BTL name in scripts, etc.
9. Is the mVAPI-based BTL still supported? |
Yes, but only through the Open MPI v1.2 series; mVAPI support was removed starting with v1.3.
The mVAPI support is an InfiniBand-specific BTL (i.e., it will not work in iWARP networks), and reflects the prior generation of InfiniBand software stacks.
However, the Open MPI team is doing no new work with mVAPI-based networks; all
effort is being put into the openib
BTL. We'll provide critical bug
fixes for the mvapi
BTL, but we consider the openib
BTL
to be the future of InfiniBand and iWARP support in Open MPI.
Generally, much of the information contained in this FAQ category applies to
both the OpenFabrics openib
BTL and the mVAPI mvapi
BTL
-- simply replace openib
with mvapi
to get similar results.
However, new features and options are continually being added to the openib
BTL (and are being listed in this FAQ) that will not be back-ported to the
mvapi
BTL. So not all items in this FAQ will apply to the mvapi
BTL.
We encourage users to switch to the OFED-based stack if possible.
10. How do I specify to use the OpenFabrics network for MPI messages? |
In general, you specify that the openib
BTL components should be
used. However, note that you should also specify that the self
BTL component should be used. self
is for loopback communication (i.e.,
when an MPI process sends to itself), and is technically a different communication
channel than the OpenFabrics networks. For example:
shell$ mpirun --mca btl openib,self ... |
Failure to specify the self
BTL may result in Open MPI being unable
to complete send-to-self scenarios (meaning that your program will run fine until
a process tries to send to itself).
Note that openib,self
is the minimum list of BTLs that
you might want to use. It is highly likely that you also want to
include the sm
(shared memory) BTL in the list as well, like this:
shell$ mpirun --mca btl openib,self,sm ... |
See this FAQ entry for more details on selecting which MCA plugins are used at run-time.
Finally, note that if the openib
component is available at run time,
Open MPI should automatically use it by default (ditto for self
). Hence,
it's usually unnecessary to specify these options on the mpirun
command line. They are typically only used when you want to be absolutely positively
definitely sure to use the specific BTL.
11. But wait -- I also have a TCP network. Do I need to explicitly disable the TCP BTL? |
No. See this FAQ entry for more details.
12. How do I know what MCA parameters are available for tuning MPI performance? |
The ompi_info
command can display all the parameters available for
the openib
BTL component:
# Show the openib BTL parameters shell$ ompi_info --param btl openib |
13. I'm experiencing a problem with Open MPI on my OpenFabrics-based network; how do I troubleshoot and get help? |
In order for us to help you, it is most helpful if you can run a few steps before sending an e-mail to both perform some basic troubleshooting and provide us with enough information about your environment to help you. Please include answers to the following questions in your e-mail:
ibv_devinfo
command on a known "good"
node and a known "bad" node? (NOTE: there must be at least
one port listed as "PORT_ACTIVE" for Open MPI to work. If there is not at least
one PORT_ACTIVE port, something is wrong with your OpenFabrics environment and
Open MPI will not be able to run).
ifconfig
command on a known "good"
node and a known "bad" node? (mainly relevant for IPoIB installations) Note
that some Linux distributions do not put ifconfig
in the default
path for normal users; look for it in /sbin/ifconfig
or /usr/sbin/ifconfig
.
Gather up this information and see this page about how to submit a help request to the user's mailing list.
14. What is "registered" (or "pinned") memory? |
"Registered" memory means two things:
These two factors allow network adapters to move data between the network fabric and physical RAM without involvement of the main CPU or operating system.
Note that many people say "pinned" memory when they actually mean "registered" memory.
However, a host can only support so much registered memory, so it is treated as a precious resource. Additionally, the cost of registering (and unregistering) memory is fairly high. Open MPI takes aggressive steps to use as little registered memory as possible (balanced against performance implications, of course) and mitigate the cost of registering and unregistering memory.
15. I'm getting errors about "error registering openib memory"; what do I do? |
With OpenFabrics (and therefore the openib
BTL component), you need
to set the available locked memory to a large number (or better yet, unlimited)
-- the defaults with most Linux installations are usually too low for most HPC applications
that utilize OpenFabrics. Failure to do so will result in a error message similar
to one of the following (the messages have changed throughout the release versions
of Open MPI):
WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. This may be caused by your OpenFabrics vendor limiting the amount of physical memory that can be registered. You should investigate the relevant Linux kernel module parameters that control how much physical memory can be registered, and increase them to allow registering all physical memory on your machine. See this Open MPI FAQ item for more information on these Linux kernel module parameters: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: node02 Registerable memory: 32768 MiB Total memory: 65476 MiB Your MPI job will continue, but may be behave poorly and/or hang. |
[node06:xxxx] mca_mpool_openib_register: \ error registering openib memory of size yyyy errno says Cannot allocate memory |
[x,y,z][btl_openib.c:812:mca_btl_openib_create_cq_srq] \ error creating low priority cq for mthca0 errno says Cannot allocate memory |
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. |
The OpenIB BTL failed to initialize while trying to create an internal queue. This typically indicates a failed OpenFabrics installation or faulty hardware. The failure occurred here: Host: compute_node.example.com OMPI source: btl_openib.c:828 Function: ibv_create_cq() Error: Invalid argument (errno=22) Device: mthca0 You may need to consult with your system administrator to get this problem fixed. |
The OpenIB BTL failed to initialize while trying to allocate some locked memory. This typically can indicate that the memlock limits are set too low. For most HPC installations, the memlock limits should be set to "unlimited". The failure occurred here: Host: compute_node.example.com OMPI source: btl_opebib.c:114 Function: ibv_create_cq() Device: Out of memory Memlock limit: 32767 You may need to consult with your system administrator to get this problem fixed. This FAQ entry on the Open MPI web site may also be helpful: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages |
error creating qp errno says Cannot allocate memory |
There are two typical causes for Open MPI being unable to register memory, or warn that it might not be able to register enough memory:
16. How can a system administrator (or user) change locked memory limits? |
There are two ways to control the amount of memory that a user process can lock:
/etc/security/limits.d/
(or directly editing
the /etc/security/limits.conf
file on older systems). Two limits
are configurable:
/etc/security/limits.d/
(e.g., 95-openfabrics.conf
)
with the line below (or, if your system doesn't have a /etc/security/limits.d/
directory, add a line directly to /etc/security/limits.conf
):
* soft memlock <number> |
where "<number>" is the number of bytes that you want user processes to be allowed to lock by default (presumably rounded down to an integral number of pages). "<number>" can also be "unlimited".
/etc/security/limits.d/
(or editing /etc/security/limits.conf
directly on older systems):
* hard memlock <number> |
where "<number>" is the maximum number of bytes that you want user processes to be allowed to lock (presumably rounded down to an integral number of pages). "<number>" can also be "unlimited".
ulimit
(or
limit
in csh) command. The default amount of memory allowed to
be locked will correspond to the "soft" limit set in /etc/security/limits.d/
(or limits.conf
-- see above); users cannot use ulimit
(or limit
in csh) to set their amount to be more than the hard
limit in /etc/security/limits.d
(or limits.conf
).
Users can increase the default limit by adding the following to their shell startup files for Bourne style shells (sh, bash):
ulimit -l unlimited |
Or for C style shells (csh, tcsh):
limit memorylocked unlimited |
This effectively sets their limit to the hard limit in /etc/security/limits.d
(or limits.conf
). Alternatively, users can set a specific number
instead of "unlimited," but this has limited usefulness unless a user is aware
of exactly how much locked memory will require (which is difficult to know since
Open MPI manages locked memory behind the scenes).
It is important to realize that this must be set in all shells where
Open MPI processes using OpenFabrics will be run. For example, if you are using
rsh
or ssh
to start parallel jobs, it will be necessary
to set the ulimit
in your shell startup files so that it is effective
on the processes that are started on each node.
More specifically -- it may not be sufficient to simply execute the following,
because the ulimit
may not be in effect on all nodes where Open
MPI processes will be run:
shell$ ulimit -l unlimited shell$ mpirun -np 2 my_mpi_application |
17. I'm still getting errors about "error registering openib memory"; what do I do? |
Ensure that the limits you've set (see this FAQ entry) are actually being used. There are two general cases where this can happen:
That is, in some cases, it is possible to login to a node and not have the "limits" set properly. For example, consider the following post on the Open MPI User's list:
In this case, the user noted that the default configuration on his Linux system
did not automatically load the pam_limits.so
upon rsh
-based
logins, meaning that the hard and soft limits were not set.
There are also some default configurations where, even though the maximum limits
are begin set system-wide in limits.d
(or limits.conf
on older systems), something during the boot procedure sets the default limit back
down to a low number (e.g., 32k). In this case, you may need to override this limit
on a per-user basis (described this FAQ entry), or
effectively system-wide by putting "ulimit -l unlimited" (for Bourne-like shells)
in a strategic location, such as:
/etc/init.d/sshd
(or wherever the script is that starts up
your SSH daemon) and restarting the SSH daemon /etc/profile.d
, or wherever system-wide shell
startup scripts are located (e.g., /etc/profile
and /etc/csh.cshrc
)
Also, note that resource managers such as SLURM, Torque/PBS, LSF, etc. may affect OpenFabrics jobs in two ways:
The files in limits.d
(or the limits.conf
file) does not usually apply to resource daemons! The limits.s
files usually only applies to rsh
or ssh
-based logins.
Hence, daemons usually inherit the system default of maximum 32k of locked memory
(which then gets passed down to the MPI processes that they start). To increase
this limit, you typically need to modify daemons' startup scripts to increase
the limit before they drop root privliedges.
Finally, note that some versions of SSH have problems with getting correct values
from /etc/security/limits.d/
(or limits.conf
) when using
privilege separation. You may notice this by ssh'ing into a node and seeing that
your memlock limits are far lower than what you have listed in /etc/security/limits.d/
(or limits.conf
) (e.g., 32k instead of unlimited). Several web sites
suggest disabling privilege separation in ssh to make PAM limits work properly,
but others imply that this may be fixed in recent versions of Open SSH.
If you do disable privilege separation in ssh, be sure to check with your local system administrator and/or security officers to understand the full implications of this change. See this Google search link for more information.
18. Open MPI is warning me about limited registered memory; what does this mean? |
OpenFabrics network vendors provide Linux kernel module parameters controlling the size of the size of the memory translation table (MTT) used to map virtual addresses to physical address. The size of this table controls the amount of physical memory that can be registered for use with OpenFabrics devices.
With Mellanox hardware, two parameters are provided to control the size of this table:
log_num_mtt
(on some older Mellanox hardware, the parameter
may be num_mtt
, not log_num_mtt
): number of memory
translation tables log_mtts_per_seg
: The amount of memory that can be registered is calculated using this formula:
In newer hardware: max_reg_mem = (2^log_num_mtt) * (2^log_mtts_per_seg) * PAGE_SIZE In older hardware: max_reg_mem = num_mtt * (2^log_mtts_per_seg) * PAGE_SIZE |
At least some versions of OFED (community OFED, Mellanox OFED, and upstream OFED in Linux distributions) set the default values of these variables FAR too low! For example, in some cases, the default values may only allow registering 2 GB -- even if the node has much more than 2GB of physical memory.
Mellanox has advised the Open MPI community to increase the log_num_mtt
value (or num_mtt
value), not the log_mtts_per_seg
value (even though
an IBM article suggests increasing the log_mtts_per_seg
value).
It is recommended that you adjust log_num_mtt
(or num_mtt
)
such that your max_reg_mem
value is at least twice the amount of physical
memory on your machine (setting it to a value higher than the amount of physical
memory present allows the internal Mellanox driver tables to handle fragmentation
and other overhead). For example, if a node has 64 GB of memory and a 4 KB page
size, log_num_mtt
should be set to 24 and (assuming log_mtts_per_seg
is set to 1). This will allow processes on the node to register:
max_reg_mem = (2^24) * (2^1) * (4 kB) = 128 GB |
NOTE: Starting with OFED 2.0, OFED's default kernel parameter values should allow registering twice the physical memory size.
19. I'm using Mellanox ConnectX HCA hardware and seeing terrible latency for short messages; how can I fix this? |
Open MPI prior to v1.2.4 did not include specific configuration information to enable RDMA for short messages on ConnectX hardware. As such, Open MPI will default to the safe setting of using send/receive semantics for short messages, which is slower than RDMA.
To enable RDMA for short messages, you can add this snippet to the bottom of
the $prefix/share/openmpi/mca-btl-openib-hca-params.ini
file:
[Mellanox Hermon] vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708 vendor_part_id = 25408,25418,25428 use_eager_rdma = 1 mtu = 2048 |
Enabling short message RDMA will significantly reduce short message latency, especially on ConnectX (and newer) Mellanox hardware.
20. How much registered memory is used by Open MPI? Is there a way to limit it? |
Open MPI uses registered memory in several places, and therefore the total amount used is calculated by a somewhat-complex formula that is directly influenced by MCA parameter values.
It can be desirable to enforce a hard limit on how much registered memory is consumed by MPI applications. For example, some platforms have limited amounts of registered memory available; setting limits on a per-process level can ensure fairness between MPI processes on the same host. Another reason is that registered memory is not swappable; as more memory is registered, less memory is available for (non-registered) process code and data. When little unregistered memory is available, swap thrashing of unregistered memory can occur.
Each instance of the openib
BTL module in an MPI process
(i.e., one per HCA port and LID) will use up to a maximum of the sum of the following
quantities:
Description | Amount | Explanation |
---|---|---|
User memory |
mpool_rdma_rcache_size_limit |
By default Open MPI will register as much user memory as necessary (upon
demand). However, if mpool_rdma_cache_size_limit is greater
than zero, it is the upper limit (in bytes) of user memory that will
be registered. User memory is registered for ongoing MPI communications
(e.g., long message sends and receives) and via the MPI_ALLOC_MEM function.
Note that this MCA parameter was introduced in v1.2.1. |
Internal eager fragment buffers |
2 x btl_openib_free_list_max x (btl_openib_eager_limit
+ overhead) |
A "free list" of buffers used in the openib BTL for "eager"
fragments (e.g., the first fragment of a long message). Two free lists
are created; one for sends and one for receives.
By default, |
Internal send/receive buffers |
2 x btl_openib_free_list_max x (btl_openib_max_send_size
+ overhead) |
A "free list" of buffers used for send/receive communication in the
openib BTL. Two free lists are created; one for sends and
one for receives.
By default, |
Internal "eager" RDMA buffers |
btl_openib_eager_rdma_num x btl_openib_max_eager_rdma
x (btl_openib_eager_limit + overhead) |
If btl_openib_user_eager_rdma is true, RDMA buffers are
used for eager fragments (because RDMA semantics can be faster than
send/receive semantics in some cases), and an additional set of registered
buffers are created (as needed).
Each MPI process will use RDMA buffers for eager fragments up to
|
In general, when any of the individual limits are reached, Open MPI will try to free up registered memory (in the case of registered user memory) and/or wait until message passing progresses and more registered memory becomes available.
Use the ompi_info
command to view the values of the MCA parameters
described above in your Open MPI installation:
shell$ ompi_info --param btl openib |
See this FAQ entry for information on how to set MCA parameters at run-time.
21. How do I get Open MPI working on Chelsio iWARP devices? |
iWARP support is only available in version Open MPI v1.3 or greater.
For the Chelsio T3 adapter, you must have at least OFED v1.3.1 and Chelsio firmware
v6.0. Download the firmware from service.chelsio.com
and put the uncompressed t3fw-6.0.0.bin
file in
/lib/firmware
. Then reload the iw_cxgb3
module and bring
up the ethernet interface to flash this new firmware. For example:
# Note that the URL for the firmware may change over time shell# cd /lib/firmware shell# wget http://service.chelsio.com/drivers/firmware/t3/t3fw-6.0.0.bin.gz [...wget output...] shell# gunzip t3fw-6.0.0.bin.gz shell# rmmod iw_cxgb3 cxgb3 shell# modprobe iw_cxgb3 # This last step may happen automatically, depending on your # Linux distro (assuming that the ethernet interface has previously # been properly configured and is ready to bring up). Substitute the # proper ethernet interface name for your T3 (vs. ethX). shell# ifup ethX |
If all goes well, you should see a message similar to the following in your syslog 15-30 seconds later:
kernel: cxgb3 0000:0c:00.0: successful upgrade to firmware 6.0.0 |
Open MPI will work without any specific configuration to the openib
BTL. Users wishing to performance tune the configurable options may wish to inspect
the receive queue values. Those can be found in the "Chelsio T3" section of mca-btl-openib-hca-params.ini.
22. I'm getting "ibv_create_qp: returned 0 byte(s) for max inline data" errors; what is this, and how do I fix it? |
Prior to Open MPI v1.0.2, the OpenFabrics (then known as "OpenIB") verbs BTL component did not check for where the OpenIB API could return an erroneous value (0) and it would hang during startup. Starting with v1.0.2, error messages of the following form are reported:
[0,1,0][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] ibv_create_qp: returned 0 byte(s) for max inline data |
This is caused by an error in older versions of the OpenIB user library. Upgrading your OpenIB stack to recent versions of the OpenFabrics software should resolve the problem. See this post on the Open MPI user's list for more details:
23. My bandwidth seems [far] smaller than it should be; why? Can this be fixed? |
Open MPI, by default, uses a pipelined RDMA protocol. Additionally, in the v1.0 series of Open MPI, small messages use send/receive semantics (instead of RDMA -- small message RDMA was added in the v1.1 series). For some applications, this may result in lower-than-expected bandwidth. However, Open MPI also supports caching of registrations in a most recently used (MRU) list -- this bypasses the pipelined RDMA and allows messages to be sent faster (in some cases).
For version the v1.1 series, see this FAQ entry for more information about small message RDMA, its effect on latency, and how to tune it.
To enable the "leave pinned" behavior, set the MCA parameter mpi_leave_pinned
to 1. For example:
shell$ mpirun --mca mpi_leave_pinned 1 ... |
NOTE: The mpi_leave_pinned
parameter was broken in Open MPI v1.3 and v1.3.1 (see
this announcement). mpi_leave_pinned
functionality was fixed in
v1.3.2.
This will enable the MRU cache and will typically increase bandwidth performance for applications which reuse the same send/receive buffers.
NOTE: The v1.3 series enabled "leave pinned" behavior by default when applicable; it is usually unnecessary to specify this flag anymore.
24. How do I tune small messages in Open MPI v1.1 and later versions? |
Starting with Open MPI version 1.1, "short" MPI messages are sent, by default, via RDMA to a limited set of peers (for versions prior to v1.2, only when the shared receive queue is not used). This provides the lowest possible latency between MPI processes.
However, this behavior is not enabled between all process peer pairs because it can quickly consume large amounts of resources on nodes (specifically: memory must be individually pre-allocated for each process peer to perform small message RDMA; for large MPI jobs, this can quickly cause individual nodes to run out of memory). Outside the limited set of peers, send/receive semantics are used (meaning that they will generally incur a greater latency, but not consume as many system resources).
This behavior is tunable via several MCA parameters:
btl_openib_use_eager_rdma
(default value: 1): These both default
to 1, meaning that the small message behavior described above (RDMA to a limited
set of peers, send/receive to everyone else) is enabled. Setting these parameters
to 0 disables all small message RDMA in the openib
BTL component.
btl_openib_eager_rdma_threshold
(default value: 16): This is
the number of short messages that must be received from a peer before Open MPI
will setup an RDMA connection to that peer. This mechanism tries to setup RDMA
connections only to those peers who will frequently send around a lot of short
messages (e.g., avoid consuming valuable RDMA resources for peers who only exchange
a few "startup" control messages).
btl_openib_max_eager_rdma
(default value: 16): This parameter
controls the maximum number of peers that can receive and RDMA connection for
short messages. It is not advisable to change this value to a very large number
because the polling time increase with the number of the connections; as a direct
result, short message latency will increase.
btl_openib_eager_rdma_num
(default value: 16): This parameter
controls the maximum number of pre-allocated buffers allocated to each peer
for small messages.
btl_openib_eager_limit
(default value: 12k): The maximum size
of small messages (in bytes).
Note that long messages use a different protocol than short messages; messages
over a certain size always use RDMA. Long messages are not affected by
the btl_openib_use_eager_rdma
MCA parameter.
Also note that, as stated above, prior to v1.2, small message RDMA is not used when the shared receive queue is used.
25. How do I tune large message behavior in Open MPI the v1.2 series? |
Note that this answer generally pertains to the Open MPI v1.2 series. Later versions slightly changed how large messages are handled.
Open MPI uses a few different protocols for large messages. Much detail is provided in this paper.
The btl_openib_flags
MCA parameter is a set of bit flags that influences
which protocol is used; they generally indicate what kind of transfers are allowed
to send the bulk of long messages. Specifically, these flags do not regulate the
behavior of "match" headers or other intermediate fragments.
The following flags are available:
Open MPI defaults to setting both the PUT and GET flags (value 6).
Open MPI uses the following long message protocols:
btl_openib_flags
and the sender's message is already registered
(either by use of the mpi_leave_pinned
MCA parameter or if the buffer was allocated via MPI_ALLOC_MEM), a slightly
simpler protocol is used:
NOTE: Per above, if striping across multiple network interfaces is available, only RDMA writes are used. The reason that RDMA reads are not used is solely because of an implementation artifact in Open MPI; we didn't implement it because using RDMA reads only saves the cost of a short message round trip, the extra code complexity didn't seem worth it for long messages (i.e., the performance difference will be negligible).
Note that the user buffer is not unregistered when the RDMA transfer(s) is(are) completed.
btl_openib_flags
and the sender's message is not already registered, a 3-phase
pipelined protocol is used:
Note that phases 2 and 3 occur in parallel. Each phase 3 fragment is unregistered when its transfer completes (see the paper for more details).
Also note that one of the benefits of the pipelined protocol is that large messages will naturally be striped across all available network interfaces.
The sizes of the fragments in each of the three phases are tunable by the MCA parameters shown in the figure below (all sizes are in units of bytes):
btl_openib_flags
):
This protocol behaves the same as the RDMA Pipeline protocol
when the btl_openib_min_rdma_size
value is infinite.
26. How do I tune large message behavior in the Open MPI v1.3 (and later) series? |
The Open MPI v1.3 (and later) series generally use the same protocols for sending long messages as described for the v1.2 series, but the MCA parameters for the RDMA Pipeline protocol were both moved and renamed (all sizes are in units of bytes):
The change to move the "intermediate" fragments to the end of the message was
made to better support applications that call fork()
. Specifically,
there is a problem in Linux when a process with registered memory calls fork()
:
the registered memory will physically not be available to the child process (touching
memory in the child that is registered in the parent will cause a segfault or other
error). Because memory is registered in units of pages, the end of a long message
is likely to share the same page as other heap memory in use by the application.
If this last page of the large message is registered, then all the memory in that
page -- to include other buffers that are not part of the long message -- will not
be available to the child. By moving the "intermediate" fragments to the end of
the message, the end of the message will be sent with copy in/copy out semantics
and, more importantly, will not have its page registered. This increases the chance
that child processes will be able to access other memory in the same page as the
end of the large message without problems.
Some notes about these parameters:
btl_openib_rndv_eager_limit
defaults to the same value as
btl_openib_eager_limit
(the size for "small" messages). It is a
separate parameter in case you want/need different values. btl_openib_min_rdma_size
parameter was an absolute
offset into the message; it was replaced by btl_openib_rdma_pipeline_send_length
,
which is a length.
Note that messages must be larger than btl_openib_min_rdma_pipeline_size
(a new MCA parameter to the v1.3 series) to use the RDMA Direct
or RDMA Pipeline protocols. Messages shorter than this length will
use the Send/Receive protocol (even if the SEND flag is not set
on btl_openib_flags
).
27. How does the mpi_leave_pinned
parameter affect large message transfers? |
NOTE: The mpi_leave_pinned
parameter was broken in Open MPI v1.3 and v1.3.1 (see
this announcement). mpi_leave_pinned
functionality was fixed in
v1.3.2.
When mpi_leave_pinned
is set to 1, Open MPI aggressively tries to
pre-register user message buffers so that the RDMA Direct protocol
can be used. Additionally, user buffers are left registered so that the
de-registration and re-registration costs are not incurred if the same buffer is
used in a future message passing operation.
NOTE: Starting with Open MPI v1.3,
mpi_leave_pinned
is automatically set to 1 by default when applicable.
It is therefore usually unnecessary to set this value manually.
NOTE: The mpi_leave_pinned
MCA parameter has some restrictions on how it can be set starting with Open MPI
v1.3.2. See this FAQ entry for details.
Leaving user memory registered when sends complete can be extremely beneficial for applications that repeatedly re-use the same send buffers (such as ping-pong benchmarks). Additionally, the fact that a single RDMA transfer is used and the entire process runs in hardware with very little software intervention results in utilizing the maximum possible bandwidth.
Leaving user memory registered has disadvantages, however. Bad Things happen
if registered memory is free()
ed, for example -- it can silently invalidate
Open MPI's cache of knowing which memory is registered and which is not. The MPI
layer usually has no visibility on when the MPI application calls free()
(or otherwise frees memory, such as through munmap()
or sbrk()
).
Open MPI has implemented complicated schemes that intercept calls to return memory
to the OS. Upon intercept, Open MPI examines whether the memory is registered, and
if so, unregisters it before returning the memory to the OS.
These schemes are best described as "icky" and can actually cause real problems in applications that provide their own internal memory allocators. Additionally, only some applications (most notably, ping-pong benchmark applications) benefit from "leave pinned" behavior -- those who consistently re-use the same buffers for sending and receiving long messages.
It is for these reasons that "leave pinned" behavior is not enabled by default. Note that other MPI implementations enable "leave pinned" behavior by default.
Also note that another pipeline-related MCA parameter also exists: mpi_leave_pinned_pipeline
.
Setting this parameter to 1 enables the use of the RDMA Pipeline
protocol, but simply leaves the user's memory registered when RDMA transfers complete
(eliminating the cost of registering / unregistering memory during the pipelined
sends / receives). This can be beneficial to a small class of user MPI applications.
28. How does the mpi_leave_pinned
parameter affect memory management? |
NOTE: The mpi_leave_pinned
parameter was broken in Open MPI v1.3 and v1.3.1 (see
this announcement). mpi_leave_pinned
functionality was fixed in
v1.3.2.
When mpi_leave_pinned
is set to 1, Open MPI aggressively leaves
user memory registered with the OpenFabrics network stack after the first time it
is used with a send or receive MPI function. This allows Open MPI to avoid expensive
registration / deregistration function invocations for each send or receive MPI
function.
NOTE: The mpi_leave_pinned
MCA parameter has some restrictions on how it can be set starting with Open MPI
v1.3.2. See this FAQ entry for details.
However, registered memory has two drawbacks:
The second problem can lead to silent data corruption or process failure. As such, this behavior must be disallowed. Note that the real issue is not simply freeing memory, but rather returning registered memory to the OS (where it can potentially be used by a different process). Open MPI has two methods of solving the issue:
malloc()
, free()
, mmap()
, munmap()
,
etc. How these options are used differs between Open MPI v1.2 (and earlier) and Open MPI v1.3 (and later).
29. How does the mpi_leave_pinned
parameter affect memory management in Open MPI v1.2? |
Be sure to read this FAQ entry first.
Open MPI 1.2 and earlier on Linux used the ptmalloc2 memory allocator linked
into the Open MPI libraries to handle memory deregistration. On Mac OS X, it uses
an interface provided by Apple for hooking into the virtual memory system, and on
other platforms no safe memory registration was available. The ptmalloc2 code could
be disabled at Open MPI configure time with the option --without-memory-manager
,
however it could not be avoided once Open MPI was built.
ptmalloc2 can cause large memory utilization numbers for a small number of applications and has a variety of link-time issues. Therefore, by default Open MPI did not use the registration cache, resulting in lower peak bandwidth. The inability to disable ptmalloc2 after Open MPI was built also resulted in headaches for users.
Open MPI v1.3 handles leave pinned memory management differently.
30. How does the mpi_leave_pinned
parameter affect memory management in Open MPI v1.3? |
NOTE: The mpi_leave_pinned
parameter was broken in Open MPI v1.3 and v1.3.1 (see
this announcement). mpi_leave_pinned
functionality was fixed in
v1.3.2.
Be sure to read this FAQ entry first.
NOTE: The mpi_leave_pinned
MCA parameter has some restrictions on how it can be set starting with Open MPI
v1.3.2. See this FAQ entry for details.
With Open MPI 1.3, Mac OS X uses the same hooks
as the 1.2 series, and most operating
systems do not provide pinning support. However, the pinning support on Linux has
changed. ptmalloc2 is now by default built as a standalone library (with dependencies
on the internal Open MPI libopen-pal
library), so that users by default
do not have the problematic code linked in with their application. Further, if OpenFabrics
networks are being used, Open MPI will use the mallopt()
system call
to disable returning memory to the OS if no other hooks are provided, resulting
in higher peak bandwidth by default.
To utilize the independent ptmalloc2 library, users need to add -lopenmpi-malloc
to the link command for their application:
shell$ mpicc foo.o -o foo -lopenmpi-malloc |
Linking in libopenmpi-malloc will result in the OpenFabrics BTL not enabling mallopt() but using the hooks provided with the ptmalloc2 library instead.
To revert to the v1.2 (and prior) behavior, with ptmalloc2 folded into libopen-pal,
Open MPI can be built with the --enable-ptmalloc2-internal
configure
flag.
When not using ptmalloc2, mallopt() behavior can be disabled by disabling
mpi_leave_pined
:
shell$ mpirun --mca mpi_leave_pinned 0 ... |
Because mpi_leave_pinned
behavior is usually only useful for synthetic
MPI benchmarks, the never-return-behavior-to-the-OS behavior was resisted by the
Open MPI developers for a long time. Ultimately, it was adopted because a) it is
less harmful than imposing the ptmalloc2 memory manager on all applications, and
b) it was deemed important to enable mpi_leave_pinned
behavior by default
since Open MPI performance kept getting negatively compared to other MPI implementations
that enable similar behavior by default.
31. How can I set the mpi_leave_pinned
MCA parameter? |
NOTE: The mpi_leave_pinned
parameter was broken in Open MPI v1.3 and v1.3.1 (see
this announcement). mpi_leave_pinned
functionality was fixed in
v1.3.2.
As with all MCA parameters, the mpi_leave_pinned
parameter (and
mpi_leave_pinned_pipeline
parameter) can be set from the mpirun command
line:
shell$ mpirun --mca mpi_leave_pinned 1 ... |
Prior to the v1.3 series,
all the
usual methods to set MCA parameters could be used to set mpi_leave_pinned
.
However, starting with v1.3.2, not all of the
usual
methods to set MCA parameters apply to mpi_leave_pinned
. Due to
various operating system memory subsystem constriaints, Open MPI must react to the
setting of the mpi_leave_pinned
parameter in each MPI process before
MPI_INIT is invoked. Specifically, some of Open MPI's MCA parameter propagation
mechanisms are not activated until during MPI_INIT -- which is too late for
mpi_leave_pinned
.
As such, only the following MCA parameter-setting mechanisms can be used for
mpi_leave_pinned
and mpi_leave_pinned_pipeline
:
OMPI_MCA_mpi_leave_pinned
to 1 before invoking mpirun
. To be clear: you cannot set the mpi_leave_pinned
MCA parameter
via Aggregate MCA parameter files or normal MCA parameter files. This is expected
to be an acceptable restriction, however, since the default value of the mpi_leave_pinned
parameter is "-1", meaning "determine at run-time if it is worthwhile to use leave-pinned
behavior." Specifically, if mpi_leave_pinned
is set to -1, if any of
the following are true when each MPI processes starts, then Open MPI will use leave-pinned
bheavior:
OMPI_MCA_mpi_leave_pinned
or
OMPI_MCA_mpi_leave_pinned_pipeline
is set to a positive value (note
that the "mpirun --mca mpi_leave_pinned 1 ..."
command-line syntax
simply results in setting these environment variables in each MPI process).
/sys/class/infiniband
/dev/open-mx
/dev/myri[0-9]
Note that if either the environment variable OMPI_MCA_mpi_leave_pinned
or OMPI_MCA_mpi_leave_pinned_pipeline
is set to to "-1", then the above
indicators are ignored and Open MPI will not use leave-pinned behavior.
32. I got an error message from Open MPI about not using the default GID prefix. What does that mean, and how do I fix it? |
Users may see the following error message from Open MPI v1.2:
WARNING: There are more than one active ports on host '%s', but the default subnet GID prefix was detected on more than one of these ports. If these ports are connected to different physical OFA networks, this configuration will fail in Open MPI. This version of Open MPI requires that every physically separate OFA subnet that is used between connected MPI processes must have different subnet ID values. |
This is a complicated issue.
What it usually means is that you have a host connected to multiple, physically separate OFA-based networks, at least 2 of which are using the factory-default subnet ID value (FE:80:00:00:00:00:00:00). Open MPI can therefore not tell these networks apart during its reachability computations, and therefore will likely fail. You need to reconfigure your OFA networks to have different subnet ID values, and then Open MPI will function properly.
Please note that the same issue can occur when any two physically separate subnets share the same subnet ID value -- not just the factory-default subnet ID value. However, Open MPI only warns about the factory default subnet ID value because most users do not bother to change it unless they know that they have to.
All this being said, note that there are valid network configurations where multiple
ports on the same host can share the same subnet ID value. For example, two ports
from a single host can be connected to the same network as a bandwidth
multiplier or a high-availability configuration. For this reason, Open MPI only
warns about finding duplicate subnet ID values, and that warning
can be disabled. Setting the btl_openib_warn_default_gid_prefix
MCA
parameter to 0 will disable this warning.
See this FAQ entry for instructions on how to set the subnet ID.
Here's a more detailed explanation:
Since Open MPI can utilize multiple network links to send MPI traffic, it needs to be able to compute the "reachability" of all network endpoints that it can use. Specifically, for each network endpoint, Open MPI calculates which other network endpoints are reachable.
In OpenFabrics networks, Open MPI uses the subnet ID to differentiate between subnets -- assuming that if two ports share the same subnet ID, they are reachable from each other. If multiple, physically separate OFA networks use the same subnet ID (such as the default subnet ID), it is not possible for Open MPI to tell them apart and therefore reachability cannot be computed properly.
33. What subnet ID / prefix value should I use for my OpenFabrics networks? |
You can use any subnet ID / prefix value that you want. However, Open MPI v1.1 and v1.2 both require that every physically separate OFA subnet that is used between connected MPI processes must have different subnet ID values.
For example, if you have two hosts (A and B) and each of these hosts has two ports (A1, A2, B1, and B2). If A1 and B1 are connected to Switch1, and A2 and B2 are connected to Switch2, and Switch1 and Switch2 are not reachable from each other, then these two switches must be on subnets with different ID values.
See this FAQ entry for instructions on how to set the subnet ID.
34. How do I set my subnet ID? |
It depends on what Subnet Manager (SM) you are using. Note that changing the subnet ID will likely kill any jobs currently running on the fabric!
shell# /etc/init.d/opensm stop |
shell# opensm -c -o |
-o
option causes OpenSM to run for one loop and exit. The
-c
option tells OpenSM to create an "options" text file.
/var/cache/opensm/opensm.opts
.
Open the file and find line with subnet_prefix
. Replace the
default value prefix with new one.
shell# /etc/init.d/opensm start |
35. In a configuration with multiple host ports on the same fabric, what connection pattern does Open MPI use? |
When multiple active ports exist on the same physical fabric between multiple hosts in an MPI job, Open MPI will attempt to use them all by default. Open MPI makes several assumptions regarding active ports when establishing connections between two hosts. Active ports that have the same subnet ID are assumed to be connected to the same physical fabric -- that is to say that communication is possible between these ports. Active ports with different subnet IDs are assumed to be connected to different physical fabric -- no communication is possible between them. It is therefore very important that if active ports on the same host are on physically separate fabrics, they must have different subnet IDs. Otherwise Open MPI may attempt to establish communication between active ports on different physical fabrics. The subnet manager allows subnet prefixes to be assigned by the administrator, which should be done when multiple fabrics are in use.
The following is a brief description of how connections are established between multiple ports. During initialization, each process discovers all active ports (and their corresponding subnet ID) on the local host and shares this information with every other process in the job. Each process then examines all active ports (and the corresponding subnet ID) of every other process in the job and makes a one-to-one assignment of active ports within the same subnet. If the number of active ports within a subnet differ on the local process and the remote process, then the smaller number of active ports are assigned, leaving the rest of the active ports out of the assignment between these two processes. Connections are not established during MPI_INIT, but the active port assignment is cached and upon the first attempted use of an active port to send data to the remote process (e.g., via MPI_SEND), a queue pair (i.e., a connection) is established between these ports. Active ports are used for communication in a round robin fashion so that connections are established and used in a fair manner.
NOTE: This FAQ entry generally applies to v1.2 and beyond. Prior to v1.2, Open MPI would follow the same scheme outlined above, but would not correctly handle the case where processes within the same MPI job had differing numbers of active ports on the same physical fabric.
36. I'm getting lower performance than I expected. Why? |
Measuring performance accurately is an extremely difficult task, especially with fast machines and networks. Be sure to read this FAQ entry for many suggestions on benchmarking performance.
Pay particular attention to the discussion of processor affinity and NUMA systems -- running benchmarks without processor affinity and/or on CPU sockets that are not directly connected to the bus where the HCA is located can lead to confusing or misleading performance results.
37. I get bizarre linker warnings / errors / run-time faults when I try to compile my OpenFabrics MPI application statically. How do I fix this? |
Fully static linking is not for the weak, and is not recommended. But it is possible.
Read both this FAQ entry and this FAQ entry in their entirety.
38. Can I use system() , popen() ,
or fork() in an MPI application that uses the OpenFabrics support? |
The answer is, unfortunately, complicated. Be sure to also see this FAQ entry as well.
system()
and/or the use of fork()
as long as the parent does nothing until
the child exits.
fork()
support in Open MPI:
btl_openib_have_fork_support
: This is a "read-only" MCA
value, meaning that users cannot change it in the normal ways that MCA parameter
values are set. It can be queried via the ompi_info
command;
it will have a value of 1 if this installation of Open MPI supports
fork()
; 0 otherwise.
btl_openib_want_fork_support
: This MCA parameter can be
used to request conditional, absolute, or no fork()
support.
The following values are supported:
Hence, you can reliably query OMPI to see if it has support for fork()
and force OMPI to abort if you request fork support and it doesn't have it.
This feature is helpful to users who switch around between multiple clusters and/or versions of Open MPI; they can script to know whether the OMPI that they're using (and therefore the underlying IB stack) has fork support. For example:
#!/bin/sh have_fork_support=`ompi_info --param btl openib --parsable | grep have_fork_support:value | cut -d: -f7` if test "$have_fork_support" = "1"; then # Happiness / world peace / birds are singing else # Despair / time for Haagen Daas fi |
Alternatively, you can skip querying and simply try to run your job:
shell$ mpirun --mca btl_openib_want_fork_support 1 --mca btl openib,self ... |
Which will abort if OMPI's openib BTL does not have fork support.
All this being said, even if Open MPI is able to enable the OpenFabrics
fork()
support, it does not mean that your fork()
-calling
application is safe.
system()
or popen()
,
it will likely be safe.
fork()
support is not
supported in the OpenFabrics software stack. If you use fork()
in your application, you must not touch any registered memory
before calling some form of exec()
to launch another process. Doing
so will cause an immediate seg fault / program crash.
It is important to note that memory is registered on a per-page basis; it
is therefore possible that your application may have memory co-located on the
same page as a buffer that was passed to an MPI communications routine (e.g.,
MPI_Send()
or MPI_Recv()
) or some other internally-registered
memory inside Open MPI. You may therefore accidentally "touch" a page that is
registered without even realizing it, thereby crashing your application.
This is unfortunately no way around this issue; it was intentionally designed into the OpenFabrics software stack. Please complain to the OpenFabrics Alliance that they should really fix this problem!
39. My MPI application sometimes hangs when using
the openib BTL; how can I fix this? |
Starting with v1.2.6, the MCA pml_ob1_use_early_completion
parameter
allows the user (or administrator) to turn off the "early completion" optimization.
Early completion may cause "hang" problems with some MPI applications running on
OpenFabrics networks, particularly loosely-synchronized applications that do not
call MPI functions often. The default is 1, meaning that early completion optimization
semantics are enabled (because it can reduce point-to-point latency).
See Open MPI ticket #1224 for further information.
NOTE: This FAQ entry only applies to the v1.2 series. This functionality is not required for v1.3 and beyond because of changes in how message passing progress occurs. Specifically, this MCA parameter will only exist in the v1.2 series.
40. Does InfiniBand support QoS (Quality of Service)? |
Yes.
InfiniBand QoS functionality is configured and enforced by the Subnet Manager/Administrator (e.g., OpenSM).
Open MPI (or any other ULP/application) sends traffic on a specific IB Service Level (SL). This SL is mapped to an IB Virtual Lane, and all the traffic arbitration and prioritization is done by the InfiniBand HCAs and switches in accordance with the priority of each Virtual Lane.
For details on how to tell Open MPI which IB Service Level to use, please see this FAQ entry.
41. Does Open MPI support InfiniBand clusters with torus/mesh topologies? |
Yes.
InfiniBand 2D/3D Torus/Mesh topologies are different from the more common fat-tree topologies in the way that routing works: different IB Service Levels are used for different routing paths to prevent the so-called "credit loops" (cyclic dependencies among routing path input buffers) that can lead to deadlock in the network.
Open MPI complies with these routing rules by querying the OpenSM for the Service Level that should be used when sending traffic to each endpoint.
Note that this Service Level will vary for different endpoint pairs.
For details on how to tell Open MPI to dynamically query OpenSM for IB Service Level, please refer to this FAQ entry.
NOTE: 3D-Torus and other torus/mesh IB topologies are supported as of version 1.5.4.
42. How do I tell Open MPI which IB Service Level to use? |
There are two ways to tell Open MPI which SL to use:
openib
BTL openib
BTL to dynamically query OpenSM for SL that
should be used for each endpoint 1. Providing the SL value as a command line parameter for the openib
BTL
Use the btl_openib_ib_service_level
MCA parameter to tell
openib
BTL which IB SL to use:
shell$ mpirun --mca btl openib,self,sm \ --mca btl_openib_ib_service_level N ... |
The value of IB SL N should be between 0 and 15, where 0 is the default value.
NOTE: Open MPI will use the same SL value for all the endpoints, which means that this option is not valid for 3D torus and other torus/mesh IB topologies.
2. Querying OpenSM for SL that should be used for each endpoint
Use the btl_openib_ib_path_record_service_level
MCA parameter to
tell the openib
BTL to query OpenSM for the IB SL that should be used
for each endpoint. Open MPI will send a PathRecord query to OpenSM in the process
of establishing connection between two endpoints, and will use the IB Service Level
from the PathRecord response:
shell$ mpirun --mca btl openib,self,sm \ --mca btl_openib_ib_path_record_service_level 1 ... |
NOTE: The btl_openib_ib_path_record_service_level
MCA parameter is supported as of version 1.5.4.
43. How do I run Open MPI over RoCE ? |
First of all, what exactly is RoCE
?
RoCE (which stands for RDMA over Converged Ethernet) provides InfiniBand native RDMA transport (OFA Verbs) on top of lossless Ethernet data link.
Since we're talking about Ethernet, there's no Subnet Manager, no Subnet Administrator, no InfiniBand SL, nor any other InfiniBand Subnet Administration parameters.
Connection management in RoCE is based on the OFED RDMACM (RDMA Connection Manager) service:
So how does Open MPI run on top of RoCE
?
OMPI uses the OFED Verbs-based openib
BTL for traffic and its internal
rdmacm
CPC (Connection Pseudo-Component) for establishing connections
for MPI traffic.
So if you just want the data to run over RoCE
and you're not interested
in VLANs, PCP, or other VLAN tagging parameters, you can just run OMPI with the
openib
BTL and rdmacm
CPC:
shell$ mpirun --mca btl openib,self,sm --mca btl_openib_cpc_include rdmacm ... |
(or set these MCA parameters in other ways)
How do I tell Open MPI to use a specific RoCE VLAN?
When a system administrator configures VLAN in RoCE, every VLAN is assigned with
its own GID. The QP that is created by the rdmacm
CPC uses this GID
as a Source GID. When OMPI (or any other application for that matter) posts a send
to this QP, the driver checks the source GID to determine which VLAN the traffic
is supposed to use, and marks the packet accordingly.
Note that InfiniBand SL (Service Level) is not involved in this process - marking is done in accordance with local kernel policy.
To control which VLAN will be selected, use the btl_openib_ipaddr_include/exclude
MCA parameters and provide it with the required IP/netmask values. For example,
if you want to use a VLAN with IP 13.x.x.x:
shell$ mpirun --mca btl openib,self,sm --mca btl_openib_cpc_include rdmacm \ --mca btl_openib_ipaddr_include "13.0.0.0/8" ... |
NOTE: VLAN selection in the OMPI v1.4 series works only with version 1.4.4 or later.
44. Does Open MPI support XRC? |
Yes.
XRC (eXtended Reliable Connection) decreases the memory consumption of Open MPI and improves its scalability by significantly decreasing number of QPs per machine.
XRC is available on Mellanox ConnectX family HCAs with OFED 1.4 and later.
See this FAQ entry for instructions how to tell Open MPI to use XRC receive queues.
45. How do I specify the type of receive queues that I want OMPI to use? |
You can use the btl_openib_receive_queues
MCA parameter to specify
the exact type of the receive queues for the OMPI to use. This can be advantageous,
for example, when you know the exact sizes of messages that your MPI application
will use -- Open MPI can internally pre-post receive buffers of exactly the right
size. See this
paper for more details.
The btl_openib_receive_queues
parameter takes a colon-delimited
string listing one or more receive queues of specific sizes and characteristics.
For now, all processes in the job must use the same string. You
can specify three kinds of receive queues:
The default value of the btl_openib_receive_queues
MCA parameter
is sometimes equivalent to the following command line:
shell$ mpirun ... --mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 ... |
In particular, note that XRC is (currently) not used by default.
NOTE: Open MPI chooses a default value
of btl_openib_receive_queues
based on the type of OpenFabrics network
device that is found. The text file $openmpi_packagedata_dir/mca-btl-openib-device-params.ini
(which is typically $openmpi_installation_prefix_dir/share/openmpi/mca-btl-openib-device-params.ini
)
contains a list of default values for different OpenFabrics devices. See that file
for further explanation of how default values are chosen.
Per-Peer Receive Queues
Per-peer receive queues require between 1 and 5 parameters:
Example: P,128,256,128,16
Shared Receive Queues
Shared Receive Queues can take between 1 and 4 parameters:
Example: S,1024,256,128,32
XRC Queues
XRC queues take the same parameters as SRQs. Note that if you use any XRC queues, then all of your queues must be XRC. Therefore, to use XRC, specify the following:
shell$ mpirun --mca btl openib... --mca btl_openib_receive_queues X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32 ... |
NOTE: the rdmacm
CPC is
not supported with XRC. Also, XRC cannot be used when btls_per_lid
> 1.
46. Does Open MPI support FCA ? |
Yes.
FCA (which stands for Fabric Collective
Accelerator) is a Mellanox MPI-integrated software package
that utilizes
CORE-Direct technology for implementing the MPI collectives communications.
You can find more information about FCA on the product web page. FCA is available for download here: http://www.mellanox.com/products/fca
Building Open MPI 1.5.x or later with FCA support
By default, FCA is installed in /opt/mellanox/fca
. Use the following
configure option to enable FCA integration in OMPI:
shell$ ./configure --with-fca=/opt/mellanox/fca ... |
Verifying the FCA Installation
To verify that Open MPI is built with FCA support, use the following command:
shell$ ompi_info --param coll fca | grep fca_enable |
A list of FCA parameters will be displayed if Open MPI has FCA support.
How do I tell Open MPI to use FCA
?
shell$ mpirun --mca coll_fca_enable 1 ... |
By default, FCA will be enabled only with 64 or more ranks. To turn on FCA for
an arbitrary number of ranks ( N
), please use the following MCA parameters:
shell$ mpirun --mca coll_fca_enable 1 --mca coll_fca_np <N> ... |
47. Does Open MPI support MXM ? |
Yes.
MXM is a MellanoX Messaging
library which provides enhancements to parallel communication libraries.
It is integrated into the Mellanox OFED stack, and also provided in a stand-alone
package.
The MXM stand-alone package is available for download here: http://www.mellanox.com/products/mxm
MXM requirements
Building Open MPI with MXM support
By default, MXM is installed in /opt/mellanox/mxm
. Use the following
configure option to enable MXM support in OMPI:
shell$ ./configure --with-mxm=/opt/mellanox/mxm ... |
Enabling MXM in Open MPI 1.6.1 or later
Running Open MPI job with MXM:
shell$ mpirun --mca mtl mxm ... |
MXM advantages kick in with large Open MPI jobs only. By default, MXM is automatically selected by Open MPI when the job has at least 128 ranks. To enable MXM for an arbitrary number of ranks ( N ), please use the following parameter:
shell$ mpirun --mca mtl_mxm_np <N> ... |
To activate MXM for any job, run the following command:
shell$ mpirun --mca mtl_mxm_np 0 ... |
NOTE: In Open MPI 1.6 MXM will always be selected, regardless of the number of ranks.