'Setting up OpenMPI-1.10.2 to run jobs on multiple nodes
First of all my setup till now:
I'm working on a fresh installed Ubuntu Gnome 15.10. on all pc's.
My networks consists of 4 pc's with static ips (198.168.0.1 - 198.168.0.4) with 198.168.0.4 as the master where I have installed open-mpi 1.10.2 in /opt/openmpi-1.10.2/.
I share this and another folder (/home/cgv_wand/openmpi-1.10.2/) via NFS to the other nodes. In the second folder I store my open-mpi application (just a sample-app for testing).
My /etc/exports-file for nfs looks like this:
/home/cgv_wand/openmpi-1.10.2 192.168.0.0/255.255.255.0(rw,sync,no_root_squash,no_subtree_check)
/opt/openmpi-1.10.2 192.168.0.0/255.255.255.0(rw,sync,no_root_squash,no_subtree_check)
I also defined the PATH and the LD_LIBRARY_PATH variables in the .bashrc's of the 4 pc's:
export PATH=:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/openmpi-1.10.2/bin
export LD_LIBRARY_PATH=:/opt/openmpi-1.10.2/lib/
Additionaly I have setup a ssh-server on each of the nodes (198.168.0.1 - 198.168.0.3) and shared the public key of my master-node with them for password-less login.
Now to my problem:
If I run a mpi-job via
mpirun -np 1 hello_c
everything is working fine. But if I try to run this job on for example 2 nodes it doesn't work (mary is the master, mila-1, mila-2, mila-3 are the other nodes):
mpirun -np 2 --host mary,mila-2 hello_c
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
If I try to run the job just on mila-2 (198.168.0.2) I get the following error:
mpirun -np 1 --host mila-2 hello_c
hello_c: error while loading shared libraries: libibverbs.so.1: cannot open shared object file: No such file or directory
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[54161,1],0]
Exit code: 127
--------------------------------------------------------------------------
I have already read the open-mpi faq and a lot of topics here but actually I have no idea what may cause this problems... So maybe someone here can help me
Solution 1:[1]
The error is surely coming from your ~/.bashrc file. Where are your environment variables ? If they are at the end, then this part of the bashrc file is not compiled when running mpi with ssh on external nodes, because you enter in non-interactive mode. Notice the if exiting function at the top of you ~/.bashrc file, hence you need to put your environment variables at the top of the file before this exiting if.
Solution 2:[2]
Seeing same behavior on Nvidia Jetson.
Maybe try like so (move the executable name before specifying the args):
franklin@node901:/mnt/clusterfs/mpi-cluster $ cat hello_mpi.c
#include
#include
int main(int argc, char** argv){
int node;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &node);
printf("Hello World from Node %d!\n", node);
MPI_Finalize();
}
franklin@node901:/mnt/clusterfs/mpi-cluster $ mpicc -o hello_mpi hello_mpi.c
franklin@node901:/mnt/clusterfs/mpi-cluster $ mpiexec hello_mpi --hostfile /home/franklin/clusterfs/mpi-cluster/cluster
Hello World from Node 1!
Hello World from Node 2!
Hello World from Node 0!
Hello World from Node 3!
franklin@node901:/mnt/clusterfs/mpi-cluster $
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Joachim |
| Solution 2 | Franklin D. |
