'Setting up OpenMPI-1.10.2 to run jobs on multiple nodes

First of all my setup till now:
I'm working on a fresh installed Ubuntu Gnome 15.10. on all pc's. My networks consists of 4 pc's with static ips (198.168.0.1 - 198.168.0.4) with 198.168.0.4 as the master where I have installed open-mpi 1.10.2 in /opt/openmpi-1.10.2/.
I share this and another folder (/home/cgv_wand/openmpi-1.10.2/) via NFS to the other nodes. In the second folder I store my open-mpi application (just a sample-app for testing).

My /etc/exports-file for nfs looks like this:

/home/cgv_wand/openmpi-1.10.2   192.168.0.0/255.255.255.0(rw,sync,no_root_squash,no_subtree_check)  
/opt/openmpi-1.10.2             192.168.0.0/255.255.255.0(rw,sync,no_root_squash,no_subtree_check)

I also defined the PATH and the LD_LIBRARY_PATH variables in the .bashrc's of the 4 pc's:

export PATH=:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/openmpi-1.10.2/bin
export LD_LIBRARY_PATH=:/opt/openmpi-1.10.2/lib/

Additionaly I have setup a ssh-server on each of the nodes (198.168.0.1 - 198.168.0.3) and shared the public key of my master-node with them for password-less login.

Now to my problem:
If I run a mpi-job via

mpirun -np 1 hello_c

everything is working fine. But if I try to run this job on for example 2 nodes it doesn't work (mary is the master, mila-1, mila-2, mila-3 are the other nodes):

mpirun -np 2 --host mary,mila-2 hello_c

bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

If I try to run the job just on mila-2 (198.168.0.2) I get the following error:

mpirun -np 1 --host mila-2 hello_c
hello_c: error while loading shared libraries: libibverbs.so.1: cannot open shared object file: No such file or directory
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[54161,1],0]
  Exit code:    127
--------------------------------------------------------------------------

I have already read the open-mpi faq and a lot of topics here but actually I have no idea what may cause this problems... So maybe someone here can help me



Solution 1:[1]

The error is surely coming from your ~/.bashrc file. Where are your environment variables ? If they are at the end, then this part of the bashrc file is not compiled when running mpi with ssh on external nodes, because you enter in non-interactive mode. Notice the if exiting function at the top of you ~/.bashrc file, hence you need to put your environment variables at the top of the file before this exiting if.

Solution 2:[2]

Seeing same behavior on Nvidia Jetson.

Maybe try like so (move the executable name before specifying the args):


    franklin@node901:/mnt/clusterfs/mpi-cluster $ cat hello_mpi.c
    #include 
    #include 
    int main(int argc, char** argv){
        int node;
        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &node);
        printf("Hello World from Node %d!\n", node);
        MPI_Finalize();
    }
    franklin@node901:/mnt/clusterfs/mpi-cluster $ mpicc -o hello_mpi hello_mpi.c
    franklin@node901:/mnt/clusterfs/mpi-cluster $ mpiexec hello_mpi --hostfile /home/franklin/clusterfs/mpi-cluster/cluster
    Hello World from Node 1!
    Hello World from Node 2!
    Hello World from Node 0!
    Hello World from Node 3!
    franklin@node901:/mnt/clusterfs/mpi-cluster $

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Joachim
Solution 2 Franklin D.