CS 566, SP06 Parallel Processing Aleksandar Leposavic
I was corresponding quite a bit with your students, and I have found that there is a common pattern among them. They are not sure as to how mpi2 works; When they get to their assignments and attempt to execute on certain number of machines, although they request eg16 nodes and get 16 nodes, however students execute on only 1 node.
Here are basic steps that sum up as to how to run any mpi2 job without torque bach system:
  1. Bring up the mpd ring
  2. use mpiexec to start your binary (ie hello world)
  3. Bring down the mpd ring (This one I can not stress out enough how important it is to me)
And that is it !

Now with torque or any other bach system you wrap up steps 1 and 2 in a script (or at least step 2). You can call it script.sh or whatever and you submit that script via qsub (or put that in the script as well)
Note: Select same nodes for torque as the ones that your mpd daemons will be using to form the ring.

Ok now more on the 3 steps stated above & I think that will answer most of questions:

Before we start create a file in homedir called: .mpd.conf (-rw-------) and whose content is one line:
secretword=some_secret_password

Steps:

  1. Bring up the mpd ring:
    rsh argo1-1 "/usr/common/mpich2-1.0.1/bin/mpdboot -r rsh -n 4 -f /path_to_home_dir/mpd.hosts -v"
    Where mpd.hosts(-rw-r--r--) looks like:
    argo1-1
    argo1-2
    argo1-3
    argo1-4

    -r rsh will let mpdboot know that it will use rsh to bring the daemons up not ssh which is the default way that mpdboot tries to communicate
    -n 4 means it (mpdboot) 'll bring only four of them(mpd daemons) from the list(mpd.hosts), say the mpd.hosts had all 64 nodes and you had -n 5 it would select first five nodes from the list and start mpd daemons on them. Anyhow best practice is to specify only the nodes you'll be using.

    To double check that the ring is indeed up, do the following:
    rsh argo1-1 "/usr/common/mpich2-1.0.1/bin/mpdtrace"
    argo1-1 argo1-4 argo1-2 argo1-3

  2. Now create a script in which you use mpiexec to start your binary (ie hello_world) and with -n specify how many instances do you wish to start. Submit this script via qsub(or place qsub in the script as well).
    NOTE NOTE NOTE: Make sure that -n # is not greater than the number of machines you started your mpds on.

    Basically this script looks something like the script x and x1 on the argo homepage.
    ---------------------------------------------------
    #!/bin/csh
    set nodes = `perl -e 'while
    (){chop;$a.="$_+"}chop($a);print $a;' <$HOME/mpd.hosts`
    qsub -l nodes=$nodes mpiexec -n 4
    /home/homes5X/path_to_your_binary/mpihello_pgcc_mpi2libs
    ---------------------------------------------------

    The most important line above was the qsub -l nodes=$nodes .......
    This line will prepare and tell PBS(torque) what nodes to use.
    Execute the script, ./script

  3. The last step is to bring the ring down, once we are done with the execution, and to do that we call: rsh argo1-1 /usr/common/mpich2-1.0.1/bin/mpdallexit
--------------------------------------------------------------------------------
Aleksandar
argo sysadmin