Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

SGE Submit Scripts

News Grid Engine Recommended Links Prolog and epilog scripts Submitting binaries in SGE Submitting parallel OpenMPI jobs
slots queue attribute qsub qstat Monitoring Queues  Monitoring and Controlling Jobs Running MPI jobs
Creating and modifying SGE Queues qmod qalter qacct command qconf SGE hostgroups
Troubleshooting Troubleshooting Gridengine diag tool MPI Tips Humor

Introduction

Submission script is the script that that is executed by the scheduler command, which puts the job is appropriate (or default) queue (qsub in SGE). It provides the set of directives to the scheduler how to treat this job (within the capabilities of qsub command). It is essentially a wrapper for the software that you are trying to run on the cluster node (or a set of nodes in MPI is used).  Directives which specify where to put output of the job (STDOUT and STDER) and other job parameters are usually specified using called pseudo comments buy can also be specified as options to qsub command too. 

There is also a special file .sge_request, located in your home directory that is sourced automatically if options specify in it became default options for the particular submission script., They can be overwrote with explicit parameters to qsub. 

The necessity of submission script is dictated by the fact that the programs cannot be submitted directly to the grid engine. Instead they require a small shell script, which is a wrapper for the program to be run. Note that the script must be an executable (check with the command

ls -l name_of_the_submission_script. 

If there is not an x in front of the shell script name, it is not executable. It can be changed with the command chmod +x <script name> ). If the program requires interactive input  the input has to be piped in by either the echo command or an external file.

Some applications with GUI like Accelrys, Tmole, Medea etc generate submission scripts automatically based on parameters you put in the GUI.

Sometimes you need a test program in order to test if you submission script works or not The minimal wrapper script  to print the hostname and list of parameters that you passed would be:

#!/bin/bash
#$ -S /bin/bash
hostname
echo $@

After a check that the script runs correctly (typing ./test.sh at the prompt should execute genesis without an error), the job is submitted with the qsub command:

qsub test.sh 

While the qsub command is submitted on the head node or login node, the script is only analyzed on this node by extracting all pseudo comments with directives and interpreting them. Actual execution of the script is delayed until it will run on the target computation node. So the environment that you have during the execution of the submission script is the environment of the computation node, not the login node of the headnode from which this command was submitted.

The command qsub has many options some of which should be explicitly defined for each submitted job. There are three methods of doing so with increasing priority (a higher priority will overwrite an already defined option of a lower priority):

The most common options

In any case an option starts always with a minus sign and a keyword, followed - if necessary - by additional arguments. Following options are recommended to be set, preferable by the .sge_request file in the home directory:

Use man qsub to see further option. All options can also be set/defined in an interactive way by using the job submission feature of qmon.

Examples

Customize a submit script : openmpi.sh ,
#!/bin/bash

# Define parallel environment (ompi) and the number of CPUs (2)
#$ -pe ompi 4

# Specify the job name in the queue system
#$ -N MPI_Job

# Start the script in the current working directory
#$ -cwd

# Specify where standard output and error are stored
#$ -o MPI-stdo.output
#$ -e MPI-stderr.output

# Put the name of your compiled MPI file
myjob=ex3.x

export PATH=/usr/local/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH

mpiexec -n $NSLOTS $myjob 

Serial

#!/bin/bash
#$ -cwd
#
#$ -N serial_test_job
#$ -m e
#$ -e sge.err
#$ -o sge.out
# requesting 12hrs wall clock time
#$ -l h_rt=12:00:00

/soft/linux/pkg/apbs-1.0.0/bin/apbs.serial inputfile >& outputfile

Parallel

#!/bin/bash
#$ -cwd
#
#$ -N apbs-PARALLEL
#$ -e apbs-PARALLEL.errout
#$ -o apbs-PARALLEL.errout
#
# requesting 8 processors
#$ -pe mpich 8

echo -n "Running on: "
hostname

APBSBIN_PARALLEL=/soft/linux/pkg/apbs-1.0.0/bin/apbs
MPIRUN=/opt/openmpi/bin/mpirun

echo "Starting apbs-PARALLEL calculation ..."

$MPIRUN -v -machinefile $TMPDIR/machines -np 8 -nolocal \
    $APBSBIN_PARALLEL apbs-PARALLEL.in >& apbs-PARALLEL.out

echo "Done."

Another parallel job

#!/bin/bash
#$ -cwd
#
#$ -N amber_test_job
#$ -m e
#$ -e sge.err
#$ -o sge.out
#$ -pe mpich 4
# requesting 6hrs wall clock time
#$ -l h_rt=6:00:00
#

# export all environment variables to SGE
#$ -V

echo Running on host `hostname`
echo "SGE job id: $JOB_ID" 
echo Time is `date`
echo Directory is `pwd`
echo This job runs on the following processors:
cat $TMPDIR/machines
echo This job has allocated $NSLOTS processors

in=./mdin
out=./mdout
crd=./inpcrd.equil

cat <<eof > $in
 short md, nve ensemble
 &cntrl
   ntx=7, irest=1,
   ntc=2, ntf=2, tol=0.0000001,
   nstlim=1000,
   ntpr=10, ntwr=10000,
   dt=0.001, vlimit=10.0,
   cut=9.,
   ntt=0, temp0=300.,
 &end
 &ewald
  a=62.23, b=62.23, c=62.23,
  nfft1=64,nfft2=64,nfft3=64,
  skinnb=2.,
 &end
eof

sander=/soft/linux/pkg/amber11/bin/sander.MPI
mpirun=/soft/linux/pkg/openmpi/bin
export LD_LIBRARY_PATH=/soft/linux/pkg/openmpi/lib:${LD_LIBRARY_PATH}

# needs prmtop and inpcrd.equil files

$mpirun -v -hostfile $TMPDIR/machines -np $NSLOTS \
   $sander -O -i $in -c $crd -o $out < /dev/null

/bin/rm -f $in restrt

Please note that if you are running parallel amber you must include the following in your .bashrc:

# Set P4_GLOBMEMSIZE environment variable used to reserve memory in bytes
# for communication with shared memory on dual nodes
# (optimum/minimum size may need experimentation)
export P4_GLOBMEMSIZE=32000000
Another example

Running CHARMM job in parallel on 4 processors (located on the same physical host to minimize internode communication):

#!/bin/bash
#$ -cwd
#$ -N p4_test
#$ -pe mpi-host 4
#$ -e job.log
#$ -o job.out
#$ -l h_rt=47:59:59

echo -n "Running on: "
hostname

MPIRUN=/soft/linux/pkg/mpich/intel/bin/mpirun
CHARMM=/soft/linux/pkg/c32b1/bin/charmm.xxlarge.mpich.32bit.070108

current_dir=`pwd`

# create a scratch directory and copy all runtime data there
export scratch_dir=`mktemp -d /scratch/${USER}.XXXXXX`
cp * $scratch_dir
cd $scratch_dir

# launch the job
$MPIRUN -v -machinefile $TMP/machines -np $NSLOTS $CHARMM < inp > out

# copy all data back from the scratch directory
cp * $current_dir
rm -rf $scratch_dir
Another example
#!/bin/bash
#$ -cwd
#
#$ -N gromacs-job
#$ -e gromacs-job.errout
#$ -o gromacs-job.out
#
# requesting 4 processors
#$ -pe mpich 4
# requesting 8hrs wall clock time
#$ -l h_rt=8:00:00
#

echo -n "Running on: "
cat $TMPDIR/machines

MDRUN=/soft/linux/pkg/gromacs/bin/mdrun-mpi
MPIRUN=/soft/linux/pkg/mpich/intel/bin/mpirun

$MPIRUN -v -machinefile $TMPDIR/machines -nolocal -np $NSLOTS \
 $MDRUN -v -nice 0 -np $NSLOTS -s topol.tpr -o traj.trr \
  -c confout.gro -e ener.edr -g md.log

echo "Done."
Another example
#!/bin/bash
#$ -cwd
#
#$ -N namd-job
#$ -e namd-job.errout
#$ -o namd-job.out
#
# requesting 8 processors
#$ -pe mpich 8
# requesting 12hrs wall clock time
#$ -l h_rt=12:00:00
#

echo "Running on:" `hostname`

/soft/linux/pkg/NAMD/namd2.sh namd_input_file > namd2.log

echo "Done."

Matlab

It is possible to run Matlab standalone jobs on the cluster using the Matlab Compiler Runtime (MCR). The Matlab application must be first compiled using the Matlab compiler and the resulting distrib folder then transferred to ctbp1. The created .sh shell script (run_program.sh) is then started from an SGE script, like the one below:
#! /bin/bash
#
#$ -cwd
#$ -j y
#
#$ -M [email protected]
#$ -m e
#$ -e sge.err
#$ -o sge.out
# requesting 30 min wall clock time
#$ -l h_rt=00:30:00

./run_program.sh /soft/linux/pkg/MCR/v714 var1 var2

Monitoring a Job and the Queue

Once the job is submitted a job id is assigned and the job is placed in the queue. To see the status of the queue the command qstat prints a list of all running and pending jobs with a list of the most important information (job ID, job owner, job name, status, node). More information on a specific job can be optain with qstat -j <job-id>. The status of the job is indicated by one or more characters:

r - running
t - transfering to a node
qw - waiting in the queue
d - marked for deletion
R - marked for restart

Normally the status d is hardly observed with qstat and if a job hangs in the queue for a long time, marked for deletion, it indicates that the grid engine is not running properly. Please inform the system administrator about it.

To remove a job from the queue, the command qdel only requires the job-id. A job can also be changed after it has been submitted with the qalter command. It works similar to the qsub commmand but with the job-id instead of the shell script name.

The command qhost gives the status of all nodes. If the load is close to unity it indicates that the machine is busy and most likely running a job (use the qstat command to check - if not then a user might have logged directly onto the node to run a job interatively). Submitting an MPI-Job

To run a parallel job the script requires some additional information. First the option -pe has to be used to indicate the parallel environment. Right now only mpich is supported on the Beowulf cluster. The second mandatory argument for the pe-optionn is the number of requested nodes, which can be also defined as a range of needed nodes. Sun gridengine tries to maximized this number. It is recommmended to add this line to the shell script

    #$ -pe mpich N

where N is the number of the desired nodes. Right now it is limited to 14, corresponding loosely to one job per node/CPU. If mulitple instances per node are required, please contact the system administrator to increase the maximum number of slots.

The invocation of mpirun requires also some non-standard place holders (environmental variables), which is then filled by grid engine at the execution of the script. The format is (one line!)

/usr/local/mpich/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines <path to mpi program + optional command line arguments>

Everything up to the path to the mpi program should be used as it is. $NSLOTS and $TMPDIR will be defined by the sun grid engine. Not also that this script does not run correctly if it is executed directly. Further information on MPICH can be found here. Interactive Sessions

If the user has to run interactive session (e.g. Oopics) it can log onto a node with the qsh command. The Sun Grid Engine will then mark that node as busy and do not submit any further job to it till the user has logged out. The command qstat will show INTERATIVE as the job name, indicating that an interactive session is running on that node.


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[May 08, 2017] Sample SGE scripts

May 08, 2017 | ctbp.ucsd.edu
  1. An example of simple APBS serial job.
    #!/bin/csh -f
    #$ -cwd
    #
    #$ -N serial_test_job
    #$ -m e
    #$ -e sge.err
    #$ -o sge.out
    # requesting 12hrs wall clock time
    #$ -l h_rt=12:00:00
    
    /soft/linux/pkg/apbs/bin/apbs inputfile >& outputfile
    
    
  2. An example script for running executable a.out in parallel on 8 CPUs. (Note: For your executable to run in parallel it must be compiled with parallel library like MPICH, LAM/MPI, PVM, etc.) This script shows file staging, i.e., using fast local filesystem /scratch on the compute node in order to eliminate speed bottlenecks.
    #!/bin/csh -f
    #$ -cwd
    #
    #$ -N parallel_test_job
    #$ -m e
    #$ -e sge.err
    #$ -o sge.out
    #$ -pe mpi 8
    # requesting 10hrs wall clock time
    #$ -l h_rt=10:00:00
    #
    echo Running on host `hostname`
    echo Time is `date`
    echo Directory is `pwd`
    set orig_dir=`pwd`
    echo This job runs on the following processors:
    cat $TMPDIR/machines
    echo This job has allocated $NSLOTS processors
    
    # copy input and support files to a temporary directory on compute node
    set temp_dir=/scratch/`whoami`.$$
    mkdir $temp_dir
    cp input_file support_file $temp_dir
    cd $temp_dir
    
    /opt/mpich/intel/bin/mpirun -v -machinefile $TMPDIR/machines \
               -np $NSLOTS $HOME/a.out ./input_file >& output_file
    
    # copy files back and clean up
    cp * $orig_dir
    rm -rf $temp_dir
    
    
  3. An example of SGE script for Amber users (parallel run, 4 CPUs, with input file generated on the fly):
    #!/bin/csh -f
    #$ -cwd
    #
    #$ -N amber_test_job
    #$ -m e
    #$ -e sge.err
    #$ -o sge.out
    #$ -pe mpi 4
    # requesting 6hrs wall clock time
    #$ -l h_rt=6:00:00
    #
    setenv MPI_MAX_CLUSTER_SIZE 2
    
    # export all environment variables to SGE 
    #$ -V
    
    echo Running on host `hostname`
    echo Time is `date`
    echo Directory is `pwd`
    echo This job runs on the following processors:
    cat $TMPDIR/machines
    echo This job has allocated $NSLOTS processors
    
    set in=./mdin
    set out=./mdout
    set crd=./inpcrd.equil
    
    cat <<eof > $in
     short md, nve ensemble
     &cntrl
       ntx=7, irest=1,
       ntc=2, ntf=2, tol=0.0000001,
       nstlim=1000,
       ntpr=10, ntwr=10000,
       dt=0.001, vlimit=10.0,
       cut=9.,
       ntt=0, temp0=300.,
     &end
     &ewald
      a=62.23, b=62.23, c=62.23,
      nfft1=64,nfft2=64,nfft3=64,
      skinnb=2.,
     &end
    eof
    
    set sander=/soft/linux/pkg/amber8/exe.parallel/sander
    set mpirun=/opt/mpich/intel/bin/mpirun
    
    # needs prmtop and inpcrd.equil files
    
    $mpirun -v -machinefile $TMPDIR/machines -np $NSLOTS \
       $sander -O -i $in -c $crd -o $out < /dev/null
    
    /bin/rm -f $in restrt
    
    

    Please note that if you are running parallel amber8 you must include the following in your .cshrc :
    # Set P4_GLOBMEMSIZE environment variable used to reserve memory in bytes
    # for communication with shared memory on dual nodes
    # (optimum/minimum size may need experimentation)
    setenv P4_GLOBMEMSIZE 32000000
    
  4. An example of SGE script for APBS job (parallel run, 8 CPUs, running example input file which is included in APBS distribution (/soft/linux/src/apbs-0.3.1/examples/actin-dimer):
    #!/bin/csh -f
    #$ -cwd
    #
    #$ -N apbs-PARALLEL
    #$ -e apbs-PARALLEL.errout
    #$ -o apbs-PARALLEL.errout
    #
    # requesting 8 processors
    #$ -pe mpi 8
    
    echo -n "Running on: "
    hostname
    
    setenv APBSBIN_PARALLEL /soft/linux/pkg/apbs/bin/apbs-icc-parallel
    setenv MPIRUN /opt/mpich/intel/bin/mpirun
    
    echo "Starting apbs-PARALLEL calculation ..."  
    
    $MPIRUN -v -machinefile $TMPDIR/machines -np 8 \
        $APBSBIN_PARALLEL apbs-PARALLEL.in >& apbs-PARALLEL.out
    
    echo "Done."
    
    
  5. An example of SGE script for parallel CHARMM job (4 processors):
    #!/bin/csh -f
    #$ -cwd
    #
    #$ -N charmm-test
    #$ -e charmm-test.errout
    #$ -o charmm-test.errout
    #
    # requesting 4 processors
    #$ -pe mpi 4
    # requesting 2hrs wall clock time
    #$ -l h_rt=2:00:00
    #
    
    echo -n "Running on: "
    hostname
    
    setenv CHARMM /soft/linux/pkg/c31a1/bin/charmm.parallel.092204
    setenv MPIRUN /soft/linux/pkg/mpich-1.2.6/intel/bin/mpirun
    
    echo "Starting CHARMM calculation (using $NSLOTS processors)"
    
    $MPIRUN -v -machinefile $TMPDIR/machines -np $NSLOTS \
        $CHARMM < mbcodyn.inp > mbcodyn.out
    
    echo "Done."
    
    
  6. An example of SGE script for parallel NAMD job (8 processors):
    #!/bin/csh -f
    #$ -cwd
    #
    #$ -N namd-job
    #$ -e namd-job.errout
    #$ -o namd-job.out
    #
    # requesting 8 processors
    #$ -pe mpi 8
    # requesting 12hrs wall clock time
    #$ -l h_rt=12:00:00
    #
    
    echo -n "Running on: "
    hostname
    
    /soft/linux/pkg/NAMD/namd2.sh namd_input_file > namd2.log
    
    echo "Done."
    
    
  7. An example of SGE script for parallel Gromacs job (4 processors):
    #!/bin/csh -f
    #$ -cwd
    #
    #$ -N gromacs-job
    #$ -e gromacs-job.errout
    #$ -o gromacs-job.out
    #
    # requesting 4 processors
    #$ -pe mpich 4
    # requesting 8hrs wall clock time
    #$ -l h_rt=8:00:00
    #
    
    echo -n "Running on: "
    cat $TMPDIR/machines
    
    setenv MDRUN /soft/linux/pkg/gromacs/bin/mdrun-mpi
    setenv MPIRUN /soft/linux/pkg/mpich/intel/bin/mpirun
    
    $MPIRUN -v -machinefile $TMPDIR/machines -np $NSLOTS \
     $MDRUN -v -nice 0 -np $NSLOTS -s topol.tpr -o traj.trr \
      -c confout.gro -e ener.edr -g md.log
    
    echo "Done."
    

[Sep 23, 2014] Reserving resources (RAM, disc, GPU) by MerlinWiki

SGE - MerlinWiki

We have found that for some tasks, it is advantageous to specify the info on required resources to SGE. It has sense in case an excessive use of RAM/netowrk storage is expected. The limits are soft and hard (parameters -soft, -hard), the limits themselves are:

 -l resource=value

For example, in case a job needs at least 400MB RAM: qsub -l ram_free=400M my_script.sh Another often requested resource is the space in /tmp: qsub -l tmp_free=10G my_script.sh. Or both:

qsub -l ram_free=400M,tmp_free=10G my_script.sh

Of course, it is possible (and preferable if the number does not change) to use the construction #$ -l ram_free=400M directly in the script. The actual status of given resource on all nodes can be obtained by: qstat -F ram_free, or more things by: qstat -F ram_free,tmp_free.

Details on other standard available resources are in /usr/local/share/SGE/doc/load_parameters.asc. In case you do not specify value for given resource, implicit value will be used (for space on /tmp it is 1GB, for RAM 100MB)

WARNING: You need to distinguish, if you request resources that are available at the time of submission (so called non-consumable resources), or if you need to allocate given resource for the whole runtime of your computation - for example, your program will need 400MB of memory but in the first 10 min of computation, it will allocate only 100MB. In case you use the standard resource mem_free, and during the first 10min another jobs will be submitted to the given node, SGE will interpret it in the following way: you wanted 400MB but you finally use only 100MB so that the rest of 300MB will be given to someone else (i.e. it will submit another task requesting this memory).

For these purposes, it is better to use consumable resources, that are computed independently on the current status of the task - for memory it is ram_free, for disc tmp_free. For example, resource ram_free does not look at the actual free RAM, but it computes the occupation of RAM only based on the requests of individual scripts. It works with the size of RAM of the given machine and subtracts the amount requested by the job that should be run on this machine. In case the job does not specify ram_free, implicit value of ram_free=100M will be used.

For the disk space in /tmp (tmp_free), the situation is more tricky: in case a job does not clean up properly its mess after it finishes, the disk can actually have less space than defined by the resource. Unfortunately, nothing can be done about this.

Known problems with SGE

#$ -q all.q@@blade,all.q@@PCNxxx,all.q@@servers

Main groups of computers are: @blade, @servers, @speech, @PCNxxx, @PCN2xxx - the full and actual list can be obtained by qconf -shgrpl

@stable - @blade, @servers - servers that run all the time w/o restarting
@PCOxxx, @PCNxxx - computer labs, there is a possibility that any node might be restarted at any time,
      a student or someone can shut the machine down by error or "by error". It is more or less sure that these
      machines will run smoothly over night and during weekends. There is also a group for each independent lab e.g. @PCN103.

Parallel jobs - OpenMP

For parallel tasks with threads, it is enough to use parallel environment smp and to set the number of threads:

#!/bin/sh 
#
#$ -N OpenMPjob
#$ -o $JOB_NAME.$JOB_ID.out
#$ -e $JOB_NAME.$JOB_ID.err
#
# PE_name    CPU_Numbers_requested
#$ -pe smp  4
#
cd SOME_DIR_WITH_YOUR_PROGRAM
export OMP_NUM_THREADS=$NSLOTS
 
./your_openmp_program [options]

Parallel jobs - OpenMPI

Listing follows:

#!/bin/bash
# ---------------------------
# our name 
#$ -N MPI_Job
#
# use reservation to stop starvation
#$ -R y
#
# pe request
#$ -pe openmpi 2-4
#
# ---------------------------
# 
#   $NSLOTS          
#       the number of tasks to be used

echo "Got $NSLOTS slots."

mpirun -n $NSLOTS /full/path/to/your/executable

[Aug 26, 2014] Configuring a New Parallel Environment

Mar 23, 2007 | DanT's Grid Blog

By templedf on Mar 23, 2007

Since this seems to be a regular topic on the user mailing list, here's a quick guide to setting up a parallel environment on Grid Engine:
  1. First, create/borrow/steal the startup and shutdown scripts for the parallel environment you're using. You can find MPI and PVM scripts in the $SGE_ROOT/mpi and $SGE_ROOT/pvm directories, respectively. If you cannot find scripts for your parallel environment, you'll have to create them. The startup script must prepacdre the parallel environment for being used. With most MPI implementations, that's just a matter of creating a "machines" file that lists the machines which are to run the parallel job. The shutdown script must clean up after the parallel job's execution. The MPI shutdown script just deletes the "machines" file.
  2. Next, you have to tell Grid Engine about your parallel environment. You can do that interactively with qmon or qconf -ap <pe_name> or you can write the data to a file and use qconf -Ap <file_name>. For an example of what such a file would look like, see $SGE_ROOT/mpi/mpi.template or $SGE_ROOT/pvm/pvm.template.

    Let's look at what the parallel environment configuration contains.

    pe_name           template
    slots             0
    user_lists        NONE
    xuser_lists       NONE
    start_proc_args   /bin/true
    stop_proc_args    /bin/true
    allocation_rule   $pe_slots
    control_slaves    FALSE
    job_is_first_task FALSE
    urgency_slots     min
    • pe_name - the name by which the parallel environment will be known to Grid Engine
    • slots - the maximum number of job slots that the parallel environment is allowed to occupy at once
    • users_lists - an ACL specifying the users who are allowed to use the parallel environment. If set to NONE, it means any user can use it
    • xusers_list - an ACL specifying the users who are not allowed to use the parallel environment. Users in both the users_list and xusers_list are not allowed to use the parallel environment
    • start_proc_args - the path to the startup script for the parallel environment followed by any needed arguments. Grid Engine provides some inline variables that you can use as arguments:
      • $pe_hostfile - the path to a file written by Grid Engine which contains information about how and where the parallel job should be run
      • $host - the host on which the parallel environment is being started
      • $job_owner - the name of the user who owns the parallel job
      • $job_id - the id of the parellel job
      • $job_name - the name of the parallel job
      • $pe - the name of the parallel environment
      • $pe_slots - the number of job slots assigned to the job
      • $queue - the name of the queue in which the parallel job is running

      The value of this setting is the command that will be run to start the parallel environment for every parallel job.

    • stop_proc_args - the path to the shutdown script for the parallel environment followed by any needed arguments. The same inline variables are available as with start_proc_args.
    • allocation_rule - this setting controls how job slots are assigned to hosts. It can have four possible values:
      • a number - if set to a number, Grid Engine will assign that many slots to the parallel job on each host until the assigned number of job slots is met. Setting this attribute to 1, for example, would mean that the job gets a single job slot on each host where it is assigned. Grid Engine will not assign the job more job slots than the number of assigned hosts multiplied by this attribute's value.
      • $fill_up - use all of the job slots on a given host before moving to the next host
      • $round_robin - select one slot from each host in a round-robin fashion until all job slots are assigned. This setting can result in more than one job slot per host.
      • $pe_slots - place all the job slots on a single machine. Grid Engine will only schedule such a job to a machine that can host the maximum number of slots requested by the job. (See below.)
    • control_slaves - this setting tells Grid Engine whether the parallel environment integration is "tight" or "loose". See your parallel environment's documentation for more details.
    • job_is_first_task - this setting tells Grid Engine whether the first task of the parallel job is actually a job task or whether it's just there to kick off the rest of the jobs. This setting is also determined by your parallel environment integration.
    • urgency_slots - this setting affect how resource requests affect job priority for parallel jobs. The values can be "min," "max," "avg," or a number. For more information about resource-based job priorities, see this white paper

    For more information about these settings, see the sge_pe man page.

  3. The next step is to enable your parallel environment for the queues where it should be available. You can add the parallel environment to a queue interactively with qmon or qconf -mq <queue> or in a single action with qconf -aattr queue pe_list <pe_name> <queue>.
  4. Now you're ready to test your parallel environment. Run qsub -pe <pe_name> <slots>. Aside from the usual output and error files (<job_name>.o<job_id> and <job_name>.e<job_id>, respectively), you should also look for the parallel environment startup output and error files, <job_name>.po<job_id> and <job_name>.pe<job_id>.

That's all there is to it! Just to make sure we're clear on everything, let's do an example. Let's create a parallel environment that starts up an RMI registry and stores the port number in a file so that the job can find it.

First thing we have to do is write the startup and shudown scripts for the RMI parallel environment. Here's what they look like:

rmi_startup.sh

#!/bin/sh
# $TMPDIR and $JOB_ID are set by Grid Engine automatically

# Borrowed from $SGE_ROOT/mpi/startmpi.sh
PeHostfile2MachineFile()
{
   cat $1 | while read line; do
      host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
      nslots=`echo $line|cut -f2 -d" "`
      i=1

      while [ $i -le $nslots ]; do
         echo $host
         i=`expr $i + 1`
      done
   done
}

# get arguments
pe_hostfile=$1

# ensure pe_hostfile is readable
if [ ! -r $pe_hostfile ]; then
   echo "$me: can't read $pe_hostfile" >&2
   exit 1
fi

# create machines file
machines="$TMPDIR/machines"
PeHostfile2MachineFile $pe_hostfile >> $machines

# We use ports 40000-40999
port=`expr \\( $JOB_ID % 1000 \\) + 40000`

# Start the registry
/usr/java/bin/rmiregistry $port &

# Save the registry's PID so that we can stop it later
echo $! > $TMPDIR/pid

# Save the port number so the job can find it
echo $port > $TMPDIR/port

rmi_shutdown.sh

#!/bin/sh
# $TMPDIR is set by Grid Engine automatically

# Get the registry's PID
pid=`cat $TMPDIR/pid`

# Kill the registry
kill $pid

# Clean up the files the startup script created
rm $TMPDIR/pid
rm $TMPDIR/port
rm $TMPDIR/machines

Next thing we have to do is add our parallel environment to Grid Engine. First we create a file, say /tmp/rmi_pe, with the following contents:

pe_name           rmi
slots             4
user_lists        NONE
xuser_lists       NONE
start_proc_args   /home/dant/rmi_startup.sh $pe_hostfile
stop_proc_args    /home/dant/rmi_shutdown.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

Note that control_slaves is true and job_is_first_task is false. Because we're writing the integration scripts, the choice is somewhat arbitrary, but it affects how the job scripts must be written, as we'll see below. It also affect whether the qmaster is able to keep accounting records on the slave tasks. If control_slaves is false, the qmaster is have no records of how much resources the slaves tasks consumed.

Now we add the parallel environment with qconf -Ap /tmp/rmi_pe. We could have skipped a step by running qconf -ap rmi and entering the data in the editor that comes up, but they way we've done it here is scriptable.

The next step is to add our parallel environment to our queue with qconf -aattr queue pe_list rmi all.q. Again, we could have run qconf -mq all.q and edited the pe_list attribute in the editor, but the way we've done it is scriptable.

Last thing to do is test out our parallel environment. First we need a job script:

#!/bin/sh
#$ -S /bin/sh

port=`cat $TMPDIR/port`
qrsh=$SGE_ROOT/bin/$ARC/qrsh

cat $TMPDIR/machines | while read host; do
   $qrsh -inherit $host /usr/bin/java -cp ~/rmi.jar RMIApp $port &
done

Let's look at this job script for a moment. The first thing to notice is the use of qrsh -inherit. The -inherit switch is specifically for kicking off slave tasks. It requires that the target host name be supplied. In order to get the target host name, we read the machines file that the startup script generated from the one Grid Engine supplied.

The second thing to notice is how ugly the use of qrsh -inherit is. RMI is not really a parallel environment. It's a communications framework. It doesn't do the work of kicking off remote processes for you. So, instead, we have to do it ourselves in the job script. With a true parallel environment, like any of the MPI flavors, the framework also takes care of starting the remote processes, often through rsh. In the MPI scripts included with Grid Engine, an rsh wrapper script is included, which transparently replaces calls to rsh with calls to qrsh -inherit. By using that wrapper script, the parallel environment's calls to rsh can be rerouted through the grid via qrsh without having to modify the parallel environment itself to work with Grid Engine.

The last thing to notice is how this script correlates to the control_slaves and job_is_first_task attributes of the parallel environment configuration. Let's start with first_job_is_task. In our configuration, we set it to false. That means that this master job script is not counted as a job and does no real work. That is why our script doesn't do anything but kick off sub-tasks. If job_is_first_task had been true, our job script would be expected to run one of the RMIApp instances itself.

Now let's talk about the control_slaves attribute. If control_slaves is true, we are allowed to use qrsh -inherit to kick off our sub-tasks. The qmaster will not, however, allow us to kick off more subtasks than the number of slots we've been assigned (minus 1 if job_is_first_task is true). The advantage of using qrsh -inherit is that the sub-tasks are tracked by Grid Engine like regular jobs. If control_slaves is false, we have to use some mechanism external to Grid Engine, such as rsh or ssh, to kick off our sub-tasks, meaning that Grid Engine cannot track them and is actually fully unaware of them. That's why job_is_first_task is meaningless when control_slaves is false.

In order to test our job we need a Java application called RMIApp. As that's outside the scope of the example, let's just pretend we have a parallel Java application that uses the RMI registry for intra-process communication. To submit our job we use qsub -pe rmi 2-4 rmi_job.sh. The -pe rmi 2-4 argument tells the qmaster that we're using the rmi parallel environment and we want 4 job slots assigned to our job, but we will accept as few as 2. Because our job script starts a sub-task for every entry in to host file, it will start the right number of sub-tasks, no matter how many slots we are assigned. Had we written the job script to start exactly two sub-tasks, we would have to use -pe rmi 2 so that we could be sure we got exactly two job slots.

While the job is running, run qstat -f. You'll see output something like this:

% qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@ultra20                  BIP   4/10      0.08     sol-amd64
    253 0.55500 rmi_job.sh dant         r     03/23/2007 11:46:51     4

From this output we can see that the job has been scheduled and has been assigned four job slots. Those four job slots only account for the four sub-tasks. The master job itself is not counted because the job_is_first_task attribute is false.

After our job completes, if we look in our home directory (which is where Grid Engine will put the output files since we didn't tell it otherwise), we will find four new files: rmi_job.sh.e253, rmi_job.sh.o253, rmi_job.sh.pe253, and rmi_job.sh.o253, assuming, of course, that our job was number 253. The \*.o253 and \*.e253 files should be familiar. They're the output and error streams from the job script. The \*.po253 and \*.pe253 files are new. They're the output and error streams from the parallel environment startup and shutdown scripts.

So, there you have it. A complete, top-to-bottom example of creating, configuring, and running a parallel environment.


Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

https://www.nbcr.net/pub/wiki/index.php?title=Sample_SGE_Script

http://www.it.uu.se/datordrift/maskinpark/albireo/gridengine.html

http://www.rbvi.ucsf.edu/Resources/sge/user_guide.html

References
LLNL Introduction to Parallel Computing

LLNL OpenMP tutorial

LinuxPRO magazine 9/2008: Parallel Programming with OpenMP

Open MPI web site

LLNL MPI Tutorial

Sun Grid Engine sources and documentation



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: May 17, 2021