Using the Sun Grid Engine on lv2.nw-grid.ac.uk

RJA page needs formatting

Cliff Addison 18/09/07

Users can only run jobs on the compute nodes of the cluster by going through the Sun Grid Engine (SGE) job submission system.

The current version of SGE used on lv2 is SGE 6.1.

The compute nodes on lv2 all have dual Xeon processors - 62 with 2.4 GHz processors and 32 with 3 GHz processors. Allocations are made first from the 3 GHz nodes and then from the other nodes. At present (and changes can be made early on if there are strong user objections to this arrangement), users can request resources for parallel (i.e. MPI) jobs in one of two ways - by specifying the total number of processors required or the total number of nodes. If an odd number of processors is specified, the user does risk having another user share one of the allocated nodes.

The remainder of this note is similar to the companion note that describes using the Sun Grid Engine on lv1.nw-grid.ac.uk.

If a user wanted to run the scalar code "bench-s" on a compute node then a SGE job script called benchrun might look something like:

#!/bin/sh #$ -cwd # Scalar benchmark date echo "This code is running on " ; hostname /users/nrcb/sge/bench-s

[nrcb@thermo]$ [nrcb@thermo]$ qsub ./benchrun

The output file would be, for example, benchrun.o148 which looks like this:

Sun Feb 11 07:12:37 GMT 2001 This code is running on vm0.streamline.com Memory required = 0 Mbytes Working out sensible value of Mflops for this cpu Bench with 327.680 Mflops Min time per test = 1.01000 Starting benchmark #1 =================================== RAW CPU RATE = 321.25 Mflops

Other simple examples of submitting serial or parallel jobs can be found in /usr/local/examples. Included are some (largely Fortran 90) codes to compile and some job scripts to use to submit jobs. The README file steers you through the illustrative examples.

When a job is submitted to the SGE, it is queued and a basic check is made that the resources required (e.g. number of nodes) are available. Once the job starts execution, the job script is executed on the first (possibly only) compute node. Information about special environment variables, where the executable and input files are located and where the output files should be located all need to be passed to this compute node. Without any additional information, the default environment is that specified via the user's .bashrc file and the default is to root any relative directory path for executables / files etc. to the user's home directory. There are also default names for the standard output and standard error files. [Typically these are <name of job script>.oNNNN for standard output and <name of job script>.eNNNN for the standard error, where NNNN is the job number assigned to this job when it is submitted.] A slightly more sophisticated example of a serial job script is:

#!/bin/sh # Specify where standard error and output should be sent #$ -e sequential_errors #$ -o sequential_output

# Use the current working directory as the "root" directory

#$ -cwd

# Export the current environment to the compute node # executing this job script.

#$ -V

# Actually do something!

sequential_pi < infile

A range of other options can either be specified in the job script file or in the qsub command line itself. The most useful of these is specifying a maximum time limit on your job that is shorter than the queue default time (currently 120 hours).

For instance, to specify a maximum time of 12 and a half hours, in the job script file include the line:

#$ -l h_rt=12:30:00

To specify this limit from the qsub command line, you could use:

qsub -l h_rt=12:30:00 myjobscript

As mentioned above, when a job script is submitted, it will be issued a job number. This number provides the user with a way to trace the job through the system and to terminate the job if something has gone wrong.

The simplest way to check the status of your jobs in the system is to use the qstat command without any arguments. NOTE - this is a change from earlier versions of SGE.

Therefore, the following are equivalent for user globus:

qstat qstat -u globus or qstat -u $USER

For example, a serial job run as user globus shows:

[globus@ulgbc4 examples]$ qstat

job-ID prior name user state submit/start at queue slots ja-task-ID


Unfortunately, this information has lines that are something like 114 characters long, so they are wrapped and can be rather difficult to interpret on a normal window. However, you will always be able to identify things such as the job-id, the first part of the job script name, the user name and the state of the job.

The state field indicates the progress of the job. Initially, the status will be 'qw' indicating that the job is queued and waiting for sufficient resources to become available. The status will then change to 't' indicating that the job is being transferred to the compute node and finally the status will change to 'r' when the program is running. When the job has completed, it will disappear from the queue list. An 'E' in the status field signals that an error has occurred. A few likely causes are:

Accidentally submitting a job from another user's directory (that contains the executable of interest) is a common reason for jobs failing.

qstat -u "*"

lists all of the running and queued jobs. This list can be long and your job information might get overlooked because of all of the other jobs present.

The command:

qstat -j NNNN

where NNNN is a job number provides far more information about the job than a user would probably be interested in.

If something is wrong with a job, it can be removed (deleted) from the SGE system with the command:

qdel NNNN

There are other options for qstat and qdel. qstat -help gives a complete set of options for qstat, qdel -help gives a complete set of options for qdel. Man pages are also available for either (man qstat, man qdel respectively).

One other helpful qstat option is:

qstat -g c

This provides a snap shot of the current state of all of the job queues on the system and merits some explanation.

[caddison@ulgbc4 Test2]$ qstat -g c

CLUSTER QUEUE CQLOAD USED AVAIL TOTAL aoACDS cdsuE


nodes 0.00 32 52 93 0 9 processors 0.00 0 104 186 64 18

The three columns that a user would think they immediately understand are USED, AVAIL and TOTAL.

Thiry-two nodes are in use with a job running on the nodes queue, which leaves 52 nodes available for other jobs, for a total of 84. However, the total number of nodes available is listed as 93. The other 9 nodes were down at the time the job was run, which is indicated by the last column.

Similarly, there are 104 processors available on the processors queue, out of a total of 186 processors. The last column indicates that there are 18 processors unavailable because their nodes are down. These numbers make sense if you remember that there are two processors per node on this system.

The second to last column lists those slots that are not available to a queue for "automatic" reasons. For instance, a node running a job on the nodes queue will appear as (S)ubordinated for jobs on the processors queue and will not be available. Similarly, if the load on a node is too high, a temporary (a)larm state will be noted and further jobs cannot use that node until the load has dropped below some predefined threshold. A slot might also be listed under this column if it is under calendar control (only active during certain times of the day or week).

In our example, on the processors queue, 64 processors are listed as "not available for automatic reasons". In general, if a node is part of a job running on the nodes queue, it cannot be allocated to a job running under the processors queue and vice versa. Parallel jobs and the SGE

The easiest way to submit a parallel job is with the mpisub command, which has the basic command syntax:

More detailed syntax details can be obtained by typing mpisub at the command prompt.

If your batch executable reads input from a file as if it was standard input, then the corresponding mpisub command line would look like:

mpisub Nx2 myexec [myexec args] " < mystdin_file"

In other words, the left angle bracket and the name of the input file must be enclosed by double quotation marks and this string must come after any arguments for the executable.

mpisub will generate automatically a qsub script that it will then submit to the SGE. The job number for this job will be displayed and you can use the qstat commands mentioned earlier to track its progress or use qdel to delete the job should the need arise.

There are some sample jobs in /usr/local/examples that work with mpisub. Consult /usr/local/examples/README for details.

There are also examples in that directory that show how users can create their own job script files. This can be useful if this script is to be used repeatedly or there are particular reasons why mpisub cannot be used. The important thing to realize is that there are two parallel environments that can be used - nodes-pe targets the nodes queue and expects the number specified to be the number of nodes required and processors-pe, which expects the number specified to be the number of processors required.

The script /usr/local/examples/qsub_pi_redirect_inout.sh submits an 8 processor job to the system. The error and output files are redirected to files specified by the user. Notice that the executable pi_redirect (source file pi_redirect.f90) accepts an argument that specifies the input file to use.

This submit script is reproduced here:

#!/bin/sh

# Specify where standard error and output should be sent #$ -e errors_pi2a #$ -o output_pi2a

#$ -cwd #$ -V #$ -j y

# We will use 8 processors on 4 nodes

#$ -pe processors-pe 8

date_start=date +%s

# Clear out any old data in the output files > errors_pi2a > output_pi2a

# Specify executable (relative or absolute path) and any arguments. # (Quotation marks not needed if the executable has no arguments.)

EXEC="pi_redirect infile"

# NSLOTS is a built-in environment variable defined from the # number provided in the -pe specification. With SCore, we want # the number of nodes to be one less than this value.

# The critical command to run the parallel job

mpirun -np $NSLOTS $EXEC

Note:

  1. The modules necessary to specify a compiler (GNU, Intel or PGI) and the
    • corresponding MPI installation must be specified before a qsub command is typed (the defaults from your .bashrc file may not be the ones needed).
  2. A parallel environment must be specified via the -pe line, either in
    • the script or in the command line, and must target one of nodes-pe or processors-pe.
  3. The line
    • #$ -cwd specifies that the current working directory should be used as the location for executables and file.
  4. The line
    • #$ -V exports your environment variables to the qsub environment.

Further information

There are man pages for the SGE commands. In addition, the SGE User's Guide can be found in PDF form at /usr/local/examples/SGE61-UserGuide.pdf.

If you have access to X-windows, then this file can be viewed directly on your desktop using:

acroread /usr/local/examples/SGE61-UserGuide.pdf

Otherwise, you will need to copy this file to your local computer and view it from there.

The SGE documentation suffers from the all-too-common problem of detailing the trees without describing the forest. It can therefore be somewhat intimidating to the beginner. If you do have questions or comments about SGE, please do not hestiate to contact Computing Services (usually via the helpdesk). Please specify Sun Grid Engine and ulgbc3 in your request (as the SGE configuration differs between systems).

LivSge2 (last edited 2009-02-12 15:55:30 by RobAllan)

This website maintained by Research Computing Services, University of Manchester