Some basic information on using the Lancaster NW-GRID cluster
Description of the System
The Lancaster NW-Grid service consists of two clusters:
- lancs1, with 48 Sun Fire X4100 execution nodes, each with 2 dual-core
- AMD Opteron 2.6 GHz processors and 8G of memory.
- lancs2, with 68 Sun Fire X2200 execution nodes, each with 2 quad-core
- Barcelona 2.3 GHz processors and 16G or 32G of memory
Both clusters support parallel jobs via MPI. Job submission is handled locally via the Sun Grid Engine, or through a Globus-accessible sge jobmanager.
Connecting to the systems
The machine names are lancs1.nw-grid.ac.uk and lancs2.nw-grid.ac.uk. Access to the clusters is only possible using standard grid tools such as the those provided by the Globus Toolkit.
Unix/Linux users
For Unix/Linux systems, the Globus middleware can be installed on a local machine via the Virtual Data Toolkit (only the VDT-Client package is required). Instructions on how to install and use VDT are available on their website. Using the Globus Toolkit's gsissh, the lancs1 cluster frontend can be accessed with the following command:
gsissh -p 2222 -X lancs1.nw-grid.ac.uk
File transfer can be accomplished via the command gsiscp. To transfer a file onto the lancs1 cluster head node:
gsiscp -P 2222 myfile lancs1.nw-grid.ac.uk:myfile
To tranfer a file from the same cluster head node to your local desktop:
gsiscp -P 2222 lancs1.nw-grid.ac.uk:myfile myfile
Windows users
Windows users can access the GSI-SSHterm application available from the National Grid Service both to access the cluster head node and to transfer files.
Modules
Each head node and execution nodes support access to various software packages using the module command. To see the available modules, type module avail. To see a brief description of a module, type module whatis modulename. The command module add modulename will add the relevant module to your current environment, and allow access to the software.
Modules can also be added from within local batch jobs.
Compilers
The Lancaster nodes offer three compiler suites:
- The PGI compilers, available under the pgi module:
- Available compilers are pgcc, pgCC, pgf77, pgf90 and pgf95
- Recommended optimised compiler flags are "-fastsse -Mcache_align"
- The Intel compilers, available under the intel module:
- Available compilers are icc and ifort
- Recommended optiised compiler flags are "-xW -ipo -O3"
- The Gnu compilers, available by default from the commandline
- Available compilers are gcc, g++ and g77
- Recommended optimisation flags are "-O3"
Local batch job submission from the head node
Alongside the standard Globus job submission mechanisms, batch jobs can be submitted from a head node using the standard Sun Grid Engine job submission mechanisms Batch jobs are run on the lcuster by creating a batch job control script (or command file) and "submitting" it to the system using the command qsub, e.g.:
qsub my_program.com
Assuming that there is at least one job-slot free, the system will select an execution node on which to run your job. This ensures that the combined load of all users' jobs is spread evenly over the entire cluster. If no suitable slot is available at the time then the job will wait in a "pending" queue until one becomes free.
At present, the system uses a Fair Share scheduling strategy; users may submit any number of jobs, however priority will be given to those who are currently running fewer jobs. Please check the head nodes' message of the day for changes to scheduling.
Example of a batch job control script
#!/bin/bash #$ -o $HOME/my_program_directory/my_program.stdout #$ -e $HOME/my_program_directory/my_program.stderr #$ -S /bin/bash . /etc/profile cd my_program_directory time my_program < my_program.input
Explanation Batch job scripts are simply standard shell scripts with extra lines (beginning with "#$") containing instructions for the scheduler.
The first line:
#!/bin/bash
specifies the shell which is to interpret this script. Leave this line exactly as shown unless you need a different shell to interpret your job.
The next two lines:
#$ -o $HOME/my_program_directory/my_program.stdout #$ -e $HOME/my_program_directory/my_program.stderr
are SGE directives used to specify the destination for your job's standard output and standard error respectively. (You don't have to specify these files. If you don't, default standard output and standard error files will be created in your HPC home directory, with names based upon the job id and job name.)
This line:
#$ -S /bin/bash
Instructs SGE to execute commands in this script according to the bash shell. This is the recommended shell for all batch jobs.
The next line:
. /etc/profile
Sets up your environment to ensure that programs can find the relevant libraries, and that the modules system is available. This line should always be included in your job scripts.
The next line:
cd my_program_directory
specifies the current working directory for your job. Note that when your batch job starts, its current working directory will be your home directory and not the current directory of the interactive session from which you submit the batch job.
The last line:
time my_program < my_program.input
is the command to run your program. This is normally the same as the command you would type if you were running it interactively. In this example the command to run the program (my_program < my_program.input) is prefixed by the system command time. This causes a timing summary to be printed to the standard error file when the job finishes. The time command is not neccesary for job scripts; it simply provides a useful summary of the length of time your program took to run.
Note that any standard input to the program (what you would type at the keyboard if you were running it interactively) must be put into a file, my_program.input in this case. The redirection operator, <, then makes the program read this file for its input.
Submitting large memory jobs
As each node can run multiple jobs, there is a risk that jobs with large memory requirements may oversubscribe memory on a node, leading to poor performance for all jobs. To prevent this, batch jobs which require more than 500M of memory must be submitted with a memory resource request, by adding the following lines to the qsub command:
qsub -l mem_free=xG -l mem_token=xG myjob.com
Where x is a real value indicating the amount of memory required in gigabytes.
Local parallel job submission from the head node
Compiling for MPI
MPI codes are handled via different parallel environments on the two nodes. SCore for lancs1 and OpenMPI for lancs2. First, load the relevant module in order to access the MPI compilers and other tools:
module add score
or
module add openmpi/1.3.1-1/gcc
You can now compile your parallel application using the relevant compiler(s); mpicc, mpiCC, mpif77, mpif90. By default, these MPI compilers will invoke the standard GNU compilers; compiler flags in this mode should therefore be those you would normally use for the GNU Compiler Collection.
For improved performance, the SCore MPI compiler wrappers can be directed to use one of the other compilers suites. With SCore this can be achieved by adding the arguments -compiler pgi or -compiler intel to any of the compiler calls. For OpenMPI, separate modules are provided for Intel and PGI compilers.
Please note: You must ensure that the MPI compilers have access to the relevant compiler suite by ensuring that its module is loaded, e.g.:
module add pgi
Submitting MPI jobs
To launch a parallel MPI jobs DO NOT write your own job submission script. Instead, run the mpisub (lancs1) or ompisub (lancs2) command:
mpisub nxm myexecutable ompisub nxm myexecutable
Where n is the number of execution nodes you want the job to run on, m is the number of CPUs on each node to use (this number should normally be four), and myexecutable is the name of the SCore-compiled application you wish to run.
E.g., to run the application myapp on 4 nodes, each using 4 CPUs (ie, a total of 16 CPUs), enter the following:
mpisub 4x4 myapp
mpisub will automatically generate a script for your executable and submit it to the queue. The output files will appear in your current working directory.
Please Note: The MPI environments work on a node-booking system; a parallel job cannot share a node with other serial or parallel jobs. So, if you run mpisub with an m value of less than the total number of processors on each node, the remaining CPUs on each node will be effectively blocked from use by others. Use smaller m values only when necessary!
Please Note 2: On lancs1, the Inter Process Communication fabric for SCore jobs runs only between nodes on the same physical rack; the job scheduler will choose an appropriate rack for you.
File redirection with MPI jobs
If you need to redirect standard input or output for a job, what may seem to be the obvious approach will not work:
mpisub nxm myprogram < myinput
What this command actually does is to instruct mpisub to take its input from myinput. Instead, you need to protect the file redirection from the shell, so that it is passed as an argument to mpisub to be appended to its call to myprogram:
mpisub nxm myprogram "< myinput"