Go

Find A Location

Search

Center for High Performance Computing (CHPC)

 

Background
In order to launch jobs on the compute nodes, you need to use the queuing system.  The easiest way to do this is to generate a 'batch script' which will serve as a template for running your applications.  The batch script contains 2 sections: The top section where you specify the resources needed for running your job (e.g. the number of processors, the amount of time, etc.) and the bottom section where you specify the commands the nodes need to execute to run your job.

The top section contains lines that begin with #PBS:
 
#PBS -l nodes=2:ppn=8,walltime=8:00:00

In the above example I'm asking for 2 nodes that each have 8 processors per node (ppn), for a total of 16 processors.  I'm asking for 8 hours of time (as measured by a clock on the wall).  Besides the resources, you can also assign a name to your job to help distinguish it from other jobs that you may be running:
 
#PBS -N gromacs_np16

To see a full list of options, see the man page for the qsub command 'man qsub'.  The #PBS options are only used by the queuing system to determine how and when to run your job.  They will be ignored as comments by the compute nodes when the script is executed.  It is important that #PBS options come before any commands in your shell script; Once the queuing system encounters a command it quits looking for #PBS options, so these options will be ignored.
 
It usually takes a few iterations to generate a working batch script.  The most common problems are forgetting to copy all data needed into the run directory, requesting either too many resources (more CPUs than you actually use) or too few (not requesting enough time), or not setting up a proper environment for running your executables (e.g. not having the LD_LIBRARY_PATH point to any dynamic libraries that need to be linked against).
 
When things go wrong, the first thing to check the output from the batch job.   This includes both the output (STDOUT) and error messages (STDERR).  The queuing system will capture this data, and copy it back to login nodes into the directory you submitted the job from.  The files will have the form JOBNAME.oJOBID for STDOUT and JOBNAME.eJOBID for STDERR.  If you haven't specified a JOBNAME, it will use the name of the batch script.

Commands
Once you have a batch script, there are 3 essential commands that you need to know: qsub, qstat and qdel.  The first command, qsub, is for submitting the batch script.  As an argument, it only needs the name of the batch script:
 
qsub batch_script
 
as all the other options are embedded in the batch script via the #PBS flags.  Once you've submitted a job, it should report back the JOBID which will be some number followed by .mgt.
 
Once the job is submitted, you'll want to view the status via 'qstat'.  'qstat' shows the status for all the users so you may want to filter out the output for your username 'qstat | grep username'.  The columns of output are:
 
Job id                    Name             User            Time Use S Queue
 
'S' stands for the job status:
Q - job is queued
R - job is running
E - job is exiting
C - job has completed
H - job is held (usually this means there is some dependance of this job on another job)

If you feel that there is something wrong with your job, you may wish to delete it.  To do that, you use the qdel command which needs the jobid that you wish to kill, i.e. 'qdel JOBID'.
 
For more details about any of these commands, see their respective man pages (e.g. 'man qdel').

Examples
 
simple_batch.txt - About as simple as it gets.  It requests a single processor for 1 hour.  The commands just report the hostname of the node it's running on and sleeps for 60 seconds.  If you're running your own serial application, you might use this batch script as a template.
 
freesurfer_batch.txt - This is for running the FreeSurfer application.  It assumes the job can finish in less than 24 hours and uses less than 2.5 GB of memory.  The batch script may be adjusted if more resources are needed.
 
gromacs_batch.txt - This example runs Gromacs on 32 processors on the iDataPlex nodes.  To do this, I request 4 nodes with 8 processors per node (ppn).  If you're running your own parallel application, you may want to use this as a template.

Queues
Our queues are designed to provide balance between the types of jobs run on the system (big vs. small, short vs. long), and to ensure all users have a reasonable opportunity to complete their jobs.
 
To make the resource limits explicit,  the queues have the following naming scheme:
CPULIMIT_ARCHITECTURE_TIMELIMIT
where TIMELIMIT is wallclock time as opposed to total CPU time.
 
Note that there is no need for users to submit to any specific production queue.  Users can continue to submit to the old queues: dque or dque_smp depending on the architecture they want to run on.  Based on the resources that are requested (e.g. number of cores/wallclock time) jobs will be funnelled into the appropriate production queue.
 

iDataPlex Queues (max. queued jobs per-user: 4,096)
Queue Name Max. Cores Max. Walltime (hours) Max. User Jobs Default Memory (GB)
pe1_iD_4hr 1 4 768 2.5
pe1_iD_24hr 1 24 512 2.5
pe1_iD_48hr 1 48 256 2.5
pe1_iD_1wk 1 168 128 2.5
pe4_iD_24hr 4 24 128 -
pe4_iD_48hr 4 48 64 -
pe8_iD_4hr 8 4 128 -
pe8_iD_24hr 8 24 64 -
pe8_iD_48hr 8 48 32 -
pe8_iD_1wk 8 168 16 -
pe16_iD_24hr 16 24 32 -
pe16_iD_1wk 16 168 8 -
pe32_iD_24hr 32 24 16 -
pe32_iD_1wk 32 168 4 -
pe128_iD_24hr 128 24 4 -
pe256_iD_48hr 256 48 2 -
SMP Queues (max. queued jobs per-user: 1,024)
Queue Name Max. Cores Max. Walltime (hours) Max. User Jobs Default Memory (GB)
pe8_SMP_24hr 8 24 32 -
pe8_SMP_1wk 8 168 8 -
pe16_SMP_24hr 16 24 16 -
pe64_SMP_1hr 64 1 7 -
pe64_SMP_24hr 64 24 6 -
pe64_SMP_1wk 64 168 4 -
pe256_SMP_24hr 256 24 1 -
MATLAB Queues (routing queue=matlab, max. queued jobs per-user: 512)
Queue Name Max. Cores Max. Walltime (hours) Max. User Jobs Default Memory (GB)
pe1_MAT_4hr 1 4 128 2.5
pe1_MAT_24hr 1 24 96 2.5
pe1_MAT_1wk 1 168 64 2.5
pe8_MAT_24hr 8 24 8 -
pe8_MAT_1wk 8 168 2 -
pe32_MAT_24hr 32 24 2 -
MATLAB SMP Queues (routing queue=matlab_smp, max. queued jobs per-user: 512)
Queue Name Max. Cores Max. Walltime (hours) Max. User Jobs Default Memory (GB)
pe1_MATLM_24hr 1 24 64 -
pe8_MATLM_24hr 8 24 8 -
pe32_MATLM_24hr 32 24 2 -
GPU Queues (routing queue=dque_gpu)
Queue Name Max. Cores Max. Walltime (hours) Max. User Jobs Default Memory (GB) NOTE: For the GPU queues, I use mem, not vmem
pe2_GPU_4hr 2 4 22 8
pe2_GPU_24hr 2 24 20 8
pe2_GPU_1wk 2 168 12 8
pe8_GPU_24hr 8 24 3 -