Queue System & Running Jobs

Posted on: August 3, 2018

Submitting Jobs with the Queuing System

User access to the compute nodes for running jobs is only available via a batch job. Virtual ROGER uses the SLURM (Simple Linux Utility for Resource Management) batch system for running batch jobs. Virtual ROGER user provides information on the number of compute tasks, amount of memory, and amount of time the code will require to SLURM through batch directives, and SLURM selects a compute node (or set of nodes) that can accomodate this job, then launches the job on the selected node(s) when available. To submit a job to Virtual ROGER batch system, assemble the batch directives and the executable statements for your code into a text file (batch script), then use that batch script as the argument to the SLURM command sbatch:

sbatch myBatchScript.batch

The batch script format and batch directives are explained below.

Important

To ensure the health of the batch system and scheduler, users should refrain from having more than 500 of their batch jobs in the queues at any one time.


Sample batch script for virtual ROGER

MPI parallel

#!/bin/tcsh
 
#SBATCH --job-name=mpitest
#SBATCH -n 4
#SBATCH --time=48:00:00
#SBATCH --mem-per-cpu=2048
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=END
#SBATCH --mail-user=seseuser@illinois.edu
 
set InputDir=/data/sesegroup/a/seseuser/inputs
 
mpirun -np 4 ./mpiModel test_setup.nml

single-CPU

#!/bin/tcsh
 
#SBATCH --time=24:00:00
#SBATCH --mem-per-cpu=2048
#SBATCH -n 1
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=END
#SBATCH --mail-user=seseuser@illinois.edu
 
set RunDir=/data/sesegroup/a/seseuser/Model
ln -s /data/sesegroup/a/commonData/inputs .
 
./myanalyst parameters.nml

Explaination

Your batch script has much in common with a shell (sh, csh, bash, or tcsh) script, and it should start with a shell line, such as:

#! /bin/tcsh

appropriate to run the program you are submitting to batch. Unlike a shell script, the next lines should contain batch directives, which start with:

#SBATCH

All batch directives must precede any executable line (shell command or executable, effectively any line for which the first non-whitespace character is anything but #) in order for SLURM to read them.

Wall clock time

The batch directive for wall clock time, the maximum amount of time you expect your code to run, is usually given as:

#SBATCH --time=h:mm:ss

for which mm and ss each represent two integers. h can represent one, two, or three integers, such as

#SBATCH --time=127:59:00

Initially, you should set the wall clock limit about 10% longer than that required for your model¿s longest run on other platforms such as manabe to insure that your code finishes successfully. If you are running your code for the first time anywhere, you may leave this directive off, which produces a wall clock time limit equal to the maximum available on the default partition (currently 7 days, or 168 hours). Once you have successfully run several jobs with your model, we strongly recommend that you reduce your wall clock limit request to perhaps one hour above the amount that allows the longest of your successful jobs to complete in order to help SLURM prioritize batch jobs correctly.

Number of tasks

The batch directive for the number of tasks to run is:

#SBATCH -n NCPU

where NCPU is a positive integer.

MPI-parallel model users should set NCPU to the number of MPI threads to be run for your model. Single-CPU model users should set NCPU to 1. When setting this batch directive for your parallel model, it is important to note that keeling is a small cluster compared to those at major supercomputer facilities, and we do not recommend using NCPU values larger than 36 (15% of the current cluster capacity) for any job more than one hour long. We reserve the right to remove any job that uses an excessive fraction of keeling¿s compute capability, especially if we are notified that users cannot get jobs through the queues.

Memory

The batch directive for the amount of memory per CPU in megabytes is:

#SBATCH --mem-per-cpu=MB

To determine the integer MB for your model if you do not already know the amount of memory the model will need, first set up an upper limit appropriate for the node type you are using. On the nodes keeling-d04 through keeling-d17 which have 64 GB RAM and 12 CPU cores each, 6000 (a little over the RAM size divided by the number of cores) could be used as an initial MB value. Then, while this batch job is running, log into one of its compute nodes and monitor your model processes with the command top. The MB value should be set somewhat larger than the size shown in the RES column, converted into MB if needed, for the largest thread of your parallel model (or the single process of your single-CPU model). Users who have run their models on manabe can also obtain information on memory requirements from the SGE command qacct -j jobID: the output of this command includes one or more lines starting with ru_maxrss (maximum memory use per CPU core in KB), so that you can divide the largest entry on these lines by 1024 to convert to megabytes, then round up.

Additional directives

Other batch directives that you may want to use include:

#SBATCH --job-name=String

Gives the job a name, String, for you to monitor with the squeue command (the default is the file name of your batch script).

#SBATCH --mail-type=Event

With this directive, SLURM notifies you by e-mail when Event happens to your batch job. Event can be one of BEGIN, END, FAIL, REQUEUE, or ALL. You may use multiple lines if you want notices for only some of the possible Event codes, such as:

#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL

If you use this batch directive, we recommend that you also include the next batch directive in your script.

#SBATCH --mail-user=yourNetID@illinois.edu

Send e-mail from the batch system to the account yourNetID@illinois.edu.

#SBATCH --output=Filename
or
#SBATCH -o Filename

Redirect the batch standard output to file Filename. The SLURM default is to send both batch standard output and standard error to a file named slurm-{jobID}.out, with {jobID} being the integer SLURM job ID returned by the sbatch command when you submitted your job. While you may use %j to represent {jobID} and %N to represent {NodeName} (the name of the first node to be allocated to the job), this SLURM directive cannot provide separate standard output for each MPI rank of a parallel program.

#SBATCH --error=Filename
or
#SBATCH -e Filename

Redirect the batch standard error to file Filename. The SLURM default is to send both batch standard output and standard error to a file named slurm-{jobID}.out, with {jobID} being the integer SLURM job ID returned by the sbatch command when you submitted the job. Use %j to represent {jobID} and %N to represent {NodeName} in your file name.

#SBATCH --export=ExportedVariables

Use this to export the environment variables ExportedVariables from your current shell environment to your batch job. Separate multiple environment variable names with commas, such as

#SBATCH --export=PATH,LD_LIBRARY_PATH

The string you supply for ExportedVariables can also be ALL, the default in SLURM, or NONE. Unlike with SGE or PBS (and compatible batch systems such as Torque), the fact that SLURM exports all variables from your environment as of submit time means that this directive is not used as often as other directives.

Executable statements

Any line starting with any non-whitespace character other than # is an executable statement, and we emphasize that no SLURM batch directives may appear after the first executable statement in a batch script. This part of your batch script should contain the shell and environment variable settings, commands such as ln or cp that retrieve input files into your work directory, and the one or more commands to run your model. For MPI users, this last line usually takes the form of an mpirun or mpiexec command such as

mpirun -np 4 ./myModel

The number of MPI tasks requested with the -np or -n option of mpirun or mpiexec should be the same as the number of tasks that you request from SLURM in its 

#SBATCH -n

  batch directive near the beginning of your batch script.


Interactive Batch

Most programs for which interactive batch would have been needed on manabe can now be run effectively on the keeling login node. The primary exception to this is if you need to run an interactive program in parallel. David has written a qlogin script that functions in the SLURM batch system the same way that qlogin on manabe did. qlogin takes most of the same options on the command line that you would use in a batch script to be given to sbatch, with one exception. A qlogin session implies the option -N 1, which requires all of your tasks to run on the same compute node, and cannot use the -N or –nodes= options to allocate more than one node.

Example:

To set up a 12-hour qlogin session on a d-partition compute node with 8 CPU cores and 6000 MB per core:

qlogin -p d -n 8 --time=12:00:00 --mem-per-cpu=6000

 


How to Check Your Job Status

The squeue command displays the status of batch jobs. Some squeue options are given below.

squeue -a

  Displays information about jobs and job steps in all partitions. .

squeue -u <user_list>

 Request jobs or job steps from a comma separated list of users.

squeue -n <name_list>

 Request jobs or job steps having one of the specified names.

squeue -j <job_id>

  Gives detailed information on a particular job.

squeue --start

  Report the expected start time and resources to be allocated for pending jobs in order of increasing start time.

In order to get information on the SLURM partitions and nodes within them, such as how heavily loaded they are with batch jobs, the command 

sinfo

.

For more information, please see squeue documation.


How to Cancel a Running/Queued Job

The scancel is used to signal jobs or job steps that are under the control of SLURM. Some examples of scancel command.

scancel JobID

 Cancel a job along with all of its steps.

scancel jobID_arrayID

   Cancel only arrayID of the job array

You only need to use the numeric part of the Job ID here.

For more information, please see scancel documentation.