Posted on: August 3, 2018
User access to the compute nodes for running jobs is only available via a batch job. Virtual ROGER uses the SLURM (Simple Linux Utility for Resource Management) batch system for running batch jobs. Virtual ROGER user provides information on the number of compute tasks, amount of memory, and amount of time the code will require to SLURM through batch directives, and SLURM selects a compute node (or set of nodes) that can accomodate this job, then launches the job on the selected node(s) when available. To submit a job to Virtual ROGER batch system, assemble the batch directives and the executable statements for your code into a text file (batch script), then use that batch script as the argument to the SLURM command sbatch:
sbatch myBatchScript.batch
The batch script format and batch directives are explained below.
Important
#!/bin/tcsh #SBATCH --job-name=mpitest #SBATCH -n 4 #SBATCH --time=48:00:00 #SBATCH --mem-per-cpu=2048 #SBATCH --mail-type=FAIL #SBATCH --mail-type=END #SBATCH --mail-user=seseuser@illinois.edu set InputDir=/data/sesegroup/a/seseuser/inputs mpirun -np 4 ./mpiModel test_setup.nml
#!/bin/tcsh #SBATCH --time=24:00:00 #SBATCH --mem-per-cpu=2048 #SBATCH -n 1 #SBATCH --mail-type=FAIL #SBATCH --mail-type=END #SBATCH --mail-user=seseuser@illinois.edu set RunDir=/data/sesegroup/a/seseuser/Model ln -s /data/sesegroup/a/commonData/inputs . ./myanalyst parameters.nml
Your batch script has much in common with a shell (sh, csh, bash, or tcsh) script, and it should start with a shell line, such as:
#! /bin/tcsh
appropriate to run the program you are submitting to batch. Unlike a shell script, the next lines should contain batch directives, which start with:
#SBATCH
All batch directives must precede any executable line (shell command or executable, effectively any line for which the first non-whitespace character is anything but #) in order for SLURM to read them.
Wall clock time
The batch directive for wall clock time, the maximum amount of time you expect your code to run, is usually given as:
#SBATCH --time=h:mm:ss
for which mm and ss each represent two integers. h can represent one, two, or three integers, such as
#SBATCH --time=127:59:00
Initially, you should set the wall clock limit about 10% longer than that required for your model¿s longest run on other platforms such as manabe to insure that your code finishes successfully. If you are running your code for the first time anywhere, you may leave this directive off, which produces a wall clock time limit equal to the maximum available on the default partition (currently 7 days, or 168 hours). Once you have successfully run several jobs with your model, we strongly recommend that you reduce your wall clock limit request to perhaps one hour above the amount that allows the longest of your successful jobs to complete in order to help SLURM prioritize batch jobs correctly.
Number of tasks
The batch directive for the number of tasks to run is:
#SBATCH -n NCPU
where NCPU is a positive integer.
MPI-parallel model users should set NCPU to the number of MPI threads to be run for your model. Single-CPU model users should set NCPU to 1. When setting this batch directive for your parallel model, it is important to note that keeling is a small cluster compared to those at major supercomputer facilities, and we do not recommend using NCPU values larger than 36 (15% of the current cluster capacity) for any job more than one hour long. We reserve the right to remove any job that uses an excessive fraction of keeling¿s compute capability, especially if we are notified that users cannot get jobs through the queues.
Memory
The batch directive for the amount of memory per CPU in megabytes is:
#SBATCH --mem-per-cpu=MB
To determine the integer MB for your model if you do not already know the amount of memory the model will need, first set up an upper limit appropriate for the node type you are using. On the nodes keeling-d04 through keeling-d17 which have 64 GB RAM and 12 CPU cores each, 6000 (a little over the RAM size divided by the number of cores) could be used as an initial MB value. Then, while this batch job is running, log into one of its compute nodes and monitor your model processes with the command top. The MB value should be set somewhat larger than the size shown in the RES column, converted into MB if needed, for the largest thread of your parallel model (or the single process of your single-CPU model). Users who have run their models on manabe can also obtain information on memory requirements from the SGE command qacct -j jobID: the output of this command includes one or more lines starting with ru_maxrss (maximum memory use per CPU core in KB), so that you can divide the largest entry on these lines by 1024 to convert to megabytes, then round up.
Additional directives
Other batch directives that you may want to use include:
#SBATCH --job-name=String
Gives the job a name, String, for you to monitor with the squeue command (the default is the file name of your batch script).
#SBATCH --mail-type=Event
With this directive, SLURM notifies you by e-mail when Event happens to your batch job. Event can be one of BEGIN, END, FAIL, REQUEUE, or ALL. You may use multiple lines if you want notices for only some of the possible Event codes, such as:
#SBATCH --mail-type=END #SBATCH --mail-type=FAIL
If you use this batch directive, we recommend that you also include the next batch directive in your script.
#SBATCH --mail-user=yourNetID@illinois.edu
Send e-mail from the batch system to the account yourNetID@illinois.edu.
#SBATCH --output=Filename or #SBATCH -o Filename
Redirect the batch standard output to file Filename. The SLURM default is to send both batch standard output and standard error to a file named slurm-{jobID}.out, with {jobID} being the integer SLURM job ID returned by the sbatch command when you submitted your job. While you may use %j to represent {jobID} and %N to represent {NodeName} (the name of the first node to be allocated to the job), this SLURM directive cannot provide separate standard output for each MPI rank of a parallel program.
#SBATCH --error=Filename or #SBATCH -e Filename
Redirect the batch standard error to file Filename. The SLURM default is to send both batch standard output and standard error to a file named slurm-{jobID}.out, with {jobID} being the integer SLURM job ID returned by the sbatch command when you submitted the job. Use %j to represent {jobID} and %N to represent {NodeName} in your file name.
#SBATCH --export=ExportedVariables
Use this to export the environment variables ExportedVariables from your current shell environment to your batch job. Separate multiple environment variable names with commas, such as
#SBATCH --export=PATH,LD_LIBRARY_PATH
The string you supply for ExportedVariables can also be ALL, the default in SLURM, or NONE. Unlike with SGE or PBS (and compatible batch systems such as Torque), the fact that SLURM exports all variables from your environment as of submit time means that this directive is not used as often as other directives.
Executable statements
Any line starting with any non-whitespace character other than # is an executable statement, and we emphasize that no SLURM batch directives may appear after the first executable statement in a batch script. This part of your batch script should contain the shell and environment variable settings, commands such as ln or cp that retrieve input files into your work directory, and the one or more commands to run your model. For MPI users, this last line usually takes the form of an mpirun or mpiexec command such as
mpirun -np 4 ./myModel
The number of MPI tasks requested with the -np or -n option of mpirun or mpiexec should be the same as the number of tasks that you request from SLURM in its
#SBATCH -n
batch directive near the beginning of your batch script.
Most programs for which interactive batch would have been needed on manabe can now be run effectively on the keeling login node. The primary exception to this is if you need to run an interactive program in parallel. David has written a qlogin script that functions in the SLURM batch system the same way that qlogin on manabe did. qlogin takes most of the same options on the command line that you would use in a batch script to be given to sbatch, with one exception. A qlogin session implies the option -N 1, which requires all of your tasks to run on the same compute node, and cannot use the -N or –nodes= options to allocate more than one node.
Example:
To set up a 12-hour qlogin session on a d-partition compute node with 8 CPU cores and 6000 MB per core:
qlogin -p d -n 8 --time=12:00:00 --mem-per-cpu=6000
The squeue command displays the status of batch jobs. Some squeue options are given below.
squeue -a
Displays information about jobs and job steps in all partitions. .
squeue -u <user_list>
Request jobs or job steps from a comma separated list of users.
squeue -n <name_list>
Request jobs or job steps having one of the specified names.
squeue -j <job_id>
Gives detailed information on a particular job.
squeue --start
Report the expected start time and resources to be allocated for pending jobs in order of increasing start time.
In order to get information on the SLURM partitions and nodes within them, such as how heavily loaded they are with batch jobs, the command
sinfo
.
For more information, please see squeue documation.
The scancel is used to signal jobs or job steps that are under the control of SLURM. Some examples of scancel command.
scancel JobID
Cancel a job along with all of its steps.
scancel jobID_arrayID
Cancel only arrayID of the job array
For more information, please see scancel documentation.