Find the job ID of your running job. The job number is the number in the first column (JOBID column).
Note that the time reported here is the wall clock time your job has been running. What is reported with just the regular squeue command is the used CPU time.
The second line of the output includes the node that the job is running on.
This means the job is running on 4 compute nodes, as listed in the NODELIST column.
2. While your job is running, you can ssh from the login node to the node it is running on. So, if your job is running on keeling-a05, as shown in the image above, you can enter
Investigate the node status with top
Now that you’re on the node your job is running on, take a look at the current CPU and RAM usage with top
Verify that the number of processes you expect to be running are running.
Don’t be too alarmed by a lower than expected total % CPU usage: Linux sees 40 cores due to Hyper-Threading, but using 50% CPU by fully-loading 20 cores still uses most of the performance of the CPU. Note that the batch nodes will have Hyper-Threading turned off in the future, and so 20 cores fully used would then show 100% CPU usage.
Note the share of CPU usage between the different types. See this StackOverflow post» for a list of what they mean. In particular, wa means waiting on Input/Output, and is something that might be able to be reduced by changing the analysis script.
Check to make sure there is sufficient free memory
Example of top result:
Investigate the temporary file storage
Now that you’re on the node your job is running on, check available hard disk storage with
Check that /dev/shm still has free space
If it is getting full, check the contents of /tmp
If you are using R, look for a directory starting with Rtmp and check its size
Next steps: logging more status information
You can add a timestamp in a bash script by including a line with date, which will output the current date and time including the seconds. Adding these periodically will include the passage of time in your log.
Alternatively, you can wrap a particular operation with the time command to output its total time (including the proportion in user operations vs. system operations) the command took.
time Rscript analysis.r
…will run the specified R script and tell you the time it took once it completes.
In R, you can do the same as the above by subtracting the time when an operation finishes from what it was before it started. For example: