Logging on to a node your job is running on
- Run squeue to see the currently running jobs
- Find the job ID of your running job. The job number is the number in the first column (JOBID column).
- Note that the time reported here is the wall clock time your job has been running. What is reported with just the regular squeue command is the used CPU time.
- The second line of the output includes the node that the job is running on.
2. While your job is running, you can ssh from the login node to the node it is running on. So, if your job is running on keeling-a05, as shown in the image above, you can enter
Investigate the node status with top
- Now that you’re on the node your job is running on, take a look at the current CPU and RAM usage with top
- Verify that the number of processes you expect to be running are running.
- Don’t be too alarmed by a lower than expected total % CPU usage: Linux sees 40 cores due to Hyper-Threading, but using 50% CPU by fully-loading 20 cores still uses most of the performance of the CPU. Note that the batch nodes will have Hyper-Threading turned off in the future, and so 20 cores fully used would then show 100% CPU usage.
- Note the share of CPU usage between the different types. See this StackOverflow post» for a list of what they mean. In particular, wa means waiting on Input/Output, and is something that might be able to be reduced by changing the analysis script.
- Check to make sure there is sufficient free memory
Example of top result:
Investigate the temporary file storage
- Now that you’re on the node your job is running on, check available hard disk storage with
- Check that /dev/shm still has free space
- If it is getting full, check the contents of /tmp
- If you are using R, look for a directory starting with Rtmp and check its size
du -h --max-depth 1 RtmpQ2053K/
Next steps: logging more status information
- You can add a timestamp in a bash script by including a line with date, which will output the current date and time including the seconds. Adding these periodically will include the passage of time in your log.
- Alternatively, you can wrap a particular operation with the time command to output its total time (including the proportion in user operations vs. system operations) the command took.
In R, you can do the same as the above by subtracting the time when an operation finishes from what it was before it started. For example:
time1 <- proc.time() #start timer
output1 <- process(input)
cat("output 1:","\n"); proc.time() - time1
that would put in the output log something like:
user system elapsed
15.292 2.184 17.482