Investigating a Running Job

Posted on: August 3, 2018

Logging on to a node your job is running on

  1. Run squeue to see the currently running jobs
    1. Find the job ID of your running job. The job number is the number in the first column (JOBID column).
    2. Run 
      squeue -j<JOBID> 
      1. Note that the time reported here is the wall clock time your job has been running. What is reported with just the regular squeue command is the used CPU time.
      2. The second line of the output includes the node that the job is running on.

 Example:

This means the job is running on 4 compute nodes, as listed in the NODELIST column.

2. While your job is running, you can ssh from the login node to the node it is running on. So, if your job is running on keeling-a05, as shown in the image above, you can enter 

ssh keeling-a05

.


Investigate the node status with top

  1. Now that you’re on the node your job is running on, take a look at the current CPU and RAM usage with top
  2. Verify that the number of processes you expect to be running are running.
    1. Don’t be too alarmed by a lower than expected total % CPU usage: Linux sees 40 cores due to Hyper-Threading, but using 50% CPU by fully-loading 20 cores still uses most of the performance of the CPU. Note that the batch nodes will have Hyper-Threading turned off in the future, and so 20 cores fully used would then show 100% CPU usage.
    2. Note the share of CPU usage between the different types. See this StackOverflow post» for a list of what they mean. In particular, wa means waiting on Input/Output, and is something that might be able to be reduced by changing the analysis script.
    3. Check to make sure there is sufficient free memory

Example of top result:


Investigate the temporary file storage

  1. Now that you’re on the node your job is running on, check available hard disk storage with 
    df -h
    1. Check that /dev/shm still has free space
    2. If it is getting full, check the contents of /tmp
    3. If you are using R, look for a directory starting with Rtmp and check its size
      E.g. 

      du -h --max-depth 1 RtmpQ2053K/

Next steps: logging more status information

 Example:

time Rscript analysis.r

…will run the specified R script and tell you the time it took once it completes.

In R, you can do the same as the above by subtracting the time when an operation finishes from what it was before it started. For example:

time1 <- proc.time() #start timer

output1 <- process(input)

cat("output 1:","\n"); proc.time() - time1

that would put in the output log something like:

output 1:

   user  system elapsed

 15.292   2.184  17.482