Investigating a Running Job

Logging on to a node your job is running on

Run squeue to see the currently running jobs
1. Find the job ID of your running job. The job number is the number in the first column (JOBID column).
2. Run
```
squeue -j<JOBID> 
```
  1. Note that the time reported here is the wall clock time your job has been running. What is reported with just the regular squeue command is the used CPU time.
  2. The second line of the output includes the node that the job is running on.

Example:

This means the job is running on 4 compute nodes, as listed in the NODELIST column.

2. While your job is running, you can ssh from the login node to the node it is running on. So, if your job is running on keeling-a05, as shown in the image above, you can enter

ssh keeling-a05

Investigate the node status with top

Now that you’re on the node your job is running on, take a look at the current CPU and RAM usage with top
Verify that the number of processes you expect to be running are running.
1. Don’t be too alarmed by a lower than expected total % CPU usage: Linux sees 40 cores due to Hyper-Threading, but using 50% CPU by fully-loading 20 cores still uses most of the performance of the CPU. Note that the batch nodes will have Hyper-Threading turned off in the future, and so 20 cores fully used would then show 100% CPU usage.
2. Note the share of CPU usage between the different types. See this StackOverflow post» for a list of what they mean. In particular, wa means waiting on Input/Output, and is something that might be able to be reduced by changing the analysis script.
3. Check to make sure there is sufficient free memory

Example of top result:

Investigate the temporary file storage

Now that you’re on the node your job is running on, check available hard disk storage with
```
df -h
```
1. Check that /dev/shm still has free space
2. If it is getting full, check the contents of /tmp
3. If you are using R, look for a directory starting with Rtmp and check its size
  E.g.
```
du -h --max-depth 1 RtmpQ2053K/
```

Next steps: logging more status information

You can add a timestamp in a bash script by including a line with date, which will output the current date and time including the seconds. Adding these periodically will include the passage of time in your log.
Alternatively, you can wrap a particular operation with the time command to output its total time (including the proportion in user operations vs. system operations) the command took.

Example:

time Rscript analysis.r

…will run the specified R script and tell you the time it took once it completes.

In R, you can do the same as the above by subtracting the time when an operation finishes from what it was before it started. For example:

time1 <- proc.time() #start timer

output1 <- process(input)

cat("output 1:","\n"); proc.time() - time1

that would put in the output log something like:

output 1:

   user  system elapsed

 15.292   2.184  17.482

Investigating a Running Job

Logging on to a node your job is running on

Investigate the node status with top

Investigate the temporary file storage

Next steps: logging more status information

Share this: