Troubleshooting a Running Job

Connect to a Compute Node Allocated to Your Job

Use squeue to find the node(s) your job is running on, then use ssh to connect. You can then monitor your job with something like top:

[be59@farnam1 ~]$ squeue -u be59
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            270455   general spacemix     be59  R 2-09:32:06      1 c13n08
[be59@farnam1 ~]$ ssh c13n08
[be59@c13n08 ~]$ top

top - 20:10:58 up 17 days,  8:06,  1 user,  load average: 6.74, 8.59, 8.89
Tasks: 265 total,   2 running, 263 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.1 us,  3.9 sy,  0.0 ni, 95.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13115290+total, 11064297+free, 18606716 used,  1903208 buff/cache
KiB Swap: 16777212 total, 16777212 free,        0 used. 11158580+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13827 be59      20   0 14.722g 0.014t   5692 R 100.0 11.5   3453:41 R      
    1 root      20   0   42756   5244   2380 S   0.0  0.0   0:22.51 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.17 kthreadd
...

In this example, I see the R job I submitted is using 100% of 1 core, and using 0.014t or ~14GB of memory

Examine Output from Your Running Job

Normally scheduler buffers all the output in an internal file, and only gives it to you when the job is all done. However, you can redirect the output yourself in your submission script using the > operator:

./mycommand myargs > output.txt

Now, when your program runs, it will write all its output to output.txt. You can look at that file during the run, either by opening it in an editor, or using a trick with tail:

$ tail -f output.txt

tail -f shows you the last few lines of the file, but then it stays connected and prints each line as it is written to the file. You can use Ctrl+c to stop.