Resource Utilization and Monitoring
Now that we can check our job history, let's learn how we can check that we are utilizing our resources (CPUs, GPUs, memory) efficiently. There is no special way to infer usage other than to test directly.
Checking Usage Directly
You can use the jobinfo program with the ID to see which node your job is on and ssh directly there. Let's run our job again
$ sbatch --exclusive myjob.sh
Submitted batch job 19823415
$ squeue --me
JOBID USER ACCOUNT NAME NODES CPUS TIME_LIMIT ST TIME
19823415 username lab_queue example.sh 1 8 10:00 R 0:05
Note
The use of --exclusive (or similar) is important here to not get polluted by others' jobs on the node. --exclusive is just a restraint that ensures that we are the only person running a job on the node that gets allocated.
Once you've done jobinfo to determine which node your job has landed on, you can ssh directly to the node. This is something you can do only if you have a Slurm job on the node. Once on the node, use a tool like htop to inspect CPU and memory activity (press q to quit).
username@login03.negishi:[~] $ ssh a200
username@a200.negishi:[~] $ top -u username # Or htop
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2145269 username 21 1 3956692 2.3g 19384 R 62.5 0.9 0:17.66 python
2145581 username 20 0 276372 7204 4128 R 25.0 0.0 0:00.29 top
2145062 username 20 0 89956 9836 8060 S 0.0 0.0 0:00.09 systemd
<press "q" to exit>
Note
When using ssh within a cluster's subnet, there's no need to specify the subnet, or the username. Further, we don't allow users to ssh onto a node unless they have job running there.
Monitor
However, just sshing onto the node isn't real telemetry. We want to collect and store the data. To do this, we can use the monitor utility (which is RCAC specific) to gather data on CPU and GPU metrics.
$ module load monitor
$ monitor cpu memory
DATE TIME HOSTNAME monitor.cpu.memory 15.7
DATE TIME HOSTNAME monitor.cpu.memory 15.7
DATE TIME HOSTNAME monitor.cpu.memory 15.7
DATE TIME HOSTNAME monitor.cpu.memory 15.7
...
This will run indefinitely, so we need to interrupt it with a keyboard interrupt. Windows and Linux users can press Ctrl+c to stop it. Mac users will use Cmd + ..
Note
Use --help or man monitor to check for usage details. You can also check our user guides for more recommendations.
Now let's do this for an actual Slurm job. Lets edit our example.sh submission script to look like this:
Be sure to ask for all the (with --exclusive) on the node so you don't collect data on your neighbor's job! Start each monitoring task before starting your application.
Why do we need to use the & on each of these commands in the script?
The & puts the process into the background! If we didn't, the node would be stuck on the monitor command until the walltime ran out. Check out Managing Processes from Week 3 if you need a refresher.
Now, let's run the new monitored submission file:
Once it's done, let's look at the output of the files:
$ cd $SCRATCH/example
$ cat cpu_mem.csv
DATE, TIME, HOSTNAME, monitor.cpu.memory, 15.7
DATE, TIME, HOSTNAME, monitor.cpu.memory, 15.7
DATE, TIME, HOSTNAME, monitor.cpu.memory, 15.7
DATE, TIME, HOSTNAME, monitor.cpu.memory, 15.7
...
$ cat cpu_per.csv
DATE, TIME, HOSTNAME, monitor.cpu.percent, 0.5
DATE, TIME, HOSTNAME, monitor.cpu.percent, 0.5
DATE, TIME, HOSTNAME, monitor.cpu.percent, 0.5
DATE, TIME, HOSTNAME, monitor.cpu.percent, 0.5
...
$ cd ~
Why is the CPU utilization so low around 0.5%?
Because currently, our workflow can only use a single CPU at a time, so the rest of the CPUs that we have allocated to our job are idle.
Next Section: Managing Workloads and Jobs