'How do I find the processes that are related to a sbatch job?

When I start a job with sbatch on a multi-node system then some processes are being started on the involved nodes.

How can I find out the processes (process ID) that are running on these nodes that have been started because of the sbatch run?

I checked the slurm documentation but did not find any command that shows the involved processes (e.g. scontrol or sstat).

The idea is to find the process ID and then use Linux tools to debug processes that are being 'stuck' (i.e. no output etc), and maybe to find out what this particular process is doing.



Solution 1:[1]

What you are looking for is scontrol listpids. From the scontrol manpage:

listpids [job_id[.step_id]] [NodeName]

Print a listing of the process IDs in a job step (if JOBID.STEPID is provided), or all of the job steps in a job (if job_id is provided), or all of the job steps in all of the jobs on the local node (if job_id is not provided or job_id is "*"). This will work only with processes on the node on which scontrol is run, and only for those processes spawned by Slurm and their descendants. Note that some Slurm configurations (ProctrackType value of pgid) are unable to identify all processes associated with a job or job step. Note that the NodeName option is only really useful when you have multiple slurmd daemons running on the same host machine. Multiple slurmd daemons on one host are, in general, only used by Slurm developers.

Just SSH to a compute node and run scontrol listpids. It will output a table with PID / JOBID correspondances.

[root@node003 ~]# scontrol listpids | column -t
PID     JOBID     STEPID      LOCALID  GLOBALID
269852  68706234  batch       0        0
269998  68706234  batch       -        -
[etc.]

I use here the column command to better align the column and ease reading.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 damienfrancois