'Why are my slurm job steps not launching in parallel?
I am trying to figure out what the concept of "tasks" means in SLURM. I have found this answer on SO that suggests me the following job script:
#!/bin/bash
#SBATCH --ntasks=2
srun --ntasks=1 sleep 10 &
srun --ntasks=1 sleep 12 &
wait
The author says that this job runs for him in 12 seconds in total, because the two steps sleep 10 and sleep 12 run in parallel but I cannot reproduce that.
If I save the above file as slurm-test and run
sbatch -o slurm.out slurm-test,
I see that my job runs for 23 seconds.
This is the output of sacct --format=JobID,Start,End,Elapsed,NCPUS -S now-2minutes:
JobID Start End Elapsed NCPUS
------------ ------------------- ------------------- ---------- ----------
645514 2021-06-30T11:05:38 2021-06-30T11:06:00 00:00:22 2
645514.batch 2021-06-30T11:05:38 2021-06-30T11:06:00 00:00:22 2
645514.exte+ 2021-06-30T11:05:38 2021-06-30T11:06:00 00:00:22 2
645514.0 2021-06-30T11:05:38 2021-06-30T11:05:48 00:00:10 2
645514.1 2021-06-30T11:05:48 2021-06-30T11:06:00 00:00:12 2
My slurm.out output is
srun: Job 645514 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 645514
Explicitly incuding -n 2 in the sbatch call does not change the result. What am I doing wrong? How can I get the two calls in my job file to run simultaneously?
Solution 1:[1]
For me, the reason for step creation temporarily disabled, retrying (Requested nodes are busy) is because, the srun command that executed first, allocated all the memory. To solve this, one first optionally(?) specify the total memory allocation in sbatch:
#SBATCH --ntasks=2
#SBATCH --mem=[XXXX]MB
And then specify the memory use per srun task:
srun --exclusive --ntasks=1 --mem-per-cpu [XXXX/2]MB sleep 10 &
srun --exclusive --ntasks=1 --mem-per-cpu [XXXX/2]MB sleep 12 &
wait
I didn't specify cpu count for srun because in my sbatch script I have #SBATCH --cpus-per-task=1. For the same reason I suspect you should use --mem instead of --mem-per-cpu in the srun command when your job isn't serial, but I haven't tested this configuration.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Isabella |
