'Slurm parallel "steps": 25 independent runs, using 1 cpu each, at most 5 simultaneously

I was previously using HTCondor as a cluster scheduler. Now even after reading Slurm documentation, I have no idea how to parallelize...

What I want to achieve is called --I think-- "embarrassingly parallel": running multiple independent instances of a program (with different inputs).

What I want: request 5 CPUs, possibly on distinct nodes; each CPUs runs the monothreaded program with its specific input. As soon as one CPU is freed, start on the next input in the queue.

Using a batch script, I tried two approaches (please help me understand their difference):

  1. job array
  2. packed jobs

If life was simple, I would assume sufficient to combine the following sbatch options:

--ntasks=5: to have at most 5 runs simultaneously?

--cpus-per-task=1: each run uses one CPU (it should be the default value)

1. job array option

I try --array=0-24%5, even if %5 appears redundant with --ntasks=5, or is it different?

#!/usr/bin/env bash

#SBATCH --job-name=myprogram
#SBATCH --mem-per-cpu=3000 # MB
#SBATCH --output=slurmed/myprogram_%a.out
#SBATCH --error=slurmed/myprogram_%a.err
#SBATCH --ntasks=5
#SBATCH --cpus-per-task=1
#SBATCH --array=0-24%5

input_files=(myinput*.txt)

srun ./myprogram "${input_files[$SLURM_ARRAY_TASK_ID]}"

However, it persists in allocating several CPUs to each "SLURM_ARRAY_TASK_ID"!!!

I also tried without specifying --ntasks at all, same problem.

2. Packed jobs (using ampersand spawning)

(Sorry, but why would a cluster scheduler even let you manually parallelize using shell syntax?)

#!/usr/bin/env bash

#SBATCH --job-name=myprogram
#SBATCH --mem-per-cpu=3000 # MB
#SBATCH --output=slurmed/myprogram_%J_%t.out
#SBATCH --error=slurmed/myprogram_%J_%t.err
#SBATCH --ntasks=5
#SBATCH --cpus-per-task=1

for inputfile in myinput*.txt; do
    srun --exclusive ./myprogram "$input_file" &
done
wait

However if I watch htop on the machine it is running, I see that the first job is running 5 times, and that the sbatch command is also using one extra CPU!

Should I remove the --exclusive?


P.S: There is a useful answer here, but as I said my array command uses multiple CPUs per array unit, instead of one.

P.P.S: Additionally, the Slurm terminology is also extremely confusing:

  • a job: something submitted using sbatch and/or srun?
  • a job step: each time an executable is called inside the batch script? Despite being called a "step", it can occur in parallel
  • a task: I don't see the difference with a job step, but the option descriptions imply that it is different (someone also asked).


Solution 1:[1]

So actually, my problem was answered here: NumCPUs shows 2 when --cpus-per-task=1. Due to hyperthreading, a physical cores of 2 threads is allocated for each job. So requesting 1 CPU per task will anyway be using 2 CPUs in slurm reports. However these 2 threads are concurrent, so running a parallelized command on it will not provide acceleration. If I want true parallelization, I have to request 4, 6, 8 or more CPUs

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 PlasmaBinturong