'Determine the amount of time allocated to a batch job in SLURM
The allocation time for a batch job can be specified on the sbatch command to slurm. For example, the following requests 1 day, 3 minutes and 10 seconds:
$ sbatch -t 1-0:3:10 test.sh
My script needs to know how long it will run so that it can save all its data before terminating. The environment variables available to the job as listed on the sbatch man page do not include the allocation time limit.
How can I determine this from within the script?
For now, I am asking the queue manager for the time limit on the current job:
#!/bin/sh
squeue -j $SLURM_JOB_ID -o "%l"
which gives
TIME_LIMIT
1-00:04:00
I parse the output using the following:
#!/bin/bash
TIMELIMIT=`squeue -j $SLURM_JOB_ID -o "%l" | tail -1`
echo Time limit $TIMELIMIT
if [[ $TIMELIMIT == *-* ]]; then
IFS='-' read -ra DAYS_HOURS <<< $TIMELIMIT
DAYS=${DAYS_HOURS[0]}
PART_DAYS=${DAYS_HOURS[1]}
else
DAYS=0
PART_DAYS=$TIMELIMIT
fi
if [[ $PART_DAYS == *:*:* ]]; then
IFS=':' read -ra HMS <<< $PART_DAYS
H=${HMS[0]}
M=${HMS[1]}
S=${HMS[2]}
else
IFS=':' read -ra HMS <<< $PART_DAYS
H=0
M=${HMS[0]}
S=${HMS[1]}
fi
SECONDS=`echo "((($DAYS*24+$H)*60+$M)*60+$S)" | bc`
echo Time limit: $SECONDS seconds
HOURS=`echo "scale=3;((($DAYS*24+$H)*60+$M)*60+$S)/3600." | bc`
echo Time limit: $HOURS hours
which gives
Time limit 1-00:04:00
Time limit: 86404 seconds
Time limit: 24.001 hours
Is there a cleaner way to do this?
[Modified with correction given by Amit Ruhela 2022-05-17]
Solution 1:[1]
A few things.
If you use proctrack/cgroup, you can trap the SIGTERM signal that is sent when the time limit is up. That gives you a configurable amount of time to save state; SIGKILL is sent after KillWait seconds, configured in slurm.conf. However, it is difficult to make this work if you are using proctrack/linuxproc, because it sends SIGTERM to all processes, not just the bash script. Something like this:
#!/bin/bash
function sigterm {
echo "SIGTERM"
#save state
}
trap sigterm TERM
srun work.sh &
# This loop only breaks when all subprocesses exit
until wait; do :; done
This can be finicky to get right if you've never trapped signals in bash before. With proctrack/cgroup, SIGTERM is sent to the main process of each job step and the batch script. So above, work.sh would also have to trap SIGTERM. Also above, bash does not trap the signal until after subprocesses end unless you background them; hence the '&' and wait loop.
If you really want to pass the timelimit into the job, you could use an environment variable.
sbatch --export=ALL,TIMELIMIT=1-0:3:10 -t1-0:3:10 test.sh
Annoyingly, you have to specify the time limit twice.
Querying the controller with squeue isn't a terrible solution. At scale however, thousands of jobs querying the controller could impact performance. Note that you can use the --noheader flag to not print TIME_LIMIT every time, instead of using tail.
Basically, this is what KillWait was designed for, so you should consider using it unless you can't for some reason. https://slurm.schedmd.com/slurm.conf.html
The best answer might be use of the --signal option for sbatch. This allows you to send a configurable signal to your job a certain amount of time before the end of the time limit.
sbatch --signal=B:USR1@120 myscript.sh
The example above sends USR1 to the batch script about 2 minutes before the end of the job. As noted in the man page, the resolution on this is 60 seconds, so the signal could be sent up to 60 seconds early.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
