'Slurm - job runs, gets data but gives TIMEOUT error
So I'm running some code which takes about 2 hours to run on the cluster. I configured the batch file with
# Set maximum wallclock time limit for this job
#Time Format = days-hours:minutes:seconds
#SBATCH --time=0-02:15:00
Just to give some overhead if the job slows for whatever reason. I checked the directory that the generated files are stored in and the simulation completes successfully every time. Despite this, slurm keeps the job running until it hits the max time. The .out file keeps saying
slurmstepd: *** JOB CANCELLED AT 2022-03-05T10:38:26 DUE TO TIME LIMIT ***
Any ideas why it doesn't show as complete instead?
Solution 1:[1]
In my opinion, this error is not related to Slurm rather about your application. Your application is somehow not sending the exit signal to the slurm.
You can use sstat -j jobid to see the status of the job, may be after 2 hours to see how the cpu consumption etc going and figure out what happens in your application (where it hangs after completion or so).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | j23 |
