'Prevent direct use of GPU in single-node slurm setup

Scenario: I have installed slurm in a single Ubuntu machine. Users may login into this machine to do some GPU and non-GPU tasks.

Goal: I want to prevent those users from directly using the GPUs. Only jobs started with slurm should be able to use the GPUs.

What I've done: To prevent direct use of the GPUs, I've changed the owner of /dev/nvidiaN (chown root.gpu /dev/nvidia*), now only users within group gpu can use them. That works. The problem is: slurm jobs, of course, run as the same user they are started with, so these submitted jobs cannot use any GPUs either.

Is any way to accomplish this goal? I thought about using mirror users (user -> user-slurm within group gpu but no login), sudo with some sbatch --uid=UID --guid=GID ... white-listed commands, but this seems awfully convoluted.



Solution 1:[1]

I had this same issue. In my lab, we want to setup a local SLURM in a single-node GPU. We want a simple solution so that only SLURM can use the GPU. This seems to be a harder proposition than it would seem at first.

The solution I am going with is (from time to time) to kill any process using the GPU that has not been launched through SLURM. This can be done by a relatively simple Python script.

#!/usr/bin/env python3
# Kills all processes using /dev/nvidia* that have not
# been initiated by SLURM.

import os, subprocess, psutil, syslog

# get all nvidia devices
devs = []
i = 0
while os.path.exists(f'/dev/nvidia{i}'):
    devs.append(f'/dev/nvidia{i}')
    i += 1

# get all processes using any of those devices
out = subprocess.check_output(['lsof', '-t'] + devs)

# kill processes whose parent is not SLURM (slurmstepd)
for pid in out.split():
    p = process = psutil.Process(int(pid))
    has_slurm_parent = False
    while p := p.parent():
        if p.name() == 'slurmstepd':
            has_slurm_parent = True
            break
    if not has_slurm_parent:
        syslog.syslog(f'Killing process {pid} {process.name()} ({process.username()}) - cannot use GPU outside of SLURM')
        process.kill()

Then, just add this to cron (sudo crontab -e) to run every 5 minutes or so:

*/5 * * * * python3 /path/to/script.py

Furthermore, since this script is only a last-resort defense, it is also a good idea to prevent this from being necessary by disabling the cuda visible devices at /etc/environment:

CUDA_VISIBLE_DEVICES=""

We can also lock this variable for the case of bash users (/etc/profile.d/xxxx.sh) together with a nice welcome message:

export CUDA_VISIBLE_DEVICES=""
readonly CUDA_VISIBLE_DEVICES
echo "Use srun or sbatch to access the GPU."

Solution 2:[2]

Here is a BASH version of the python script proposed by Ricardo Magalhães Cruz

#!/bin/bash
for pid in $( /usr/bin/lsof -t /dev/nvidia* )
do 
  grep -q "slurmd.service" /proc/$pid/cgroup && continue
  echo "Killing process $pid from user $(stat -c "%U" /proc/$pid/ ) - cannot use GPU outside of SLURM"
  kill -9 $pid 
done

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Rockcat