gres/gpu - Fix for gres.conf file with multiple files on a single line
We're in the process of setting up a few GPU nodes in our cluster, and want to use Gres to control access to them. Currently, we have activated one node with 2 GPUs. The gres.conf file on that node reads ---------------- Name=gpu Count=2 File=/dev/nvidia[0-1] Name=localtmp Count=1800 ---------------- (the localtmp is just counting access to local tmp disk.) Nodes without GPUs have gres.conf files like this: ---------------- Name=gpu Count=0 Name=localtmp Count=90 ---------------- slurm.conf contains the following: GresTypes=gpu,localtmp Nodename=DEFAULT Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62976 Gres=localtmp:90 State=unknown [...] Nodename=c19-[1-16] NodeHostname=compute-19-[1-16] Weight=15848 CoresPerSocket=4 Gres=localtmp:1800,gpu:2 Feature=rack19,intel,ib Submitting a job with sbatch --gres:1 ... sets the CUDA_VISIBLE_DEVICES for the job. However, the values seem a bit strange: - If we submit one job with --gres:1, CUDA_VISIBLE_DEVICES gets the value 0. - If we submit two jobs with --gres:1 at the same time, CUDA_VISIBLE_DEVICES gets the value 0 for one job, and 1633906540 for the other. - If we submit one job with --gres:2, CUDA_VISIBLE_DEVICES gets the value 0,1633906540
Please register or sign in to comment