Commit ee6a7066 authored by Morris Jette's avatar Morris Jette
Browse files

gres/gpu - Fix for gres.conf file with multiple files on a single line

We're in the process of setting up a few GPU nodes in our cluster, and
want to use Gres to control access to them.

Currently, we have activated one node with 2 GPUs.  The gres.conf file
on that node reads

----------------

Name=gpu Count=2 File=/dev/nvidia[0-1]
Name=localtmp Count=1800
----------------

(the localtmp is just counting access to local tmp disk.)  Nodes without
GPUs have gres.conf files like this:

----------------

Name=gpu Count=0
Name=localtmp Count=90
----------------

slurm.conf contains the following:

GresTypes=gpu,localtmp
Nodename=DEFAULT Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62976 Gres=localtmp:90 State=unknown
[...]
Nodename=c19-[1-16] NodeHostname=compute-19-[1-16] Weight=15848 CoresPerSocket=4 Gres=localtmp:1800,gpu:2 Feature=rack19,intel,ib

Submitting a job with sbatch --gres:1 ... sets the CUDA_VISIBLE_DEVICES for
the job.  However, the values seem a bit strange:

- If we submit one job with --gres:1, CUDA_VISIBLE_DEVICES gets the value 0.

- If we submit two jobs with --gres:1 at the same time,
  CUDA_VISIBLE_DEVICES gets the value 0 for one job, and 1633906540 for
  the other.

- If we submit one job with --gres:2, CUDA_VISIBLE_DEVICES gets the
  value 0,1633906540
parent e6501902
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment