- 17 Apr, 2013 2 commits
-
-
Danny Auble authored
-
Morris Jette authored
-
- 16 Apr, 2013 20 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
jobs if ntasks-per-node was used but no node count given.
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Martin Perry authored
-
Morris Jette authored
-
Morris Jette authored
-
- 15 Apr, 2013 2 commits
-
-
Danny Auble authored
file doesn't exist.
-
Thomas Cadeau authored
-
- 13 Apr, 2013 4 commits
-
-
jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Accidentally removed in commit 3b4d338d
-
- 12 Apr, 2013 12 commits
-
-
Morris Jette authored
-
Morris Jette authored
Execute autogen.sh to rebuild src/plugins/mpi/pmi2/Makefile.in Cast int to uint32 for un/pack32 function calls Split a long line.
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Thomas Cadeau authored
Sometimes, generally with several jobs on the same node or calling many sstat for the job, the pipe is not ready to be read. In this case, the function reading the pipe return an error and the values of consumed energy are set to NO_VAL. From this point, the values are never read again because the process "knows" there is no value to read. Thus, if there is one error, NO_VAL is saved in database and no information of consumed energy is stored. To avoid this, we wrote the attached patch. For first read of pipe, if the pipe doesn't exist, the function retry "NBFIRSTREAD = 3" times with a waiting time of 1 second. Then during job run and for final read, if the pipe doesn't exist, the values are not updated. The first time, the pipe is read if the writer thread is running. If sstat fails to read pipe, the value is not update and last value is printed. But if there is a problem during last read: if there was sstat calls, the value exists but we miss all change between last sstat and end of step. if not, the value is just "0" (no update from the begin).
-
Danny Auble authored
-
Danny Auble authored
plugins. For those doing development to use this follow the model set forth in the acct_gather_energy_ipmi plugin.
-
Hongjia Cao authored
retransmit kvs messages to avoid failure due to async execution of slurmstepds. simple implementation of name publish/unpublish/lookup.
-
Hongjia Cao authored
-
Morris Jette authored
We're in the process of setting up a few GPU nodes in our cluster, and want to use Gres to control access to them. Currently, we have activated one node with 2 GPUs. The gres.conf file on that node reads ---------------- Name=gpu Count=2 File=/dev/nvidia[0-1] Name=localtmp Count=1800 ---------------- (the localtmp is just counting access to local tmp disk.) Nodes without GPUs have gres.conf files like this: ---------------- Name=gpu Count=0 Name=localtmp Count=90 ---------------- slurm.conf contains the following: GresTypes=gpu,localtmp Nodename=DEFAULT Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62976 Gres=localtmp:90 State=unknown [...] Nodename=c19-[1-16] NodeHostname=compute-19-[1-16] Weight=15848 CoresPerSocket=4 Gres=localtmp:1800,gpu:2 Feature=rack19,intel,ib Submitting a job with sbatch --gres:1 ... sets the CUDA_VISIBLE_DEVICES for the job. However, the values seem a bit strange: - If we submit one job with --gres:1, CUDA_VISIBLE_DEVICES gets the value 0. - If we submit two jobs with --gres:1 at the same time, CUDA_VISIBLE_DEVICES gets the value 0 for one job, and 1633906540 for the other. - If we submit one job with --gres:2, CUDA_VISIBLE_DEVICES gets the value 0,1633906540
-