- 12 Apr, 2013 5 commits
-
-
Danny Auble authored
-
Thomas Cadeau authored
Sometimes, generally with several jobs on the same node or calling many sstat for the job, the pipe is not ready to be read. In this case, the function reading the pipe return an error and the values of consumed energy are set to NO_VAL. From this point, the values are never read again because the process "knows" there is no value to read. Thus, if there is one error, NO_VAL is saved in database and no information of consumed energy is stored. To avoid this, we wrote the attached patch. For first read of pipe, if the pipe doesn't exist, the function retry "NBFIRSTREAD = 3" times with a waiting time of 1 second. Then during job run and for final read, if the pipe doesn't exist, the values are not updated. The first time, the pipe is read if the writer thread is running. If sstat fails to read pipe, the value is not update and last value is printed. But if there is a problem during last read: if there was sstat calls, the value exists but we miss all change between last sstat and end of step. if not, the value is just "0" (no update from the begin).
-
Danny Auble authored
-
Danny Auble authored
plugins. For those doing development to use this follow the model set forth in the acct_gather_energy_ipmi plugin.
-
Morris Jette authored
We're in the process of setting up a few GPU nodes in our cluster, and want to use Gres to control access to them. Currently, we have activated one node with 2 GPUs. The gres.conf file on that node reads ---------------- Name=gpu Count=2 File=/dev/nvidia[0-1] Name=localtmp Count=1800 ---------------- (the localtmp is just counting access to local tmp disk.) Nodes without GPUs have gres.conf files like this: ---------------- Name=gpu Count=0 Name=localtmp Count=90 ---------------- slurm.conf contains the following: GresTypes=gpu,localtmp Nodename=DEFAULT Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62976 Gres=localtmp:90 State=unknown [...] Nodename=c19-[1-16] NodeHostname=compute-19-[1-16] Weight=15848 CoresPerSocket=4 Gres=localtmp:1800,gpu:2 Feature=rack19,intel,ib Submitting a job with sbatch --gres:1 ... sets the CUDA_VISIBLE_DEVICES for the job. However, the values seem a bit strange: - If we submit one job with --gres:1, CUDA_VISIBLE_DEVICES gets the value 0. - If we submit two jobs with --gres:1 at the same time, CUDA_VISIBLE_DEVICES gets the value 0 for one job, and 1633906540 for the other. - If we submit one job with --gres:2, CUDA_VISIBLE_DEVICES gets the value 0,1633906540
-
- 11 Apr, 2013 6 commits
-
-
Danny Auble authored
Conflicts: NEWS
-
Danny Auble authored
APRUN_DEFAULT_MEMORY env var for aprun. This scenario will not display the option when used with --launch-cmd.
-
Morris Jette authored
-
Danny Auble authored
per cpu.
-
Morris Jette authored
-
Danny Auble authored
per cpu.
-
- 10 Apr, 2013 10 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
If task count specified, but no tasks-per-node, then set the tasks per node in the BASIL reservation request.
-
Danny Auble authored
Conflicts: src/plugins/select/cray/cray_config.c
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
as the hosts given.
-
Danny Auble authored
-
- 09 Apr, 2013 6 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
isn't a sub-block job.
-
Danny Auble authored
the XML.
-
Danny Auble authored
-
Morris Jette authored
Fix for bug 258
-
- 08 Apr, 2013 4 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Nathan Yee authored
FrontEndNode: allow/deny groups/users Node: Power consumption
-
Morris Jette authored
-
- 06 Apr, 2013 2 commits
-
-
jette authored
(at initiating pending job steps), interupt driven rather than retry based.\
-
Morris Jette authored
Fix sched/backfill logic to initiate jobs with maximum time limit over the partition limit, but the minimum time limit permits it to start. Related to bug 251
-
- 05 Apr, 2013 5 commits
-
-
Morris Jette authored
Conflicts: src/slurmctld/job_mgr.c
-
Danny Auble authored
-
Morris Jette authored
-
Morris Jette authored
This is the only remaining use of the function. Also added boards parameter to the function and removed some unused parameters
-
Morris Jette authored
Mostly to split long lines
-
- 04 Apr, 2013 2 commits
-
-
Stephen Trofinoff authored
I am sending the latest update of my NPPCU-support patch for Slurm 2.5.0. As before, this patch is applied over my basic BASIL 1.3 support patch. The reason for this latest version is that it came to my attention, that certain jobs that should have been rejected by Slurm were allowed through. I then further noticed that this would cause the backfill algorithm to slow down dramatically (often not being able to process any other jobs). The cause of the problem was that when I introduced the functionality into Slurm to properly set the "nppcu" (number of processors per compute unit) attribute in the XML reservation request to ALPS, I didn't also adjust the tests earlier in the code that eliminate nodes from consideration that do not have sufficient resources. In other words, jobs that would exceed the absolute total number of processors on the node would be rejected as always (this is good). Jobs that required the reduced number of "visible" processors on the node or less were allocated and worked fine (this is good). Unfortunately, jobs that needed a number of processors somewhere in between these limits (let's call them the soft and hard limits) were allowed through by Slurm. Making matters worse, when Slurm would subsequently try to request the ALPS reservation, ALPS would correctly reject it but Slurm would keep trying--this would then kill the backfilling. In my opinion, these jobs should have been rejected from the onset by Slurm as they are asking for more processors per node than can be supplied. If the user wants this number of processors they should specify the "--ntasks-per-core=..." (in our case "2" as that is the full number of hardware threads per core). Obviously, this problem only appeared when I used CR_ONE_TASK_PER_CORE in the slurm.conf as I had modified the code to set nppcu to 1 when Slurm was configured with that option and the user didn't explicitly specify a different value. The patch appears to be working well for us now and so I am submitting it to you for your review.
-
Morris Jette authored
-