- 01 Apr, 2016 2 commits
-
-
Morris Jette authored
-
Morris Jette authored
Rename partition configuration from "Shared" to "OverSubscribe". Rename salloc, sbatch, srun option from "--shared" to "--oversubscribe". The old options will continue to function. Output field names also changed in scontrol, sinfo, squeue, and sview.
-
- 31 Mar, 2016 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
Power/cray: Don't specify NID list to Cray APIs. If any of those nodes are not in a ready state, the API returned an error for ALL nodes rather than valid data for nodes in ready state. bug 2332
-
Matthieu Hautreux authored
and retries are done making the error message a little misleading.
-
- 30 Mar, 2016 10 commits
-
-
Morris Jette authored
Update a node's socket and cores per socket counts as needed after a node boot to reflect configuration changes which can occur on KNL processors. Note that the node's total core count must not change, only the distribution of cores across varying socket counts (KNL NUMA nodes treated as sockets by Slurm).
-
Morris Jette authored
Log if the number of cores is not evenly divisible by the socket count (which will be the case on some KNL) or the number of threads is not evenly divisible by the core count.
-
Danny Auble authored
rollup would effectively never run again. bug 2575 and sort of bug 2596
-
Morris Jette authored
Remove the SchedulerParameters option of "assoc_limit_continue", making it the default value. Add option of "assoc_limit_stop". If "assoc_limit_stop" is set and a job cannot start due to association limits, then do not attempt to initiate any lower priority jobs in that partition. Setting this can decrease system throughput and utlization, but avoid potentially starving larger jobs by preventing them from launching indefinitely.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Conflicts: META NEWS
-
Morris Jette authored
-
Morris Jette authored
-
- 29 Mar, 2016 6 commits
-
-
Danny Auble authored
-
Morris Jette authored
This adds a FAQ to go with commit 8ee976b4
-
Morris Jette authored
-
Danny Auble authored
launching a job, instead the job will fail and drain the node if the env isn't loaded normally. bug 2546
-
Morris Jette authored
Some portitions of the reservation test assume the socket/core/thread counts are homogeneous and fail otherwise. bug 2597 added to address root cause of failure at a later time.
-
Brian Christiansen authored
Argument with 'nonnull' attribute passed null.
-
- 28 Mar, 2016 5 commits
-
-
Danny Auble authored
with the slurmdbd.
-
Danny Auble authored
make the wait to return data only hit after 500 nodes and configurable based on the TcpTimeout value.
-
Morris Jette authored
-
Morris Jette authored
There was a subtle bug in how tasks were bound to CPUs which could result in an "infinite loop" error. The problem was various socket/core/threasd calculations were based upon the resources allocated to a step rather than all resources on the node and rounding errors could occur. Consider for example a node with 2 sockets, 6 cores per socket and 2 threads per core. On the idle node, a job requesting 14 CPUs is submitted. That job would be allocted 4 cores on the first socket and 3 cores on the second socket. The old logic would get the number of sockets for the job at 2 and the number of cores at 7, then calculate the number of cores per socket at 7/2 or 3 (rounding down to an integer). The logic layouting out tasks would bind the first 3 cores on each socket to the job then not find any remaining cores, report the "infinite loop" error to the user, and run the job without one of the expected cores. The problem gets even worse when there are some allocated cores on a node. In a more extreme case, a job might be allocated 6 cores on one socket and 1 core on a second socket. In that case, 3 of that job's cores would be unused. bug 2502
-
Morris Jette authored
This is a revision to commit 1ed38f26 The root problem is that a pthread is passed an argument which is a pointer to a variable on the stack. If that variable is over-written, the signal number recieved will be garbage, and that bad signal number will be interpretted by srun to possible abort the request.
-
- 26 Mar, 2016 5 commits
-
-
Morris Jette authored
-
Morris Jette authored
This fixes tests when a cluster's node name format includes nodes with numeric sufficies and those without.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
The previous commit obviously fixed a problem, but introduced a different set of problems. This will be pursued later, perhaps in version 16.05.
-
- 25 Mar, 2016 6 commits
-
-
Morris Jette authored
With some configurations and systems, errors of the following sort were occuring: task/cgroup: task[1] infinite loop broken while trying to provision compute elements using block task/cgroup: task[1] unable to set taskset '0x0'
-
Nathan Yee authored
Bug 1706
-
Morris Jette authored
-
Nathan Yee authored
bug 2070
-
Morris Jette authored
-
Morris Jette authored
burst_buffer/cray - If the pre-run operation fails then don't issue duplicate job cancel/requeue unless the job is still in run state. Prevents jobs hung in COMPLETING state. bug 2587
-
- 24 Mar, 2016 3 commits
-
-
Morris Jette authored
Running "scontrol reconfig" releases resources for jobs waiting for the completion of Node Health Check so that other jobs can run. Cray says to always wait for NHC to complete, but in extreme cases that can be 2 hours, during which the entire resource allocation for a job may be unusable. Per advice from NERSC, the logic to release resources is unchanged, but logging is added here.
-
Danny Auble authored
isn't kept up to date in the cache.
-
Danny Auble authored
-