Commits · f9da1f4c27d4c7e10b37a85de64d685d5783cdc6 · Manuel G. Marciani / ces_slurm_simulator

01 Apr, 2016 2 commits

Tweak test for oversubscribe change · f9da1f4c
Morris Jette authored Apr 01, 2016

f9da1f4c

Rename "Shared" to "OverSubscribe" · 5fe0915e

Morris Jette authored Apr 01, 2016

Rename partition configuration from "Shared" to "OverSubscribe". Rename
salloc, sbatch, srun option from "--shared" to "--oversubscribe". The old
options will continue to function. Output field names also changed in
scontrol, sinfo, squeue, and sview.

5fe0915e

31 Mar, 2016 3 commits
- Merge branch 'slurm-15.08' · 30274f4d
  Morris Jette authored Mar 31, 2016
  
  30274f4d
- power/cray fix for nodes not ready · 5b0800e4
  Morris Jette authored Mar 31, 2016
```
Power/cray: Don't specify NID list to Cray APIs. If any of those nodes are
    not in a ready state, the API returned an error for ALL nodes rather than
    valid data for nodes in ready state.
bug 2332
```
  5b0800e4
- Make error message in the pmi2 code to debug as the issue can be expected · bcccd20c
  Matthieu Hautreux authored Mar 30, 2016
```
and retries are done making the error message a little misleading.
```
  bcccd20c
30 Mar, 2016 10 commits

Update node socket/core counts on the fly · 606948a8

Morris Jette authored Mar 30, 2016

Update a node's socket and cores per socket counts as needed after a node
boot to reflect configuration changes which can occur on KNL processors.
Note that the node's total core count must not change, only the distribution
of cores across varying socket counts (KNL NUMA nodes treated as sockets by
Slurm).

606948a8

Log heterogeneous socket/core/thread configuration · 3c83b269

Morris Jette authored Mar 30, 2016

Log if the number of cores is not evenly divisible by the socket
  count (which will be the case on some KNL) or the number of
  threads is not evenly divisible by the core count.

3c83b269

Fix issue where if a slurmdbd rollup lasted longer than 1 hour the · 2bec1975
Danny Auble authored Mar 29, 2016
```
rollup would effectively never run again.

bug 2575

and sort of bug 2596
```
2bec1975

Replace SchedulerParameters option assoc_limit_continue · a685e0e9

Morris Jette authored Mar 29, 2016

Remove the SchedulerParameters option of "assoc_limit_continue", making it
the default value. Add option of "assoc_limit_stop". If "assoc_limit_stop"
is set and a job cannot start due to association limits, then do not attempt
to initiate any lower priority jobs in that partition. Setting this can
decrease system throughput and utlization, but avoid potentially starving
larger jobs by preventing them from launching indefinitely.

a685e0e9

Start NEWS for v16.05.0-pre3 · f0f1286e
Morris Jette authored Mar 29, 2016

f0f1286e
Update RELEASE_NOTES through pre2 NEWS · bb50f3ed
Morris Jette authored Mar 29, 2016

bb50f3ed
Update META for v16.05.0-pre2 tag · f3a80339
Morris Jette authored Mar 29, 2016

f3a80339
Merge branch 'slurm-15.08' · 7a5dea3a
Morris Jette authored Mar 29, 2016
```
Conflicts:
	META
	NEWS
```
7a5dea3a
Update NEWS for start of v15.08.10 · 173eb1e6
Morris Jette authored Mar 29, 2016

173eb1e6
Update META for v15.08.9 tag · 7444b066
Morris Jette authored Mar 29, 2016

7444b066

29 Mar, 2016 6 commits
- Add spank_init and spank_fini to the extern step. · 68d7448e
  Danny Auble authored Mar 29, 2016
  
  68d7448e
- Add FAQ for slurmdbd -R option · 97c91cc5
  Morris Jette authored Mar 29, 2016
```
This adds a FAQ to go with commit 8ee976b4
```
  97c91cc5
- Harden test for "stale file handle" error · 572ba574
  Morris Jette authored Mar 29, 2016
  
  572ba574
- Add SchedulerParameter no_env_cache, if set no env cache will be use when · 1f72fbe9
  Danny Auble authored Mar 29, 2016
```
launching a job, instead the job will fail and drain the node if the env
isn't loaded normally.

bug 2546
```
  1f72fbe9
- In reservation test, log heterogeneous cluster may cause test failure · d8f55ff1
  Morris Jette authored Mar 29, 2016
```
Some portitions of the reservation test assume the socket/core/thread
counts are homogeneous and fail otherwise. bug 2597 added to address
root cause of failure at a later time.
```
  d8f55ff1
- Fix clang issue. · 810971e2
  Brian Christiansen authored Mar 28, 2016
```
Argument with 'nonnull' attribute passed null.
```
  810971e2
28 Mar, 2016 5 commits

Add functionality to reset the lft and rgt values of the association table · 8ee976b4
Danny Auble authored Mar 28, 2016
```
with the slurmdbd.
```
8ee976b4
When a stepd is about to shutdown and send it's response to srun · ea470f71
Danny Auble authored Mar 28, 2016
```
make the wait to return data only hit after 500 nodes and configurable
based on the TcpTimeout value.
```
ea470f71
Merge branch 'slurm-15.08' · 2d70778b
Morris Jette authored Mar 28, 2016

2d70778b

task/cgroup - Fix task binding to CPUs bug · ddf6d9a4

Morris Jette authored Mar 28, 2016

There was a subtle bug in how tasks were bound to CPUs which could result
in an "infinite loop" error. The problem was various socket/core/threasd
calculations were based upon the resources allocated to a step rather than
all resources on the node and rounding errors could occur. Consider for
example a node with 2 sockets, 6 cores per socket and 2 threads per core.
On the idle node, a job requesting 14 CPUs is submitted. That job would
be allocted 4 cores on the first socket and 3 cores on the second socket.
The old logic would get the number of sockets for the job at 2 and the
number of cores at 7, then calculate the number of cores per socket at
7/2 or 3 (rounding down to an integer). The logic layouting out tasks
would bind the first 3 cores on each socket to the job then not find any
remaining cores, report the "infinite loop" error to the user, and run
the job without one of the expected cores. The problem gets even worse
when there are some allocated cores on a node. In a more extreme case,
a job might be allocated 6 cores on one socket and 1 core on a second
socket. In that case, 3 of that job's cores would be unused.
bug 2502

ddf6d9a4

Fix for srun signal handling threading problem · c8d36dba

Morris Jette authored Mar 28, 2016

This is a revision to commit 1ed38f26
The root problem is that a pthread is passed an argument which is
a pointer to a variable on the stack. If that variable is over-written,
the signal number recieved will be garbage, and that bad signal
number will be interpretted by srun to possible abort the request.

c8d36dba

26 Mar, 2016 5 commits
- Fix test if "." not in search path · c86b143c
  Morris Jette authored Mar 25, 2016
  
  c86b143c
- Fix some tests for odd host names · 21b25847
  Morris Jette authored Mar 25, 2016
```
This fixes tests when a cluster's node name format includes nodes
  with numeric sufficies and those without.
```
  21b25847
- Harden test for node names without numeric suffix · 0cda6daf
  Morris Jette authored Mar 25, 2016
  
  0cda6daf
- Merge branch 'slurm-15.08' · af7a6fd3
  Morris Jette authored Mar 25, 2016
  
  af7a6fd3
- Revert commit efa83a02 · c1dde86c
  Morris Jette authored Mar 25, 2016
```
The previous commit obviously fixed a problem, but introduced a different
set of problems. This will be pursued later, perhaps in version 16.05.
```
  c1dde86c
25 Mar, 2016 6 commits
- Revert commit 6c14b969 · f5920b77
  Morris Jette authored Mar 25, 2016
```
With some configurations and systems, errors of the following sort were
occuring:
task/cgroup: task[1] infinite loop broken while trying to provision compute elements using block
task/cgroup: task[1] unable to set taskset '0x0'
```
  f5920b77
- Add "sacctmgr lost jobs" to report orphaned jobs on clsuter. · 2dd920b9
  Nathan Yee authored Mar 25, 2016
```
Bug 1706
```
  2dd920b9
- Add OR constraint option test · b3dee298
  Morris Jette authored Mar 25, 2016
  
  b3dee298
- Add test of exclusive OR constraints · 59899e2d
  Nathan Yee authored Mar 25, 2016
```
bug 2070
```
  59899e2d
- Merge branch 'slurm-15.08' · bd39fef3
  Morris Jette authored Mar 25, 2016
  
  bd39fef3
- burst_buffer/cray - pre-run fail fix · 5a48207e
  Morris Jette authored Mar 25, 2016
```
burst_buffer/cray - If the pre-run operation fails then don't issue
    duplicate job cancel/requeue unless the job is still in run state. Prevents
    jobs hung in COMPLETING state.
bug 2587
```
  5a48207e
24 Mar, 2016 3 commits

Select/cray - Log NHC run time on "scontrol reconfig" · 58627d02

Morris Jette authored Mar 24, 2016

Running "scontrol reconfig" releases resources for jobs waiting for
  the completion of Node Health Check so that other jobs can run.
  Cray says to always wait for NHC to complete, but in extreme
  cases that can be 2 hours, during which the entire resource
  allocation for a job may be unusable. Per advice from NERSC,
  the logic to release resources is unchanged, but logging is
  added here.

58627d02

Remove Rgt from the association output of scontrol show assoc_mgr. Rgt · ddfd2781
Danny Auble authored Mar 24, 2016
```
isn't kept up to date in the cache.
```
ddfd2781
Fix incorrect info from commit ce6d3717 · 60979f52
Danny Auble authored Mar 24, 2016

60979f52