Commits · 3c83b2693385b2c71a11d6ae8e2f16d8a66261bd · Manuel G. Marciani / ces_slurm_simulator

30 Mar, 2016 8 commits
- Log heterogeneous socket/core/thread configuration · 3c83b269
  Morris Jette authored Mar 30, 2016
```
Log if the number of cores is not evenly divisible by the socket
  count (which will be the case on some KNL) or the number of
  threads is not evenly divisible by the core count.
```
  3c83b269
- Replace SchedulerParameters option assoc_limit_continue · a685e0e9
  Morris Jette authored Mar 29, 2016
```
Remove the SchedulerParameters option of "assoc_limit_continue", making it
    the default value. Add option of "assoc_limit_stop". If "assoc_limit_stop"
    is set and a job cannot start due to association limits, then do not attempt
    to initiate any lower priority jobs in that partition. Setting this can
    decrease system throughput and utlization, but avoid potentially starving
    larger jobs by preventing them from launching indefinitely.
```
  a685e0e9
- Start NEWS for v16.05.0-pre3 · f0f1286e
  Morris Jette authored Mar 29, 2016
  
  f0f1286e
- Update RELEASE_NOTES through pre2 NEWS · bb50f3ed
  Morris Jette authored Mar 29, 2016
  
  bb50f3ed
- Update META for v16.05.0-pre2 tag · f3a80339
  Morris Jette authored Mar 29, 2016
  
  f3a80339
- Merge branch 'slurm-15.08' · 7a5dea3a
  Morris Jette authored Mar 29, 2016
```
Conflicts:
	META
	NEWS
```
  7a5dea3a
- Update NEWS for start of v15.08.10 · 173eb1e6
  Morris Jette authored Mar 29, 2016
  
  173eb1e6
- Update META for v15.08.9 tag · 7444b066
  Morris Jette authored Mar 29, 2016
  
  7444b066
29 Mar, 2016 6 commits
- Add spank_init and spank_fini to the extern step. · 68d7448e
  Danny Auble authored Mar 29, 2016
  
  68d7448e
- Add FAQ for slurmdbd -R option · 97c91cc5
  Morris Jette authored Mar 29, 2016
```
This adds a FAQ to go with commit 8ee976b4
```
  97c91cc5
- Harden test for "stale file handle" error · 572ba574
  Morris Jette authored Mar 29, 2016
  
  572ba574
- Add SchedulerParameter no_env_cache, if set no env cache will be use when · 1f72fbe9
  Danny Auble authored Mar 29, 2016
```
launching a job, instead the job will fail and drain the node if the env
isn't loaded normally.

bug 2546
```
  1f72fbe9
- In reservation test, log heterogeneous cluster may cause test failure · d8f55ff1
  Morris Jette authored Mar 29, 2016
```
Some portitions of the reservation test assume the socket/core/thread
counts are homogeneous and fail otherwise. bug 2597 added to address
root cause of failure at a later time.
```
  d8f55ff1
- Fix clang issue. · 810971e2
  Brian Christiansen authored Mar 28, 2016
```
Argument with 'nonnull' attribute passed null.
```
  810971e2
28 Mar, 2016 5 commits

Add functionality to reset the lft and rgt values of the association table · 8ee976b4
Danny Auble authored Mar 28, 2016
```
with the slurmdbd.
```
8ee976b4
When a stepd is about to shutdown and send it's response to srun · ea470f71
Danny Auble authored Mar 28, 2016
```
make the wait to return data only hit after 500 nodes and configurable
based on the TcpTimeout value.
```
ea470f71
Merge branch 'slurm-15.08' · 2d70778b
Morris Jette authored Mar 28, 2016

2d70778b

task/cgroup - Fix task binding to CPUs bug · ddf6d9a4

Morris Jette authored Mar 28, 2016

There was a subtle bug in how tasks were bound to CPUs which could result
in an "infinite loop" error. The problem was various socket/core/threasd
calculations were based upon the resources allocated to a step rather than
all resources on the node and rounding errors could occur. Consider for
example a node with 2 sockets, 6 cores per socket and 2 threads per core.
On the idle node, a job requesting 14 CPUs is submitted. That job would
be allocted 4 cores on the first socket and 3 cores on the second socket.
The old logic would get the number of sockets for the job at 2 and the
number of cores at 7, then calculate the number of cores per socket at
7/2 or 3 (rounding down to an integer). The logic layouting out tasks
would bind the first 3 cores on each socket to the job then not find any
remaining cores, report the "infinite loop" error to the user, and run
the job without one of the expected cores. The problem gets even worse
when there are some allocated cores on a node. In a more extreme case,
a job might be allocated 6 cores on one socket and 1 core on a second
socket. In that case, 3 of that job's cores would be unused.
bug 2502

ddf6d9a4

Fix for srun signal handling threading problem · c8d36dba

Morris Jette authored Mar 28, 2016

This is a revision to commit 1ed38f26
The root problem is that a pthread is passed an argument which is
a pointer to a variable on the stack. If that variable is over-written,
the signal number recieved will be garbage, and that bad signal
number will be interpretted by srun to possible abort the request.

c8d36dba

26 Mar, 2016 5 commits
- Fix test if "." not in search path · c86b143c
  Morris Jette authored Mar 25, 2016
  
  c86b143c
- Fix some tests for odd host names · 21b25847
  Morris Jette authored Mar 25, 2016
```
This fixes tests when a cluster's node name format includes nodes
  with numeric sufficies and those without.
```
  21b25847
- Harden test for node names without numeric suffix · 0cda6daf
  Morris Jette authored Mar 25, 2016
  
  0cda6daf
- Merge branch 'slurm-15.08' · af7a6fd3
  Morris Jette authored Mar 25, 2016
  
  af7a6fd3
- Revert commit efa83a02 · c1dde86c
  Morris Jette authored Mar 25, 2016
```
The previous commit obviously fixed a problem, but introduced a different
set of problems. This will be pursued later, perhaps in version 16.05.
```
  c1dde86c
25 Mar, 2016 6 commits
- Revert commit 6c14b969 · f5920b77
  Morris Jette authored Mar 25, 2016
```
With some configurations and systems, errors of the following sort were
occuring:
task/cgroup: task[1] infinite loop broken while trying to provision compute elements using block
task/cgroup: task[1] unable to set taskset '0x0'
```
  f5920b77
- Add "sacctmgr lost jobs" to report orphaned jobs on clsuter. · 2dd920b9
  Nathan Yee authored Mar 25, 2016
```
Bug 1706
```
  2dd920b9
- Add OR constraint option test · b3dee298
  Morris Jette authored Mar 25, 2016
  
  b3dee298
- Add test of exclusive OR constraints · 59899e2d
  Nathan Yee authored Mar 25, 2016
```
bug 2070
```
  59899e2d
- Merge branch 'slurm-15.08' · bd39fef3
  Morris Jette authored Mar 25, 2016
  
  bd39fef3
- burst_buffer/cray - pre-run fail fix · 5a48207e
  Morris Jette authored Mar 25, 2016
```
burst_buffer/cray - If the pre-run operation fails then don't issue
    duplicate job cancel/requeue unless the job is still in run state. Prevents
    jobs hung in COMPLETING state.
bug 2587
```
  5a48207e
24 Mar, 2016 4 commits

Select/cray - Log NHC run time on "scontrol reconfig" · 58627d02

Morris Jette authored Mar 24, 2016

Running "scontrol reconfig" releases resources for jobs waiting for
  the completion of Node Health Check so that other jobs can run.
  Cray says to always wait for NHC to complete, but in extreme
  cases that can be 2 hours, during which the entire resource
  allocation for a job may be unusable. Per advice from NERSC,
  the logic to release resources is unchanged, but logging is
  added here.

58627d02

Remove Rgt from the association output of scontrol show assoc_mgr. Rgt · ddfd2781
Danny Auble authored Mar 24, 2016
```
isn't kept up to date in the cache.
```
ddfd2781
Fix incorrect info from commit ce6d3717 · 60979f52
Danny Auble authored Mar 24, 2016

60979f52
Add new expect get_curr_line_num so you can print out the script line · ce6d3717
Danny Auble authored Mar 24, 2016
```
as will.
```
ce6d3717

23 Mar, 2016 6 commits

Merge branch 'slurm-15.08' · 3028cfea
Morris Jette authored Mar 23, 2016
```
Conflicts:
	src/plugins/select/cons_res/job_test.c
```
3028cfea

gang scheduling bug fix · 5f1e78f6

Morris Jette authored Mar 23, 2016

Fix gang scheduling resource selection bug which could prevent multiple jobs
    from being allocated the same resources. Bug was introduced in 15.08.6,
    commit 44f491b8

5f1e78f6

Fix leak of out_buf when file length is zero. · 498624df
Tim Wickberg authored Mar 23, 2016

498624df

Cleanup Coverity errors from file_bcast work. · 54c9ac31

Tim Wickberg authored Mar 23, 2016

Also ensure empty (0-length) files are handled properly.
Remove a stray exit(1) call from _rpc_file_bcast() to avoid
slurmd exiting on malformed data.

54c9ac31

Propagate use of functions added in commits 2ed5c7fb and e7f12058 · 10824898
Danny Auble authored Mar 23, 2016

10824898

task/cgroup: Fix for task binding anomaly · efa83a02

Morris Jette authored Mar 23, 2016

Here's how to reproduce on smd-server with 2 sockets, 6 cores per
socket and 2 threads per core, just run the following command line
3 times in quick succession (all active at the same time):
srun --cpus-per-task=4 -m block sleep 30
What was happening is the first job would be allocated cores 0+1
The second job would be allocated cores 2+3
The thrid job would test use of cores 0-3 then exit because the
 job only needs 4 CPUs. The resulting core binding would include
 NO CPUs. The new logic tests that the core being considered for
 use actually has some resources available to the job before
 updating the counter which is being tested against the needed
 CPU counter.

efa83a02