Commits · 718009373294d16b53c652abf09afdef7d1375f9 · Manuel G. Marciani / ces_slurm_simulator

16 Jul, 2016 3 commits

Remove vestigial comment · 71800937
Morris Jette authored Jul 15, 2016

71800937

Move startup of power save thread · fb8e3558

Morris Jette authored Jul 15, 2016

Start power save thread only after the partition information is read
  in order to avoid trying to interpret the SuspendExcParts configuration
  information before the partition information is available, which would
  result in a slurmctld abort.

fb8e3558

Prevent slurmctld race condition · c7cae55b

Morris Jette authored Jul 15, 2016

Do not try to access part_list variable (partition list pointer)
  if not yet initialized. Return NULL pointer rather than aborting
  with NULL pointer.

c7cae55b

15 Jul, 2016 13 commits
- Fix spelling of hierarchy in comments · 4f3a0a02
  Tim Wickberg authored Jul 15, 2016
  
  4f3a0a02
- Do not scheduled powered down nodes in FAILED state · 310de98d
  Jacek Budzowski authored Jul 15, 2016
```
bug 2900
```
  310de98d
- Remove unnecessary test for super user in regression test · 2a7d01a5
  Nicolas Joly authored Jul 15, 2016
  
  2a7d01a5
- Cleanup generated files if test cannot run due to inappropriate conditions. · b9abe288
  Nicolas Joly authored Jul 15, 2016
  
  b9abe288
- Fix user message in test1.32 to report correct signal USR2. · 7f98f056
  Nicolas Joly authored Jul 15, 2016
  
  7f98f056
- Update LRZ site report in SLUG16 agenda · 48dc2bec
  Morris Jette authored Jul 15, 2016
  
  48dc2bec
- Move commit 30f4f81c to be above code that could call · f2b1c35f
  Danny Auble authored Jul 14, 2016
```
delete_step_records which would delete the steps without the killing flag
set.
```
  f2b1c35f
- More on the others dealing with the extern cleanup. · 7c831dd9
  Danny Auble authored Jul 14, 2016
```
What this does is treats the extern step like a normal step on exit.  It
doesn't appear the original code is needed anymore and this simplifies
the code.

The select_cray change is relevant since the add is needed only when
killing the step as that is the only place _internal_step_complete isn't
used.
```
  7c831dd9
- Continuation of commit 667f1105. Remove unneeded job_ptr variable from · d74bcf74
  Danny Auble authored Jul 14, 2016
```
functions.
```
  d74bcf74
- Continuation of commit 667f1105 to simplify the code more · 28745901
  Danny Auble authored Jul 14, 2016
  
  28745901
- Slightly better debug when dealing with stepd_completions. · af4d7aa4
  Danny Auble authored Jul 14, 2016
  
  af4d7aa4
- Make scontrol show steps show the extern step correctly. · a4c2649d
  Danny Auble authored Jul 14, 2016
```
Before it was showing it as TBD since pending steps and the extern step
have the same stepid.
```
  a4c2649d
- Various cleanup needed for extern step. Continuation of commit 2fc0c860 · c79063b0
  Danny Auble authored Jul 14, 2016
```
What this does is set the state earlier to match a normal set.

Remove the unneeded _send_pending_exit_msgs.  There is only one task and
we have the message for it, so don't worry about that one.

Most important, wait for the other slurmstepd's to send their message,
otherwise they could be lost on the other end.
```
  c79063b0
14 Jul, 2016 8 commits
- Move talks at SLUG16 · f27341f7
  Morris Jette authored Jul 14, 2016
  
  f27341f7
- Fix gang scheduling and license release logic · 111e3b48
  Morris Jette authored Jul 14, 2016
```
Fix gang scheduling and license release logic if single node job killed on
    bad node. Notifying gang and releasing licences is normally done when
    the epilog completion happens, but if the node(s) assigned to a job are
    all down, that does not happen. This results in the licenses being
    reserved indefinitely and the gang scheduler being left with a bad
    (old) job pointer that can result in various failure modes
bug 2867
```
  111e3b48
- Add SLUG16 agenda/hotel links · 5041174f
  Morris Jette authored Jul 14, 2016
  
  5041174f
- SLUG16 agenda update · c3a7d302
  Morris Jette authored Jul 14, 2016
```
Add hotels. Other minor changes.
```
  c3a7d302
- Fix missing variable from commit 30f4f81c · da462dbf
  Danny Auble authored Jul 13, 2016
  
  da462dbf
- CRAY - If trying to kill a step and you have NHC_NO_STEPS set run NHC · e956f297
  Danny Auble authored Jul 13, 2016
```
anyway to attempt to log the backtraces of the potential
unkillable processes.
```
  e956f297
- Fix uninitialized variable which could cause a core dump from commit · 50f77062
  Danny Auble authored Jul 13, 2016
```
667f1105.
```
  50f77062
- Fix potential deadlock from commit b4dc9eea . · 30f4f81c
  Danny Auble authored Jul 13, 2016
  
  30f4f81c
13 Jul, 2016 3 commits

Continuation of last commit. · b4dc9eea

Danny Auble authored Jul 13, 2016

We have decided to go back to the way 15.08 called NHC instead of calling
it first before sending a SIGKILL to the steps tasks. With this patch we
only start the NHC early when we have to resend the SIGKILL for unkillable
processes. This will hopefully get us the backtrace of the unkillable
processes which was the reason we did it this way in the first place :).

b4dc9eea

CRAY - Simplify when a NHC is called on a step that has unkillable · 603ae198
Danny Auble authored Jul 13, 2016
```
processes.
```
603ae198
Update SLUG agenda for 2016 · a9c3ea71
Morris Jette authored Jul 12, 2016

a9c3ea71

12 Jul, 2016 7 commits
- Fix test1.29 / 17.15 for limits above 32-bits. · 0b8bbc00
  Nicolas Joly authored Jul 12, 2016
```
Bug 2892.
```
  0b8bbc00
- CRAY - Fix for reporting steps lingering after they are already finished. · cd06d0f9
  Danny Auble authored Jul 12, 2016
```
Bug 2874

We will most likely redo this logic (as it appears to be duplicated) in
a following patch.
```
  cd06d0f9
- Fix for burst_buffer/cray batch submit error · 7cdcc25c
  Morris Jette authored Jul 12, 2016
```
Don't generate an error when a batch job is submitted that must wait
  for stage-in before starting.
```
  7cdcc25c
- CRAY - Fix add of extern step to AELD. · 667f1105
  Danny Auble authored Jul 12, 2016
  
  667f1105
- CRAY - Fix issue if pid has already been added to another job container. · 067f2ee3
  Danny Auble authored Jul 12, 2016
```
Bug 2886
```
  067f2ee3
- Merge branch 'slurm-15.08' into slurm-16.05 · baaf6fc0
  Tim Wickberg authored Jul 12, 2016
```
Conflicts:
	src/sstat/options.c
```
  baaf6fc0
- Fix sstat to print job.batch step when requesting multiple jobs. · 59ae8600
  Jacek Budzowski authored Jul 12, 2016
```
Was incorrectly translating request to job.extern if part of a
comma-separate list.

Bug 2890.
```
  59ae8600
11 Jul, 2016 1 commit
- Fix proctrack plugin to only add the pid of a process once · a280c7f7
  Danny Auble authored Jul 11, 2016
```
(regression in 16.05.2).

related commit 5d3e5e1e

Bug 2612 and 2886
```
  a280c7f7
08 Jul, 2016 5 commits
- Slightly better documentation in sacctmgr for the entity "job", and removed · a04eb600
  Danny Auble authored Jul 08, 2016
```
'-' without a '\' in front of it.
```
  a04eb600
- Rename LostJobs RunawayJobs in sacctmgr. · d9640972
  Danny Auble authored Jul 08, 2016
  
  d9640972
- document salloc burst buffer limitations · 12837d1c
  Morris Jette authored Jul 08, 2016
```
Document limitations in burst buffer use by the salloc command (possible
    access problems from a login node).
bug 2883
```
  12837d1c
- task/cgroup set of soft memory limit added · 0eac300d
  Janne Blomqvist authored Jul 08, 2016
```
task/cgroup plugin is configured with ConstrainRAMSpace=yes, then set soft
    memory limit to allocated memory limit (previously no soft limit was set).
bug 2679
```
  0eac300d
- Add web link to collectd from EDF · bf9a621d
  Morris Jette authored Jul 08, 2016
  
  bf9a621d