Commits · b8705d7ffd9f0b9e48d8837a07c0a80bed30b5b8 · Manuel G. Marciani / ces_slurm_simulator

16 Jul, 2016 4 commits

Merge branch 'slurm-16.05' · b8705d7f
Morris Jette authored Jul 15, 2016

b8705d7f
Remove vestigial comment · 71800937
Morris Jette authored Jul 15, 2016

71800937

Move startup of power save thread · fb8e3558

Morris Jette authored Jul 15, 2016

Start power save thread only after the partition information is read
  in order to avoid trying to interpret the SuspendExcParts configuration
  information before the partition information is available, which would
  result in a slurmctld abort.

fb8e3558

Prevent slurmctld race condition · c7cae55b

Morris Jette authored Jul 15, 2016

Do not try to access part_list variable (partition list pointer)
  if not yet initialized. Return NULL pointer rather than aborting
  with NULL pointer.

c7cae55b

15 Jul, 2016 18 commits
- Fix spelling of hierarchy in comments · 4f3a0a02
  Tim Wickberg authored Jul 15, 2016
  
  4f3a0a02
- Do not scheduled powered down nodes in FAILED state · 310de98d
  Jacek Budzowski authored Jul 15, 2016
```
bug 2900
```
  310de98d
- Remove unnecessary test for super user in regression test · 2a7d01a5
  Nicolas Joly authored Jul 15, 2016
  
  2a7d01a5
- Cleanup generated files if test cannot run due to inappropriate conditions. · b9abe288
  Nicolas Joly authored Jul 15, 2016
  
  b9abe288
- Fix user message in test1.32 to report correct signal USR2. · 7f98f056
  Nicolas Joly authored Jul 15, 2016
  
  7f98f056
- Update LRZ site report in SLUG16 agenda · 48dc2bec
  Morris Jette authored Jul 15, 2016
  
  48dc2bec
- burst_buffer.conf document - Remove info about old release · d371bed5
  Morris Jette authored Jul 15, 2016
  
  d371bed5
- burst_buffer/cray newly found buffer timeout · 6de710be
  Morris Jette authored Jul 15, 2016
```
Don't register newly found buffers that are less than OtherTimout
  old to avoid possible duplicates.
```
  6de710be
- bufst_buffer/cray race condition · 91bc07b8
  Morris Jette authored Jul 15, 2016
```
This hardens the code with respect to a race condtion if the
  slurmctld restarts and a burts buffer creation for a job is
  in progress. Eliminate the possibility of a duplicate job
  allocation record.
```
  91bc07b8
- burst_buffer/cray: Move some logic around for better clarity · 6669f1f1
  Morris Jette authored Jul 15, 2016
```
No change in functionality, just moved function call and added
  comment
```
  6669f1f1
- Move commit 30f4f81c to be above code that could call · f2b1c35f
  Danny Auble authored Jul 14, 2016
```
delete_step_records which would delete the steps without the killing flag
set.
```
  f2b1c35f
- More on the others dealing with the extern cleanup. · 7c831dd9
  Danny Auble authored Jul 14, 2016
```
What this does is treats the extern step like a normal step on exit.  It
doesn't appear the original code is needed anymore and this simplifies
the code.

The select_cray change is relevant since the add is needed only when
killing the step as that is the only place _internal_step_complete isn't
used.
```
  7c831dd9
- Continuation of commit 667f1105. Remove unneeded job_ptr variable from · d74bcf74
  Danny Auble authored Jul 14, 2016
```
functions.
```
  d74bcf74
- Continuation of commit 667f1105 to simplify the code more · 28745901
  Danny Auble authored Jul 14, 2016
  
  28745901
- Slightly better debug when dealing with stepd_completions. · af4d7aa4
  Danny Auble authored Jul 14, 2016
  
  af4d7aa4
- Make scontrol show steps show the extern step correctly. · a4c2649d
  Danny Auble authored Jul 14, 2016
```
Before it was showing it as TBD since pending steps and the extern step
have the same stepid.
```
  a4c2649d
- Various cleanup needed for extern step. Continuation of commit 2fc0c860 · c79063b0
  Danny Auble authored Jul 14, 2016
```
What this does is set the state earlier to match a normal set.

Remove the unneeded _send_pending_exit_msgs.  There is only one task and
we have the message for it, so don't worry about that one.

Most important, wait for the other slurmstepd's to send their message,
otherwise they could be lost on the other end.
```
  c79063b0
- burst_buffer/cray real_size change · 5b62fdc3
  Morris Jette authored Jul 14, 2016
```
Only execute the DataWarp real_size function if there is a job burst buffer. Calling the function if the job only references persistent buffers generates an error that is not useful
```
  5b62fdc3
14 Jul, 2016 15 commits
- Fix for bad cut/paste · 06a10230
  Morris Jette authored Jul 14, 2016
```
Wrong argument type
```
  06a10230
- Change variable name to preserve original · a45301ae
  Morris Jette authored Jul 14, 2016
```
Preserve variable resp_msg for use in error message and use a different
  variable for temporary storage.
```
  a45301ae
- Merge branch 'slurm-16.05' · 04fa2512
  Morris Jette authored Jul 14, 2016
  
  04fa2512
- Move talks at SLUG16 · f27341f7
  Morris Jette authored Jul 14, 2016
  
  f27341f7
- Fix gang scheduling and license release logic · 111e3b48
  Morris Jette authored Jul 14, 2016
```
Fix gang scheduling and license release logic if single node job killed on
    bad node. Notifying gang and releasing licences is normally done when
    the epilog completion happens, but if the node(s) assigned to a job are
    all down, that does not happen. This results in the licenses being
    reserved indefinitely and the gang scheduler being left with a bad
    (old) job pointer that can result in various failure modes
bug 2867
```
  111e3b48
- Merge branch 'slurm-16.05' · 7f291e6f
  Morris Jette authored Jul 14, 2016
  
  7f291e6f
- Add SLUG16 agenda/hotel links · 5041174f
  Morris Jette authored Jul 14, 2016
  
  5041174f
- SLUG16 agenda update · c3a7d302
  Morris Jette authored Jul 14, 2016
```
Add hotels. Other minor changes.
```
  c3a7d302
- Fix missing variable from commit 30f4f81c · da462dbf
  Danny Auble authored Jul 13, 2016
  
  da462dbf
- CRAY - If trying to kill a step and you have NHC_NO_STEPS set run NHC · e956f297
  Danny Auble authored Jul 13, 2016
```
anyway to attempt to log the backtraces of the potential
unkillable processes.
```
  e956f297
- Fix uninitialized variable which could cause a core dump from commit · 50f77062
  Danny Auble authored Jul 13, 2016
```
667f1105.
```
  50f77062
- Fix potential deadlock from commit b4dc9eea . · 30f4f81c
  Danny Auble authored Jul 13, 2016
  
  30f4f81c
- Fix bad cut/paste · 5f16621d
  Morris Jette authored Jul 13, 2016
```
Used wrong symbol name in commit c4e34cb9
a few hours ago
```
  5f16621d
- burst_buffer/cray: Change logic to synchronize · afc97d40
  Morris Jette authored Jul 13, 2016
```
Match sessions and instances using new DataWarp data format
```
  afc97d40
- Make sure cbuf_mutex_is_locked() is defined · c4e34cb9
  Morris Jette authored Jul 13, 2016
  
  c4e34cb9
13 Jul, 2016 3 commits

Don't log success from pthread_cond_timedwait · 327d2d27
Morris Jette authored Jul 13, 2016
```
correction to logic in commit c0919263
```
327d2d27
burst_buffer/cray: move real_size function · 1dd9dc86
Morris Jette authored Jul 13, 2016
```
Move the real_size function to after the buffer has been setup
  per cray documentation
```
1dd9dc86

Continuation of last commit. · b4dc9eea

Danny Auble authored Jul 13, 2016

We have decided to go back to the way 15.08 called NHC instead of calling
it first before sending a SIGKILL to the steps tasks. With this patch we
only start the NHC early when we have to resend the SIGKILL for unkillable
processes. This will hopefully get us the backtrace of the unkillable
processes which was the reason we did it this way in the first place :).

b4dc9eea