Commits · d1fbb57b4c3fc9cb2b15a1af8c27326f5cafdf26 · Manuel G. Marciani / ces_slurm_simulator

22 Jul, 2016 3 commits
- Always report a 0 exit code for the extern step instead of being canceled · d1fbb57b
  Danny Auble authored Jul 21, 2016
```
or failed based on the signal that would always be killing it.
```
  d1fbb57b
- Create the extern step while creating the job instead of waiting until the · a1953657
  Danny Auble authored Jul 21, 2016
```
end of the job to do it.
```
  a1953657
- qsub - When doing the default output files for an array in qsub style · 8e008533
  Danny Auble authored Jul 21, 2016
```
make them using the master job ID instead of the normal job ID.
```
  8e008533
21 Jul, 2016 2 commits
- Add README info about some contribs files · a307e1ae
  Morris Jette authored Jul 20, 2016
  
  a307e1ae
- Treat invalid user ID in AllowUserBoot as error · 59e66700
  Morris Jette authored Jul 20, 2016
```
Treat invalid user ID in AllowUserBoot option of knl.conf file as error
    rather than fatal (log and do not exit).
```
  59e66700
20 Jul, 2016 4 commits
- Prevent slurmctld abort on kill of job waiting node reboot · 1aa7af7d
  Morris Jette authored Jul 20, 2016
```
Prevent slurmctld abort if job is killed or requeued while waiting for
    reboot of its allocated compute nodes. The _wait_boot() would
    reference job_ptr->node_bitmap, which would be NULL.
```
  1aa7af7d
- Fixed race condition in PMIx Fence logic · cf6733be
  Boris Karasev authored Jul 20, 2016
```
Bug 2908
```
  cf6733be
- Continuation of commit 65b4f283 · 71ddc0a5
  Danny Auble authored Jul 20, 2016
  
  71ddc0a5
- Prevent segfault when attempting to cleanup a SLURM_PENDING_STEP. · 3b914e5b
  Tim Wickberg authored Jul 20, 2016
```
Step hasn't been assigned resources, so the select_jobinfo struct
hasn't yet been populated. Calling select_g_step_finish will dereference
causing a segfault.

Bug 2922.
```
  3b914e5b
19 Jul, 2016 6 commits

Add routing queue info to Slurm FAQ web page · f88119ff
Morris Jette authored Jul 19, 2016

f88119ff
Fix some typos in comments and logs · 5a45503c
Gennaro Oliva authored Jul 19, 2016

5a45503c

Improve partition AllowGroups caching · 7e381982

Morris Jette authored Jul 19, 2016

If the user is now allowed to use the partition,
    then do not check that user's group access again for 5 seconds.
bug 2913

7e381982

Improve partition AllowGroups caching · 98dc38b2

Morris Jette authored Jul 19, 2016

Improve partition AllowGroups caching. Update the table of UIDs permitted to
    use a partition based upon it's AllowGroups configuration parameter as new
    valid UIDs are found rather than looking up that user's group information
    for every job they submit, which can involve considerable overhead for
    some systems.
bug 2913

98dc38b2

Minimize preempted jobs · b9f17b18

Morris Jette authored Jul 18, 2016

Minimize preempted jobs for configurations with multiple jobs per node.
  Previous logic would preeempt every job on node allocated to pending
  job.
bug 2906

b9f17b18

gres-flags=enforce-binding fix · 5df8509f

Morris Jette authored Jul 18, 2016

Fix for core selection with job --gres-flags=enforce-binding option.
    Previous logic would in some cases allocate a job zero cores, resulting in
    slurmctld abort.
bug 2808

5df8509f

18 Jul, 2016 3 commits

Improve GRES log format · b5e54e11

Morris Jette authored Jul 18, 2016

Add some indentation so that GRES topology-specific information
  logged is more readable.

b5e54e11

Select/cons_res memory corruption fix · c06db0de

Morris Jette authored Jul 18, 2016

A job allocation selecting nodes and no cores/CPUs could write
  off the end of arrays and corrupt memory. Now to figure out how
  the logic reached this point in the first place.
bug 2808

c06db0de

Add SLUGM16 dinner info · 6dc074c8
Morris Jette authored Jul 18, 2016

6dc074c8

16 Jul, 2016 4 commits

Add SLURM_PENDING_STEP id so it won't be confused with SLURM_EXTERN_CONT. · 0c7bd6d0

Danny Auble authored Jul 15, 2016

In commit b8190e5d many places that were mean to be pending step ids
were changed to be extern_step id.  The main problem was when we came up
with the idea of the extern step we reused -1 (INFINITE) for the id.  So
pending steps also appeared to be extern steps as well.  Hopefully this
fixes the situation.

Bug 2907

0c7bd6d0

Remove vestigial comment · 71800937
Morris Jette authored Jul 15, 2016

71800937

Move startup of power save thread · fb8e3558

Morris Jette authored Jul 15, 2016

Start power save thread only after the partition information is read
  in order to avoid trying to interpret the SuspendExcParts configuration
  information before the partition information is available, which would
  result in a slurmctld abort.

fb8e3558

Prevent slurmctld race condition · c7cae55b

Morris Jette authored Jul 15, 2016

Do not try to access part_list variable (partition list pointer)
  if not yet initialized. Return NULL pointer rather than aborting
  with NULL pointer.

c7cae55b

15 Jul, 2016 13 commits
- Fix spelling of hierarchy in comments · 4f3a0a02
  Tim Wickberg authored Jul 15, 2016
  
  4f3a0a02
- Do not scheduled powered down nodes in FAILED state · 310de98d
  Jacek Budzowski authored Jul 15, 2016
```
bug 2900
```
  310de98d
- Remove unnecessary test for super user in regression test · 2a7d01a5
  Nicolas Joly authored Jul 15, 2016
  
  2a7d01a5
- Cleanup generated files if test cannot run due to inappropriate conditions. · b9abe288
  Nicolas Joly authored Jul 15, 2016
  
  b9abe288
- Fix user message in test1.32 to report correct signal USR2. · 7f98f056
  Nicolas Joly authored Jul 15, 2016
  
  7f98f056
- Update LRZ site report in SLUG16 agenda · 48dc2bec
  Morris Jette authored Jul 15, 2016
  
  48dc2bec
- Move commit 30f4f81c to be above code that could call · f2b1c35f
  Danny Auble authored Jul 14, 2016
```
delete_step_records which would delete the steps without the killing flag
set.
```
  f2b1c35f
- More on the others dealing with the extern cleanup. · 7c831dd9
  Danny Auble authored Jul 14, 2016
```
What this does is treats the extern step like a normal step on exit.  It
doesn't appear the original code is needed anymore and this simplifies
the code.

The select_cray change is relevant since the add is needed only when
killing the step as that is the only place _internal_step_complete isn't
used.
```
  7c831dd9
- Continuation of commit 667f1105. Remove unneeded job_ptr variable from · d74bcf74
  Danny Auble authored Jul 14, 2016
```
functions.
```
  d74bcf74
- Continuation of commit 667f1105 to simplify the code more · 28745901
  Danny Auble authored Jul 14, 2016
  
  28745901
- Slightly better debug when dealing with stepd_completions. · af4d7aa4
  Danny Auble authored Jul 14, 2016
  
  af4d7aa4
- Make scontrol show steps show the extern step correctly. · a4c2649d
  Danny Auble authored Jul 14, 2016
```
Before it was showing it as TBD since pending steps and the extern step
have the same stepid.
```
  a4c2649d
- Various cleanup needed for extern step. Continuation of commit 2fc0c860 · c79063b0
  Danny Auble authored Jul 14, 2016
```
What this does is set the state earlier to match a normal set.

Remove the unneeded _send_pending_exit_msgs.  There is only one task and
we have the message for it, so don't worry about that one.

Most important, wait for the other slurmstepd's to send their message,
otherwise they could be lost on the other end.
```
  c79063b0
14 Jul, 2016 5 commits

Move talks at SLUG16 · f27341f7
Morris Jette authored Jul 14, 2016

f27341f7

Fix gang scheduling and license release logic · 111e3b48

Morris Jette authored Jul 14, 2016

Fix gang scheduling and license release logic if single node job killed on
    bad node. Notifying gang and releasing licences is normally done when
    the epilog completion happens, but if the node(s) assigned to a job are
    all down, that does not happen. This results in the licenses being
    reserved indefinitely and the gang scheduler being left with a bad
    (old) job pointer that can result in various failure modes
bug 2867

111e3b48

Add SLUG16 agenda/hotel links · 5041174f
Morris Jette authored Jul 14, 2016

5041174f
SLUG16 agenda update · c3a7d302
Morris Jette authored Jul 14, 2016
```
Add hotels. Other minor changes.
```
c3a7d302
Fix missing variable from commit 30f4f81c · da462dbf
Danny Auble authored Jul 13, 2016

da462dbf