Commits · c0b87d5ea09c1864470aa30bf055b462f3a0771d · Manuel G. Marciani / ces_slurm_simulator

27 Jul, 2016 4 commits
- burst buffer documentation · c0b87d5e
  Morris Jette authored Jul 27, 2016
```
Document that persistent burst buffers can not be created or destroyed using
    the salloc or srun --bb options.
bug 2404
```
  c0b87d5e
- Fix incorrect casting · 31160391
  Brian Christiansen authored Jul 27, 2016
```
Missed in b5bba34c
```
  31160391
- Streamline when schedule() is called when running with message aggregation · 1b889109
  Danny Auble authored Jul 26, 2016
```
on batch script completes.
```
  1b889109
- Fix potential deadlock if running with message aggregation. · 661f0c36
  Danny Auble authored Jul 26, 2016
  
  661f0c36
26 Jul, 2016 2 commits
- Start NEWS for v16.05.4 · 882718c8
  Morris Jette authored Jul 26, 2016
  
  882718c8
- Fix eligible_time for elasticsearch as well as add queue_wait · 0a4d5770
  Danny Auble authored Jul 25, 2016
```
(difference between start of job and when it was eligible).
```
  0a4d5770
25 Jul, 2016 2 commits
- CRAY - Change slurmconfgen_smw.py to skip over disabled nodes. · 32ab75f8
  David Gloe authored Jul 25, 2016
```
Bug 2939.
```
  32ab75f8
- CRAY - Fix minor memory leak in switch plugin. · 348637e9
  Danny Auble authored Jul 25, 2016
  
  348637e9
23 Jul, 2016 1 commit
- Print correct cluster name in "slurmd -C" output · c39f9ac9
  Morris Jette authored Jul 22, 2016
  
  c39f9ac9
22 Jul, 2016 4 commits
- Fix to allow users to update QOS on pending jobs. · 22171a3a
  Dominik Bartkiewicz authored Jul 22, 2016
```
Inadvertently broken in commit 05eac196.

Bug 2912.
```
  22171a3a
- Always report a 0 exit code for the extern step instead of being canceled · d1fbb57b
  Danny Auble authored Jul 21, 2016
```
or failed based on the signal that would always be killing it.
```
  d1fbb57b
- Create the extern step while creating the job instead of waiting until the · a1953657
  Danny Auble authored Jul 21, 2016
```
end of the job to do it.
```
  a1953657
- qsub - When doing the default output files for an array in qsub style · 8e008533
  Danny Auble authored Jul 21, 2016
```
make them using the master job ID instead of the normal job ID.
```
  8e008533
21 Jul, 2016 1 commit

Treat invalid user ID in AllowUserBoot as error · 59e66700

Morris Jette authored Jul 20, 2016

Treat invalid user ID in AllowUserBoot option of knl.conf file as error
    rather than fatal (log and do not exit).

59e66700

20 Jul, 2016 3 commits

Prevent slurmctld abort on kill of job waiting node reboot · 1aa7af7d

Morris Jette authored Jul 20, 2016

Prevent slurmctld abort if job is killed or requeued while waiting for
reboot of its allocated compute nodes. The _wait_boot() would
reference job_ptr->node_bitmap, which would be NULL.

1aa7af7d

Fixed race condition in PMIx Fence logic · cf6733be
Boris Karasev authored Jul 20, 2016
```
Bug 2908
```
cf6733be

Prevent segfault when attempting to cleanup a SLURM_PENDING_STEP. · 3b914e5b

Tim Wickberg authored Jul 20, 2016

Step hasn't been assigned resources, so the select_jobinfo struct
hasn't yet been populated. Calling select_g_step_finish will dereference
causing a segfault.

Bug 2922.

3b914e5b

19 Jul, 2016 5 commits

Add routing queue info to Slurm FAQ web page · f88119ff
Morris Jette authored Jul 19, 2016

f88119ff

Improve partition AllowGroups caching · 7e381982

Morris Jette authored Jul 19, 2016

If the user is now allowed to use the partition,
    then do not check that user's group access again for 5 seconds.
bug 2913

7e381982

Improve partition AllowGroups caching · 98dc38b2

Morris Jette authored Jul 19, 2016

Improve partition AllowGroups caching. Update the table of UIDs permitted to
    use a partition based upon it's AllowGroups configuration parameter as new
    valid UIDs are found rather than looking up that user's group information
    for every job they submit, which can involve considerable overhead for
    some systems.
bug 2913

98dc38b2

Minimize preempted jobs · b9f17b18

Morris Jette authored Jul 18, 2016

Minimize preempted jobs for configurations with multiple jobs per node.
  Previous logic would preeempt every job on node allocated to pending
  job.
bug 2906

b9f17b18

gres-flags=enforce-binding fix · 5df8509f

Morris Jette authored Jul 18, 2016

Fix for core selection with job --gres-flags=enforce-binding option.
    Previous logic would in some cases allocate a job zero cores, resulting in
    slurmctld abort.
bug 2808

5df8509f

16 Jul, 2016 2 commits

Add SLURM_PENDING_STEP id so it won't be confused with SLURM_EXTERN_CONT. · 0c7bd6d0

Danny Auble authored Jul 15, 2016

In commit b8190e5d many places that were mean to be pending step ids
were changed to be extern_step id.  The main problem was when we came up
with the idea of the extern step we reused -1 (INFINITE) for the id.  So
pending steps also appeared to be extern steps as well.  Hopefully this
fixes the situation.

Bug 2907

0c7bd6d0

Move startup of power save thread · fb8e3558

Morris Jette authored Jul 15, 2016

Start power save thread only after the partition information is read
  in order to avoid trying to interpret the SuspendExcParts configuration
  information before the partition information is available, which would
  result in a slurmctld abort.

fb8e3558

15 Jul, 2016 2 commits
- Do not scheduled powered down nodes in FAILED state · 310de98d
  Jacek Budzowski authored Jul 15, 2016
```
bug 2900
```
  310de98d
- Make scontrol show steps show the extern step correctly. · a4c2649d
  Danny Auble authored Jul 14, 2016
```
Before it was showing it as TBD since pending steps and the extern step
have the same stepid.
```
  a4c2649d
14 Jul, 2016 2 commits

Fix gang scheduling and license release logic · 111e3b48

Morris Jette authored Jul 14, 2016

Fix gang scheduling and license release logic if single node job killed on
    bad node. Notifying gang and releasing licences is normally done when
    the epilog completion happens, but if the node(s) assigned to a job are
    all down, that does not happen. This results in the licenses being
    reserved indefinitely and the gang scheduler being left with a bad
    (old) job pointer that can result in various failure modes
bug 2867

111e3b48

CRAY - If trying to kill a step and you have NHC_NO_STEPS set run NHC · e956f297
Danny Auble authored Jul 13, 2016
```
anyway to attempt to log the backtraces of the potential
unkillable processes.
```
e956f297

13 Jul, 2016 1 commit
- CRAY - Simplify when a NHC is called on a step that has unkillable · 603ae198
  Danny Auble authored Jul 13, 2016
```
processes.
```
  603ae198
12 Jul, 2016 6 commits
- Fix test1.29 / 17.15 for limits above 32-bits. · 0b8bbc00
  Nicolas Joly authored Jul 12, 2016
```
Bug 2892.
```
  0b8bbc00
- CRAY - Fix for reporting steps lingering after they are already finished. · cd06d0f9
  Danny Auble authored Jul 12, 2016
```
Bug 2874

We will most likely redo this logic (as it appears to be duplicated) in
a following patch.
```
  cd06d0f9
- Fix for burst_buffer/cray batch submit error · 7cdcc25c
  Morris Jette authored Jul 12, 2016
```
Don't generate an error when a batch job is submitted that must wait
  for stage-in before starting.
```
  7cdcc25c
- CRAY - Fix add of extern step to AELD. · 667f1105
  Danny Auble authored Jul 12, 2016
  
  667f1105
- CRAY - Fix issue if pid has already been added to another job container. · 067f2ee3
  Danny Auble authored Jul 12, 2016
```
Bug 2886
```
  067f2ee3
- Fix sstat to print job.batch step when requesting multiple jobs. · 59ae8600
  Jacek Budzowski authored Jul 12, 2016
```
Was incorrectly translating request to job.extern if part of a
comma-separate list.

Bug 2890.
```
  59ae8600
11 Jul, 2016 1 commit
- Fix proctrack plugin to only add the pid of a process once · a280c7f7
  Danny Auble authored Jul 11, 2016
```
(regression in 16.05.2).

related commit 5d3e5e1e

Bug 2612 and 2886
```
  a280c7f7
08 Jul, 2016 4 commits

document salloc burst buffer limitations · 12837d1c

Morris Jette authored Jul 08, 2016

Document limitations in burst buffer use by the salloc command (possible
    access problems from a login node).
bug 2883

12837d1c

task/cgroup set of soft memory limit added · 0eac300d

Janne Blomqvist authored Jul 08, 2016

task/cgroup plugin is configured with ConstrainRAMSpace=yes, then set soft
    memory limit to allocated memory limit (previously no soft limit was set).
bug 2679

0eac300d

Add web link to collectd from EDF · bf9a621d
Morris Jette authored Jul 08, 2016

bf9a621d

MYSQL - Sightly better logic if a job completion comes in with an end time · c1eb2872

Danny Auble authored Jul 07, 2016

of 0.

This might be the cause of run away jobs.  I couldn't see how an end_time
could be 0, but if it was it would just exit and never set time_end to
anything.  At least if it happens now we can have an idea that it is
possible and we will have an idea this is the place it happens.

c1eb2872