Commits · b72096ac0b0110d0c9ae08de501d68e365673e36 · Manuel G. Marciani / ces_slurm_simulator

04 Aug, 2017 2 commits
- Sort TRES id's on limits when getting them from the database. · 7e55acf7
  Danny Auble authored Aug 04, 2017
  
  7e55acf7
- Fix inherited association 'max' TRES limits combining multiple limits in · ab24f8b4
  Danny Auble authored Aug 04, 2017
```
the tree.

Bug 4050
```
  ab24f8b4
02 Aug, 2017 2 commits

Fix starting ctld w/out existing StateSaveLocation · ec78d45a

Marshall Garey authored Aug 02, 2017

Would fail when trying to create the clustername file because the
StateSaveLocation path didn't exist yet.

Bug 3988

ec78d45a

Fix srun jobs to run in high prio partition · 948de46b

Marshall Garey authored Aug 02, 2017

srun jobs that could start immediately and requested multiple partitions
didn't run in the highest priority partition if the highest priority
partition wasn't listed first.

It's possible that the scontrol show jobs will show the partition list
in priority order now that the job's partition list gets sorted by
priority.

Bug 4015

948de46b

01 Aug, 2017 2 commits
- Increase buffer to handle long /proc//stat output · 9f3b04c0
  Tim Shaw authored Aug 01, 2017
```
Bug 3999
```
  9f3b04c0
- Fix GRES selection with CPU binding · e94fdf2e
  Dominik Bartkiewicz authored Aug 01, 2017
```
Fix bug in selection of GRES bound to specific CPUs where the GRES count
    is 2 or more. Previous logic could allocate CPUs not available to the job.

bug 4029
```
  e94fdf2e
31 Jul, 2017 1 commit
- Docment inconsistent behavior of GroupUpdateForce option. · 333932bc
  Tim Shaw authored Jul 31, 2017
```
This will be fixed before 17.11, but is being left as-is on 17.02.

Bug 3956.
```
  333932bc
28 Jul, 2017 2 commits

Fix issue when an alternate munge key when communicating on a persistent · 591dc036
Danny Auble authored Jul 28, 2017
```
connection.

Bug 4009
```
591dc036

jobcomp/elasticsearch - save state on REQUEST_CONTROL. · 8944b77a

Alejandro Sanchez authored Jul 28, 2017

jobcomp/elasticsearch saves/load the state to/from elasticsearch_state.  Since
the jobcomp API isn't designed with save/load state operations, the plugin
_save_state() isn't extern and not available from outside the plugin itself,
thus it is highly coupled to fini() function. This state doesn't follow the
same execution path as the rest of Slurm states, where in save_all_sate()
they are all independently scheduled. So we save it manually here on a RPC
of type REQUEST_CONTROL.

This enables that when the Primary ctld issues a REQUEST_CONTROL to the Backup
which is currently in controller mode, the Backup will save the state and when
the Primary assumes control again it can process the saved pending jobs.  The
other way around was already controlled, because when the Primary is running
in controller mode and the Backup issues a REQUEST_CONTROL, the Primary is
shutdown and when breaking the ctld main() function while(1) loop, there was
already a g_slurm_jobcomp_fini() call in place.

Bug 3908

8944b77a

27 Jul, 2017 1 commit

Fix bug when tracking multiple simultaneous spawned ping cycles · f7463ef5

Alejandro Sanchez authored Jul 27, 2017

When more than 1 ping cycle is spawned simultaneously (for instance
REQUEST_PING + REQUEST_NODE_REGISTRATION_STATUS for the selected nodes),
we do not track a separate ping_start time for each cycle. When ping_begin()
is called, the information about the previous ping cycle is lost. Then when
ping_end() is called for the first of the two cycles, we set ping_start=0,
which is incorrectly used to see if the last cycle ran for more than
PING_TIMEOUT seconds (100s), thus incorrectly triggering the:

error("Node ping apparently hung, many nodes may be DOWN or configured "
"SlurmdTimeout should be increased");

Bug 3914

f7463ef5

26 Jul, 2017 3 commits
- Fix issue where UnkillableStepProgram if step was in an ending state. · 9f48e07c
  Danny Auble authored Jul 26, 2017
  
  9f48e07c
- Fix minor memory leak if launch fails in the slurmstepd. · 558d7c1a
  Danny Auble authored Jul 24, 2017
  
  558d7c1a
- If failing after switch_g_job_init happened make sure switch_g_job_fini is called. · 488c7c36
  Danny Auble authored Jul 24, 2017
```
Bug 3865
```
  488c7c36
24 Jul, 2017 2 commits
- CRAY - Throttle step creation if trying to create too many steps at once. · f9f13a86
  Morris Jette authored Jul 24, 2017
  
  f9f13a86
- Fix memory leak in slurmctld when agent queue to the DBD has filled up. · 6c7b9ba1
  Danny Auble authored Jul 24, 2017
```
Pretty much fix the entire purpose of this max_agent_queue.
```
  6c7b9ba1
21 Jul, 2017 3 commits
- Serialize updates from from the dbd to the slurmctld. · 24375cb8
  Danny Auble authored Jul 21, 2017
```
Bug 3159
```
  24375cb8
- Fixed truncation on scontrol show config output. · 5b1983d5
  Tim Shaw authored Jul 21, 2017
```
Bug 3956
```
  5b1983d5
- Better debug when slurmdbd queue is filling up in the slurmctld. · 4a46c5a6
  Danny Auble authored Jul 21, 2017
```
Bug 3967
```
  4a46c5a6
19 Jul, 2017 3 commits
- Fix race condition when using jobacct_gather/cgroup where the memory of the · e5c05549
  Danny Auble authored Jul 19, 2017
```
step wasn't always gathered correctly.

Bug 3531
```
  e5c05549
- Prevent slurmctld abort with gres socket binding · c850ccf4
  Morris Jette authored Jul 19, 2017
```
Fix for possible slurmctld abort with use of salloc/sbatch/srun
    --gres-flags=enforce-binding option.
bug 4008
```
  c850ccf4
- Clarify item in NEWS · ac896302
  Morris Jette authored Jul 19, 2017
```
Update from commit b40bd8d3
```
  ac896302
18 Jul, 2017 1 commit

Fix issue with multiple jobs from an array to start. · b40bd8d3

Dominik Bartkiewicz authored Jul 18, 2017

By removing the real locks we can get into a race condition where the prolog
starts and finishes before we get here and then we end up waiting forever.

Making the mutex a static seemed to help in many cases, but didn't
completely close the window.  Changing slurm_cond_wait to
slurm_cond_timedwait fixed the scenario where we would hit the window, but
not degrade performance the original commit provides.

There were also spots where if the job or step didn't exist it wouldn't
signal the conditional also providing a spot this could get stuck not
starting the job.

Fix regression from commit 52ce3ff0

Bug 3977

b40bd8d3

14 Jul, 2017 1 commit

Fix issue with whole gres not being printed out with Slurm tools. · 028bf3e1

Danny Auble authored Jul 13, 2017

This is a regression from commit fec995e0.

It turns out using tok here was erroneous for situations where the gres had
a type and name and potentially a count (i.e. network:gigabit:1)

_get_gres_req_cnt() would alter the incoming char *config which is what tok
was.  So when we print it back to the requested string it would only have
what was there to the first ':'.  As we didn't need to \0 out the first char
as we skip over it anyway I just kept track of what the replaced \0 was for
the number portion and put it back when we are done copying it.

Related to bug 3521

028bf3e1

13 Jul, 2017 6 commits
- Update typo in NEWS · 311dfd67
  Morris Jette authored Jul 13, 2017
  
  311dfd67
- Modify srun --pty option to use configured SrunPortRange range · 9ed3a300
  Tim Shaw authored Jul 13, 2017
```
bug 3979
```
  9ed3a300
- When resuming node only send one message to the slurmdbd. · 2f2680d7
  Danny Auble authored Jul 13, 2017
```
Bug 3967
```
  2f2680d7
- Make srun --pty option ignore EINTR allowing windows to resize. · 88c60014
  Danny Auble authored Jul 13, 2017
```
Bug 3979 and 3989
```
  88c60014
- Revert "Make srun --pty option ignore EINTR allowing windows to resize." · 8dd12372
  Danny Auble authored Jul 13, 2017
```
This reverts commit d49081df.
```
  8dd12372
- Make srun --pty option ignore EINTR allowing windows to resize. · d49081df
  Danny Auble authored Jul 13, 2017
```
Bug 3979 and 3989
```
  d49081df
07 Jul, 2017 4 commits

Set job/step start and end times to 0 when using --truncate and start > end. · 3e11d04c

Alejandro Sanchez authored Jul 07, 2017

Otherwise we can end up printing Start times greater than End times,
leading to confusion when reading sacct output. 0 is displayed as Unknown.
Cosmetic change.

Bug 3940.

3e11d04c

Do not defer slurmd node registration if HealthCheckProgram fails · b31fa177

Alejandro Sanchez authored Jul 07, 2017

This behavior was introduced in bug 2504, commit 7fb0c981 and bug 2643
commit 988edf12 respectively.

The reasoning is that sysadmins who see nodes with Reason "Not Responding"
but they can manually ping/access the node end up confused. That reason
should only be set if the node is trully not responding, but not if the
HealthCheckProgram execution failed or returned non-zero exit code. For
that case, the program itself would take the appropiate actions, such
as draining the node and setting an appropiate Reason.

Bug 3931

b31fa177

Fix potential memory leak when creating partition name. · 3c161d32
Dominik Bartkiewicz authored Jul 07, 2017

3c161d32
Fix deadlock if requesting to create more than 10000 reservations. · b7bb1e05
Dominik Bartkiewicz authored Jul 07, 2017

b7bb1e05

05 Jul, 2017 1 commit
- Start NEWS for v17.02.7 · 856ba827
  Morris Jette authored Jul 05, 2017
  
  856ba827
30 Jun, 2017 2 commits

Burst buffer size unit changes · 7e161809

Alejandro Sanchez authored Jun 30, 2017

burst_buffer logic modified to support sizes in both SI and EIC size units
    (e.g. M/MiB for powers of 1024, MB for powers of 1000).
bug 3922

7e161809

Fix potential to corrupt DBD message. · 3e00ede5

Dominik Bartkiewicz authored Jun 30, 2017

This patch removes a window in which a message bound for the DBD
could be packed with the non-dbd packing.  This would result in a
packed msg_type, but nothing else.  When that message was given to
the DBD it would complain forever about an unpacking error.

Bug 3891 and 3939

3e00ede5

29 Jun, 2017 1 commit
- CRAY - Add configuration for ATP to the ansible play script. · c16581f6
  David Gloe authored Jun 29, 2017
  
  c16581f6
28 Jun, 2017 3 commits
- Just don't free the pthread_cond, we are shutting down anyway, so no harm · 0b30c37c
  Danny Auble authored Jun 28, 2017
```
done.

Pretty much remove 10cc6f93

Bug 3919
```
  0b30c37c
- Fix potential degradation when running HTC (> 100 jobs a sec) like · 52ce3ff0
  Danny Auble authored Jun 28, 2017
```
workflows through the slurmd.

Bug 3833
```
  52ce3ff0
- Fix missing initialization in slurmd. · a3abbd7f
  Tim Wickberg authored Jun 28, 2017
  
  a3abbd7f