Commits · 21ca33f56f67c9abfc9f1ef7767eb878ada8a3ce · Manuel G. Marciani / ces_slurm_simulator

30 May, 2018 1 commit

Fix deadlock in slurmstepd during shutdown. · 21ca33f5

Tim Wickberg authored May 16, 2018

The race condition was created in a7c8964e in 17.11.6 when removing
the (unsafe) pthread_cancel code handling thread termination.

Bug 5164

21ca33f5

24 May, 2018 1 commit

Notify srun and ctld when unkillable stepd exits · 956a808d

Brian Christiansen authored May 16, 2018

Commits f18390e8 and eed76f85 modified the stepd so that if the
stepd encountered an unkillable step timeout that the stepd would just
exit the stepd. If the stepd is a batch step then it would reply back
to the controller with a non-zero exit code which will drain the node.
But if an srun allocation/step were to get into the unkillable step
code, the steps wouldn't let the waiting srun or controller know about
the step going away -- leaving a hanging srun and job.

This patch enables the stepd to notify the waiting sruns and the ctld of
the stepd being done and drains the node for srun'ed alloction and/or
steps.

Bug 5164

956a808d

21 May, 2018 1 commit
- _post_qos_list() modifies global variables · da1eb7c7
  Dominik Bartkiewicz authored May 21, 2018
```
g_qos_count, g_qos_max_priority, must be call under qos write lock.

Bug 5159.
```
  da1eb7c7
19 May, 2018 2 commits
- Fix warning message in test17.7 · c7ddc591
  Brian Christiansen authored May 18, 2018
```
Display correct path.
```
  c7ddc591
- Fix test17.7 to work with symlinked home dirs · c3b41366
  Bjørn-Helge Mevik authored May 18, 2018
```
Bug 5151
```
  c3b41366
18 May, 2018 2 commits

Update test3.4 for regular user · 9d82faa1

Brian Christiansen authored May 18, 2018

Commits 4454316e and 76706b51 adjusted the updating of priority logic so
that when a non-authorized user modifies the priority it will only be
temporary -- in most cases the user will never see that change.

Bug 5151

9d82faa1

Update limits on GRES docs · 1e1cd45e
Marshall Garey authored May 18, 2018
```
Clarification of c2c06468.

Bug 5150
```
1e1cd45e

17 May, 2018 1 commit
- Have sprio display jobs before eligible time when · 8782db29
  Danny Auble authored May 17, 2018
```
PriorityFlags=ACCRUE_ALWAYS is set.

Bug 5186
```
  8782db29
16 May, 2018 2 commits
- docs - add a note to slurm.conf man page clarifying slurmdbd log rotation. · 74d8344b
  Alejandro Sanchez authored May 16, 2018
```
Bug 5174.
```
  74d8344b
- docs - remove 'nocreate' option from slurm.conf man logrotate example. · 35b67364
  Dan Barke authored May 16, 2018
```
Since having 'nocreate' would override the following option:
create 640 slurm root

Bug 5174.
```
  35b67364
15 May, 2018 3 commits

Make a test more robust · b1c2a6fb

Morris Jette authored May 15, 2018

If ReturnToService=2 is configured, the test could generate an error
changing node state to resume after setting it to down. The reason
is if the node communicates with slurmctld, then its state will
automatically be changed from down to idle and resuming an idle
node triggers an error.

b1c2a6fb

Run autogen.sh after previous commit. · ac24b431
Alejandro Sanchez authored May 15, 2018
```
Bug 5168.
```
ac24b431

PMIx - override default paths at configure time if --with-pmix is used. · 635c0232

Alejandro Sanchez authored May 15, 2018

Previously the default paths continued to be tested even when new ones
were requested. This had as a consequence that if any of the new paths
was the same as any of the default ones (i.e. /usr or /usr/local), the
configure script was incorrectly erroring out specifying that a version
of PMIx was already found in a previous path.

Bug 5168.

635c0232

11 May, 2018 2 commits
- Improve test error handling · 1dd0d6f7
  Morris Jette authored May 11, 2018
```
Gracefully fail if salloc does not get job allocation
```
  1dd0d6f7
- Fix Coverity CID 185566: Incorrect expression (COPY_PASTE_ERROR). · b3f2ebf1
  Alejandro Sanchez authored May 11, 2018
```
Introduced in bf4cb0b1.
```
  b3f2ebf1
10 May, 2018 2 commits

Update and modernize dw_wlm_cli cray emulation script · dc7ca7be
Morris Jette authored May 10, 2018

dc7ca7be

Fix different issues when requesting memory per cpu/node. · bf4cb0b1

Alejandro Sanchez authored May 10, 2018

First issue was identified on multi partition requests. job_limits_check()
was overriding the original memory requests, so the next partition
Slurm validating limits against was not using the original values. The
solution consists in adding three members to job_details struct to
preserve the original requests. This issue is reported in bug 4895.

Second issue was memory enforcement behavior being different depending on
job the request issued against a reservation or not.

Third issue had to do with the automatic adjustments Slurm did underneath
when the memory request exceeded the limit. These adjustments included
increasing pn_min_cpus (even incorrectly beyond the number of cpus
available on the nodes) or different tricks increasing cpus_per_task and
decreasing mem_per_cpu.

Fourth issue was identified when requesting the special case of 0 memory,
which was handled inside the select plugin after the partition validations
and thus that could be used to incorrectly bypass the limits.

Issues 2-4 were identified in bug 4976.

Patch also includes an entire refactor on how and when job memory is
is both set to default values (if not requested initially) and how and
when limits are validated.

Co-authored-by: Dominik Bartkiewicz <bart@schedmd.com>

bf4cb0b1

09 May, 2018 18 commits
- Fix for possible slurmctld daemon abort with NULL pointer. · b67d7350
  Morris Jette authored May 09, 2018
```
If running without AccountingStorageEnforce but with the DBD and
it isn't up when starting the slurmctld you could get into a
corner case where you don't have a QOS list in the assoc_mgr.  Thus no
usage to transfer.

Bug 5156
```
  b67d7350
- Start NEWS for v17.11.7 · 8d8bc02b
  Tim Wickberg authored May 09, 2018
  
  8d8bc02b
- Update META for v17.11.6. · 1d1a0a48
  Tim Wickberg authored May 09, 2018
```
Update slurm.spec and slurm.spec-legacy as well
```
  1d1a0a48
- Fix clang error from de7eac9a. · a3267c00
  Tim Wickberg authored May 09, 2018
```
Clang warns about a possible null dereference of job_part_ptr
if the !job_ptr->priority_array part of the conditional is taken.

Remove that part of the conditional, as it doesn't matter if that is
set or not here. The jobs eligibility on one vs. multiple partition is
not determined by that, but by the status of part_ptr_list and part_ptr.

Bug 5136.
```
  a3267c00
- Document which PMI plugin to use with different OMPI versions · 4db67922
  Morris Jette authored May 09, 2018
  
  4db67922
- Fix spelling · 3c5a7302
  Brian Christiansen authored May 09, 2018
  
  3c5a7302
- Fix typo in NEWS · e8892420
  Felip Moll authored May 09, 2018
  
  e8892420
- select/cons_res - Improve handling of --cores-per-socket. · 6de8c831
  Morris Jette authored May 09, 2018
```
Try to fill up each socket completely before moving into additional
sockets. This will minimize the number of sockets needed, improving
packing especially alongside MaxCPUsPerNode.

Bug 4995.
```
  6de8c831
- Fix misplaced NEWS entry for "select/backfill - fix issue with job resizing". · 2507e149
  Tim Wickberg authored May 09, 2018
```
My mistake on commit 602817c8.

Bug 4922.
```
  2507e149
- select/backfill - fix issue with job resizing · 602817c8
  Felip Moll authored May 09, 2018
```
Without this, gang scheduling would incorrectly kick in for
these jobs since active_resmap has not been updated appropriately.

Bug 4922.
```
  602817c8
- Doc - remove stray reference to non-existant utility. · 5d0cd8cf
  Tim Wickberg authored May 09, 2018
```
Code for this was removed in 2012.

Bug 5126.
```
  5d0cd8cf
- Docs - clarify plugin order importance in pam_slurm_adopt.html. · becf0a73
  Marshall Garey authored May 08, 2018
```
Bug 5026.
```
  becf0a73
- job_submit/lua - return an error if the script uses log.user() within job_modify. · 3f4cde9c
  Tim Wickberg authored May 08, 2018
```
Otherwise this will return the error message back to the next job submitter.

Bug 5106.
```
  3f4cde9c
- job_submit/lua - add ESLURM_INVALID_TIME_LIMIT return code · f4f42d0f
  Tim Wickberg authored May 08, 2018
```
Bug 5106.
```
  f4f42d0f
- Docs - update checkpoint page with further elaboration on deprecated status. · 356329ef
  Tim Wickberg authored May 08, 2018
```
Link to CRIU as well.

Bug 4293.
```
  356329ef
- sacctmgr - fix padding in help message. · ed56974c
  Tim Wickberg authored May 08, 2018
```
Related to fix from bug 4155.
```
  ed56974c
- Update documentation on sacctmgr WithRawQOSLevel option · cf90a75d
  Josh Samuelson authored Sep 14, 2017
```
Bug 4155.
```
  cf90a75d
- Prevent segfault on 'sprio' if a partition has recently been deleted. · de7eac9a
  Alejandro Sanchez authored May 08, 2018
```
job_ptr->part_ptr is NULL if the partition has been deleted.

Crash only happens with PriorityFlags=CALCULATE_RUNNING enabled.

Bug 5136.
```
  de7eac9a
08 May, 2018 3 commits

Prevent slurmd from launching steps if prolog fail · 3b029021
Brian Christiansen authored May 08, 2018
```
Bug 5146
```
3b029021

Fix issue with invalid protocol_version when using srun on ppc64. · 77d65f4f

Tim Wickberg authored May 08, 2018

Caused by a corrupted protocol_version field value being received
by the slurmstepd, as we cannot safely write/read a uint16_t across
the pipe as if it was an int.

Regression caused by commit 90b116c2.

Bug 5133.

77d65f4f

Fix checkpointing requeued jobs in a bad state · f9f395af

Brian Christiansen authored May 08, 2018

Requeued jobs are marked as PENDING|COMPLETING until the epilog checks
in. The issue is that if job_set_alloc_tres gets called while in the
PENDING|COMPLETING state, the job's alloc_tres_str will be free'd. If
this job then gets checkpointed in this state (PENDING|COMPLETING + no
tres_alloc_str) on startup the controller would crash because it
expected the job to have a tres_alloc_str/cnt when in the COMPLETING
state. This could be triggered if starting the controller without the
dbd up. When the dbd comes up, the assoc_cache_mgr calls
_update_job_tres() which calls job_set_alloc_tres. It could also be
triggered by adding new tres.

This most likely started happening in 17.11.5 because of commit
865b672f which introduced calling _update_job_tres() on each job
after the dbd comes up.

Bugs 5137,4522

f9f395af

04 May, 2018 2 commits
- Increase # of tries when sending responses to srun · c1f3eb01
  Brian Christiansen authored May 04, 2018
```
Only when the connection has timedout. If the connection is timing out,
consider increasing TCPTimeout in the slurm.conf

Bug 4574
```
  c1f3eb01
- Fix test to work with new ncurses/expect combo. · 350de5c4
  Danny Auble authored May 04, 2018
  
  350de5c4