Commits · af27757dd73afb47303a36e6d9031c0452d3bc8a · Manuel G. Marciani / ces_slurm_simulator

30 May, 2018 9 commits
- SLUG18 - conference dinner is Tuesday, not Monday · af27757d
  Tim Wickberg authored May 30, 2018
  
  af27757d
- Merge branch 'slurm-17.02' into slurm-17.11 · b5837d3f
  Tim Wickberg authored May 30, 2018
  
  b5837d3f
- Add ESLURM_GROUP_ID_MISSING error. · 26996fa8
  Tim Wickberg authored May 29, 2018
```
Value of 2113 is where it fits in with 17.11, so pin it here.
```
  26996fa8
- SLUG info update · f227eec4
  Morris Jette authored May 30, 2018
  
  f227eec4
- Update SLUG18 agenda · 36a518b4
  Morris Jette authored May 30, 2018
  
  36a518b4
- Update SLUG18 info · 952fd271
  Morris Jette authored May 30, 2018
  
  952fd271
- Change pthread_cond_signal to slurm_cond_signal. · 657aba97
  Michael Hinton authored May 29, 2018
  
  657aba97
- Fix race in pmixp_agent_start(). · 7c1dad6e
  Tim Wickberg authored May 29, 2018
```
Caused by pthread_cancel cleanup by commit e5f03971  in 17.11.6.

Bug 5181.
```
  7c1dad6e
- Fix deadlock in slurmstepd during shutdown. · 21ca33f5
  Tim Wickberg authored May 16, 2018
```
The race condition was created in a7c8964e in 17.11.6 when removing
the (unsafe) pthread_cancel code handling thread termination.

Bug 5164
```
  21ca33f5
24 May, 2018 1 commit

Notify srun and ctld when unkillable stepd exits · 956a808d

Brian Christiansen authored May 16, 2018

Commits f18390e8 and eed76f85 modified the stepd so that if the
stepd encountered an unkillable step timeout that the stepd would just
exit the stepd. If the stepd is a batch step then it would reply back
to the controller with a non-zero exit code which will drain the node.
But if an srun allocation/step were to get into the unkillable step
code, the steps wouldn't let the waiting srun or controller know about
the step going away -- leaving a hanging srun and job.

This patch enables the stepd to notify the waiting sruns and the ctld of
the stepd being done and drains the node for srun'ed alloction and/or
steps.

Bug 5164

956a808d

21 May, 2018 1 commit
- _post_qos_list() modifies global variables · da1eb7c7
  Dominik Bartkiewicz authored May 21, 2018
```
g_qos_count, g_qos_max_priority, must be call under qos write lock.

Bug 5159.
```
  da1eb7c7
19 May, 2018 2 commits
- Fix warning message in test17.7 · c7ddc591
  Brian Christiansen authored May 18, 2018
```
Display correct path.
```
  c7ddc591
- Fix test17.7 to work with symlinked home dirs · c3b41366
  Bjørn-Helge Mevik authored May 18, 2018
```
Bug 5151
```
  c3b41366
18 May, 2018 2 commits

Update test3.4 for regular user · 9d82faa1

Brian Christiansen authored May 18, 2018

Commits 4454316e and 76706b51 adjusted the updating of priority logic so
that when a non-authorized user modifies the priority it will only be
temporary -- in most cases the user will never see that change.

Bug 5151

9d82faa1

Update limits on GRES docs · 1e1cd45e
Marshall Garey authored May 18, 2018
```
Clarification of c2c06468.

Bug 5150
```
1e1cd45e

17 May, 2018 1 commit
- Have sprio display jobs before eligible time when · 8782db29
  Danny Auble authored May 17, 2018
```
PriorityFlags=ACCRUE_ALWAYS is set.

Bug 5186
```
  8782db29
16 May, 2018 2 commits
- docs - add a note to slurm.conf man page clarifying slurmdbd log rotation. · 74d8344b
  Alejandro Sanchez authored May 16, 2018
```
Bug 5174.
```
  74d8344b
- docs - remove 'nocreate' option from slurm.conf man logrotate example. · 35b67364
  Dan Barke authored May 16, 2018
```
Since having 'nocreate' would override the following option:
create 640 slurm root

Bug 5174.
```
  35b67364
15 May, 2018 3 commits

Make a test more robust · b1c2a6fb

Morris Jette authored May 15, 2018

If ReturnToService=2 is configured, the test could generate an error
changing node state to resume after setting it to down. The reason
is if the node communicates with slurmctld, then its state will
automatically be changed from down to idle and resuming an idle
node triggers an error.

b1c2a6fb

Run autogen.sh after previous commit. · ac24b431
Alejandro Sanchez authored May 15, 2018
```
Bug 5168.
```
ac24b431

PMIx - override default paths at configure time if --with-pmix is used. · 635c0232

Alejandro Sanchez authored May 15, 2018

Previously the default paths continued to be tested even when new ones
were requested. This had as a consequence that if any of the new paths
was the same as any of the default ones (i.e. /usr or /usr/local), the
configure script was incorrectly erroring out specifying that a version
of PMIx was already found in a previous path.

Bug 5168.

635c0232

11 May, 2018 2 commits
- Improve test error handling · 1dd0d6f7
  Morris Jette authored May 11, 2018
```
Gracefully fail if salloc does not get job allocation
```
  1dd0d6f7
- Fix Coverity CID 185566: Incorrect expression (COPY_PASTE_ERROR). · b3f2ebf1
  Alejandro Sanchez authored May 11, 2018
```
Introduced in bf4cb0b1.
```
  b3f2ebf1
10 May, 2018 2 commits

Update and modernize dw_wlm_cli cray emulation script · dc7ca7be
Morris Jette authored May 10, 2018

dc7ca7be

Fix different issues when requesting memory per cpu/node. · bf4cb0b1

Alejandro Sanchez authored May 10, 2018

First issue was identified on multi partition requests. job_limits_check()
was overriding the original memory requests, so the next partition
Slurm validating limits against was not using the original values. The
solution consists in adding three members to job_details struct to
preserve the original requests. This issue is reported in bug 4895.

Second issue was memory enforcement behavior being different depending on
job the request issued against a reservation or not.

Third issue had to do with the automatic adjustments Slurm did underneath
when the memory request exceeded the limit. These adjustments included
increasing pn_min_cpus (even incorrectly beyond the number of cpus
available on the nodes) or different tricks increasing cpus_per_task and
decreasing mem_per_cpu.

Fourth issue was identified when requesting the special case of 0 memory,
which was handled inside the select plugin after the partition validations
and thus that could be used to incorrectly bypass the limits.

Issues 2-4 were identified in bug 4976.

Patch also includes an entire refactor on how and when job memory is
is both set to default values (if not requested initially) and how and
when limits are validated.

Co-authored-by: Dominik Bartkiewicz <bart@schedmd.com>

bf4cb0b1

09 May, 2018 15 commits
- Fix for possible slurmctld daemon abort with NULL pointer. · b67d7350
  Morris Jette authored May 09, 2018
```
If running without AccountingStorageEnforce but with the DBD and
it isn't up when starting the slurmctld you could get into a
corner case where you don't have a QOS list in the assoc_mgr.  Thus no
usage to transfer.

Bug 5156
```
  b67d7350
- Start NEWS for v17.11.7 · 8d8bc02b
  Tim Wickberg authored May 09, 2018
  
  8d8bc02b
- Update META for v17.11.6. · 1d1a0a48
  Tim Wickberg authored May 09, 2018
```
Update slurm.spec and slurm.spec-legacy as well
```
  1d1a0a48
- Fix clang error from de7eac9a. · a3267c00
  Tim Wickberg authored May 09, 2018
```
Clang warns about a possible null dereference of job_part_ptr
if the !job_ptr->priority_array part of the conditional is taken.

Remove that part of the conditional, as it doesn't matter if that is
set or not here. The jobs eligibility on one vs. multiple partition is
not determined by that, but by the status of part_ptr_list and part_ptr.

Bug 5136.
```
  a3267c00
- Document which PMI plugin to use with different OMPI versions · 4db67922
  Morris Jette authored May 09, 2018
  
  4db67922
- Fix spelling · 3c5a7302
  Brian Christiansen authored May 09, 2018
  
  3c5a7302
- Fix typo in NEWS · e8892420
  Felip Moll authored May 09, 2018
  
  e8892420
- select/cons_res - Improve handling of --cores-per-socket. · 6de8c831
  Morris Jette authored May 09, 2018
```
Try to fill up each socket completely before moving into additional
sockets. This will minimize the number of sockets needed, improving
packing especially alongside MaxCPUsPerNode.

Bug 4995.
```
  6de8c831
- Fix misplaced NEWS entry for "select/backfill - fix issue with job resizing". · 2507e149
  Tim Wickberg authored May 09, 2018
```
My mistake on commit 602817c8.

Bug 4922.
```
  2507e149
- select/backfill - fix issue with job resizing · 602817c8
  Felip Moll authored May 09, 2018
```
Without this, gang scheduling would incorrectly kick in for
these jobs since active_resmap has not been updated appropriately.

Bug 4922.
```
  602817c8
- Doc - remove stray reference to non-existant utility. · 5d0cd8cf
  Tim Wickberg authored May 09, 2018
```
Code for this was removed in 2012.

Bug 5126.
```
  5d0cd8cf
- Docs - clarify plugin order importance in pam_slurm_adopt.html. · becf0a73
  Marshall Garey authored May 08, 2018
```
Bug 5026.
```
  becf0a73
- job_submit/lua - return an error if the script uses log.user() within job_modify. · 3f4cde9c
  Tim Wickberg authored May 08, 2018
```
Otherwise this will return the error message back to the next job submitter.

Bug 5106.
```
  3f4cde9c
- job_submit/lua - add ESLURM_INVALID_TIME_LIMIT return code · f4f42d0f
  Tim Wickberg authored May 08, 2018
```
Bug 5106.
```
  f4f42d0f
- Docs - update checkpoint page with further elaboration on deprecated status. · 356329ef
  Tim Wickberg authored May 08, 2018
```
Link to CRIU as well.

Bug 4293.
```
  356329ef