Commits · a4a140922a792b817cdcdd3e1333c23682aa3cfa · Manuel G. Marciani / ces_slurm_simulator

09 May, 2018 27 commits
- Merge branch 'slurm-17.11' · a4a14092
  Tim Wickberg authored May 09, 2018
  
  a4a14092
- Fix regression from e346aa8a. · a97b1b8f
  Danny Auble authored May 09, 2018
```
It turns out if we are running cache we need to call set_cluster_tres()
again since the pointers in the list have changed and don't have any
counts anymore.

I opted to just modify the existing locks and call it inside the lock
instead of having set_cluster_tres grab the locks.
```
  a97b1b8f
- Start NEWS for v17.11.7 · 8d8bc02b
  Tim Wickberg authored May 09, 2018
  
  8d8bc02b
- Update META for v17.11.6. · 1d1a0a48
  Tim Wickberg authored May 09, 2018
```
Update slurm.spec and slurm.spec-legacy as well
```
  1d1a0a48
- Fix clang error from de7eac9a. · a3267c00
  Tim Wickberg authored May 09, 2018
```
Clang warns about a possible null dereference of job_part_ptr
if the !job_ptr->priority_array part of the conditional is taken.

Remove that part of the conditional, as it doesn't matter if that is
set or not here. The jobs eligibility on one vs. multiple partition is
not determined by that, but by the status of part_ptr_list and part_ptr.

Bug 5136.
```
  a3267c00
- Merge branch 'slurm-17.11' · 752f259b
  Morris Jette authored May 09, 2018
  
  752f259b
- Document which PMI plugin to use with different OMPI versions · 4db67922
  Morris Jette authored May 09, 2018
  
  4db67922
- Remove slurmdbd_auth_info since it hasn't been needed since f1311b56 . · 1a249044
  Danny Auble authored May 09, 2018
  
  1a249044
- Revert commit 7dcc831b which is no longer needed with the · 62e3834c
  Danny Auble authored May 09, 2018
```
advent of persistent connections.
```
  62e3834c
- Fix spelling · 3c5a7302
  Brian Christiansen authored May 09, 2018
  
  3c5a7302
- Fix typo in NEWS · e8892420
  Felip Moll authored May 09, 2018
  
  e8892420
- select/cons_res - Improve handling of --cores-per-socket. · 6de8c831
  Morris Jette authored May 09, 2018
```
Try to fill up each socket completely before moving into additional
sockets. This will minimize the number of sockets needed, improving
packing especially alongside MaxCPUsPerNode.

Bug 4995.
```
  6de8c831
- Fix misplaced NEWS entry for "select/backfill - fix issue with job resizing". · 2507e149
  Tim Wickberg authored May 09, 2018
```
My mistake on commit 602817c8.

Bug 4922.
```
  2507e149
- select/backfill - fix issue with job resizing · 602817c8
  Felip Moll authored May 09, 2018
```
Without this, gang scheduling would incorrectly kick in for
these jobs since active_resmap has not been updated appropriately.

Bug 4922.
```
  602817c8
- Merge branch 'slurm-17.11' · 6a1591a8
  Tim Wickberg authored May 09, 2018
  
  6a1591a8
- Doc - remove stray reference to non-existant utility. · 5d0cd8cf
  Tim Wickberg authored May 09, 2018
```
Code for this was removed in 2012.

Bug 5126.
```
  5d0cd8cf
- Add support for "scontrol write batch_script -" to send script to stdout. · 158551a7
  Jessica Nettelblad authored Feb 18, 2018
```
Bug 4563
```
  158551a7
- Docs - clarify plugin order importance in pam_slurm_adopt.html. · becf0a73
  Marshall Garey authored May 08, 2018
```
Bug 5026.
```
  becf0a73
- job_submit/lua - return an error if the script uses log.user() within job_modify. · 3f4cde9c
  Tim Wickberg authored May 08, 2018
```
Otherwise this will return the error message back to the next job submitter.

Bug 5106.
```
  3f4cde9c
- job_submit/lua - add ESLURM_INVALID_TIME_LIMIT return code · f4f42d0f
  Tim Wickberg authored May 08, 2018
```
Bug 5106.
```
  f4f42d0f
- Docs - update checkpoint page with further elaboration on deprecated status. · 356329ef
  Tim Wickberg authored May 08, 2018
```
Link to CRIU as well.

Bug 4293.
```
  356329ef
- Merge branch 'slurm-17.11' · 632a8523
  Tim Wickberg authored May 08, 2018
  
  632a8523
- sacctmgr - fix padding in help message. · ed56974c
  Tim Wickberg authored May 08, 2018
```
Related to fix from bug 4155.
```
  ed56974c
- Update documentation on sacctmgr WithRawQOSLevel option · cf90a75d
  Josh Samuelson authored Sep 14, 2017
```
Bug 4155.
```
  cf90a75d
- Remove unused slurm_valid_uid_gid function. · 7605f1ac
  Tim Wickberg authored May 08, 2018
```
Made obsolete by structural changes in 17.11.

Bug 4953.
```
  7605f1ac
- Prevent segfault on 'sprio' if a partition has recently been deleted. · de7eac9a
  Alejandro Sanchez authored May 08, 2018
```
job_ptr->part_ptr is NULL if the partition has been deleted.

Crash only happens with PriorityFlags=CALCULATE_RUNNING enabled.

Bug 5136.
```
  de7eac9a
- Fix comment to reflect current behavior. · d8d0e40c
  Tim Wickberg authored May 08, 2018
```
Partition is deleted immediately, not flagged.

Tangentially related to bug 5136.
```
  d8d0e40c
08 May, 2018 13 commits

Merge branch 'slurm-17.11' · 1f2639b1
Tim Wickberg authored May 08, 2018

1f2639b1
Send protocol_version as the uint16_t it is. · 73397b83
Tim Wickberg authored May 08, 2018
```
Bug 5133.
```
73397b83
Prevent slurmd from launching steps if prolog fail · 3b029021
Brian Christiansen authored May 08, 2018
```
Bug 5146
```
3b029021
Merge branch 'slurm-17.11' · 42eb7720
Tim Wickberg authored May 08, 2018

42eb7720

Fix issue with invalid protocol_version when using srun on ppc64. · 77d65f4f

Tim Wickberg authored May 08, 2018

Caused by a corrupted protocol_version field value being received
by the slurmstepd, as we cannot safely write/read a uint16_t across
the pipe as if it was an int.

Regression caused by commit 90b116c2.

Bug 5133.

77d65f4f

Fix globals to work again. Looks like a bad merge was had in a5fbf83e . · 88e82996
Danny Auble authored May 04, 2018

88e82996
Remove job's alloc_tres after being requeued · fa285d9e
Brian Christiansen authored May 08, 2018
```
since it's not allocated anymore. Found while investigating Bugs
5137,4522.
```
fa285d9e
Fix sending a NULL pointer when stating a job from a bad uid. · b935690d
Danny Auble authored May 08, 2018
```
Regression caused in commit fa3a8ff1.

Coverity issue 182984
```
b935690d

Fix checkpointing requeued jobs in a bad state · f9f395af

Brian Christiansen authored May 08, 2018

Requeued jobs are marked as PENDING|COMPLETING until the epilog checks
in. The issue is that if job_set_alloc_tres gets called while in the
PENDING|COMPLETING state, the job's alloc_tres_str will be free'd. If
this job then gets checkpointed in this state (PENDING|COMPLETING + no
tres_alloc_str) on startup the controller would crash because it
expected the job to have a tres_alloc_str/cnt when in the COMPLETING
state. This could be triggered if starting the controller without the
dbd up. When the dbd comes up, the assoc_cache_mgr calls
_update_job_tres() which calls job_set_alloc_tres. It could also be
triggered by adding new tres.

This most likely started happening in 17.11.5 because of commit
865b672f which introduced calling _update_job_tres() on each job
after the dbd comes up.

Bugs 5137,4522

f9f395af

Add check for NULL pointer · 0daf370e
Morris Jette authored May 08, 2018
```
Coverity CID 185507
```
0daf370e
Fix memory leak · 211d4efe
Morris Jette authored May 08, 2018
```
Coverity CID 185506
```
211d4efe
Fix for xmalloc size · 3e4f81b6
Morris Jette authored May 08, 2018
```
Coverity CID 185503
```
3e4f81b6
Fix memory leak · fa7650ed
Morris Jette authored May 08, 2018
```
Coverity CID 185505
```
fa7650ed