Commits · 4db6792284e0b5f1517f7a7122c18dd7b978d138 · Manuel G. Marciani / ces_slurm_simulator

09 May, 2018 14 commits
- Document which PMI plugin to use with different OMPI versions · 4db67922
  Morris Jette authored May 09, 2018
  
  4db67922
- Fix spelling · 3c5a7302
  Brian Christiansen authored May 09, 2018
  
  3c5a7302
- Fix typo in NEWS · e8892420
  Felip Moll authored May 09, 2018
  
  e8892420
- select/cons_res - Improve handling of --cores-per-socket. · 6de8c831
  Morris Jette authored May 09, 2018
```
Try to fill up each socket completely before moving into additional
sockets. This will minimize the number of sockets needed, improving
packing especially alongside MaxCPUsPerNode.

Bug 4995.
```
  6de8c831
- Fix misplaced NEWS entry for "select/backfill - fix issue with job resizing". · 2507e149
  Tim Wickberg authored May 09, 2018
```
My mistake on commit 602817c8.

Bug 4922.
```
  2507e149
- select/backfill - fix issue with job resizing · 602817c8
  Felip Moll authored May 09, 2018
```
Without this, gang scheduling would incorrectly kick in for
these jobs since active_resmap has not been updated appropriately.

Bug 4922.
```
  602817c8
- Doc - remove stray reference to non-existant utility. · 5d0cd8cf
  Tim Wickberg authored May 09, 2018
```
Code for this was removed in 2012.

Bug 5126.
```
  5d0cd8cf
- Docs - clarify plugin order importance in pam_slurm_adopt.html. · becf0a73
  Marshall Garey authored May 08, 2018
```
Bug 5026.
```
  becf0a73
- job_submit/lua - return an error if the script uses log.user() within job_modify. · 3f4cde9c
  Tim Wickberg authored May 08, 2018
```
Otherwise this will return the error message back to the next job submitter.

Bug 5106.
```
  3f4cde9c
- job_submit/lua - add ESLURM_INVALID_TIME_LIMIT return code · f4f42d0f
  Tim Wickberg authored May 08, 2018
```
Bug 5106.
```
  f4f42d0f
- Docs - update checkpoint page with further elaboration on deprecated status. · 356329ef
  Tim Wickberg authored May 08, 2018
```
Link to CRIU as well.

Bug 4293.
```
  356329ef
- sacctmgr - fix padding in help message. · ed56974c
  Tim Wickberg authored May 08, 2018
```
Related to fix from bug 4155.
```
  ed56974c
- Update documentation on sacctmgr WithRawQOSLevel option · cf90a75d
  Josh Samuelson authored Sep 14, 2017
```
Bug 4155.
```
  cf90a75d
- Prevent segfault on 'sprio' if a partition has recently been deleted. · de7eac9a
  Alejandro Sanchez authored May 08, 2018
```
job_ptr->part_ptr is NULL if the partition has been deleted.

Crash only happens with PriorityFlags=CALCULATE_RUNNING enabled.

Bug 5136.
```
  de7eac9a
08 May, 2018 3 commits

Prevent slurmd from launching steps if prolog fail · 3b029021
Brian Christiansen authored May 08, 2018
```
Bug 5146
```
3b029021

Fix issue with invalid protocol_version when using srun on ppc64. · 77d65f4f

Tim Wickberg authored May 08, 2018

Caused by a corrupted protocol_version field value being received
by the slurmstepd, as we cannot safely write/read a uint16_t across
the pipe as if it was an int.

Regression caused by commit 90b116c2.

Bug 5133.

77d65f4f

Fix checkpointing requeued jobs in a bad state · f9f395af

Brian Christiansen authored May 08, 2018

Requeued jobs are marked as PENDING|COMPLETING until the epilog checks
in. The issue is that if job_set_alloc_tres gets called while in the
PENDING|COMPLETING state, the job's alloc_tres_str will be free'd. If
this job then gets checkpointed in this state (PENDING|COMPLETING + no
tres_alloc_str) on startup the controller would crash because it
expected the job to have a tres_alloc_str/cnt when in the COMPLETING
state. This could be triggered if starting the controller without the
dbd up. When the dbd comes up, the assoc_cache_mgr calls
_update_job_tres() which calls job_set_alloc_tres. It could also be
triggered by adding new tres.

This most likely started happening in 17.11.5 because of commit
865b672f which introduced calling _update_job_tres() on each job
after the dbd comes up.

Bugs 5137,4522

f9f395af

04 May, 2018 2 commits
- Increase # of tries when sending responses to srun · c1f3eb01
  Brian Christiansen authored May 04, 2018
```
Only when the connection has timedout. If the connection is timing out,
consider increasing TCPTimeout in the slurm.conf

Bug 4574
```
  c1f3eb01
- Fix test to work with new ncurses/expect combo. · 350de5c4
  Danny Auble authored May 04, 2018
  
  350de5c4
03 May, 2018 6 commits
- pmixp: fixed the logging level for the send error message · e8b11443
  Boris Karasev authored May 03, 2018
```
Bug 5129.
```
  e8b11443
- doc/html/faq - clarify CRAY ALPS running jobs changes not supported. · 382a8a7f
  Alejandro Sanchez authored May 03, 2018
```
Bug 5110.
```
  382a8a7f
- Add more field identical tests with respect to old ones on job updates. · a8961fd7
  Alejandro Sanchez authored May 03, 2018
```
Bug 5110.
```
  a8961fd7
- Fix test25.1 to work with usernames longer than eight characters. · a9c7fd5c
  Tim Wickberg authored May 03, 2018
```
Continuation of d0deea4f.

Bug 4841.
```
  a9c7fd5c
- slurmd - fix ABRT on SIGINT after reconfigure with MemSpecLimit set. · ca304478
  Alejandro Sanchez authored May 03, 2018
```
Use setenv() instead of setenvfs(), since setenvfs() memory allocation
is implemented with xmalloc() and fini_setproctitle() (which is called
on reconfigure) free's the memory with free(), leading to a:
"free(): invalid size" malloc_printerr error.

Continuation of dce83a23.

Bug 5095.
```
  ca304478
- Documentation added for GRES types vs limits · c2c06468
  Felip Moll authored May 03, 2018
```
Due to current design the job limits are checked before the allocation is
made when one specifies a generic gres and a specific gres type is configured.
The workaround for now is to define a job submit plugin to control the user
request and succesfully apply limits.

Bug 4767
```
  c2c06468
02 May, 2018 6 commits
- Disable reservation overlap check on job submission. · 8c333be7
  Dominik Bartkiewicz authored May 02, 2018
```
Bug 4960.
```
  8c333be7
- Fix fatal in controller when loading completed trigger. · 972c1cf8
  Dominik Bartkiewicz authored May 02, 2018
```
Bug 4887.
```
  972c1cf8
- Remove unsafe use of pthread_cancel in pmix plugin. · e5f03971
  Tim Wickberg authored May 02, 2018
```
Can lead to deadlock within malloc depending on the exact timing.

Rework thread startup and shutdown code so pthread_cancel is not
needed.

Bug 5119, 5103.
```
  e5f03971
- Don't tear down a BB if a node fails and --no-kill or resize of a job · f8b11ddc
  Tim Wickberg authored May 02, 2018
```
happens.

Bug 5108
```
  f8b11ddc
- Revert "Don't tear down a BB if a node fails and --no-kill or resize of a job" · 02bdc9a9
  Danny Auble authored May 02, 2018
```
This reverts commit de5a4da2.
```
  02bdc9a9
- Don't tear down a BB if a node fails and --no-kill or resize of a job · de5a4da2
  Danny Auble authored May 02, 2018
```
happens.

Bug 5108
```
  de5a4da2
01 May, 2018 2 commits

Fix total TRES Billing on partitions. · 3686dd9c

Danny Auble authored May 01, 2018

Turns out the partititon's billing tres was working off the sum of
the node_ptrs which contain the max of all partitions they are in.

This isn't correct since each partition's billing can be different.

Set it correctly here.

3686dd9c

Fix style in _direct_init_sent_buf_cb(). · 9c5b3baa
Tim Wickberg authored Apr 30, 2018
```
No functional change.
```
9c5b3baa

30 Apr, 2018 7 commits

Remove _task_sleep() from slurm_jobacct_gather.c. · 3be9e1ee

Tim Wickberg authored Apr 30, 2018

The use in _watch_tasks needs to be removed as the switch to pthread_signal
from pthread_cancel means this will not get interrupted and would keep the
step alive for at least a second, potentially harming throughput.
Since the call to _poll_data() happens after the first timer expires,
this delay turns out to be unnecessary, so we won't be replacing it with
a pthread_cond_timedwait() construct.

The use jobacct_gather_stat_task() is unnecessary since the two locations
this can happen take place after _fork_all_tasks() has setup the tasks,
thus the delay should not be necessary.

Bug 5103.

3be9e1ee

Remove unsafe use of pthread_cancel() in slurmstepd. · a7c8964e

Tim Wickberg authored Apr 30, 2018

These functions are not async-cancel-safe, and cannot safely be cancelled.
This leads to potential deadlock, either between our own locks, or deep
inside glibc when the thread held a malloc arena lock when canceled.

Replace with pthread_signal to the appropriate cond to
wake threads up at the appropriate time instead.

Bug 5103.

a7c8964e

Make a global for each of the accounting gather profile timers. · 1675ada0
Danny Auble authored Apr 30, 2018
```
This will make it easier in a future commit to avoid the
async pthread_cancel.

Bug 5103
```
1675ada0
doc/html/faq - add missing braces to example. · fd9b143a
Alejandro Sanchez authored Apr 30, 2018
```
Bug 5110.
```
fd9b143a

Testsuite - fix issues with test1.103 when MaxTime is set on the partition. · 3138a98e

Marshall Garey authored Apr 30, 2018

Remove partition MaxTime limit at the beginning of the test,
run the rest of the test, then restore the partition configuration
with scontrol reconfigure.

Bug 4994.

3138a98e

Increase duration of extern step sleep command. · 140758ca
Marshall Garey authored Apr 30, 2018
```
Otherwise the extern step will disappear after 11.5 days.

Bug 5000.
```
140758ca
Moved creating gres_detail_str from _pack_job_gres · e2312e03
Dominik Bartkiewicz authored Apr 30, 2018
```
to be sure if it is created under job write lock.

Bug 4901
```
e2312e03