Commits · e5f03971b97ed2d0d0bd441da31e16647e6ac579 · Manuel G. Marciani / ces_slurm_simulator

02 May, 2018 4 commits
- Remove unsafe use of pthread_cancel in pmix plugin. · e5f03971
  Tim Wickberg authored May 02, 2018
```
Can lead to deadlock within malloc depending on the exact timing.

Rework thread startup and shutdown code so pthread_cancel is not
needed.

Bug 5119, 5103.
```
  e5f03971
- Don't tear down a BB if a node fails and --no-kill or resize of a job · f8b11ddc
  Tim Wickberg authored May 02, 2018
```
happens.

Bug 5108
```
  f8b11ddc
- Revert "Don't tear down a BB if a node fails and --no-kill or resize of a job" · 02bdc9a9
  Danny Auble authored May 02, 2018
```
This reverts commit de5a4da2.
```
  02bdc9a9
- Don't tear down a BB if a node fails and --no-kill or resize of a job · de5a4da2
  Danny Auble authored May 02, 2018
```
happens.

Bug 5108
```
  de5a4da2
01 May, 2018 2 commits

Fix total TRES Billing on partitions. · 3686dd9c

Danny Auble authored May 01, 2018

Turns out the partititon's billing tres was working off the sum of
the node_ptrs which contain the max of all partitions they are in.

This isn't correct since each partition's billing can be different.

Set it correctly here.

3686dd9c

Fix style in _direct_init_sent_buf_cb(). · 9c5b3baa
Tim Wickberg authored Apr 30, 2018
```
No functional change.
```
9c5b3baa

30 Apr, 2018 7 commits

Remove _task_sleep() from slurm_jobacct_gather.c. · 3be9e1ee

Tim Wickberg authored Apr 30, 2018

The use in _watch_tasks needs to be removed as the switch to pthread_signal
from pthread_cancel means this will not get interrupted and would keep the
step alive for at least a second, potentially harming throughput.
Since the call to _poll_data() happens after the first timer expires,
this delay turns out to be unnecessary, so we won't be replacing it with
a pthread_cond_timedwait() construct.

The use jobacct_gather_stat_task() is unnecessary since the two locations
this can happen take place after _fork_all_tasks() has setup the tasks,
thus the delay should not be necessary.

Bug 5103.

3be9e1ee

Remove unsafe use of pthread_cancel() in slurmstepd. · a7c8964e

Tim Wickberg authored Apr 30, 2018

These functions are not async-cancel-safe, and cannot safely be cancelled.
This leads to potential deadlock, either between our own locks, or deep
inside glibc when the thread held a malloc arena lock when canceled.

Replace with pthread_signal to the appropriate cond to
wake threads up at the appropriate time instead.

Bug 5103.

a7c8964e

Make a global for each of the accounting gather profile timers. · 1675ada0
Danny Auble authored Apr 30, 2018
```
This will make it easier in a future commit to avoid the
async pthread_cancel.

Bug 5103
```
1675ada0
doc/html/faq - add missing braces to example. · fd9b143a
Alejandro Sanchez authored Apr 30, 2018
```
Bug 5110.
```
fd9b143a

Testsuite - fix issues with test1.103 when MaxTime is set on the partition. · 3138a98e

Marshall Garey authored Apr 30, 2018

Remove partition MaxTime limit at the beginning of the test,
run the rest of the test, then restore the partition configuration
with scontrol reconfigure.

Bug 4994.

3138a98e

Increase duration of extern step sleep command. · 140758ca
Marshall Garey authored Apr 30, 2018
```
Otherwise the extern step will disappear after 11.5 days.

Bug 5000.
```
140758ca
Moved creating gres_detail_str from _pack_job_gres · e2312e03
Dominik Bartkiewicz authored Apr 30, 2018
```
to be sure if it is created under job write lock.

Bug 4901
```
e2312e03

28 Apr, 2018 4 commits

Merge branch 'bug5053' into slurm-17.11 · 5ab8ab6f
Brian Christiansen authored Apr 27, 2018

5ab8ab6f

Set node->last_idle to 0 when in power_save state · 242c7406

Brian Christiansen authored Apr 23, 2018

In conjuction with previous commit (reconginizing nodes being powered up
out of band) set node's last_idle to 0 when the node is in a power_save
state. Additional meaning that the node isn't booted.

Partially reverts da722a89. Checking for (last_idle > 0) when in
power_save state isn't necessary because if the node is already in
power_save state the node won't be resumed unless
(node_ptr->last_idle > (now - SuspendTime)). And with the previous
change, the node's last_idle time will be set when the node registers.

242c7406

Recognize cloud nodes that are booted out of band · ece205ed
Brian Christiansen authored Apr 23, 2018
```
Bug 5053
```
ece205ed

Force power_down of cloud node · 8ebf36eb

Brian Christiansen authored Apr 23, 2018

This allows the suspend script to be triggered even if Slurm has the
node(s) in a power_save state.

Bug 5053

8ebf36eb

27 Apr, 2018 1 commit
- run autogen.sh on new automake 1.15.1 · 894c65d1
  Danny Auble authored Apr 27, 2018
  
  894c65d1
26 Apr, 2018 2 commits
- Fix test for possible race condition · edac5cb3
  Morris Jette authored Apr 26, 2018
```
The test was failing solidly on a Cray with NHC configured
```
  edac5cb3
- Modify 2 tests that can not run with some configurations · 47a108a7
  Morris Jette authored Apr 26, 2018
```
Disable the tests as needed
```
  47a108a7
25 Apr, 2018 2 commits
- Turns out hwloc_obj_snprintf is deprecated and needed to be replaced · 6054a903
  Danny Auble authored Apr 25, 2018
```
by hwloc_obj_type_snprintf.

You will only see this if you have _DEBUG set to 1.
```
  6054a903
- Move function header to be over the correct function no code change. · 720418cb
  Danny Auble authored Apr 25, 2018
  
  720418cb
24 Apr, 2018 3 commits
- Remove vestigial FAQ entry · 73555a11
  Morris Jette authored Apr 24, 2018
```
The included lightweight corefile description is no longer valid,
but misleading at best.
```
  73555a11
- Minor typo fixes · 037c8b40
  Christopher Bottoms authored Apr 23, 2018
  
  037c8b40
- Document more fully MPI plugin specification · 29a90698
  Isaac Hartung authored Apr 23, 2018
  
  29a90698
23 Apr, 2018 2 commits

Fix build when lz4 is in a non-standard location. · 2f81322e
Morris Jette authored Apr 23, 2018

2f81322e

Fix error code and scheduling problem for --exclusive=[user|mcs]. · 9cf41301

Morris Jette authored Apr 23, 2018

When any of these --exclusive modes couldn't be satisfied, Slurm was
returning an incorrect ESLURM_NODE_NOT_AVAIL, having as a consequence
scheduling problems as described in the bug. The fix makes it so the
error code is properly set to ESLURM_NODES_BUSY, fixing also the
scheduling problems and working over the correct share_node_bitmap.

Continuation of commits from bug 4932:
e2a14b8d
fc4e5ac9

Bug 5047.

9cf41301

19 Apr, 2018 4 commits
- Fix incorrect error thrown when cancelling part of a job array. · 8432f9f6
  Marshall Garey authored Apr 19, 2018
```
Fix an issue in the bit manipulation log introduced in commit 892ffa89.

Bug 4997.
```
  8432f9f6
- Docs - fix descriptions of srun --kill-on-bad-exit flag. · cc2ec15f
  Isaac Hartung authored Apr 19, 2018
```
And related KillOnBadExit setting in slurm.conf.

These only affect an individual job step, not the entire job.

Bug 5023.
```
  cc2ec15f
- Fix 'squeue -o %s' on Cray systems. · d3398004
  Tim Wickberg authored Apr 19, 2018
```
Replace select_p_select_jobinfo_sprint() with the same NO-OP
that the other plugins (except alps and bluegene) have implemented.

Bug 5077.
```
  d3398004
- Add FAQ for sview coloring · 093ef9fe
  Isaac Hartung authored Apr 19, 2018
```
Bug 5049
```
  093ef9fe
17 Apr, 2018 3 commits

Clarify reduce_completing_frag configuration option · 831aaceb
Morris Jette authored Apr 17, 2018

831aaceb

Make UnavailableNodes value in job reason be correct for each job · fc4e5ac9

Morris Jette authored Apr 17, 2018

1. Identifies nodes which are unavailable to a specific job,
adding a call to filter_by_node_owner() in select_nodes()
 where the node list is generated.
2. Removes the "unavail_node_str" argument to select_nodes()
as it is no longer useful. This string originally was originally
generated once at the start of the job scheduling logic for all jobs,
but since each job can have a different set of
 unavailable nodes (dedicated to user, group, etc.)
so the same string for all jobs can be misleading.

Bug 4932.

fc4e5ac9

Prevent from wrongly returning, ESLURM_NODE_NOT_AVAIL from _pick_best_nodes... · e2a14b8d
Dominik Bartkiewicz authored Apr 17, 2018
```
Prevent from wrongly returning, ESLURM_NODE_NOT_AVAIL from _pick_best_nodes when some jobs are using "--exclusive=user"

Bug 4932.
```
e2a14b8d

16 Apr, 2018 4 commits
- Add missed NEWS entry for e4b531c2 . · ae350f36
  Tim Wickberg authored Apr 16, 2018
  
  ae350f36
- slurmctld: check UID in pack_job before hiding · e4b531c2
  Thomas HAMEL authored Apr 10, 2018
```
Improve performance of 'squeue -u' when PrivateData=jobs is
enabled by moving the UID filter code ahead of the more expensive
PrivateData=job checks.

Bug 5056.
```
  e4b531c2
- Prevent bit_ffs() from returning value out of bitmap range. · 202f432c
  Dominik Bartkiewicz authored Apr 16, 2018
```
See commit 0dabf4e7.
Bug 4932.
```
  202f432c
- Fix problem with wrongly set as Reservation job state_reason. · 217a0698
  Dominik Bartkiewicz authored Apr 16, 2018
```
regression from ef1f3e73.
Bug 4885.
```
  217a0698
14 Apr, 2018 2 commits
- Docs - add a table of contents to accounting and programmer_guide · 5c14c118
  Michael Hinton authored Apr 13, 2018
  
  5c14c118
- Docs - assorted grammar and spelling fixes · 63f347df
  Michael Hinton authored Apr 13, 2018
  
  63f347df