- 30 Apr, 2018 9 commits
-
-
Tim Wickberg authored
-
Tim Wickberg authored
The use in _watch_tasks needs to be removed as the switch to pthread_signal from pthread_cancel means this will not get interrupted and would keep the step alive for at least a second, potentially harming throughput. Since the call to _poll_data() happens after the first timer expires, this delay turns out to be unnecessary, so we won't be replacing it with a pthread_cond_timedwait() construct. The use jobacct_gather_stat_task() is unnecessary since the two locations this can happen take place after _fork_all_tasks() has setup the tasks, thus the delay should not be necessary. Bug 5103.
-
Tim Wickberg authored
These functions are not async-cancel-safe, and cannot safely be cancelled. This leads to potential deadlock, either between our own locks, or deep inside glibc when the thread held a malloc arena lock when canceled. Replace with pthread_signal to the appropriate cond to wake threads up at the appropriate time instead. Bug 5103.
-
Danny Auble authored
This will make it easier in a future commit to avoid the async pthread_cancel. Bug 5103
-
Alejandro Sanchez authored
Bug 5110.
-
Danny Auble authored
# Conflicts: # src/slurmctld/job_mgr.c
-
Marshall Garey authored
Remove partition MaxTime limit at the beginning of the test, run the rest of the test, then restore the partition configuration with scontrol reconfigure. Bug 4994.
-
Marshall Garey authored
Otherwise the extern step will disappear after 11.5 days. Bug 5000.
-
Dominik Bartkiewicz authored
to be sure if it is created under job write lock. Bug 4901
-
- 28 Apr, 2018 5 commits
-
-
Tim Wickberg authored
-
Brian Christiansen authored
-
Brian Christiansen authored
In conjuction with previous commit (reconginizing nodes being powered up out of band) set node's last_idle to 0 when the node is in a power_save state. Additional meaning that the node isn't booted. Partially reverts da722a89. Checking for (last_idle > 0) when in power_save state isn't necessary because if the node is already in power_save state the node won't be resumed unless (node_ptr->last_idle > (now - SuspendTime)). And with the previous change, the node's last_idle time will be set when the node registers.
-
Brian Christiansen authored
Bug 5053
-
Brian Christiansen authored
This allows the suspend script to be triggered even if Slurm has the node(s) in a power_save state. Bug 5053
-
- 27 Apr, 2018 2 commits
-
-
Danny Auble authored
-
Tim Wickberg authored
-
- 26 Apr, 2018 5 commits
-
-
Morris Jette authored
-
Morris Jette authored
The test was failing solidly on a Cray with NHC configured
-
Morris Jette authored
Disable the tests as needed
-
Tim Wickberg authored
-
Marshall Garey authored
Just in case reboot_program doesn't actually turn this node off for some reason, at least stopping slurmd explicitly will keep the node offline until someone intervenes. Bug 5019.
-
- 25 Apr, 2018 6 commits
-
-
Morris Jette authored
Add configuration paramerers SlurmctldAddr for use with virtual IP to manage backup slurmctld daemons. bug 4768
-
Danny Auble authored
by hwloc_obj_type_snprintf. You will only see this if you have _DEBUG set to 1.
-
Danny Auble authored
-
Morris Jette authored
Add configuration paramerers SlurmctldPrimaryOnProg and SlurmctldPrimaryOffProg, which define programs to execute when a slurmctld daemon becomes the primary server or goes from primary to backup mode. bug 4768
-
Tim Wickberg authored
-
Isaac Hartung authored
Large results from this will cause scheduler performance problems, usually due to running inside a VM without the Linux vDSO module. Bug 4961.
-
- 24 Apr, 2018 7 commits
-
-
Tim Wickberg authored
-
Morris Jette authored
-
Morris Jette authored
The included lightweight corefile description is no longer valid, but misleading at best.
-
Morris Jette authored
-
Christopher Bottoms authored
-
Morris Jette authored
-
Isaac Hartung authored
-
- 23 Apr, 2018 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
Bug introduced in commit 11a75ff4
-
Morris Jette authored
When any of these --exclusive modes couldn't be satisfied, Slurm was returning an incorrect ESLURM_NODE_NOT_AVAIL, having as a consequence scheduling problems as described in the bug. The fix makes it so the error code is properly set to ESLURM_NODES_BUSY, fixing also the scheduling problems and working over the correct share_node_bitmap. Continuation of commits from bug 4932: e2a14b8d fc4e5ac9 Bug 5047.
-
- 19 Apr, 2018 3 commits
-
-
Marshall Garey authored
Fix an issue in the bit manipulation log introduced in commit 892ffa89. Bug 4997.
-
Isaac Hartung authored
And related KillOnBadExit setting in slurm.conf. These only affect an individual job step, not the entire job. Bug 5023.
-
Tim Wickberg authored
Replace select_p_select_jobinfo_sprint() with the same NO-OP that the other plugins (except alps and bluegene) have implemented. Bug 5077.
-