- 16 Jul, 2016 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
Start power save thread only after the partition information is read in order to avoid trying to interpret the SuspendExcParts configuration information before the partition information is available, which would result in a slurmctld abort.
-
Morris Jette authored
Do not try to access part_list variable (partition list pointer) if not yet initialized. Return NULL pointer rather than aborting with NULL pointer.
-
- 15 Jul, 2016 13 commits
-
-
Tim Wickberg authored
-
Jacek Budzowski authored
bug 2900
-
Nicolas Joly authored
-
Nicolas Joly authored
-
Nicolas Joly authored
-
Morris Jette authored
-
Danny Auble authored
delete_step_records which would delete the steps without the killing flag set.
-
Danny Auble authored
What this does is treats the extern step like a normal step on exit. It doesn't appear the original code is needed anymore and this simplifies the code. The select_cray change is relevant since the add is needed only when killing the step as that is the only place _internal_step_complete isn't used.
-
Danny Auble authored
functions.
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
Before it was showing it as TBD since pending steps and the extern step have the same stepid.
-
Danny Auble authored
What this does is set the state earlier to match a normal set. Remove the unneeded _send_pending_exit_msgs. There is only one task and we have the message for it, so don't worry about that one. Most important, wait for the other slurmstepd's to send their message, otherwise they could be lost on the other end.
-
- 14 Jul, 2016 8 commits
-
-
Morris Jette authored
-
Morris Jette authored
Fix gang scheduling and license release logic if single node job killed on bad node. Notifying gang and releasing licences is normally done when the epilog completion happens, but if the node(s) assigned to a job are all down, that does not happen. This results in the licenses being reserved indefinitely and the gang scheduler being left with a bad (old) job pointer that can result in various failure modes bug 2867
-
Morris Jette authored
-
Morris Jette authored
Add hotels. Other minor changes.
-
Danny Auble authored
-
Danny Auble authored
anyway to attempt to log the backtraces of the potential unkillable processes.
-
Danny Auble authored
667f1105.
-
Danny Auble authored
-
- 13 Jul, 2016 3 commits
-
-
Danny Auble authored
We have decided to go back to the way 15.08 called NHC instead of calling it first before sending a SIGKILL to the steps tasks. With this patch we only start the NHC early when we have to resend the SIGKILL for unkillable processes. This will hopefully get us the backtrace of the unkillable processes which was the reason we did it this way in the first place :).
-
Danny Auble authored
processes.
-
Morris Jette authored
-
- 12 Jul, 2016 7 commits
-
-
Nicolas Joly authored
Bug 2892.
-
Danny Auble authored
Bug 2874 We will most likely redo this logic (as it appears to be duplicated) in a following patch.
-
Morris Jette authored
Don't generate an error when a batch job is submitted that must wait for stage-in before starting.
-
Danny Auble authored
-
Danny Auble authored
Bug 2886
-
Tim Wickberg authored
Conflicts: src/sstat/options.c
-
Jacek Budzowski authored
Was incorrectly translating request to job.extern if part of a comma-separate list. Bug 2890.
-
- 11 Jul, 2016 1 commit
-
-
Danny Auble authored
(regression in 16.05.2). related commit 5d3e5e1e Bug 2612 and 2886
-
- 08 Jul, 2016 5 commits
-
-
Danny Auble authored
'-' without a '\' in front of it.
-
Danny Auble authored
-
Morris Jette authored
Document limitations in burst buffer use by the salloc command (possible access problems from a login node). bug 2883
-
Janne Blomqvist authored
task/cgroup plugin is configured with ConstrainRAMSpace=yes, then set soft memory limit to allocated memory limit (previously no soft limit was set). bug 2679
-
Morris Jette authored
-