- 27 Jul, 2016 4 commits
-
-
Morris Jette authored
Document that persistent burst buffers can not be created or destroyed using the salloc or srun --bb options. bug 2404
-
Brian Christiansen authored
Missed in b5bba34c
-
Danny Auble authored
on batch script completes.
-
Danny Auble authored
-
- 26 Jul, 2016 2 commits
-
-
Morris Jette authored
-
Danny Auble authored
(difference between start of job and when it was eligible).
-
- 25 Jul, 2016 2 commits
-
-
David Gloe authored
Bug 2939.
-
Danny Auble authored
-
- 23 Jul, 2016 1 commit
-
-
Morris Jette authored
-
- 22 Jul, 2016 4 commits
-
-
Dominik Bartkiewicz authored
Inadvertently broken in commit 05eac196. Bug 2912.
-
Danny Auble authored
or failed based on the signal that would always be killing it.
-
Danny Auble authored
end of the job to do it.
-
Danny Auble authored
make them using the master job ID instead of the normal job ID.
-
- 21 Jul, 2016 1 commit
-
-
Morris Jette authored
Treat invalid user ID in AllowUserBoot option of knl.conf file as error rather than fatal (log and do not exit).
-
- 20 Jul, 2016 3 commits
-
-
Morris Jette authored
Prevent slurmctld abort if job is killed or requeued while waiting for reboot of its allocated compute nodes. The _wait_boot() would reference job_ptr->node_bitmap, which would be NULL.
-
Boris Karasev authored
Bug 2908
-
Tim Wickberg authored
Step hasn't been assigned resources, so the select_jobinfo struct hasn't yet been populated. Calling select_g_step_finish will dereference causing a segfault. Bug 2922.
-
- 19 Jul, 2016 5 commits
-
-
Morris Jette authored
-
Morris Jette authored
If the user is now allowed to use the partition, then do not check that user's group access again for 5 seconds. bug 2913
-
Morris Jette authored
Improve partition AllowGroups caching. Update the table of UIDs permitted to use a partition based upon it's AllowGroups configuration parameter as new valid UIDs are found rather than looking up that user's group information for every job they submit, which can involve considerable overhead for some systems. bug 2913
-
Morris Jette authored
Minimize preempted jobs for configurations with multiple jobs per node. Previous logic would preeempt every job on node allocated to pending job. bug 2906
-
Morris Jette authored
Fix for core selection with job --gres-flags=enforce-binding option. Previous logic would in some cases allocate a job zero cores, resulting in slurmctld abort. bug 2808
-
- 16 Jul, 2016 2 commits
-
-
Danny Auble authored
In commit b8190e5d many places that were mean to be pending step ids were changed to be extern_step id. The main problem was when we came up with the idea of the extern step we reused -1 (INFINITE) for the id. So pending steps also appeared to be extern steps as well. Hopefully this fixes the situation. Bug 2907
-
Morris Jette authored
Start power save thread only after the partition information is read in order to avoid trying to interpret the SuspendExcParts configuration information before the partition information is available, which would result in a slurmctld abort.
-
- 15 Jul, 2016 2 commits
-
-
Jacek Budzowski authored
bug 2900
-
Danny Auble authored
Before it was showing it as TBD since pending steps and the extern step have the same stepid.
-
- 14 Jul, 2016 2 commits
-
-
Morris Jette authored
Fix gang scheduling and license release logic if single node job killed on bad node. Notifying gang and releasing licences is normally done when the epilog completion happens, but if the node(s) assigned to a job are all down, that does not happen. This results in the licenses being reserved indefinitely and the gang scheduler being left with a bad (old) job pointer that can result in various failure modes bug 2867
-
Danny Auble authored
anyway to attempt to log the backtraces of the potential unkillable processes.
-
- 13 Jul, 2016 1 commit
-
-
Danny Auble authored
processes.
-
- 12 Jul, 2016 6 commits
-
-
Nicolas Joly authored
Bug 2892.
-
Danny Auble authored
Bug 2874 We will most likely redo this logic (as it appears to be duplicated) in a following patch.
-
Morris Jette authored
Don't generate an error when a batch job is submitted that must wait for stage-in before starting.
-
Danny Auble authored
-
Danny Auble authored
Bug 2886
-
Jacek Budzowski authored
Was incorrectly translating request to job.extern if part of a comma-separate list. Bug 2890.
-
- 11 Jul, 2016 1 commit
-
-
Danny Auble authored
(regression in 16.05.2). related commit 5d3e5e1e Bug 2612 and 2886
-
- 08 Jul, 2016 4 commits
-
-
Morris Jette authored
Document limitations in burst buffer use by the salloc command (possible access problems from a login node). bug 2883
-
Janne Blomqvist authored
task/cgroup plugin is configured with ConstrainRAMSpace=yes, then set soft memory limit to allocated memory limit (previously no soft limit was set). bug 2679
-
Morris Jette authored
-
Danny Auble authored
of 0. This might be the cause of run away jobs. I couldn't see how an end_time could be 0, but if it was it would just exit and never set time_end to anything. At least if it happens now we can have an idea that it is possible and we will have an idea this is the place it happens.
-