- 07 Oct, 2015 2 commits
-
-
Danny Auble authored
-
Danny Auble authored
database but the start record hadn't made it yet.
-
- 06 Oct, 2015 3 commits
-
-
Thomas Cadeau authored
bug 2011
-
Danny Auble authored
','.
-
Morris Jette authored
bug 1999
-
- 03 Oct, 2015 1 commit
-
-
Morris Jette authored
Don't requeue RPC going out from slurmctld to DOWN nodes (can generate repeating communication errors). bug 2002
-
- 02 Oct, 2015 1 commit
-
-
Morris Jette authored
This will only happen if a PING RPC for the node is already queued when the decision is made to power it down, then fails to get a response for the ping (since the node is already down). bug 1995
-
- 30 Sep, 2015 2 commits
-
-
Morris Jette authored
If a job's CPUs/task ratio is increased due to configured MaxMemPerCPU, then increase it's allocated CPU count in order to enforce CPU limits. Previous logic would increase/set the cpus_per_task as needed if a job's --mem-per-cpu was above the configured MaxMemPerCPU, but NOT increase the min_cpus or max_cpus varilable. This resulted in allocating the wrong CPU count.
-
Morris Jette authored
Requeue/hold batch job launch request if job already running. This is possible if node went to DOWN state, but jobs remained active. In addition, if a prolog/epilog failed DRAIN the node rather than setting it down, which could kill jobs that could continue to run. bug 1985
-
- 29 Sep, 2015 2 commits
-
-
Brian Christiansen authored
Bug 1938
-
Brian Christiansen authored
Bug 1984
-
- 28 Sep, 2015 1 commit
-
-
Morris Jette authored
When nodes have been allocated to a job and then released by the job while resizing, this patch prevents the nodes from continuing to appear allocated and unavailable to other jobs. Requires exclusive node allocation to trigger. This prevents the previously reported failure, but a proper fix will be quite complex and delayed to the next major release of Slurm (v 16.05). bug 1851
-
- 23 Sep, 2015 1 commit
-
-
Danny Auble authored
The 2 came from the nodelist being "None assigned", which would be treated as 2 hosts when sent into hostlist.
-
- 22 Sep, 2015 1 commit
-
-
Danny Auble authored
Correct counting for job array limits, job count limit underflow possible when master cancellation of master job record. bug 1952
-
- 21 Sep, 2015 1 commit
-
-
Nathan Yee authored
only 1 job was accounted (against MaxSubmitJob) for when an array was submitted.
-
- 17 Sep, 2015 1 commit
-
-
David Bigagli authored
-
- 11 Sep, 2015 2 commits
-
-
Morris Jette authored
This prevents a step from being launched if the job is killed while the prolog is running. Reproducing the original failure requires use of srun to trigger the prolog and using scancel while that prolog is running. bug 1755
-
Brian Christiansen authored
And add missing documenation. Bug 1921
-
- 10 Sep, 2015 4 commits
-
-
Morris Jette authored
GRES were not being properly tracks for multiple simultaneous steps. A step which could have run later could be rejected as never being able to run. Replacement for commit dd842d79, which was reverted in commit 6f73812875c bug 1925
-
David Bigagli authored
-
David Bigagli authored
-
Danny Auble authored
and you use all the GRES up instead of reporting the configuration isn't available you hold the requesting step until the GRES is available.
-
- 09 Sep, 2015 1 commit
-
-
Morris Jette authored
Don't trucate task ID information in "squeue --array/-r" output. Task ID info in sview also expanded to 64 characters (from ~16 chars).
-
- 08 Sep, 2015 1 commit
-
-
David Bigagli authored
-
- 02 Sep, 2015 2 commits
-
-
Morris Jette authored
Previous logic would set the avail_node_bitmap when a node was powered down, even if the initial state was DOWN or DRAINED. This made the node available for allocation to a job, which we don't want until the DOWN or DRAIN state is cleared. bug 1893
-
Morris Jette authored
This reverts commits 7660da9e 5c386455 and f6c5302b
-
- 01 Sep, 2015 4 commits
-
-
Brian Christiansen authored
Bug 1741
-
David Bigagli authored
-
Danny Auble authored
-
David Bigagli authored
-
- 28 Aug, 2015 1 commit
-
-
Morris Jette authored
This problem is reproducible by launching a job then killing the slurmstepd process. Under those conditions, requeue the job if possible (i.e. batch job with requeue option/configuration). This patch also improves the slurmctld logging when this happens. bug 1889
-
- 27 Aug, 2015 2 commits
-
-
Morris Jette authored
Correct RebootProgram logic when executed outside of a maintenance reservation. Previous logic would mark the node up upon response to the reboot RPC (from slurmctld to slurmc) and when the node actually rebooted, flag that as an unexpected reboot. This new logic checks the node's up time to not mark the compute node as being usable until the reboot actually takes place. but 1866
-
Danny Auble authored
association manager.
-
- 26 Aug, 2015 3 commits
-
-
Morris Jette authored
Prevent job array task ID from being reported as NO_VAL if last task in the array gets requeued. The problem is that when that task starts, the task bitmap entry for it stays set, but the task counter gets decremented. If that job then gets requeued, under some conditions a failure to schedule it results in the array_task_id in the job record getting set to NO_VAL. Then when building the job info to report for squeue/scontrol, the string showing the pending task ID's is not rebuilt due to that counter being zero. All indications are that the job runs fine, only the information reported to squeue/scontrol is wrong. bug 1790
-
Danny Auble authored
-
Danny Auble authored
-
- 25 Aug, 2015 2 commits
-
-
Brian Christiansen authored
Bug 1873
-
Danny Auble authored
or binary.
-
- 21 Aug, 2015 2 commits
-
-
Brian Christiansen authored
Bug 1831
-
Brian Christiansen authored
Bug 1869
-