- 21 Oct, 2015 2 commits
-
-
Morris Jette authored
sbatch --ntasks option to take precedence over --ntasks-per-node plus node count, as documented. Set SLURM_NTASKS/SLURM_NPROCS environment variables accordingly. bug 2015
-
David Bigagli authored
-
- 20 Oct, 2015 3 commits
-
-
Morris Jette authored
Avoid reporting more allocated CPUs than exist on a node. This can be triggered by resuming a previosly suspended job, resulting in oversubscription of CPUs. bug 2021
-
Danny Auble authored
-
Morris Jette authored
Add scancel -f/--full option to signal all steps including batch script and all of its child processes. bug 2031
-
- 19 Oct, 2015 5 commits
-
-
Brian Christiansen authored
Bug 1888
-
Danny Auble authored
out. Remove unneeded code that commit 8274ea54 fixed. This code would 0 out all GRES/TRES on a reconfig which isn't what we want. 8274ea54 does the right thing by itself.
-
Hongjia Cao authored
bug 2032
-
Morris Jette authored
Needed to change a couple of variables from 32- to 64-bit.
-
Morris Jette authored
Add new burst_buffer.conf parameters: ValidateTimeout and OtherTimeout. See man page for details.
-
- 16 Oct, 2015 1 commit
-
-
David Bigagli authored
-
- 15 Oct, 2015 1 commit
-
-
Danny Auble authored
previously take 2 restarts of the slurmdbd to make it stick correctly.
-
- 14 Oct, 2015 1 commit
-
-
Danny Auble authored
single-threaded cores. A regression caused only 1 socket to be used on this kind of node instead of all that were available.
-
- 08 Oct, 2015 2 commits
-
-
Brian Christiansen authored
Fix case where if the backup slurmdbd has existing connections when it gives up control that the it would be killed. If the backup had existing connections when giving up control, it would try to signal the existing threads by using pthread_kill to send SIGKILL to the threads. The problem is that SIGKILL doesn't go the thread but the main process and the backup dbd would be killed.
-
Danny Auble authored
when a cold-start (-c) happens to the slurmctld.
-
- 07 Oct, 2015 6 commits
-
-
Danny Auble authored
-
Danny Auble authored
from a user. This would cause the slurmctld to cache the old default which wasn't valid and cause the user to have to request the association always.
-
Morris Jette authored
byg 2013
-
David Bigagli authored
-
David Bigagli authored
-
Danny Auble authored
database but the start record hadn't made it yet.
-
- 06 Oct, 2015 6 commits
-
-
Danny Auble authored
-
Danny Auble authored
requirements.
-
Axel Auweter authored
Add acct_gather_energy/ibmaem plugin for systems with IBM Systems Director Active Energy Manager.
-
Thomas Cadeau authored
bug 2011
-
Danny Auble authored
','.
-
Morris Jette authored
bug 1999
-
- 03 Oct, 2015 1 commit
-
-
Morris Jette authored
Don't requeue RPC going out from slurmctld to DOWN nodes (can generate repeating communication errors). bug 2002
-
- 02 Oct, 2015 4 commits
-
-
Morris Jette authored
-
Morris Jette authored
This will only happen if a PING RPC for the node is already queued when the decision is made to power it down, then fails to get a response for the ping (since the node is already down). bug 1995
-
Morris Jette authored
If a job's CPUs/task ratio is increased due to configured MaxMemPerCPU, then increase it's allocated CPU count in order to enforce CPU limits. Previous logic would increase/set the cpus_per_task as needed if a job's --mem-per-cpu was above the configured MaxMemPerCPU, but NOT increase the min_cpus or max_cpus varilable. This resulted in allocating the wrong CPU count.
-
Morris Jette authored
This will only happen if a PING RPC for the node is already queued when the decision is made to power it down, then fails to get a response for the ping (since the node is already down). bug 1995
-
- 01 Oct, 2015 2 commits
-
-
Danny Auble authored
values.
-
Morris Jette authored
This required a fairly major re-write of the select plugin logic bug 1975
-
- 30 Sep, 2015 3 commits
-
-
Morris Jette authored
Correct some cgroup paths ("step_batch" vs. "step_4294967294", "step_exter" vs. "step_extern", and "step_extern" vs. "step_4294967295").
-
Morris Jette authored
If a job's CPUs/task ratio is increased due to configured MaxMemPerCPU, then increase it's allocated CPU count in order to enforce CPU limits. Previous logic would increase/set the cpus_per_task as needed if a job's --mem-per-cpu was above the configured MaxMemPerCPU, but NOT increase the min_cpus or max_cpus varilable. This resulted in allocating the wrong CPU count.
-
Morris Jette authored
Requeue/hold batch job launch request if job already running. This is possible if node went to DOWN state, but jobs remained active. In addition, if a prolog/epilog failed DRAIN the node rather than setting it down, which could kill jobs that could continue to run. bug 1985
-
- 29 Sep, 2015 2 commits
-
-
Brian Christiansen authored
Bug 1938
-
Brian Christiansen authored
Bug 1984
-
- 28 Sep, 2015 1 commit
-
-
Morris Jette authored
When nodes have been allocated to a job and then released by the job while resizing, this patch prevents the nodes from continuing to appear allocated and unavailable to other jobs. Requires exclusive node allocation to trigger. This prevents the previously reported failure, but a proper fix will be quite complex and delayed to the next major release of Slurm (v 16.05). bug 1851
-