- 30 Jan, 2015 1 commit
-
-
David Bigagli authored
-
- 21 Jan, 2015 1 commit
-
-
Morris Jette authored
Squeue modified to not merge tasks of a job array if their wait reasons differ. bug 1388
-
- 07 Jan, 2015 3 commits
-
-
Aaron Knister authored
-
Rémi Palancher authored
Intel MPI, on MPI jobs initialisation through PMI, uses to call PMI_KVS_Put() many many times from task at rank 0, and each on these call is followed by PMI_KVS_Commit(). Slurm implementation of PMI_KVS_Commit() imposes a delay to avoid DDOS on original srun. This delay is proportional to the total number. It could be up to 3 secs for large jobs for ex. with 7168 tasks. Therefore, when Intel MPI calls PMI_KVS_Commit() 475 times (mesured on a test case) from task at rank 0, 28 minutes are spent in delay function. All other tasks in the job are waiting for a PMI_Barrier. Therefore, there is no risk for a DDOS from this single task 0. The patch alters the delaying time calculation to make sure task at rank 0 will does not be delayed. All other tasks are globally spreaded in the same time range as before.
-
Aaron Knister authored
-
- 19 Dec, 2014 1 commit
-
-
Danny Auble authored
of Slurm daemons.
-
- 12 Dec, 2014 1 commit
-
-
Danny Auble authored
-
- 11 Dec, 2014 2 commits
-
-
Danny Auble authored
If a QOS was added for the job and then removed and it just happened to be the largest QOS id wise if the slurmctld was restarted and the job wasn't flushed out yet it could mess things up.
-
Danny Auble authored
accounting_storage/filetxt.
-
- 08 Dec, 2014 1 commit
-
-
Artem Polyakov authored
Logic introdiced in version 14.03.10 to support requeueing of jobs with GRES allocated to currently running steps broke select/linear due to differernces in the plugin logic. The commit with the bad logic is 1209a664
-
- 05 Dec, 2014 1 commit
-
-
Brian Christiansen authored
Bug 1301
-
- 04 Dec, 2014 3 commits
-
-
Brian Christiansen authored
Fix jobs from starting in overlapping reservations that won't finish before a "maint" reservation begins. Bug 1290
-
Danny Auble authored
when the DBD is down.
-
Danny Auble authored
-
- 03 Dec, 2014 1 commit
-
-
Morris Jette authored
Log Cray MPI job calling exit() without mpi_fini(), but do not treat it as a fatal error. This partially reverts logic added in version 14.03.9. bug 1171
-
- 02 Dec, 2014 3 commits
-
-
Danny Auble authored
better.
-
Danny Auble authored
in BASIL was changed.
-
Brian Christiansen authored
-
- 24 Nov, 2014 1 commit
-
-
Artem Polyakov authored
Double max string that Slurm can pack from 16MB to 32MB to support larger MPI2 configurations.
-
- 21 Nov, 2014 2 commits
-
-
Danny Auble authored
-
Dominik Bartkiewicz authored
This can happen if the specified job ID is not found.
-
- 13 Nov, 2014 2 commits
-
-
Brian Christiansen authored
Bug 1253
-
Brian Christiansen authored
Bug 1255
-
- 12 Nov, 2014 2 commits
-
-
Danny Auble authored
-
Morris Jette authored
Do not requeue a batch job from slurmd daemon if it is killed while in the process of being launched (a race condition introduced in v14.03.9). This partially reverts commit 2bc9bc29
-
- 10 Nov, 2014 1 commit
-
-
Danny Auble authored
with CR_PACK_NODES. Really do commit d388dd67 a different way to get the same info and be able to lay out tasks correctly when --hint=nomultithread. tests on a 4 core 8 thread system are srun -n6 --hint=nomultithread --exclusive whereami | sort -h srun: cpu count 6 0 snowflake0 - MASK:0x1 1 snowflake0 - MASK:0x2 2 snowflake0 - MASK:0x4 3 snowflake0 - MASK:0x8 4 snowflake1 - MASK:0x1 5 snowflake1 - MASK:0x2 and srun -n10 -N5 --hint=nomultithread --exclusive whereami | sort -h srun: cpu count 10 0 snowflake0 - MASK:0x1 1 snowflake0 - MASK:0x2 2 snowflake0 - MASK:0x4 3 snowflake0 - MASK:0x8 4 snowflake1 - MASK:0x1 5 snowflake1 - MASK:0x2 6 snowflake1 - MASK:0x4 7 snowflake2 - MASK:0x1 8 snowflake3 - MASK:0x1 9 snowflake4 - MASK:0x1
-
- 07 Nov, 2014 2 commits
-
-
David Bigagli authored
an maintenance reservation that is not active yet.
-
Danny Auble authored
work "partition". reference bug 1246
-
- 06 Nov, 2014 4 commits
-
-
Danny Auble authored
is requested. This is a re-factor of commit e5635a76 related to bug 1148 to handle the cases where a job could run, but an error was given when selecting the nodes.
-
Danny Auble authored
-
Danny Auble authored
lock was locked outside of the function or not. This also fixes a race condition when adding a QOS and planning on using it right away when the controller is busy with previous requests.
-
Danny Auble authored
PerCPU. Before it wasn't taking into account if the user was requesting per node memory or the job was told it needed to use less than the node allowed.
-
- 05 Nov, 2014 1 commit
-
-
Danny Auble authored
-
- 04 Nov, 2014 2 commits
-
-
Danny Auble authored
-
Danny Auble authored
This was an unrealized regression from commit 0da01963. The problem is we were clearing the job_ptr->job_resrcs too early. This patch fixes it to wait until the job is actually being requeued so it does the right thing.
-
- 31 Oct, 2014 4 commits
-
-
Danny Auble authored
-
Danny Auble authored
This isn't that big of an issue for 14.03, but 14.11 added more to this string which could overflow the buffer since sprintf is used instead of snprintf. Using xstrfmtcat fixes the issue and is easier to read code.
-
Danny Auble authored
-
Danny Auble authored
amount of tasks / number of node.
-
- 30 Oct, 2014 1 commit
-
-
David Bigagli authored
-