- 22 Dec, 2014 1 commit
-
-
Rémi Palancher authored
Intel MPI, on MPI jobs initialisation through PMI, uses to call PMI_KVS_Put() many many times from task at rank 0, and each on these call is followed by PMI_KVS_Commit(). Slurm implementation of PMI_KVS_Commit() imposes a delay to avoid DDOS on original srun. This delay is proportional to the total number. It could be up to 3 secs for large jobs for ex. with 7168 tasks. Therefore, when Intel MPI calls PMI_KVS_Commit() 475 times (mesured on a test case) from task at rank 0, 28 minutes are spent in delay function. All other tasks in the job are waiting for a PMI_Barrier. Therefore, there is no risk for a DDOS from this single task 0. The patch alters the delaying time calculation to make sure task at rank 0 will does not be delayed. All other tasks are globally spreaded in the same time range as before.
-
- 20 Dec, 2014 1 commit
-
-
Danny Auble authored
of Slurm daemons. The slurmstepd still needs to be fixed, which most likely can't be fixed until 15.08.
-
- 19 Dec, 2014 3 commits
-
-
Danny Auble authored
of Slurm daemons.
-
Danny Auble authored
but then sets CPUs to only represent the number of cores on the node.
-
Danny Auble authored
-
- 17 Dec, 2014 2 commits
-
-
Brian Christiansen authored
Bug 1327
-
Danny Auble authored
doesn't request a number of tasks.
-
- 16 Dec, 2014 2 commits
-
-
Morris Jette authored
Fix job array hash table bug, could result in slurmctld infinite loop or invalid memory reference. bug 1309
-
David Bigagli authored
-
- 12 Dec, 2014 3 commits
-
-
Morris Jette authored
If a master job array record is complete, then consider all pending tasks as also complete. This problem happens when a master job array record is pending (has pending tasks) and is cancelled. The result previously was a job record not visible to squeue/scontrol, but occupying memory. The same type of problem happened with respect to a dependency on a job array which was cancelled.
-
Danny Auble authored
-
Danny Auble authored
-
- 11 Dec, 2014 6 commits
-
-
Danny Auble authored
If a QOS was added for the job and then removed and it just happened to be the largest QOS id wise if the slurmctld was restarted and the job wasn't flushed out yet it could mess things up.
-
David Bigagli authored
-
Morris Jette authored
Log how many nodes are removed from consideration from jobs due to advanced reservation. Change user error message to indicated that required nodes might be down, drained or (added this bit) reserved.
-
Morris Jette authored
In proctrack/linuxproc and proctrack/pgid, check the result of strtol() for error condition rather than errno, which might have a vestigial error code.
-
Danny Auble authored
correctly.
-
Danny Auble authored
accounting_storage/filetxt.
-
- 09 Dec, 2014 2 commits
-
-
Morris Jette authored
-
Danny Auble authored
when running from cache.
-
- 08 Dec, 2014 4 commits
-
-
Brian Christiansen authored
Bug 1305
-
Morris Jette authored
Fix bug with GRES having multiple types that can cause slurmctld abort. This can be reproduced with select/cons_res and one Gres like this: Name=gpu Type=kepler File=/dev/tty0 A bad index was being used that caused an assert.
-
Morris Jette authored
-
Artem Polyakov authored
Logic introdiced in version 14.03.10 to support requeueing of jobs with GRES allocated to currently running steps broke select/linear due to differernces in the plugin logic. The commit with the bad logic is 1209a664
-
- 05 Dec, 2014 4 commits
-
-
Brian Christiansen authored
Bug 1298
-
Brian Christiansen authored
-
Brian Christiansen authored
Bug 1301
-
Danny Auble authored
have no weight. This allows for association and QOS decay limits to work.
-
- 04 Dec, 2014 6 commits
-
-
David Bigagli authored
draining in sinfo output.
-
Brian Christiansen authored
-
Brian Christiansen authored
Fix jobs from starting in overlapping reservations that won't finish before a "maint" reservation begins. Bug 1290
-
Morris Jette authored
Avoid huge malloc if GRES configured with "Type" and huge "Count".
-
Danny Auble authored
when the DBD is down.
-
Danny Auble authored
-
- 03 Dec, 2014 3 commits
-
-
Morris Jette authored
Log Cray MPI job calling exit() without mpi_fini(), but do not treat it as a fatal error. This partially reverts logic added in version 14.03.9. bug 1171
-
Brian Christiansen authored
Bug 1289
-
Danny Auble authored
could result in seg fault.
-
- 02 Dec, 2014 3 commits
-
-
Danny Auble authored
better.
-
Danny Auble authored
in BASIL was changed.
-
Brian Christiansen authored
-