- 29 Dec, 2014 2 commits
-
-
Danny Auble authored
-
David Bigagli authored
-
- 26 Dec, 2014 1 commit
-
-
Jason Bacon authored
-
- 24 Dec, 2014 3 commits
-
-
Morris Jette authored
Enable per-partition gang scheduling resource resolution (e.g. the partition can have SelectTypeParameters=CR_CORE, while the global value is CR_SOCKET). bug 1299
-
Morris Jette authored
Properly enforce partition Shared=YES option. Previously oversubscribing resources required gang scheduling to also be configured.
-
Morris Jette authored
Prevent invalid job array task ID value if a task is started using gang scheduling (i.e. the task starts in a SUSPENDED state). The task ID gets set to NO_VAL and the task string is also cleared.
-
- 23 Dec, 2014 3 commits
-
-
Morris Jette authored
Prevent invalid job array task ID value if a task is started using gang scheduling (i.e. the task starts in a SUSPENDED state). The task ID gets set to NO_VAL and the task string is also cleared.
-
Morris Jette authored
Prevent a job manually suspended from being resumed by gang scheduler once free resources are available. bug 1335
-
Dorian Krause authored
we have hit the following problem that seems to be present in Slurm slurm-14-11-2-1 and previous versions. When a node is reserved and an overlapping maint reservation is created and later deleted the scontrol output will report the node as IDLE rather than RESERVED: + scontrol show node node1 + grep State State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 + scontrol create reservation starttime=now duration=120 user=usr01000 nodes=node1 ReservationName=X Reservation created: X + sleep 10 + scontrol show nodes node1 + grep State State=RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=1 + scontrol create reservation starttime=now duration=120 user=usr01000 nodes=ALL flags=maint,ignore_jobs ReservationName=Y Reservation created: Y + sleep 10 + grep State + scontrol show nodes node1 State=MAINT ThreadsPerCore=1 TmpDisk=0 Weight=1 + scontrol delete ReservationName=Y + sleep 10 + scontrol show nodes node1 + grep State * State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1* + scontrol delete ReservationName=X + sleep 10 + scontrol show nodes node1 + grep State State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Note that the after the deletion of reservation "X" the State=IDLE instead of State=RESERVED. I think that the delete_resv() function in slurmctld/reservation.c should call set_node_maint_mode(true) like update_resv() does. With the patch pasted at the end of this e-mail I get the following output which matches my expectation: + scontrol show node node1 + grep State State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 + scontrol create reservation starttime=now duration=120 user=usr01000 nodes=node1 ReservationName=X Reservation created: X + sleep 10 + scontrol show nodes node1 + grep State State=RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=1 + scontrol create reservation starttime=now duration=120 user=usr01000 nodes=ALL flags=maint,ignore_jobs ReservationName=Y Reservation created: Y + sleep 10 + scontrol show nodes node1 + grep State State=MAINT ThreadsPerCore=1 TmpDisk=0 Weight=1 + scontrol delete ReservationName=Y + sleep 10 + scontrol show nodes node1 + grep State * State=RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=1* + scontrol delete ReservationName=X + sleep 10 + scontrol show nodes node1 + grep State State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Thanks, Dorian
-
- 22 Dec, 2014 2 commits
-
-
Daniel Ahlin authored
Correct parsing of AccountingStoragePass when specified in old format (just a path name)
-
Rémi Palancher authored
Intel MPI, on MPI jobs initialisation through PMI, uses to call PMI_KVS_Put() many many times from task at rank 0, and each on these call is followed by PMI_KVS_Commit(). Slurm implementation of PMI_KVS_Commit() imposes a delay to avoid DDOS on original srun. This delay is proportional to the total number. It could be up to 3 secs for large jobs for ex. with 7168 tasks. Therefore, when Intel MPI calls PMI_KVS_Commit() 475 times (mesured on a test case) from task at rank 0, 28 minutes are spent in delay function. All other tasks in the job are waiting for a PMI_Barrier. Therefore, there is no risk for a DDOS from this single task 0. The patch alters the delaying time calculation to make sure task at rank 0 will does not be delayed. All other tasks are globally spreaded in the same time range as before.
-
- 20 Dec, 2014 3 commits
-
-
Nathan Yee authored
-
Danny Auble authored
of Slurm daemons. The slurmstepd still needs to be fixed, which most likely can't be fixed until 15.08.
-
David Bigagli authored
-
- 19 Dec, 2014 4 commits
-
-
Danny Auble authored
of Slurm daemons.
-
David Bigagli authored
and SLURM_JOB_RESERVATION in the batch job.
-
Danny Auble authored
but then sets CPUs to only represent the number of cores on the node.
-
Danny Auble authored
-
- 17 Dec, 2014 2 commits
-
-
Brian Christiansen authored
Bug 1327
-
Danny Auble authored
doesn't request a number of tasks.
-
- 16 Dec, 2014 2 commits
-
-
Morris Jette authored
Fix job array hash table bug, could result in slurmctld infinite loop or invalid memory reference. bug 1309
-
David Bigagli authored
-
- 12 Dec, 2014 4 commits
-
-
Morris Jette authored
If a master job array record is complete, then consider all pending tasks as also complete. This problem happens when a master job array record is pending (has pending tasks) and is cancelled. The result previously was a job record not visible to squeue/scontrol, but occupying memory. The same type of problem happened with respect to a dependency on a job array which was cancelled.
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
- 11 Dec, 2014 8 commits
-
-
Danny Auble authored
If a QOS was added for the job and then removed and it just happened to be the largest QOS id wise if the slurmctld was restarted and the job wasn't flushed out yet it could mess things up.
-
David Bigagli authored
-
Morris Jette authored
Log how many nodes are removed from consideration from jobs due to advanced reservation. Change user error message to indicated that required nodes might be down, drained or (added this bit) reserved.
-
Morris Jette authored
In proctrack/linuxproc and proctrack/pgid, check the result of strtol() for error condition rather than errno, which might have a vestigial error code.
-
Danny Auble authored
correctly.
-
Danny Auble authored
accounting_storage/filetxt.
-
Morris Jette authored
The task_dist_states variable has been split into "flags" and "base" components. Added SLURM_DIST_PACK_NODES and SLURM_DIST_NO_PACK_NODES values to give user greater control over task distribution. The srun --dist options has been modified to accept a "Pack" and "NoPack" option. These options can be used to override the CR_PACK_NODE configuration option.
-
Danny Auble authored
to have all the limits a QOS has. If a limit is set in both QOS the partition QOS will override the job's QOS unless the job's QOS has the 'PartitionQOS' flag set.
-
- 09 Dec, 2014 2 commits
-
-
Morris Jette authored
-
Danny Auble authored
when running from cache.
-
- 08 Dec, 2014 4 commits
-
-
David Bigagli authored
-
Brian Christiansen authored
Bug 1305
-
Morris Jette authored
Fix bug with GRES having multiple types that can cause slurmctld abort. This can be reproduced with select/cons_res and one Gres like this: Name=gpu Type=kepler File=/dev/tty0 A bad index was being used that caused an assert.
-
David Bigagli authored
-