1. 29 Dec, 2014 2 commits
  2. 26 Dec, 2014 1 commit
  3. 24 Dec, 2014 3 commits
    • Morris Jette's avatar
      Enable per-partition gang sched resolution · 5e02af31
      Morris Jette authored
      Enable per-partition gang scheduling resource resolution (e.g. the partition
      can have SelectTypeParameters=CR_CORE, while the global value is CR_SOCKET).
      bug 1299
      5e02af31
    • Morris Jette's avatar
      Enforce partition shared option · f8fb79d5
      Morris Jette authored
      Properly enforce partition Shared=YES option. Previously oversubscribing
      resources required gang scheduling to also be configured.
      f8fb79d5
    • Morris Jette's avatar
      Fix bad job array task ID value · 46a2e9a1
      Morris Jette authored
      Prevent invalid job array task ID value if a task is started using gang
      scheduling (i.e. the task starts in a SUSPENDED state). The task ID gets
      set to NO_VAL and the task string is also cleared.
      46a2e9a1
  4. 23 Dec, 2014 3 commits
    • Morris Jette's avatar
      Fix bad job array task ID value · 48016f86
      Morris Jette authored
      Prevent invalid job array task ID value if a task is started using gang
      scheduling (i.e. the task starts in a SUSPENDED state). The task ID gets
      set to NO_VAL and the task string is also cleared.
      48016f86
    • Morris Jette's avatar
      Prevent gang resume of suspended job · 161d0336
      Morris Jette authored
      Prevent a job manually suspended from being resumed by gang scheduler once
      free resources are available.
      bug 1335
      161d0336
    • Dorian Krause's avatar
      set node state RESERVED on maint reservation delete · cf846644
      Dorian Krause authored
      we have hit the following problem that seems to be present in Slurm
      slurm-14-11-2-1 and previous versions. When a node is reserved and an
      overlapping maint reservation is created and later deleted the scontrol
      output will report the node as IDLE rather than RESERVED:
      
      + scontrol show node node1
      + grep State
         State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
      + scontrol create reservation starttime=now duration=120 user=usr01000
      nodes=node1 ReservationName=X
      Reservation created: X
      + sleep 10
      + scontrol show nodes node1
      + grep State
         State=RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=1
      + scontrol create reservation starttime=now duration=120 user=usr01000
      nodes=ALL flags=maint,ignore_jobs ReservationName=Y
      Reservation created: Y
      + sleep 10
      + grep State
      + scontrol show nodes node1
         State=MAINT ThreadsPerCore=1 TmpDisk=0 Weight=1
      + scontrol delete ReservationName=Y
      + sleep 10
      + scontrol show nodes node1
      + grep State
      *   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1*
      + scontrol delete ReservationName=X
      + sleep 10
      + scontrol show nodes node1
      + grep State
         State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
      
      Note that the after the deletion of reservation "X" the State=IDLE
      instead of State=RESERVED. I think that the delete_resv() function in
      slurmctld/reservation.c should call set_node_maint_mode(true) like
      update_resv() does. With the patch pasted at the end of this e-mail I
      get the following output which matches my expectation:
      
      + scontrol show node node1
      + grep State
         State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
      + scontrol create reservation starttime=now duration=120 user=usr01000
      nodes=node1 ReservationName=X
      Reservation created: X
      + sleep 10
      + scontrol show nodes node1
      + grep State
         State=RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=1
      + scontrol create reservation starttime=now duration=120 user=usr01000
      nodes=ALL flags=maint,ignore_jobs ReservationName=Y
      Reservation created: Y
      + sleep 10
      + scontrol show nodes node1
      + grep State
         State=MAINT ThreadsPerCore=1 TmpDisk=0 Weight=1
      + scontrol delete ReservationName=Y
      + sleep 10
      + scontrol show nodes node1
      + grep State
      *   State=RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=1*
      + scontrol delete ReservationName=X
      + sleep 10
      + scontrol show nodes node1
      + grep State
         State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
      
      Thanks,
      Dorian
      cf846644
  5. 22 Dec, 2014 2 commits
    • Daniel Ahlin's avatar
      Auth/munge - Correct AccountingStoragePass parsing · 2edef50d
      Daniel Ahlin authored
      Correct parsing of AccountingStoragePass when specified in old format
      (just a path name)
      2edef50d
    • Rémi Palancher's avatar
      avoid delay on commit for PMI task at rank 0 · fcc11e22
      Rémi Palancher authored
      Intel MPI, on MPI jobs initialisation through PMI, uses to call PMI_KVS_Put()
      many many times from task at rank 0, and each on these call is followed by
      PMI_KVS_Commit(). Slurm implementation of PMI_KVS_Commit() imposes a delay
      to avoid DDOS on original srun. This delay is proportional to the total number.
      It could be up to 3 secs for large jobs for ex. with 7168 tasks. Therefore,
      when Intel MPI calls PMI_KVS_Commit() 475 times (mesured on a test case) from
      task at rank 0, 28 minutes are spent in delay function.
      All other tasks in the job are waiting for a PMI_Barrier. Therefore, there is
      no risk for a DDOS from this single task 0. The patch alters the delaying time
      calculation to make sure task at rank 0 will does not be delayed. All other
      tasks are globally spreaded in the same time range as before.
      fcc11e22
  6. 20 Dec, 2014 3 commits
  7. 19 Dec, 2014 4 commits
  8. 17 Dec, 2014 2 commits
  9. 16 Dec, 2014 2 commits
  10. 12 Dec, 2014 4 commits
  11. 11 Dec, 2014 8 commits
  12. 09 Dec, 2014 2 commits
  13. 08 Dec, 2014 4 commits