1. 04 Nov, 2014 4 commits
  2. 31 Oct, 2014 5 commits
  3. 30 Oct, 2014 2 commits
  4. 27 Oct, 2014 2 commits
  5. 24 Oct, 2014 3 commits
  6. 23 Oct, 2014 4 commits
  7. 22 Oct, 2014 1 commit
  8. 21 Oct, 2014 1 commit
    • Morris Jette's avatar
      Fix job gres info clear on slurmctld restart · 1209a664
      Morris Jette authored
      Fix bug that prevented preservation of a job's GRES bitmap on slurmctld
      restart or reconfigure (bug was introduced in 14.03.5 "Clear record of a
      job's gres when requeued" and only applies when GRES mapped to specific
      files).
      bug 1192
      1209a664
  9. 20 Oct, 2014 4 commits
  10. 18 Oct, 2014 1 commit
  11. 17 Oct, 2014 6 commits
  12. 16 Oct, 2014 5 commits
    • Brian Christiansen's avatar
      e1c42895
    • Danny Auble's avatar
    • Morris Jette's avatar
      Remove vestigial variable · 463df8fd
      Morris Jette authored
      463df8fd
    • Morris Jette's avatar
      Cray PMI refinements · eeb97050
      Morris Jette authored
      Refine commit 5f89223f based upon
      feedback from David Gloe:
      * It's not only MPI jobs, but anything that uses PMI. That includes MPI,
      shmem, etc, so you may want to reword the error message.
      * I added the terminated flag because if multiple tasks on a node exit,
      you would get an error message from each of them. That reduces it to one
      error message per node. Cray bug 810310 prompted that change.
      * Since we're now relying on --kill-on-bad-exit, I think we should update
      the Cray slurm.conf template to default to 1 (set KillOnBadExit=1 in
      contribs/cray/slurm.conf.template).
      bug 1171
      eeb97050
    • Morris Jette's avatar
      Change Cray mpi_fini failure logic · 5f89223f
      Morris Jette authored
      Treat Cray MPI job calling exit() without mpi_fini() as fatal error for
      that specific task and let srun handle all timeout logic.
      Previous logic would cancel the entire job step and srun options
      for wait time and kill on exit were ignored. The new logic provides
      users with the following type of response:
      
      $ srun -n3 -K0 -N3 --wait=60 ./tmp
      Task:0 Cycle:1
      Task:2 Cycle:1
      Task:1 Cycle:1
      Task:0 Cycle:2
      Task:2 Cycle:2
      slurmstepd: step 14927.0 task 1 exited without calling mpi_fini()
      srun: error: tux2: task 1: Killed
      Task:0 Cycle:3
      Task:2 Cycle:3
      Task:0 Cycle:4
      ...
      
      bug 1171
      5f89223f
  13. 15 Oct, 2014 2 commits
    • Morris Jette's avatar
      Avoid duplicate PowerUp of node on slurmctld start · d99cf552
      Morris Jette authored
      This fixes a race condition if the slurmctld needed to power up
      a node shortly after startup. Previously it would execute the
      ResumeProgram twice for effected nodes.
      d99cf552
    • Morris Jette's avatar
      if DOWN node set to PowerDown, clear NoResp flag · 13023913
      Morris Jette authored
      Without this change, a node in the cloud that failed to power up,
      would not have its NoResponding flag cleared, which would prevent
      its later use. The NoResponding flag is now cleared when manuallly
      when the node is modified to PowerDown.
      13023913