1. 07 Oct, 2015 2 commits
  2. 06 Oct, 2015 3 commits
  3. 03 Oct, 2015 1 commit
  4. 02 Oct, 2015 1 commit
    • Morris Jette's avatar
      Don't mark powered down node as not responding · 8c03a8bc
      Morris Jette authored
      This will only happen if a PING RPC for the node is already queued
        when the decision is made to power it down, then fails to get
        a response for the ping (since the node is already down).
      bug 1995
      8c03a8bc
  5. 30 Sep, 2015 2 commits
    • Morris Jette's avatar
      Reset job CPU count if CPUs/task ratio increased for mem limit · 836912bf
      Morris Jette authored
      If a job's CPUs/task ratio is increased due to configured MaxMemPerCPU,
      then increase it's allocated CPU count in order to enforce CPU limits.
      Previous logic would increase/set the cpus_per_task as needed if a
      job's --mem-per-cpu was above the configured MaxMemPerCPU, but NOT
      increase the min_cpus or max_cpus varilable. This resulted in allocating
      the wrong CPU count.
      836912bf
    • Morris Jette's avatar
      Don't start duplicate batch job · c1513956
      Morris Jette authored
      Requeue/hold batch job launch request if job already running. This is
        possible if node went to DOWN state, but jobs remained active.
      In addition, if a prolog/epilog failed DRAIN the node rather than
        setting it down, which could kill jobs that could continue to
        run.
      bug 1985
      c1513956
  6. 29 Sep, 2015 2 commits
  7. 28 Sep, 2015 1 commit
    • Morris Jette's avatar
      Fix for node state when shrinking jobs · 6c9d4540
      Morris Jette authored
      When nodes have been allocated to a job and then released by the
        job while resizing, this patch prevents the nodes from continuing
        to appear allocated and unavailable to other jobs. Requires
        exclusive node allocation to trigger. This prevents the previously
        reported failure, but a proper fix will be quite complex and
        delayed to the next major release of Slurm (v 16.05).
      bug 1851
      6c9d4540
  8. 23 Sep, 2015 1 commit
  9. 22 Sep, 2015 1 commit
  10. 21 Sep, 2015 1 commit
  11. 17 Sep, 2015 1 commit
  12. 11 Sep, 2015 2 commits
  13. 10 Sep, 2015 4 commits
  14. 09 Sep, 2015 1 commit
  15. 08 Sep, 2015 1 commit
  16. 02 Sep, 2015 2 commits
  17. 01 Sep, 2015 4 commits
  18. 28 Aug, 2015 1 commit
    • Morris Jette's avatar
      Requeue job if possible when slurmstepd aborts · d8e6f55d
      Morris Jette authored
      This problem is reproducible by launching a job then killing the
        slurmstepd process. Under those conditions, requeue the job if
        possible (i.e. batch job with requeue option/configuration).
        This patch also improves the slurmctld logging when this happens.
      bug 1889
      d8e6f55d
  19. 27 Aug, 2015 2 commits
    • Morris Jette's avatar
      Correct RebootProgram usage · 82068b6b
      Morris Jette authored
      Correct RebootProgram logic when executed outside of a maintenance
        reservation. Previous logic would mark the node up upon response
        to the reboot RPC (from slurmctld to slurmc) and when the node
        actually rebooted, flag that as an unexpected reboot. This new
        logic checks the node's up time to not mark the compute node as
        being usable until the reboot actually takes place.
      but 1866
      82068b6b
    • Danny Auble's avatar
      Fix some potential deadlock issues when state files don't exist in the · f6bc60cc
      Danny Auble authored
      association manager.
      f6bc60cc
  20. 26 Aug, 2015 3 commits
  21. 25 Aug, 2015 2 commits
  22. 21 Aug, 2015 2 commits