1. 15 Oct, 2015 1 commit
  2. 12 Oct, 2015 1 commit
  3. 09 Oct, 2015 1 commit
  4. 08 Oct, 2015 4 commits
    • Brian Christiansen's avatar
      Fix case where the primary and backup dbds would both be performing rollup. · b2eb504b
      Brian Christiansen authored
      If the backup dbd happened to be doing rollup at the time the primary resumed
      both the primary and the backup would be doing rollups and causing contention on
      the database tables. The backup would wait for the rollup handler to finish
      before giving up control.
      
      The fix is to cancel the rollup_handler and let the backup begin to shutdown so
      that it will close an existing connections and then re-exec itself. The re-exec
      helps because the rollup handler spawns a thread for each cluster to rollup and
      just cancelling the rollup handler doesn't cancel the spawned threads from the
      rollup handler. This cleans up the dbd and locks. The re-exec only happens in
      the backup if the primary resumed and a rollup was happening.
      
      Bug 1988
      b2eb504b
    • Brian Christiansen's avatar
      Fix case where if the backup slurmdbd has existing connections when it gives... · 44bb06bc
      Brian Christiansen authored
      Fix case where if the backup slurmdbd has existing connections when it gives up control that the it would be killed.
      
      If the backup had existing connections when giving up control, it would try to
      signal the existing threads by using pthread_kill to send SIGKILL to the
      threads. The problem is that SIGKILL doesn't go the thread but the main process
      and the backup dbd would be killed.
      44bb06bc
    • Danny Auble's avatar
      Fixed slurmctld not sending cold-start messages correctly to the database · 4ed2f8c6
      Danny Auble authored
      when a cold-start (-c) happens to the slurmctld.
      4ed2f8c6
    • Morris Jette's avatar
      Remove SICP job option · 0f6bf406
      Morris Jette authored
      This was intended as a step toward managing jobs across mutliple
        clusters, but we will be pursuing a very different design.
      0f6bf406
  5. 07 Oct, 2015 7 commits
  6. 06 Oct, 2015 7 commits
  7. 03 Oct, 2015 1 commit
  8. 02 Oct, 2015 4 commits
  9. 01 Oct, 2015 2 commits
  10. 30 Sep, 2015 3 commits
    • Morris Jette's avatar
      Make cgroup paths consistent · c5c566ff
      Morris Jette authored
      Correct some cgroup paths ("step_batch" vs. "step_4294967294", "step_exter"
          vs. "step_extern", and "step_extern" vs. "step_4294967295").
      c5c566ff
    • Morris Jette's avatar
      Reset job CPU count if CPUs/task ratio increased for mem limit · 836912bf
      Morris Jette authored
      If a job's CPUs/task ratio is increased due to configured MaxMemPerCPU,
      then increase it's allocated CPU count in order to enforce CPU limits.
      Previous logic would increase/set the cpus_per_task as needed if a
      job's --mem-per-cpu was above the configured MaxMemPerCPU, but NOT
      increase the min_cpus or max_cpus varilable. This resulted in allocating
      the wrong CPU count.
      836912bf
    • Morris Jette's avatar
      Don't start duplicate batch job · c1513956
      Morris Jette authored
      Requeue/hold batch job launch request if job already running. This is
        possible if node went to DOWN state, but jobs remained active.
      In addition, if a prolog/epilog failed DRAIN the node rather than
        setting it down, which could kill jobs that could continue to
        run.
      bug 1985
      c1513956
  11. 29 Sep, 2015 2 commits
  12. 28 Sep, 2015 2 commits
    • Morris Jette's avatar
      Fix for node state when shrinking jobs · 16f4b6a9
      Morris Jette authored
      When nodes have been allocated to a job and then released by the
        job while resizing, this patch prevents the nodes from continuing
        to appear allocated and unavailable to other jobs. Requires
        exclusive node allocation to trigger. This prevents the previously
        reported failure, but a proper fix will be quite complex and
        delayed to the next major release of Slurm (v 16.05).
      bug 1851
      16f4b6a9
    • Morris Jette's avatar
      Fix for node state when shrinking jobs · 6c9d4540
      Morris Jette authored
      When nodes have been allocated to a job and then released by the
        job while resizing, this patch prevents the nodes from continuing
        to appear allocated and unavailable to other jobs. Requires
        exclusive node allocation to trigger. This prevents the previously
        reported failure, but a proper fix will be quite complex and
        delayed to the next major release of Slurm (v 16.05).
      bug 1851
      6c9d4540
  13. 26 Sep, 2015 1 commit
  14. 25 Sep, 2015 2 commits
  15. 24 Sep, 2015 2 commits