1. 06 Jun, 2018 1 commit
    • Brian Christiansen's avatar
      Don't allocate downed cloud nodes · be449407
      Brian Christiansen authored
      which were marked down due to ResumeTimeout.
      
      If a cloud node was marked down due to not responding by ResumeTimeout,
      the code inadvertently added the node back to the avail_node_bitmap --
      after being cleared by set_node_down_ptr(). The scheduler would then
      attempt to allocate the node again, which would cause a loop of hitting
      ResumeTimeout and allocating the downed node again.
      
      Bug 5264
      be449407
  2. 05 Jun, 2018 2 commits
  3. 04 Jun, 2018 1 commit
  4. 02 Jun, 2018 1 commit
  5. 01 Jun, 2018 1 commit
  6. 31 May, 2018 4 commits
  7. 30 May, 2018 24 commits
  8. 24 May, 2018 1 commit
    • Brian Christiansen's avatar
      Notify srun and ctld when unkillable stepd exits · 956a808d
      Brian Christiansen authored
      Commits f18390e8 and eed76f85 modified the stepd so that if the
      stepd encountered an unkillable step timeout that the stepd would just
      exit the stepd. If the stepd is a batch step then it would reply back
      to the controller with a non-zero exit code which will drain the node.
      But if an srun allocation/step were to get into the unkillable step
      code, the steps wouldn't let the waiting srun or controller know about
      the step going away -- leaving a hanging srun and job.
      
      This patch enables the stepd to notify the waiting sruns and the ctld of
      the stepd being done and drains the node for srun'ed alloction and/or
      steps.
      
      Bug 5164
      956a808d
  9. 21 May, 2018 1 commit
  10. 19 May, 2018 2 commits
  11. 18 May, 2018 2 commits