1. 20 Feb, 2018 1 commit
    • Morris Jette's avatar
      Correctly check return codes when creating a step to check if needing to · 10af7fbe
      Morris Jette authored
      wait to retry or not.
      
      I discovered this bug regression testing. Some similar situations will
      result in srun continuously issuing step create requests and the
      launch_common_create_job_step() function not sleeping between RPCs.
      Basically launch_common_create_job_step() sleeps for some error codes
      and srun retries the step create on some error codes. The problem is
      that those error codes do not match in both places, resulting in
      constant retries without sleeps. This situation is very likely with
      job preemption combined with salloc, but other conditions can trigger
      the same event. The following errno will all trigger this situation:
      EAGAIN, ESLURM_DISABLED, ESLURM_POWER_NOT_AVAIL, ESLURM_POWER_RESERVED,
      ESLURM_PROLOG_RUNNING, ESLURM_INTERCONNECT_BUSY.
      
      Bug 4786
      10af7fbe
  2. 16 Feb, 2018 3 commits
  3. 15 Feb, 2018 3 commits
  4. 14 Feb, 2018 2 commits
  5. 13 Feb, 2018 5 commits
  6. 12 Feb, 2018 3 commits
  7. 09 Feb, 2018 2 commits
  8. 08 Feb, 2018 3 commits
  9. 07 Feb, 2018 16 commits
  10. 06 Feb, 2018 2 commits