1. 03 Feb, 2017 1 commit
  2. 31 Jan, 2017 3 commits
  3. 30 Jan, 2017 3 commits
    • Danny Auble's avatar
      Fix regression from commits · a4c51165
      Danny Auble authored
      e3a7bdcc
      f9804256
      d72b13f2
      
      Reference bug 3366
      
      If you are running on a Bluegene system we rely on the prolog to take us out of configuring
      state.  These commits work good for system rebooting the nodes where the prolog is running,
      but in the case of Bluegene this is the opposite desire :).   These commits on a Bluegene
      pretty much make it so a batch job never gets launched.
      a4c51165
    • Morris Jette's avatar
      Clear job BeginTime reason · 0abbf727
      Morris Jette authored
      Clear job's reason of "BeginTime" in a more timely fashion and/or prevents
          them from being stuck in a PENDING state. There are multiple ways of
          clearing the reason, especially on a lightly loaded system, but the
          state can persist indefinitely on a heavily loaded system.
      bug 3368
      0abbf727
    • Morris Jette's avatar
      will_run fix for job with begin time in past · f75abc9c
      Morris Jette authored
      Fix to logic for getting expected start time of existing job ID with
          explicit begin time that is in the past. Previous logic would
          compare that (past) begin time with advanced reservations that
          would compete with it rather than the current time.
      f75abc9c
  4. 29 Jan, 2017 4 commits
  5. 28 Jan, 2017 4 commits
  6. 27 Jan, 2017 2 commits
  7. 26 Jan, 2017 3 commits
  8. 25 Jan, 2017 10 commits
  9. 24 Jan, 2017 3 commits
  10. 23 Jan, 2017 7 commits
    • Morris Jette's avatar
      For batch step, reset job memory after node boot · 0277629b
      Morris Jette authored
      Reset a job's memory limit based upon what's available after node
        reboot, which can change on a KNL if the MCDRAM mode is changes
        on reboot
      0277629b
    • Morris Jette's avatar
      Fix for backfill launch job with reboot · d72b13f2
      Morris Jette authored
      This bug was likely the root cause of bug 3366. If the backfill scheduler
        allocates resources for a batch job and a node reboot is required, the
        batch launch RPC would be sent to the agent. At that point, there is a
        race condition between the agent and the job_time_limit() function
        testing for boot completion. If the job_time_limit() function ran
        first, it would trigger a second launch RPC request getting sent to
        the agent.
      bug 3366
      d72b13f2
    • Morris Jette's avatar
      Cleaner job configuring logic · f9804256
      Morris Jette authored
      Clean up logic to test if job is configuring
      bug 3366
      f9804256
    • Morris Jette's avatar
      Avoid launching batch step while configuring · e3a7bdcc
      Morris Jette authored
      Do not launch a batch step while the job is configuring. Previous
        logic checked for the PrologSlurmctld running, but not nodes
        booting. Checking the job's CONFIGURING state flag will validate
        both.
      bug 3366
      e3a7bdcc
    • Morris Jette's avatar
      Avoid duplicate configuration complete logic · db6acb8f
      Morris Jette authored
      Add check to avoid step allocation logic from executing job
        configuration completion logic multiple times (check if job
        is configurating before clearing flag and resetting time limit).
      bug 3366
      db6acb8f
    • Morris Jette's avatar
      fix slurmctld/agent race condition · 53784477
      Morris Jette authored
      slurmctld/agent race condition fix: Prevent job launch while PrologSlurmctld
          daemon is running or node boot in progress.
      bug 3366
      53784477
    • Morris Jette's avatar
      job write lock added to agent_retry() · 379007b8
      Morris Jette authored
      This is required to manage the configuration completion.
      bug 3366
      379007b8