1. 27 Nov, 2012 14 commits
  2. 26 Nov, 2012 10 commits
  3. 25 Nov, 2012 3 commits
  4. 22 Nov, 2012 7 commits
  5. 21 Nov, 2012 6 commits
    • Morris Jette's avatar
      Remove some currently unused variables · 3e8b10b3
      Morris Jette authored
      3e8b10b3
    • Morris Jette's avatar
    • Matthieu Hautreux's avatar
      slurmstepd : correct a bug in the IO thread termination monitoring · f297242e
      Matthieu Hautreux authored
      A dedicated thread (_kill_thr) is launched by slurmstepd at the end of a
      step in order to destroy the IO thread if it does not manage to correctly
      terminate by itself after 300 seconds.
      
      Two bugs are corrected in this logic by this patch.
      
      First, the performed sleep(300) is not protected against interruptions
      and this delay can be reduced to a few seconds in case of signals received
      by slurmstepd, thus, reducing the delay and forcing the IO thread to
      terminate before the expiration of the grace time. The logic is modified
      to ensure that the delay is respected using a loop around the sleep().
      
      Second, to terminate the IO thread, a SIGKILL is delivered to the IO thread
      using pthread_kill. However, sending SIGKILL using pthread_kill is a
      process-wide operation (see man pthread_kill), thus all the slurmstepd
      threads are killed and slurmstepd is terminated. This logic is modified
      by using pthread_cancel() instead of pthread_kill() thus letting the
      pthread_join() of _wait_for_io() having a chance to act as expected.
      
      Without this patch, when _kill_thr is interrupted, slurmstepd is
      terminated, letting the step in a incomplete state, as the node may not
      have been able to send the REQUEST_STEP_COMPLETE to the controler.
      Thus, consecutive steps can no longer be executed and stay permanently in
      the "Job step creation temporarily disabled, retrying" state.
      f297242e
    • Matthieu Hautreux's avatar
      Correct a bug with -w in step management resulting in inadequate memory errors returned to srun · ac86cc37
      Matthieu Hautreux authored
      When requesting a particular nodelist for a step, if at least one of the node is
      still used by a former step (no REQUEST_STEP_COMPLETE received from that node),
      the current behavior is to return ESLURM_INVALID_TASK_MEMORY and srun aborting
      with "Memory required by task is not available".
      
      This can be reproduced by launching consecutive steps with the -w parameter set
      to $SLURM_NODELIST and introducing delays in the spank epilog on the execution
      nodes.
      
      The behavior is changed to only defer the execution of the step by returning
      ESLURM_NODES_BUSY when it is detected that some nodes are blocked because of
      already used memory.
      ac86cc37
    • Matthieu Hautreux's avatar
      Correct a bug in consecutive steps management due to asynchronous step completions · 4c97337d
      Matthieu Hautreux authored
      When using consecutive steps, it appears that in some cases, the time required
      by the slurmstepd on the execution nodes to inform the controler of the completion
      of the step is higher than the time required to request the following step.
      In that scenario, the controler can reject the step by returning the error code
      ESLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE even if the step could be executed if
      all the former steps were correctly finished.
      
      This can be reproduced by launching consecutive steps and introducing dalys in
      the spank epilog on the execution nodes.
      
      The behavior is changed to only defer the execution of the step by returning
      ESLURM_NODES_BUSY when all the available nodes are not idle considering the
      former steps.
      4c97337d
    • Morris Jette's avatar