1. 17 Jan, 2013 1 commit
    • Morris Jette's avatar
      Add support for configurable keep alive time for srun/slurmstep communications · 5752c6ce
      Morris Jette authored
      Added "KeepAliveTime" configuration parameter
      
      From Matthieu Hautreux:
      TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP
      exchanges, thus letting the system potentially deadlocked if a remote host
      dissapear and the local host is waiting on a read (the write would result in a
      EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a
      relatively large timeout value (5 minutes), could enhance the resilience of
      SLURM for unexpected packet/connection loss without too much implication on the
      scalability of the solution. The timeout could be configurable in case it is
      find too aggresive for particular configurations.
      5752c6ce
  2. 16 Jan, 2013 18 commits
  3. 15 Jan, 2013 8 commits
  4. 14 Jan, 2013 12 commits
    • jette's avatar
    • Hongjia Cao's avatar
      Prevent srun abort on task launch failure · 163d9547
      Hongjia Cao authored
      On job step launch failure, the function
      "slurm_step_launch_wait_finish()" will be called twice in launch/slurm,
      which causes srun to be aborted:
      
      srun: error: Task launch for 22495.0 failed on node cn6: Job credential
      expired
      srun: error: Application launch failed: Job credential expired
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      cn5
      cn4
      cn7
      srun: error: Timed out waiting for job step to complete
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      srun: error: Timed out waiting for job step to complete
      srun: bitstring.c:174: bit_test: Assertion `(b) != ((void *)0)' failed.
      Aborted (core dumped)
      
      The attached patch(version 2.5.1) fixes it. But the message of
      "
      Job step aborted: Waiting up to 2 seconds for job step to finish.
      Timed out waiting for job step to complete
      "
      will still be printed twice.
      163d9547
    • Morris Jette's avatar
    • Morris Jette's avatar
    • Morris Jette's avatar
      select/cons_res plugin: CPU allocation logic fix · 1ef41ac9
      Morris Jette authored
      Correction to CPU allocation count logic in for cores without hyperthreading.
      1ef41ac9
    • Hongjia Cao's avatar
      Add SLURM_SRUN_REDUCE_TASK_EXIT_MSG environment variable · 96986199
      Hongjia Cao authored
      With jobs launched using srun directly which end abnormally, there will
      be a step-killed-message(slurmd[cn123]: *** 1234.0 KILLED AT ... WITH
      SIGNAL 9 ***) from each node. And/or there will be a
      task-exit-message(srun: error: task[0-1]: Terminated) for each node. For
      large scale jobs, these messages become tedious and the other error
      messages will be buried. The attached two patches(for slurm-2.5.1)
      introduce two environment variables to control the output of such
      messages:
      
      SLURM_STEP_KILLED_MSG_NODE_ID: if set, only the specified node will
      print the step-killed-message;
      
      SLURM_SRUN_REDUCE_TASK_EXIT_MSG: if set and non-zero, successive task
      exit messages with the same exit code will be printed only once.
      96986199
    • Hongjia Cao's avatar
      Add SLURM_STEP_KILLED_MSG_NODE_ID environment variable · 232ab305
      Hongjia Cao authored
      With jobs launched using srun directly which end abnormally, there will
      be a step-killed-message(slurmd[cn123]: *** 1234.0 KILLED AT ... WITH
      SIGNAL 9 ***) from each node. And/or there will be a
      task-exit-message(srun: error: task[0-1]: Terminated) for each node. For
      large scale jobs, these messages become tedious and the other error
      messages will be buried. The attached two patches(for slurm-2.5.1)
      introduce two environment variables to control the output of such
      messages:
      
      SLURM_STEP_KILLED_MSG_NODE_ID: if set, only the specified node will
      print the step-killed-message;
      
      SLURM_SRUN_REDUCE_TASK_EXIT_MSG: if set and non-zero, successive task
      exit messages with the same exit code will be printed only once.
      232ab305
    • Morris Jette's avatar
      Merge branch 'slurm-2.5' · fef33d8d
      Morris Jette authored
      fef33d8d
    • Morris Jette's avatar
      Add debugging hint to MPI guide for MPICH2 · dd8c22c7
      Morris Jette authored
      dd8c22c7
    • Yair Yarom's avatar
      Fix bug in accounting_storage/pgsql · 667cbf15
      Yair Yarom authored
      667cbf15
    • Morris Jette's avatar
      08cfbf0a
    • Morris Jette's avatar
      Revision of gres topology bug fix · e9c216c4
      Morris Jette authored
      e9c216c4
  5. 11 Jan, 2013 1 commit