1. 16 Jan, 2013 18 commits
  2. 15 Jan, 2013 8 commits
  3. 14 Jan, 2013 12 commits
    • jette's avatar
    • Hongjia Cao's avatar
      Prevent srun abort on task launch failure · 163d9547
      Hongjia Cao authored
      On job step launch failure, the function
      "slurm_step_launch_wait_finish()" will be called twice in launch/slurm,
      which causes srun to be aborted:
      
      srun: error: Task launch for 22495.0 failed on node cn6: Job credential
      expired
      srun: error: Application launch failed: Job credential expired
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      cn5
      cn4
      cn7
      srun: error: Timed out waiting for job step to complete
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      srun: error: Timed out waiting for job step to complete
      srun: bitstring.c:174: bit_test: Assertion `(b) != ((void *)0)' failed.
      Aborted (core dumped)
      
      The attached patch(version 2.5.1) fixes it. But the message of
      "
      Job step aborted: Waiting up to 2 seconds for job step to finish.
      Timed out waiting for job step to complete
      "
      will still be printed twice.
      163d9547
    • Morris Jette's avatar
    • Morris Jette's avatar
    • Morris Jette's avatar
      select/cons_res plugin: CPU allocation logic fix · 1ef41ac9
      Morris Jette authored
      Correction to CPU allocation count logic in for cores without hyperthreading.
      1ef41ac9
    • Hongjia Cao's avatar
      Add SLURM_SRUN_REDUCE_TASK_EXIT_MSG environment variable · 96986199
      Hongjia Cao authored
      With jobs launched using srun directly which end abnormally, there will
      be a step-killed-message(slurmd[cn123]: *** 1234.0 KILLED AT ... WITH
      SIGNAL 9 ***) from each node. And/or there will be a
      task-exit-message(srun: error: task[0-1]: Terminated) for each node. For
      large scale jobs, these messages become tedious and the other error
      messages will be buried. The attached two patches(for slurm-2.5.1)
      introduce two environment variables to control the output of such
      messages:
      
      SLURM_STEP_KILLED_MSG_NODE_ID: if set, only the specified node will
      print the step-killed-message;
      
      SLURM_SRUN_REDUCE_TASK_EXIT_MSG: if set and non-zero, successive task
      exit messages with the same exit code will be printed only once.
      96986199
    • Hongjia Cao's avatar
      Add SLURM_STEP_KILLED_MSG_NODE_ID environment variable · 232ab305
      Hongjia Cao authored
      With jobs launched using srun directly which end abnormally, there will
      be a step-killed-message(slurmd[cn123]: *** 1234.0 KILLED AT ... WITH
      SIGNAL 9 ***) from each node. And/or there will be a
      task-exit-message(srun: error: task[0-1]: Terminated) for each node. For
      large scale jobs, these messages become tedious and the other error
      messages will be buried. The attached two patches(for slurm-2.5.1)
      introduce two environment variables to control the output of such
      messages:
      
      SLURM_STEP_KILLED_MSG_NODE_ID: if set, only the specified node will
      print the step-killed-message;
      
      SLURM_SRUN_REDUCE_TASK_EXIT_MSG: if set and non-zero, successive task
      exit messages with the same exit code will be printed only once.
      232ab305
    • Morris Jette's avatar
      Merge branch 'slurm-2.5' · fef33d8d
      Morris Jette authored
      fef33d8d
    • Morris Jette's avatar
      Add debugging hint to MPI guide for MPICH2 · dd8c22c7
      Morris Jette authored
      dd8c22c7
    • Yair Yarom's avatar
      Fix bug in accounting_storage/pgsql · 667cbf15
      Yair Yarom authored
      667cbf15
    • Morris Jette's avatar
      08cfbf0a
    • Morris Jette's avatar
      Revision of gres topology bug fix · e9c216c4
      Morris Jette authored
      e9c216c4
  4. 11 Jan, 2013 2 commits