1. 18 Jan, 2013 1 commit
  2. 17 Jan, 2013 6 commits
    • Morris Jette's avatar
      Merge branch 'slurm-2.5' · d44a7cbd
      Morris Jette authored
      Conflicts:
      	src/sacctmgr/sacctmgr.c
      	src/sreport/sreport.c
      d44a7cbd
    • David Bigagli's avatar
      Terminate sreport on EOF · 892b14aa
      David Bigagli authored
      892b14aa
    • Morris Jette's avatar
      Fix typo in comment · 56a821a7
      Morris Jette authored
      56a821a7
    • David Bigagli's avatar
    • Morris Jette's avatar
      Use shutdown() rather than close() for slurmstepd/srun sockets · 30f31198
      Morris Jette authored
      From Matthieu Hautreux:
      However, after discussing the point with onsite Bull support team and looking
      at the slurmstepd code concerning stdout/err/in redirection we would like to
      recommend two things for future versions of SLURM :
      
      - sutdown(...,SHUT_WR) should be performed when managing the TCP sockets : no
      shutdown(...,SHUT_WR) is performed on the TCP socket in slurmstepd eio
      management. Thus, the close() can not reliably inform the other end of the
      socket that the transmission is done (no TCP_FIN transmitted). As the close is
      followed by an exit(), the kernel is the only entity that is knowing of the
      fact that the close may not have been took into account by the other side (wich
      might be our initial problem) and thus no retry can be performed, letting the
      server side of the socket (srun) in a position where it can wait for a read
      until the end of time.
      
      - TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP
      exchanges, thus letting the system potentially deadlocked if a remote host
      dissapear and the local host is waiting on a read (the write would result in a
      EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a
      relatively large timeout value (5 minutes), could enhance the resilience of
      SLURM for unexpected packet/connection loss without too much implication on the
      scalability of the solution. The timeout could be configurable in case it is
      find too aggresive for particular configurations.
      30f31198
    • Morris Jette's avatar
      Add support for configurable keep alive time for srun/slurmstep communications · 5752c6ce
      Morris Jette authored
      Added "KeepAliveTime" configuration parameter
      
      From Matthieu Hautreux:
      TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP
      exchanges, thus letting the system potentially deadlocked if a remote host
      dissapear and the local host is waiting on a read (the write would result in a
      EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a
      relatively large timeout value (5 minutes), could enhance the resilience of
      SLURM for unexpected packet/connection loss without too much implication on the
      scalability of the solution. The timeout could be configurable in case it is
      find too aggresive for particular configurations.
      5752c6ce
  3. 16 Jan, 2013 18 commits
  4. 15 Jan, 2013 8 commits
  5. 14 Jan, 2013 7 commits
    • jette's avatar
    • Hongjia Cao's avatar
      Prevent srun abort on task launch failure · 163d9547
      Hongjia Cao authored
      On job step launch failure, the function
      "slurm_step_launch_wait_finish()" will be called twice in launch/slurm,
      which causes srun to be aborted:
      
      srun: error: Task launch for 22495.0 failed on node cn6: Job credential
      expired
      srun: error: Application launch failed: Job credential expired
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      cn5
      cn4
      cn7
      srun: error: Timed out waiting for job step to complete
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      srun: error: Timed out waiting for job step to complete
      srun: bitstring.c:174: bit_test: Assertion `(b) != ((void *)0)' failed.
      Aborted (core dumped)
      
      The attached patch(version 2.5.1) fixes it. But the message of
      "
      Job step aborted: Waiting up to 2 seconds for job step to finish.
      Timed out waiting for job step to complete
      "
      will still be printed twice.
      163d9547
    • Morris Jette's avatar
    • Morris Jette's avatar
    • Morris Jette's avatar
      select/cons_res plugin: CPU allocation logic fix · 1ef41ac9
      Morris Jette authored
      Correction to CPU allocation count logic in for cores without hyperthreading.
      1ef41ac9
    • Hongjia Cao's avatar
      Add SLURM_SRUN_REDUCE_TASK_EXIT_MSG environment variable · 96986199
      Hongjia Cao authored
      With jobs launched using srun directly which end abnormally, there will
      be a step-killed-message(slurmd[cn123]: *** 1234.0 KILLED AT ... WITH
      SIGNAL 9 ***) from each node. And/or there will be a
      task-exit-message(srun: error: task[0-1]: Terminated) for each node. For
      large scale jobs, these messages become tedious and the other error
      messages will be buried. The attached two patches(for slurm-2.5.1)
      introduce two environment variables to control the output of such
      messages:
      
      SLURM_STEP_KILLED_MSG_NODE_ID: if set, only the specified node will
      print the step-killed-message;
      
      SLURM_SRUN_REDUCE_TASK_EXIT_MSG: if set and non-zero, successive task
      exit messages with the same exit code will be printed only once.
      96986199
    • Hongjia Cao's avatar
      Add SLURM_STEP_KILLED_MSG_NODE_ID environment variable · 232ab305
      Hongjia Cao authored
      With jobs launched using srun directly which end abnormally, there will
      be a step-killed-message(slurmd[cn123]: *** 1234.0 KILLED AT ... WITH
      SIGNAL 9 ***) from each node. And/or there will be a
      task-exit-message(srun: error: task[0-1]: Terminated) for each node. For
      large scale jobs, these messages become tedious and the other error
      messages will be buried. The attached two patches(for slurm-2.5.1)
      introduce two environment variables to control the output of such
      messages:
      
      SLURM_STEP_KILLED_MSG_NODE_ID: if set, only the specified node will
      print the step-killed-message;
      
      SLURM_SRUN_REDUCE_TASK_EXIT_MSG: if set and non-zero, successive task
      exit messages with the same exit code will be printed only once.
      232ab305