1. 10 Aug, 2018 3 commits
  2. 08 Aug, 2018 3 commits
  3. 07 Aug, 2018 16 commits
  4. 06 Aug, 2018 10 commits
    • Brian Christiansen's avatar
      7e107975
    • Brian Christiansen's avatar
      Use correct units on slurmd side · 539cbae4
      Brian Christiansen authored
      when validating memory limits
      
      Continuation of b502d179
      539cbae4
    • Tim Wickberg's avatar
      Change debug messages() in _launch_handler(). · c3c78acd
      Tim Wickberg authored
      After changes to slurm_send_only_node_msg(), this message is much
      more likely to appear on systems with overloaded interconnects since
      that connection handling code may end up retransmitting messages
      that were actually received (but that the transmit side could not
      verify were delivered successfully).
      
      As the error() message stated, this isn't actually an error, and
      the code will proceed happily past this point. So drop the debug
      level, and remove the surrealist "this is not an error" part.
      
      Bug 5164.
      c3c78acd
    • Tim Wickberg's avatar
      Modify slurm_send_only_node_msg() to catch issues with socket. · 06582da8
      Tim Wickberg authored
      There are subtle issues involved in treating a TCP transmission
      as a unidirectional message delivery layer.
      
      The original code path looks like: connect(), write(), close().
      But Linux handles the write() and close() asynchronously behind the
      scenes, and does not block until that write() has been ACK'd by the
      remote end. So the write() and close() may succeed, even with data
      still in flight. A communication error - and message loss - would
      have been silently ignored, leading to unreliable message transmission.
      
      Worse yet, one side of the connection would believe it sent the message,
      while the receive side swears it never saw the packets. This leads to
      infrequent and yet seemingly impossible data loss, and a very tough
      bug to chase down.
      
      This teardown code tries to force the connection to shut down in an
      orderly manner, giving Slurm a chance to catch a connection problem
      and the upstream calling path an opportunity to retransmit.
      
      This teardown code is based on an approach described in Section 7.5
      of "UNIX Network Programming" Volume 1 (Third Edition), specifically
      the subsection regarding SO_LINGER. (And also covers why SO_LINGER is
      not sufficent to prevent this issue.)
      
      Bug 5164.
      06582da8
    • Tim Wickberg's avatar
      Retransmit on all errors. · a572d5d6
      Tim Wickberg authored
      Bug 5164.
      a572d5d6
    • Tim Wickberg's avatar
      Use reliable communication with srun in _send_srun_resp_msg(). · ed638974
      Tim Wickberg authored
      Do not use slurm_send_only_node_msg().
      
      There is no way to tell if the srun has received the message before
      the socket is shutdown if we do not wait to receive data. Use
      slurm_send_recv_rc_msg_only_one() instead, and send back a response
      from the other side.
      
      We still need the older (and unreliable) behavior when talking to older
      srun client commands, so make this change dependent on the protocol_version
      field in the message.
      
      Bug 5164.
      ed638974
    • Tim Wickberg's avatar
      Retransmit on all errors. · 60bc0e14
      Tim Wickberg authored
      Bug 5164.
      60bc0e14
    • Tim Wickberg's avatar
    • Marshall Garey's avatar
      Fix job array preemption in backfill scheduling. · 5efab599
      Marshall Garey authored
      Previously only a single task of a job array could preempt during
      backfill scheduling. This allows multiple tasks to preempt and have
      resources reserved in backfill.
      
      Bug 5405.
      5efab599
    • Brian Christiansen's avatar
      Fix stepd segfault when using proctrack/linux · dab09800
      Brian Christiansen authored
      This appears to be an oversight of 865338c7 where the cont_id check
      was changed from NO_VAL64 to INFINITE64. The cont_id is initialized to
      NO_VAL64 in src/common/slurm_jobacct_gather.c.
      dab09800
  5. 04 Aug, 2018 2 commits
  6. 03 Aug, 2018 5 commits
  7. 02 Aug, 2018 1 commit