• Tim Wickberg's avatar
    Modify slurm_send_only_node_msg() to catch issues with socket. · 06582da8
    Tim Wickberg authored
    There are subtle issues involved in treating a TCP transmission
    as a unidirectional message delivery layer.
    
    The original code path looks like: connect(), write(), close().
    But Linux handles the write() and close() asynchronously behind the
    scenes, and does not block until that write() has been ACK'd by the
    remote end. So the write() and close() may succeed, even with data
    still in flight. A communication error - and message loss - would
    have been silently ignored, leading to unreliable message transmission.
    
    Worse yet, one side of the connection would believe it sent the message,
    while the receive side swears it never saw the packets. This leads to
    infrequent and yet seemingly impossible data loss, and a very tough
    bug to chase down.
    
    This teardown code tries to force the connection to shut down in an
    orderly manner, giving Slurm a chance to catch a connection problem
    and the upstream calling path an opportunity to retransmit.
    
    This teardown code is based on an approach described in Section 7.5
    of "UNIX Network Programming" Volume 1 (Third Edition), specifically
    the subsection regarding SO_LINGER. (And also covers why SO_LINGER is
    not sufficent to prevent this issue.)
    
    Bug 5164.
    06582da8
To find the state of this project's repository at the time of any of these versions, check out the tags.