Commit 06582da8 authored by Tim Wickberg's avatar Tim Wickberg
Browse files

Modify slurm_send_only_node_msg() to catch issues with socket.

There are subtle issues involved in treating a TCP transmission
as a unidirectional message delivery layer.

The original code path looks like: connect(), write(), close().
But Linux handles the write() and close() asynchronously behind the
scenes, and does not block until that write() has been ACK'd by the
remote end. So the write() and close() may succeed, even with data
still in flight. A communication error - and message loss - would
have been silently ignored, leading to unreliable message transmission.

Worse yet, one side of the connection would believe it sent the message,
while the receive side swears it never saw the packets. This leads to
infrequent and yet seemingly impossible data loss, and a very tough
bug to chase down.

This teardown code tries to force the connection to shut down in an
orderly manner, giving Slurm a chance to catch a connection problem
and the upstream calling path an opportunity to retransmit.

This teardown code is based on an approach described in Section 7.5
of "UNIX Network Programming" Volume 1 (Third Edition), specifically
the subsection regarding SO_LINGER. (And also covers why SO_LINGER is
not sufficent to prevent this issue.)

Bug 5164.
parent a572d5d6
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment