Modify slurm_send_only_node_msg() to catch issues with socket. (06582da8) · Commits · Manuel G. Marciani / ces_slurm_simulator

Commit 06582da8 authored Aug 06, 2018 by

Tim Wickberg

Modify slurm_send_only_node_msg() to catch issues with socket.

There are subtle issues involved in treating a TCP transmission
as a unidirectional message delivery layer.

The original code path looks like: connect(), write(), close().
But Linux handles the write() and close() asynchronously behind the
scenes, and does not block until that write() has been ACK'd by the
remote end. So the write() and close() may succeed, even with data
still in flight. A communication error - and message loss - would
have been silently ignored, leading to unreliable message transmission.

Worse yet, one side of the connection would believe it sent the message,
while the receive side swears it never saw the packets. This leads to
infrequent and yet seemingly impossible data loss, and a very tough
bug to chase down.

This teardown code tries to force the connection to shut down in an
orderly manner, giving Slurm a chance to catch a connection problem
and the upstream calling path an opportunity to retransmit.

This teardown code is based on an approach described in Section 7.5
of "UNIX Network Programming" Volume 1 (Third Edition), specifically
the subsection regarding SO_LINGER. (And also covers why SO_LINGER is
not sufficent to prevent this issue.)

Bug 5164.

parent a572d5d6

Hide whitespace changes

Inline Side-by-side

Please register or to comment