Commit 30f31198 authored by Morris Jette's avatar Morris Jette
Browse files

Use shutdown() rather than close() for slurmstepd/srun sockets

From Matthieu Hautreux:
However, after discussing the point with onsite Bull support team and looking
at the slurmstepd code concerning stdout/err/in redirection we would like to
recommend two things for future versions of SLURM :

- sutdown(...,SHUT_WR) should be performed when managing the TCP sockets : no
shutdown(...,SHUT_WR) is performed on the TCP socket in slurmstepd eio
management. Thus, the close() can not reliably inform the other end of the
socket that the transmission is done (no TCP_FIN transmitted). As the close is
followed by an exit(), the kernel is the only entity that is knowing of the
fact that the close may not have been took into account by the other side (wich
might be our initial problem) and thus no retry can be performed, letting the
server side of the socket (srun) in a position where it can wait for a read
until the end of time.

- TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP
exchanges, thus letting the system potentially deadlocked if a remote host
dissapear and the local host is waiting on a read (the write would result in a
EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a
relatively large timeout value (5 minutes), could enhance the resilience of
SLURM for unexpected packet/connection loss without too much implication on the
scalability of the solution. The timeout could be configurable in case it is
find too aggresive for particular configurations.
parent 5752c6ce
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment