Use shutdown() rather than close() for slurmstepd/srun sockets
From Matthieu Hautreux: However, after discussing the point with onsite Bull support team and looking at the slurmstepd code concerning stdout/err/in redirection we would like to recommend two things for future versions of SLURM : - sutdown(...,SHUT_WR) should be performed when managing the TCP sockets : no shutdown(...,SHUT_WR) is performed on the TCP socket in slurmstepd eio management. Thus, the close() can not reliably inform the other end of the socket that the transmission is done (no TCP_FIN transmitted). As the close is followed by an exit(), the kernel is the only entity that is knowing of the fact that the close may not have been took into account by the other side (wich might be our initial problem) and thus no retry can be performed, letting the server side of the socket (srun) in a position where it can wait for a read until the end of time. - TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP exchanges, thus letting the system potentially deadlocked if a remote host dissapear and the local host is waiting on a read (the write would result in a EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a relatively large timeout value (5 minutes), could enhance the resilience of SLURM for unexpected packet/connection loss without too much implication on the scalability of the solution. The timeout could be configurable in case it is find too aggresive for particular configurations.
Please register or sign in to comment