Commit 5752c6ce authored by Morris Jette's avatar Morris Jette
Browse files

Add support for configurable keep alive time for srun/slurmstep communications

Added "KeepAliveTime" configuration parameter

From Matthieu Hautreux:
TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP
exchanges, thus letting the system potentially deadlocked if a remote host
dissapear and the local host is waiting on a read (the write would result in a
EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a
relatively large timeout value (5 minutes), could enhance the resilience of
SLURM for unexpected packet/connection loss without too much implication on the
scalability of the solution. The timeout could be configurable in case it is
find too aggresive for particular configurations.
parent 0367b663
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment