• Morris Jette's avatar
    Add support for configurable keep alive time for srun/slurmstep communications · 5752c6ce
    Morris Jette authored
    Added "KeepAliveTime" configuration parameter
    
    From Matthieu Hautreux:
    TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP
    exchanges, thus letting the system potentially deadlocked if a remote host
    dissapear and the local host is waiting on a read (the write would result in a
    EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a
    relatively large timeout value (5 minutes), could enhance the resilience of
    SLURM for unexpected packet/connection loss without too much implication on the
    scalability of the solution. The timeout could be configurable in case it is
    find too aggresive for particular configurations.
    5752c6ce
To find the state of this project's repository at the time of any of these versions, check out the tags.