Fix a potential hang in srun that occurs when a node fails,
srun doesn't get any notification through TCP that the connection was lost, and the slurmctld sees the down node and ends the job step.
Please register or sign in to comment
srun doesn't get any notification through TCP that the connection was lost, and the slurmctld sees the down node and ends the job step.