Harden slurmctld HA to mitigate certain HA issues.
If the network path to shared storage used for the StateSaveLocation is separate from that used to communicate with the cluster, both the primary and backup controllers can end up acting as master on loss of the cluster network. Alter the HA takeover code path to make sure that the job state save file is not still being updated by the primary slurmctld. If it is, refuse to takeover and retry again later. Bug 3592.
Please register or sign in to comment