Harden slurmctld HA to mitigate certain HA issues. (e69449b6) · Commits · Manuel G. Marciani / ces_slurm_simulator

Commit e69449b6 authored Sep 07, 2017 by

Tim Wickberg

Harden slurmctld HA to mitigate certain HA issues.

If the network path to shared storage used for the StateSaveLocation
is separate from that used to communicate with the cluster, both the
primary and backup controllers can end up acting as master on loss
of the cluster network.

Alter the HA takeover code path to make sure that the job state
save file is not still being updated by the primary slurmctld.
If it is, refuse to takeover and retry again later.

Bug 3592.

parent 5f1df178

Hide whitespace changes

Inline Side-by-side

Please register or to comment