Commit e69449b6 authored by Tim Wickberg's avatar Tim Wickberg
Browse files

Harden slurmctld HA to mitigate certain HA issues.

If the network path to shared storage used for the StateSaveLocation
is separate from that used to communicate with the cluster, both the
primary and backup controllers can end up acting as master on loss
of the cluster network.

Alter the HA takeover code path to make sure that the job state
save file is not still being updated by the primary slurmctld.
If it is, refuse to takeover and retry again later.

Bug 3592.
parent 5f1df178
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment