• Tim Wickberg's avatar
    Harden slurmctld HA to mitigate certain HA issues. · e69449b6
    Tim Wickberg authored
    If the network path to shared storage used for the StateSaveLocation
    is separate from that used to communicate with the cluster, both the
    primary and backup controllers can end up acting as master on loss
    of the cluster network.
    
    Alter the HA takeover code path to make sure that the job state
    save file is not still being updated by the primary slurmctld.
    If it is, refuse to takeover and retry again later.
    
    Bug 3592.
    e69449b6
To find the state of this project's repository at the time of any of these versions, check out the tags.