• Thomas Hamel's avatar
    Defer slurmd registration until NodeHealthCheck · 7fb0c981
    Thomas Hamel authored
    We want to introduce a new behavior in the way slurmd uses the
    HealthCheckProgram. The idea is to avoid a race condition between the
    first HealthCheckProgram run and the node accepting jobs. The slurmd
    daemon will initialize and then loop on HealthCheckProgram execution
    before registering with slurmctld. It will stay in this loop until
    the HealthCheckProgram returns successfully (the node is still DOWN).
    
    On our clusters we are using NHC as an HealthCheckProgram. NHC drains
    the node if it fails and remove the drain if it is successfull, this
    behavior fits well with our purpose. This behavior permits us to start
    slurmd at boot without setting up a complex boot sequence in the init
    system, slurmd just wait for the node to be ready before registering.
    
    The HealthCheckProgram is not run during slurmd startup if
    HealthCheckInteval is 0.
    7fb0c981
To find the state of this project's repository at the time of any of these versions, check out the tags.