-
Thomas Hamel authored
We want to introduce a new behavior in the way slurmd uses the HealthCheckProgram. The idea is to avoid a race condition between the first HealthCheckProgram run and the node accepting jobs. The slurmd daemon will initialize and then loop on HealthCheckProgram execution before registering with slurmctld. It will stay in this loop until the HealthCheckProgram returns successfully (the node is still DOWN). On our clusters we are using NHC as an HealthCheckProgram. NHC drains the node if it fails and remove the drain if it is successfull, this behavior fits well with our purpose. This behavior permits us to start slurmd at boot without setting up a complex boot sequence in the init system, slurmd just wait for the node to be ready before registering. The HealthCheckProgram is not run during slurmd startup if HealthCheckInteval is 0.
7fb0c981
To find the state of this project's repository at the time of any of these versions, check out the tags.