Commit 7fb0c981 authored by Thomas Hamel's avatar Thomas Hamel Committed by Morris Jette
Browse files

Defer slurmd registration until NodeHealthCheck

We want to introduce a new behavior in the way slurmd uses the
HealthCheckProgram. The idea is to avoid a race condition between the
first HealthCheckProgram run and the node accepting jobs. The slurmd
daemon will initialize and then loop on HealthCheckProgram execution
before registering with slurmctld. It will stay in this loop until
the HealthCheckProgram returns successfully (the node is still DOWN).

On our clusters we are using NHC as an HealthCheckProgram. NHC drains
the node if it fails and remove the drain if it is successfull, this
behavior fits well with our purpose. This behavior permits us to start
slurmd at boot without setting up a complex boot sequence in the init
system, slurmd just wait for the node to be ready before registering.

The HealthCheckProgram is not run during slurmd startup if
HealthCheckInteval is 0.
parent 50286191
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment