Do not defer slurmd node registration if HealthCheckProgram fails
This behavior was introduced in bug 2504, commit 7fb0c981 and bug 2643 commit 988edf12 respectively. The reasoning is that sysadmins who see nodes with Reason "Not Responding" but they can manually ping/access the node end up confused. That reason should only be set if the node is trully not responding, but not if the HealthCheckProgram execution failed or returned non-zero exit code. For that case, the program itself would take the appropiate actions, such as draining the node and setting an appropiate Reason. Bug 3931
Please register or sign in to comment