Commit 14596246 authored by Alejandro Sanchez's avatar Alejandro Sanchez
Browse files

Requeue allocated jobs on nodes requested to DRAIN if POWER_[SAVE|UP].

After commit b31fa177, we do not defer slurmd node registration if
HealthCheckProgram fails. So at slurmd startup, slurmd executes:

run_script_health_check();
_spawn_registration_engine();

And does not keeps spinning if NHC fails. Now if there are nodes
managed by the Power Save logic, when they are requested to be
POWER_UP because a job is allocated resources, then at slurmd startup
NHC is executed before node registers.

The problem comes when this NHC execution fails, if the NHC program
decides to update the node to DRAIN, since the job was already
allocated before this update, then the job will attempt to start
RUNNING but might fail since NHC detected there's something wrong.

So this change what it does is to detect DRAIN/FAIL node update
requests, then check if node is ALLOC/MIXED and POWER_[SAVE|UP] and
if so then force a requeue, so that the job doesn't start on a failed
node.

Bug 4689.
parent 10c90b25
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment