Requeue allocated jobs on nodes requested to DRAIN if POWER_[SAVE|UP]. (14596246) · Commits · Manuel G. Marciani / ces_slurm_simulator

Commit 14596246 authored Feb 22, 2018 by

Alejandro Sanchez

Requeue allocated jobs on nodes requested to DRAIN if POWER_[SAVE|UP].

After commit b31fa177, we do not defer slurmd node registration if
HealthCheckProgram fails. So at slurmd startup, slurmd executes:

run_script_health_check();
_spawn_registration_engine();

And does not keeps spinning if NHC fails. Now if there are nodes
managed by the Power Save logic, when they are requested to be
POWER_UP because a job is allocated resources, then at slurmd startup
NHC is executed before node registers.

The problem comes when this NHC execution fails, if the NHC program
decides to update the node to DRAIN, since the job was already
allocated before this update, then the job will attempt to start
RUNNING but might fail since NHC detected there's something wrong.

So this change what it does is to detect DRAIN/FAIL node update
requests, then check if node is ALLOC/MIXED and POWER_[SAVE|UP] and
if so then force a requeue, so that the job doesn't start on a failed
node.

Bug 4689.

parent 10c90b25

Hide whitespace changes

Inline Side-by-side

Please register or to comment