NEWS · 145962467424752e43e40d8e2d183847348b99d4 · Manuel G. Marciani / ces_slurm_simulator · GitLab

Find file Blame History Permalink

Requeue allocated jobs on nodes requested to DRAIN if POWER_[SAVE|UP]. · 14596246

Alejandro Sanchez authored Feb 22, 2018

After commit b31fa177, we do not defer slurmd node registration if
HealthCheckProgram fails. So at slurmd startup, slurmd executes:

run_script_health_check();
_spawn_registration_engine();

And does not keeps spinning if NHC fails. Now if there are nodes
managed by the Power Save logic, when they are requested to be
POWER_UP because a job is allocated resources, then at slurmd startup
NHC is executed before node registers.

The problem comes when this NHC execution fails, if the NHC program
decides to update the node to DRAIN, since the job was already
allocated before this update, then the job will attempt to start
RUNNING but might fail since NHC detected there's something wrong.

So this change what it does is to detect DRAIN/FAIL node update
requests, then check if node is ALLOC/MIXED and POWER_[SAVE|UP] and
if so then force a requeue, so that the job doesn't start on a failed
node.

Bug 4689.

14596246

To find the state of this project's repository at the time of any of these versions, check out the tags.