-
Alejandro Sanchez authored
After commit b31fa177, we do not defer slurmd node registration if HealthCheckProgram fails. So at slurmd startup, slurmd executes: run_script_health_check(); _spawn_registration_engine(); And does not keeps spinning if NHC fails. Now if there are nodes managed by the Power Save logic, when they are requested to be POWER_UP because a job is allocated resources, then at slurmd startup NHC is executed before node registers. The problem comes when this NHC execution fails, if the NHC program decides to update the node to DRAIN, since the job was already allocated before this update, then the job will attempt to start RUNNING but might fail since NHC detected there's something wrong. So this change what it does is to detect DRAIN/FAIL node update requests, then check if node is ALLOC/MIXED and POWER_[SAVE|UP] and if so then force a requeue, so that the job doesn't start on a failed node. Bug 4689.
14596246
To find the state of this project's repository at the time of any of these versions, check out the tags.