• Alejandro Sanchez's avatar
    Requeue allocated jobs on nodes requested to DRAIN if POWER_[SAVE|UP]. · 14596246
    Alejandro Sanchez authored
    After commit b31fa177, we do not defer slurmd node registration if
    HealthCheckProgram fails. So at slurmd startup, slurmd executes:
    
    run_script_health_check();
    _spawn_registration_engine();
    
    And does not keeps spinning if NHC fails. Now if there are nodes
    managed by the Power Save logic, when they are requested to be
    POWER_UP because a job is allocated resources, then at slurmd startup
    NHC is executed before node registers.
    
    The problem comes when this NHC execution fails, if the NHC program
    decides to update the node to DRAIN, since the job was already
    allocated before this update, then the job will attempt to start
    RUNNING but might fail since NHC detected there's something wrong.
    
    So this change what it does is to detect DRAIN/FAIL node update
    requests, then check if node is ALLOC/MIXED and POWER_[SAVE|UP] and
    if so then force a requeue, so that the job doesn't start on a failed
    node.
    
    Bug 4689.
    14596246
To find the state of this project's repository at the time of any of these versions, check out the tags.