• Morris Jette's avatar
    cray job requeue bug · fec5e03b
    Morris Jette authored
    Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
    allocated to a requeued job as non-usable on job termination.
    
    Specifically, each job has a "cleaning/cleaned" flag. Once a job
    terminates, the cleaning flag is set, then after the job node health
    check completes, the value gets set to cleaned. If the job is requeued,
    on its second (or subsequent) termination, the select/cray plugin
    is called to launch the NHC. The plugin sees the "cleaned" flag
    already set, it then logs:
    error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
    and returns, never launching the NHC. Since the termination of the
    job NHC triggers releasing job resources (CPUs, memory, and GRES),
    those resources are never released for use by other jobs.
    
    Bug 2384
    fec5e03b
To find the state of this project's repository at the time of any of these versions, check out the tags.