Commit 536c8451 authored by Morris Jette's avatar Morris Jette
Browse files

cray job requeue bug

Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
allocated to a requeued job as non-usable on job termination.

Specifically, each job has a "cleaning/cleaned" flag. Once a job
terminates, the cleaning flag is set, then after the job node health
check completes, the value gets set to cleaned. If the job is requeued,
on its second (or subsequent) termination, the select/cray plugin
is called to launch the NHC. The plugin sees the "cleaned" flag
already set, it then logs:
error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
and returns, never launching the NHC. Since the termination of the
job NHC triggers releasing job resources (CPUs, memory, and GRES),
those resources are never released for use by other jobs.

Bug 2384
parent e050806e
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment