Commit 10af7fbe authored by Morris Jette's avatar Morris Jette Committed by Danny Auble
Browse files

Correctly check return codes when creating a step to check if needing to

wait to retry or not.

I discovered this bug regression testing. Some similar situations will
result in srun continuously issuing step create requests and the
launch_common_create_job_step() function not sleeping between RPCs.
Basically launch_common_create_job_step() sleeps for some error codes
and srun retries the step create on some error codes. The problem is
that those error codes do not match in both places, resulting in
constant retries without sleeps. This situation is very likely with
job preemption combined with salloc, but other conditions can trigger
the same event. The following errno will all trigger this situation:
EAGAIN, ESLURM_DISABLED, ESLURM_POWER_NOT_AVAIL, ESLURM_POWER_RESERVED,
ESLURM_PROLOG_RUNNING, ESLURM_INTERCONNECT_BUSY.

Bug 4786
parent 9dc6cbe1
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment