• Morris Jette's avatar
    Correctly check return codes when creating a step to check if needing to · 10af7fbe
    Morris Jette authored
    wait to retry or not.
    
    I discovered this bug regression testing. Some similar situations will
    result in srun continuously issuing step create requests and the
    launch_common_create_job_step() function not sleeping between RPCs.
    Basically launch_common_create_job_step() sleeps for some error codes
    and srun retries the step create on some error codes. The problem is
    that those error codes do not match in both places, resulting in
    constant retries without sleeps. This situation is very likely with
    job preemption combined with salloc, but other conditions can trigger
    the same event. The following errno will all trigger this situation:
    EAGAIN, ESLURM_DISABLED, ESLURM_POWER_NOT_AVAIL, ESLURM_POWER_RESERVED,
    ESLURM_PROLOG_RUNNING, ESLURM_INTERCONNECT_BUSY.
    
    Bug 4786
    10af7fbe
To find the state of this project's repository at the time of any of these versions, check out the tags.