Correct a bug with -w in step management resulting in inadequate memory errors returned to srun
When requesting a particular nodelist for a step, if at least one of the node is still used by a former step (no REQUEST_STEP_COMPLETE received from that node), the current behavior is to return ESLURM_INVALID_TASK_MEMORY and srun aborting with "Memory required by task is not available". This can be reproduced by launching consecutive steps with the -w parameter set to $SLURM_NODELIST and introducing delays in the spank epilog on the execution nodes. The behavior is changed to only defer the execution of the step by returning ESLURM_NODES_BUSY when it is detected that some nodes are blocked because of already used memory.
Please register or sign in to comment