Prevent srun abort on task launch failure
On job step launch failure, the function "slurm_step_launch_wait_finish()" will be called twice in launch/slurm, which causes srun to be aborted: srun: error: Task launch for 22495.0 failed on node cn6: Job credential expired srun: error: Application launch failed: Job credential expired srun: Job step aborted: Waiting up to 2 seconds for job step to finish. cn5 cn4 cn7 srun: error: Timed out waiting for job step to complete srun: Job step aborted: Waiting up to 2 seconds for job step to finish. srun: error: Timed out waiting for job step to complete srun: bitstring.c:174: bit_test: Assertion `(b) != ((void *)0)' failed. Aborted (core dumped) The attached patch(version 2.5.1) fixes it. But the message of " Job step aborted: Waiting up to 2 seconds for job step to finish. Timed out waiting for job step to complete " will still be printed twice.
Please register or sign in to comment