Commit 163d9547 authored by Hongjia Cao's avatar Hongjia Cao Committed by jette
Browse files

Prevent srun abort on task launch failure

On job step launch failure, the function
"slurm_step_launch_wait_finish()" will be called twice in launch/slurm,
which causes srun to be aborted:

srun: error: Task launch for 22495.0 failed on node cn6: Job credential
expired
srun: error: Application launch failed: Job credential expired
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
cn5
cn4
cn7
srun: error: Timed out waiting for job step to complete
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
srun: bitstring.c:174: bit_test: Assertion `(b) != ((void *)0)' failed.
Aborted (core dumped)

The attached patch(version 2.5.1) fixes it. But the message of
"
Job step aborted: Waiting up to 2 seconds for job step to finish.
Timed out waiting for job step to complete
"
will still be printed twice.
parent dd8c22c7
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment