Commit 232ab305 authored by Hongjia Cao's avatar Hongjia Cao Committed by Morris Jette
Browse files

Add SLURM_STEP_KILLED_MSG_NODE_ID environment variable

With jobs launched using srun directly which end abnormally, there will
be a step-killed-message(slurmd[cn123]: *** 1234.0 KILLED AT ... WITH
SIGNAL 9 ***) from each node. And/or there will be a
task-exit-message(srun: error: task[0-1]: Terminated) for each node. For
large scale jobs, these messages become tedious and the other error
messages will be buried. The attached two patches(for slurm-2.5.1)
introduce two environment variables to control the output of such
messages:

SLURM_STEP_KILLED_MSG_NODE_ID: if set, only the specified node will
print the step-killed-message;

SLURM_SRUN_REDUCE_TASK_EXIT_MSG: if set and non-zero, successive task
exit messages with the same exit code will be printed only once.
parent fef33d8d
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment