Commit 12772a3a authored by Moe Jette's avatar Moe Jette
Browse files

select/cray: release ALPS reservation on termination signals

On rosa we experienced severe problems when jobs got killed via scancel or
as a result of job timeout. Job cleanup took several minutes, created stray
processes that consumed resources on the slurmd node, keeping the system 
for long spans unable from scheduling.

This problem did not show up on the smaller 2-cabinet XE system (which also
runs a more recent ALPS version). The fix for the problem is to keep new
script lines from starting by sending apkill only after formally releasing
the reservation.

For all signals whose default disposition is to terminate or to dump core,
the reservation is released before signalling the aprun job steps. This
prevents a race condition where further aprun lines get executed while the
apkill of the current aprun line in the job script is in progress.

We did a before/after test on rosa under full load and the problem disappeared.
parent 44bec602
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment