Commit f297242e authored by Matthieu Hautreux's avatar Matthieu Hautreux Committed by Morris Jette
Browse files

slurmstepd : correct a bug in the IO thread termination monitoring

A dedicated thread (_kill_thr) is launched by slurmstepd at the end of a
step in order to destroy the IO thread if it does not manage to correctly
terminate by itself after 300 seconds.

Two bugs are corrected in this logic by this patch.

First, the performed sleep(300) is not protected against interruptions
and this delay can be reduced to a few seconds in case of signals received
by slurmstepd, thus, reducing the delay and forcing the IO thread to
terminate before the expiration of the grace time. The logic is modified
to ensure that the delay is respected using a loop around the sleep().

Second, to terminate the IO thread, a SIGKILL is delivered to the IO thread
using pthread_kill. However, sending SIGKILL using pthread_kill is a
process-wide operation (see man pthread_kill), thus all the slurmstepd
threads are killed and slurmstepd is terminated. This logic is modified
by using pthread_cancel() instead of pthread_kill() thus letting the
pthread_join() of _wait_for_io() having a chance to act as expected.

Without this patch, when _kill_thr is interrupted, slurmstepd is
terminated, letting the step in a incomplete state, as the node may not
have been able to send the REQUEST_STEP_COMPLETE to the controler.
Thus, consecutive steps can no longer be executed and stay permanently in
the "Job step creation temporarily disabled, retrying" state.
parent ac86cc37
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment