slurmstepd : correct a bug in the IO thread termination monitoring
A dedicated thread (_kill_thr) is launched by slurmstepd at the end of a step in order to destroy the IO thread if it does not manage to correctly terminate by itself after 300 seconds. Two bugs are corrected in this logic by this patch. First, the performed sleep(300) is not protected against interruptions and this delay can be reduced to a few seconds in case of signals received by slurmstepd, thus, reducing the delay and forcing the IO thread to terminate before the expiration of the grace time. The logic is modified to ensure that the delay is respected using a loop around the sleep(). Second, to terminate the IO thread, a SIGKILL is delivered to the IO thread using pthread_kill. However, sending SIGKILL using pthread_kill is a process-wide operation (see man pthread_kill), thus all the slurmstepd threads are killed and slurmstepd is terminated. This logic is modified by using pthread_cancel() instead of pthread_kill() thus letting ...
Please register or sign in to comment