Fix for bug reported by Jim Garlick:
"srun output overflow ("Need to rewind" in srun/_do_output_line)" When srun's stdout is consuming data slowly, srun can receive notice that the job has terminated before the output stream has been fully written. The IO thread will receives a SIGHUP to kick it out of its blocking poll. However in the slow stdout situation the SIGHUP can interrupt the fflush. When the fflush is interrupted, it appears to clear the stream buffer even though the data wasn't written out to the file descriptor, and we see data loss on stdout. To avoid this situation, this change makes signals to the IO thread go over a pipe rather than sending a signal. Also, some extra return code checking is done in io.c:_do_output_line().
Please register or sign in to comment