Fix MPIR_partial_attach_ok issues for parallel debuggers.
As specified in MPIR debug interface (https://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf), the presence of the MPIR_partial_attach_ok symbol should inform the debugger that the initial startup synchronization is implemented in such a way that the tool need not attach nor continue MPI processes that the user is not interested in controlling. To implement this, SLURM chose to send SIGCONT to those processes that are not attached by the debugger. However, the old code does not reliably detect the condition in which a process is traced by the debugger, and this has lead to various side effects. On some systems (e.g., TOSS2), the old code sends SIGCONT to all of the target processes including those attached by the debugger. On newer systems (e.g., TOSS3), it does not send SIGCONT to the target processes at all. It seems that one of the reasons for such undefined behavior is the use of CLONE_PTRACE. @grondo found no documentation that indicates CLONE_PTRACE is for the case where the process is being attached by a debugger. More importantly, this code is matching clone(2) flags to proc(5) process flags, which are not the same, as task->flags defined as PF_* flags from kernel source include/linux/sched.h. This patch fixes these problems by replacing the old detection logic with ones based on the TracerPid field in /proc/<pid>/status. From proc(5), TracerPid: PID of process tracing this process (0 if not being traced).
Please register or sign in to comment