Commit 18e3d6fb authored by Dong Ahn's avatar Dong Ahn Committed by Danny Auble
Browse files

Fix MPIR_partial_attach_ok issues for parallel debuggers.

As specified in MPIR debug interface
(https://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf),
the presence of the MPIR_partial_attach_ok symbol
should inform the debugger that the initial startup synchronization
is implemented in such a way that the tool need not attach
nor continue MPI processes that the user is not interested in controlling.

To implement this, SLURM chose to send SIGCONT to those processes that are
not attached by the debugger.

However, the old code does not reliably detect the condition
in which a process is traced by the debugger, and this
has lead to various side effects.

On some systems (e.g., TOSS2), the old code sends SIGCONT to
all of the target processes including those attached by the debugger.
On newer systems (e.g., TOSS3), it does not send SIGCONT
to the target processes at all.

It seems that one of the reasons for such undefined behavior
is the use of CLONE_PTRACE.
@grondo found no documentation that indicates
CLONE_PTRACE is for the case where the process is being attached
by a debugger.
More importantly, this code is matching clone(2) flags
to proc(5) process flags, which are not the same, as task->flags
defined as PF_* flags from kernel source include/linux/sched.h.

This patch fixes these problems by replacing
the old detection logic with ones based on the TracerPid field
in /proc/<pid>/status.

From proc(5), TracerPid: PID of process tracing this process (0 if not
being traced).
parent 030d9d4b
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment