Commit 98b203d4 authored by Morris Jette's avatar Morris Jette
Browse files

Problem using salloc/mpirun with task affinity socket binding

salloc/mpirun does not play well together with task affinity socket binding.  The following example illustrates the problem.

[sulu] (slurm) mnp> salloc -p bones-only -N1-1 -n3 --cpu_bind=socket mpirun cat /proc/self/status | grep Cpus_allowed_list
salloc: Granted job allocation 387
--------------------------------------------------------------------------
An invalid physical processor id was returned ...

The problem is that with mpirun jobs Slurm launches only a single task, regardless of the value of -n. This confuses the socket binding logic in task affinity.  The result is that task affinity binds the task to only a single cpu, instead of all the allocated cpus on the socket.  When mpi attempts to bind to any of the other allocated cpus on the socket, it gets the "invalid physical processor id" error. Note that the problem may occur even if socket binding is not explicitly requested by the user.  If task/affinity is configured and the allocated CPUs are a whole number of sockets, Slurm will use "implicit auto binding" to sockets, triggering the problem.
Patch from Martin Perry (Bull).
parent 7e181113
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment