Commit 06863788 authored by Tim Wickberg's avatar Tim Wickberg
Browse files

Rework slurmstepd authentication.

slurmstepd exclusively accepts API connections through a unix socket.
Before this patch, the client end (usually slurmd, but pam_slurm_adopt and
scontrol both can use this) retrieves an auth cred via MUNGE, serializes
that over the socket, after which the slurmstepd must send that crential
back to MUNGE for verification.

However, the only info used from that cred is the uid from the client side
of the socket. That info can be retrieved via SO_PEERCRED (on Linux) - this
is what MUNGE uses to authenticate its own credentials. And the client uid
is only checked in half of the API calls since the info exposed is not
considered sensitive.

So, rather than have every slurmd -> slurmstepd call involve a sequence of:

    slurmd -> MUNGE for cred (authenticated using SO_PEERCRED internally)
    slurmd -> slurmstepd over socket
    slurmstepd -> MUNGE to validate credential

This can be simplified to:
    slurmd -> slurmstepd over socket (auth using SO_PEERCRED directly)

This simplified call path removes two socket connections, plus the overhead
from MUNGE's cryptographic operations, from the exchange. While performance
is not criticial for slurmd -> slurmstepd communication, this also improves
performance for other system utilities such as pam_slurm_adopt (which needs
to connect to half of the extern stepds on the node on average), or a future
nss_slurm module which is expected to place an even higher load on this API.

The one caveat here is that the API was not built in a way that makes this
restructing easy. The slurmstepd protocol version, which may be one or two
release behind that of the slurmd, was only sent back to the slurmd _after_
the auth cred has been received and validated. So, to handle backwards
compatibility, we change over to sending the SLURM_PROTOCOL_VERSION instead
of SOCKET_CONNECT as the first int over the socket. If the slurmstepd
returns an error - since this value is not equal to SOCKET_CONNECT (zero)
as was required in older versions - we allow that connection to close, and
try to reconnect using the older RPC format instead. That fallback code
should be removed two versions after 19.05 is released.
parent 78ea3e01
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment