- 03 Dec, 2018 1 commit
-
-
Tim Wickberg authored
slurmstepd exclusively accepts API connections through a unix socket. Before this patch, the client end (usually slurmd, but pam_slurm_adopt and scontrol both can use this) retrieves an auth cred via MUNGE, serializes that over the socket, after which the slurmstepd must send that crential back to MUNGE for verification. However, the only info used from that cred is the uid from the client side of the socket. That info can be retrieved via SO_PEERCRED (on Linux) - this is what MUNGE uses to authenticate its own credentials. And the client uid is only checked in half of the API calls since the info exposed is not considered sensitive. So, rather than have every slurmd -> slurmstepd call involve a sequence of: slurmd -> MUNGE for cred (authenticated using SO_PEERCRED internally) slurmd -> slurmstepd over socket slurmstepd -> MUNGE to validate credential This can be simplified to: slurmd -> slurmstepd over socket (auth using SO_PEERCRED directly) This simplified call path removes two socket connections, plus the overhead from MUNGE's cryptographic operations, from the exchange. While performance is not criticial for slurmd -> slurmstepd communication, this also improves performance for other system utilities such as pam_slurm_adopt (which needs to connect to half of the extern stepds on the node on average), or a future nss_slurm module which is expected to place an even higher load on this API. The one caveat here is that the API was not built in a way that makes this restructing easy. The slurmstepd protocol version, which may be one or two release behind that of the slurmd, was only sent back to the slurmd _after_ the auth cred has been received and validated. So, to handle backwards compatibility, we change over to sending the SLURM_PROTOCOL_VERSION instead of SOCKET_CONNECT as the first int over the socket. If the slurmstepd returns an error - since this value is not equal to SOCKET_CONNECT (zero) as was required in older versions - we allow that connection to close, and try to reconnect using the older RPC format instead. That fallback code should be removed two versions after 19.05 is released.
-
- 02 Dec, 2018 2 commits
-
-
Tim Wickberg authored
Use __func__, and list the function name first in the message. Drop one redundant message printing the request number - all paths through the switch statement will print this out in some form. Remove a ternary used to print SLURM_SUCCESS/SLURM_FAILURE and print the raw return value. If you're staring at debug3 logs, you should hopefully know how to interpret these values. :)
-
Tim Wickberg authored
Not used, so don't bother retrieving it from the cred in _handle_accept. Also, switch a printf format to %u instead of %d for uid_t.
-
- 29 Nov, 2018 5 commits
-
-
Morris Jette authored
-
Dominik Bartkiewicz authored
Bug 6121
-
Morris Jette authored
-
Nate Rini authored
Bug 6008
-
Morris Jette authored
bug 6078
-
- 28 Nov, 2018 14 commits
-
-
Morris Jette authored
This change permits the set of the type_ arrays in the job gres data structure without setup of the typo_ arrays. Bug 6078
-
Morris Jette authored
-
Alejandro Sanchez authored
Bug 6077
-
Danny Auble authored
node system. Bug 6037
-
Marshall Garey authored
Bug 6037
-
Morris Jette authored
The test was failing if the GPU count exceeded the available CPU count
-
Morris Jette authored
The test was failing due to bad logic in the case where a GRES of craynetwork > 1 and the node count was 1
-
Artem Y. Polyakov authored
Bug 5983
-
Morris Jette authored
Rename "has_file" to "config_flags" and independently set the flags GRES_CONF_HAS_FILE and GRES_CONF_HAS_TYPE as appropriate. Previously defining a GRES "Type" would set a default value of "File" to "/dev/null" Bug 6078
-
Artem Y. Polyakov authored
In case of the error code paths (like collective timeout) it is possible that a callback provided by PMIx will be called twice leading to a segmentation fault. This commit fixes it by properly accounting callback invocations. Bug 5983
-
Morris Jette authored
Permit independent copy of topo and type data structures. Bug 6078
-
Morris Jette authored
-
Morris Jette authored
A temporary GRES counter was being stored in a variable of type "int". This changes the variable to type "uint64_t" without changing logic.
-
Morris Jette authored
No change in logic, just add initialization to eliminate some Coverity warnings.
-
- 27 Nov, 2018 18 commits
-
-
Tim Wickberg authored
-
Danny Auble authored
This reverts commit b0515009.
-
Danny Auble authored
Bug 5935
-
Boris Karasev authored
This could have caused core dumps if communication failed for one reason or another. Signed-off-by: Boris Karasev <karasev.b@gmail.com> Bug 5935
-
Artem Y. Polyakov authored
In case of the error code paths (like collective timeout) it is possible that a callback provided by PMIx will be called twice leading to a segmentation fault. This commit fixes it by properly accounting callback invocations.
-
Morris Jette authored
This patch does 2 things: 1. When a step fails on some node, then mark it as complete on those nodes. This is needed so that when the step ends on the other nodes, slurmctld recognized the step as completely done. 2. If the step does not have the --no-kill option set, then when some node fails, send a request to terminate the step on ALL of its nodes. Bug 5805
-
Danny Auble authored
Bug 5977
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Tim Wickberg authored
Removed in 19.05 release; remove the documentation in 18.08 to discourage further use. Bug 6075.
-
Tim Wickberg authored
-
Tim Wickberg authored
Bug 6075.
-
Tim Wickberg authored
Bug 6075.
-
Nate Rini authored
when env is overwritten by the command line. Bug 5977
-
Danny Auble authored
-
Danny Auble authored
Bug 6016
-
Broderick Gardner authored
Bug 6092
-