- 07 Aug, 2018 14 commits
-
-
Brian Christiansen authored
from_tres_rec was being added to dest_tres_list and being free'd twice when free'ing from_tres_list and dest_tres_list. Continuation of a4a91719
-
Danny Auble authored
Bug 5260
-
Tim Wickberg authored
-
Morris Jette authored
Only split nodes here if a node_features plugin is in use. Otherwise node fragmentation will occur if the node config has CPUs specified but not CoresPerSocket and Sockets. This could be avoided by filling out the node definition, but adding this workaround for backwards compatiblity. Bug 5039.
-
Tim Wickberg authored
-
Marshall Garey authored
Task prologs could set or modify this, so wait to create the directory until after they've finished. Bug 5367.
-
Alejandro Sanchez authored
Bug 4373.
-
Alejandro Sanchez authored
Bug 4373.
-
Alejandro Sanchez authored
Which can be used to force to retrieve the information about all the components in a heterogeneous job or just the ones selected, leaving untouched the way regular jobs and/or arrays are retrieved. Bug 4373.
-
Alejandro Sanchez authored
cluster_name is now passed as an argument, since it'll be needed for a subsequent commit subquery. No functional change. Bug 4373.
-
Alejandro Sanchez authored
No functional change. Bug 4373.
-
Alejandro Sanchez authored
No functional change. Bug 4373.
-
Marshall Garey authored
Bug 5481.
-
Dominik Bartkiewicz authored
for an allocation. Bug 5293
-
- 06 Aug, 2018 10 commits
-
-
Brian Christiansen authored
-
Brian Christiansen authored
when validating memory limits Continuation of b502d179
-
Tim Wickberg authored
After changes to slurm_send_only_node_msg(), this message is much more likely to appear on systems with overloaded interconnects since that connection handling code may end up retransmitting messages that were actually received (but that the transmit side could not verify were delivered successfully). As the error() message stated, this isn't actually an error, and the code will proceed happily past this point. So drop the debug level, and remove the surrealist "this is not an error" part. Bug 5164.
-
Tim Wickberg authored
There are subtle issues involved in treating a TCP transmission as a unidirectional message delivery layer. The original code path looks like: connect(), write(), close(). But Linux handles the write() and close() asynchronously behind the scenes, and does not block until that write() has been ACK'd by the remote end. So the write() and close() may succeed, even with data still in flight. A communication error - and message loss - would have been silently ignored, leading to unreliable message transmission. Worse yet, one side of the connection would believe it sent the message, while the receive side swears it never saw the packets. This leads to infrequent and yet seemingly impossible data loss, and a very tough bug to chase down. This teardown code tries to force the connection to shut down in an orderly manner, giving Slurm a chance to catch a connection problem and the upstream calling path an opportunity to retransmit. This teardown code is based on an approach described in Section 7.5 of "UNIX Network Programming" Volume 1 (Third Edition), specifically the subsection regarding SO_LINGER. (And also covers why SO_LINGER is not sufficent to prevent this issue.) Bug 5164.
-
Tim Wickberg authored
Bug 5164.
-
Tim Wickberg authored
Do not use slurm_send_only_node_msg(). There is no way to tell if the srun has received the message before the socket is shutdown if we do not wait to receive data. Use slurm_send_recv_rc_msg_only_one() instead, and send back a response from the other side. We still need the older (and unreliable) behavior when talking to older srun client commands, so make this change dependent on the protocol_version field in the message. Bug 5164.
-
Tim Wickberg authored
Bug 5164.
-
Tim Wickberg authored
-
Marshall Garey authored
Previously only a single task of a job array could preempt during backfill scheduling. This allows multiple tasks to preempt and have resources reserved in backfill. Bug 5405.
-
Brian Christiansen authored
This appears to be an oversight of 865338c7 where the cont_id check was changed from NO_VAL64 to INFINITE64. The cont_id is initialized to NO_VAL64 in src/common/slurm_jobacct_gather.c.
-
- 04 Aug, 2018 2 commits
-
-
Tim Wickberg authored
-
Jason Booth authored
The getopt format string needs to handle an option here, and the --help output had not been corrected after 99b2c4e8. Bug 5522.
-
- 03 Aug, 2018 5 commits
-
-
Marshall Garey authored
Bug 5527.
-
Tim Wickberg authored
-
Jason Booth authored
for displaying unformtted timelimit numbers. Bug 5407
-
Brian Christiansen authored
Bug 5523
-
Brian Christiansen authored
Bug 5523
-
- 02 Aug, 2018 9 commits
-
-
Tim Wickberg authored
-
Tim Wickberg authored
Update slurm.spec and slurm.spec-legacy as well.
-
Brian Christiansen authored
CID 187301
-
Brian Christiansen authored
CID 187302
-
Danny Auble authored
Bug 5308
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-