- 11 Aug, 2018 1 commit
-
-
Brian Christiansen authored
and other drain + node state flags. Bug 5505
-
- 10 Aug, 2018 2 commits
-
-
Tim Wickberg authored
-
Tim Wickberg authored
Update slurm.spec and slurm.spec-legacy as well.
-
- 09 Aug, 2018 3 commits
-
-
Brian Christiansen authored
Bug 5505
-
Tim Wickberg authored
Bug 5164. Without this patch the slurmctld would send a finishing message to srun which may already be torn down. This prevents printing a benign error message.
-
Alejandro Sanchez authored
Bug 3844.
-
- 08 Aug, 2018 7 commits
-
-
Marshall Garey authored
Bug 5526
-
Marshall Garey authored
Bug 5526
-
Marshall Garey authored
Bug 5171
-
Marshall Garey authored
so that the tests won't fail due to trying to set time limits greater than MaxTime. Bug 5171
-
Marshall Garey authored
Bug 5171
-
Marshall Garey authored
in order to be able to set a partition's MaxTime Bug 5171
-
Marshall Garey authored
To set a partition's MaxTime Bug 5171
-
- 07 Aug, 2018 3 commits
-
-
Alejandro Sanchez authored
Bug 5528.
-
Morris Jette authored
Only split nodes here if a node_features plugin is in use. Otherwise node fragmentation will occur if the node config has CPUs specified but not CoresPerSocket and Sockets. This could be avoided by filling out the node definition, but adding this workaround for backwards compatiblity. Bug 5039.
-
Marshall Garey authored
Task prologs could set or modify this, so wait to create the directory until after they've finished. Bug 5367.
-
- 06 Aug, 2018 3 commits
-
-
Tim Wickberg authored
After changes to slurm_send_only_node_msg(), this message is much more likely to appear on systems with overloaded interconnects since that connection handling code may end up retransmitting messages that were actually received (but that the transmit side could not verify were delivered successfully). As the error() message stated, this isn't actually an error, and the code will proceed happily past this point. So drop the debug level, and remove the surrealist "this is not an error" part. Bug 5164.
-
Tim Wickberg authored
There are subtle issues involved in treating a TCP transmission as a unidirectional message delivery layer. The original code path looks like: connect(), write(), close(). But Linux handles the write() and close() asynchronously behind the scenes, and does not block until that write() has been ACK'd by the remote end. So the write() and close() may succeed, even with data still in flight. A communication error - and message loss - would have been silently ignored, leading to unreliable message transmission. Worse yet, one side of the connection would believe it sent the message, while the receive side swears it never saw the packets. This leads to infrequent and yet seemingly impossible data loss, and a very tough bug to chase down. This teardown code tries to force the connection to shut down in an orderly manner, giving Slurm a chance to catch a connection problem and the upstream calling path an opportunity to retransmit. This teardown code is based on an approach described in Section 7.5 of "UNIX Network Programming" Volume 1 (Third Edition), specifically the subsection regarding SO_LINGER. (And also covers why SO_LINGER is not sufficent to prevent this issue.) Bug 5164.
-
Tim Wickberg authored
Bug 5164.
-
- 04 Aug, 2018 1 commit
-
-
Jason Booth authored
The getopt format string needs to handle an option here, and the --help output had not been corrected after 99b2c4e8. Bug 5522.
-
- 31 Jul, 2018 1 commit
-
-
Morris Jette authored
Bug 5070
-
- 27 Jul, 2018 2 commits
-
-
Danny Auble authored
Bug 5468 This is a backport of commit cefc9ec1.
-
Dominik Bartkiewicz authored
scheduling cycle. Primarily caused by EnforcePartLimits=ALL. Bug 5452
-
- 24 Jul, 2018 1 commit
-
-
Brian Christiansen authored
-
- 19 Jul, 2018 7 commits
-
-
Tim Wickberg authored
-
Tim Wickberg authored
Update slurm.spec and slurm.spec-legacy as well.
-
Tim Wickberg authored
-
Tim Wickberg authored
The lower limit of 1024 may be too short for srun with large-scale jobs, and lead to problems processing task completion messages in a timely fashion. Rather than adjust that, unify the two separate macros into SLURM_DEFAULT_LISTEN_BACKLOG with the higer 4096 value. Bug 5164.
-
Tim Wickberg authored
Without Delegate=yes, systemd will "fix" the cgroup hierarchies whenever 'systemctl daemon-reload' is called, which will then remove any restrictions placed on memory or device access for a given job. This is a problem especially since 'systemctl daemon-reload' may be called automatically by rpm/yum or a variety of config file mangers, leading to jobs escaping from slurmd/slurmstepd's control. This setting should work for systemd versions >= 205. https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ Bug 5292.
-
Tim Wickberg authored
-
Felip Moll authored
When a job with time_end=0 and TRES null exists from an association that is currently inside a reservation, the hourly rollup segfaults. Bug 5143
-
- 18 Jul, 2018 5 commits
-
-
Dominik Bartkiewicz authored
As reported by Avalon Johnson on slurm-users https://groups.google.com/forum/#!topic/slurm-users/BsMQ7Uk1PLw Bug 5287.
-
Brian Christiansen authored
-
Brian Christiansen authored
srun was already fixed in b7053bda (Bug 3294). Bug 5126
-
Brian Christiansen authored
-
Broderick Gardner authored
'have_innodb' is deprecated. Bug 5317.
-
- 17 Jul, 2018 4 commits
-
-
Felip Moll authored
When printing arrays in squeue and setting the SLURM_BITSTR_LEN variable to 0 or to NULL, the length of the output defaulted to 64, when the documentation says it will default to "unlimited". This patch fixes this situation. Bug 5440
-
Marshall Garey authored
Logic was switched around in 17.11, enable_user_top is now the correct option. Bug 5165.
-
Alejandro Sanchez authored
This is not working reliably even when setting SchedulerParameters=enable_hetero_steps and/or using OpenMPI with Slurm's mpi/pmi2, as it was previously documented. Bug 5309.
-
Marshall Garey authored
Documented, and code reads as needing, the node lock. But these were incorrectly set as the job locks. Bug 5394.
-