- 07 Aug, 2018 5 commits
-
-
Danny Auble authored
Bug 5260
-
Morris Jette authored
Only split nodes here if a node_features plugin is in use. Otherwise node fragmentation will occur if the node config has CPUs specified but not CoresPerSocket and Sockets. This could be avoided by filling out the node definition, but adding this workaround for backwards compatiblity. Bug 5039.
-
Marshall Garey authored
Task prologs could set or modify this, so wait to create the directory until after they've finished. Bug 5367.
-
Alejandro Sanchez authored
Bug 4373.
-
Marshall Garey authored
Bug 5481.
-
- 06 Aug, 2018 2 commits
-
-
Tim Wickberg authored
There are subtle issues involved in treating a TCP transmission as a unidirectional message delivery layer. The original code path looks like: connect(), write(), close(). But Linux handles the write() and close() asynchronously behind the scenes, and does not block until that write() has been ACK'd by the remote end. So the write() and close() may succeed, even with data still in flight. A communication error - and message loss - would have been silently ignored, leading to unreliable message transmission. Worse yet, one side of the connection would believe it sent the message, while the receive side swears it never saw the packets. This leads to infrequent and yet seemingly impossible data loss, and a very tough bug to chase down. This teardown code tries to force the connection to shut down in an orderly manner, giving Slurm a chance to catch a connection problem and the upstream calling path an opportunity to retransmit. This teardown code is based on an approach described in Section 7.5 of "UNIX Network Programming" Volume 1 (Third Edition), specifically the subsection regarding SO_LINGER. (And also covers why SO_LINGER is not sufficent to prevent this issue.) Bug 5164.
-
Marshall Garey authored
Previously only a single task of a job array could preempt during backfill scheduling. This allows multiple tasks to preempt and have resources reserved in backfill. Bug 5405.
-
- 04 Aug, 2018 1 commit
-
-
Jason Booth authored
The getopt format string needs to handle an option here, and the --help output had not been corrected after 99b2c4e8. Bug 5522.
-
- 03 Aug, 2018 1 commit
-
-
Jason Booth authored
for displaying unformtted timelimit numbers. Bug 5407
-
- 02 Aug, 2018 5 commits
-
-
Tim Wickberg authored
-
Thomas Cadeau authored
Bug 5094
-
Brian Christiansen authored
-
Brian Christiansen authored
to dictate what state of node is after reboot.
-
Brian Christiansen authored
-
- 31 Jul, 2018 1 commit
-
-
Morris Jette authored
Bug 5070
-
- 27 Jul, 2018 3 commits
-
-
Danny Auble authored
Bug 5468 This is a backport of commit cefc9ec1.
-
Felip Moll authored
Bug 4918
-
Dominik Bartkiewicz authored
scheduling cycle. Primarily caused by EnforcePartLimits=ALL. Bug 5452
-
- 24 Jul, 2018 1 commit
-
-
Broderick Gardner authored
Bug 5248.
-
- 19 Jul, 2018 4 commits
-
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Tim Wickberg authored
Without Delegate=yes, systemd will "fix" the cgroup hierarchies whenever 'systemctl daemon-reload' is called, which will then remove any restrictions placed on memory or device access for a given job. This is a problem especially since 'systemctl daemon-reload' may be called automatically by rpm/yum or a variety of config file mangers, leading to jobs escaping from slurmd/slurmstepd's control. This setting should work for systemd versions >= 205. https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ Bug 5292.
-
Felip Moll authored
When a job with time_end=0 and TRES null exists from an association that is currently inside a reservation, the hourly rollup segfaults. Bug 5143
-
- 18 Jul, 2018 4 commits
-
-
Dominik Bartkiewicz authored
As reported by Avalon Johnson on slurm-users https://groups.google.com/forum/#!topic/slurm-users/BsMQ7Uk1PLw Bug 5287.
-
Brian Christiansen authored
srun was already fixed in b7053bda (Bug 3294). Bug 5126
-
Brian Christiansen authored
-
Morris Jette authored
Add salloc/sbatch/srun option of --gres-flags=disable-binding to disable filtering of CPUs with respect to generic resource locality. This option is currently required to use more CPUs than are bound to a GRES (i.e. if a GPU is bound to the CPUs on one socket, but resources on more than one socket are required to run the job). This option may permit a job to be allocated resources sooner than otherwise possible, but may result in lower job performance. bug 5189
-
- 17 Jul, 2018 5 commits
-
-
Felip Moll authored
When printing arrays in squeue and setting the SLURM_BITSTR_LEN variable to 0 or to NULL, the length of the output defaulted to 64, when the documentation says it will default to "unlimited". This patch fixes this situation. Bug 5440
-
Marshall Garey authored
Documented, and code reads as needing, the node lock. But these were incorrectly set as the job locks. Bug 5394.
-
Dominik Bartkiewicz authored
Needs the job write lock, as it may change job status not just node status. Especially after commit 33e352a6. Bug 5406.
-
Felip Moll authored
Previously, slashes '\' in job->cwd were always expanded regardless of they were part of the name of a directory or not. Bug 4859
-
Felip Moll authored
When dealing with special characters like %A, %u, %s and so on and escaping it on the command line, problems arises when one have directories with multiple slashes in their names. This patch fixes this situation removing only one slash on each pair of slashes just as normal escaping works i.e. in bash. Bug 4859
-
- 13 Jul, 2018 1 commit
-
-
Isaac Hartung authored
Add errno to info message in the SlurmDBD log, and pass the actual errno back to the sacctmgr process so the user can see it. Bug 5152.
-
- 12 Jul, 2018 3 commits
-
-
Boris Karasev authored
- avoid `abort()` when collective is failed - added logging of coll details for fail cases Bug 5067
-
Danny Auble authored
Note, this is setting it up so we can use defunct functions. It will probably need to be properly fixed in a future version so we don't do this.
-
Dominik Bartkiewicz authored
with preemption or when job requests a specific list of hosts. Bug 5293.
-
- 09 Jul, 2018 1 commit
-
-
Danny Auble authored
-
- 06 Jul, 2018 1 commit
-
-
Marshall Garey authored
Continuation of 923c9b37. There is a delay in the cgroup system when moving a PID from one cgroup to another. It is usually short, but if we don't wait for the PID to move before removing cgroup directories the PID previously belonged to, we could leak cgroups. This was previously fixed in the cpuset and devices subsystems. This uses the same logic to fix the freezer subsystem. Bug 5082.
-
- 05 Jul, 2018 1 commit
-
-
Danny Auble authored
the database. Bug 5247
-
- 04 Jul, 2018 1 commit
-
-
Morris Jette authored
So that multiple nodes changes will be reported on one line rather than one line per node. Otherwise this could lead to performance issues when reloading slurmctld in big systems. Bug4980
-