- 12 Jul, 2018 10 commits
-
-
Danny Auble authored
-
Boris Karasev authored
- avoid `abort()` when collective is failed - added logging of coll details for fail cases Bug 5067
-
Danny Auble authored
Note, this is setting it up so we can use defunct functions. It will probably need to be properly fixed in a future version so we don't do this.
-
Morris Jette authored
This change is associated with commit 6be109d9
-
Morris Jette authored
gres_per_socket requires sockets-per-node count specification gres_per_task requires task count specification these restrictions are required in order for cons_res to support these options in a finite amount of time/code
-
Dominik Bartkiewicz authored
-
Dominik Bartkiewicz authored
Bug 5098.
-
Morris Jette authored
-
Dominik Bartkiewicz authored
with preemption or when job requests a specific list of hosts. Bug 5293.
-
Morris Jette authored
-
- 11 Jul, 2018 2 commits
-
-
Morris Jette authored
Coverity CID 186992
-
Morris Jette authored
Coverity CID 186991
-
- 10 Jul, 2018 3 commits
-
-
Morris Jette authored
Pass "first_pass" and "avail_cores to _eval_nodes() so that the usable cores can be better identified by the GRES selection logic. Add new function, _select_cores(), to select specific cores for use Create new data structure with job multi-core spec Permit off-socket cores to be used with enforce_bind Needed so that cores on and off socket can be used. Details will need to be handled in _select_cores()
-
Morris Jette authored
the munge regression test7.16 would fail roughly 0.1% of the time when modifying a bit that munge did not use. This change modifies the test to retry once in that case.
-
Broderick Gardner authored
bug 5337
-
- 09 Jul, 2018 4 commits
-
-
Danny Auble authored
Coverity 186930
-
Boris Karasev authored
-
Danny Auble authored
-
Morris Jette authored
-
- 07 Jul, 2018 1 commit
-
-
Morris Jette authored
When we need to drop nodes in the selection algorithm, change from dropping low CPU count nodes to CPU+GPU count (for jobs requesting GPUs). Not an ideal algorithm, but much better when using GPUs.
-
- 06 Jul, 2018 14 commits
-
-
Danny Auble authored
thread Bug 5390
-
Brian Christiansen authored
-
Thea Flowers authored
Bug 5395
-
Morris Jette authored
this logs the GPU configuration from the slurmd perspecitve. while we don't have tools to load the information directly from nvidia system configuration, i have confirmed where that logic needs to go and the data structure contents.
-
Danny Auble authored
# Conflicts: # doc/html/faq.shtml # src/slurmctld/job_mgr.c
-
Danny Auble authored
Bug 5390
-
Marshall Garey authored
Continuation of 923c9b37. There is a delay in the cgroup system when moving a PID from one cgroup to another. It is usually short, but if we don't wait for the PID to move before removing cgroup directories the PID previously belonged to, we could leak cgroups. This was previously fixed in the cpuset and devices subsystems. This uses the same logic to fix the freezer subsystem. Bug 5082.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Marshall Garey authored
cpuset and devices subsystems have duplicate code to cleanup the cgroup and prevent leaking cgroups by moving the process to the root cgroup and waiting for it to be moved. Move this duplicate code to a common function so it can be used later by the freezer subsystem. Bug 5082.
-
Marshall Garey authored
Bug 5227
-
Broderick Gardner authored
bug 5337
-
Danny Auble authored
-
- 05 Jul, 2018 3 commits
-
-
Danny Auble authored
the database. Bug 5247
-
Morris Jette authored
Previous logic could trigger KNL node reboot when job did not request any KNL MCDRAM or NUMA modes as features. For example: srun -N3 -C "[foo*1&bar*2]" hostname would trigger reboot of all KNL nodes even though no KNL-specific features were requested. This bug only exists in v18.08 and was introduced when expanding KNL node feature specification capabilities.
-
Morris Jette authored
Invoke select_g_job_test() one time with all valid node rather than multiple times when adding higher weight nodes. This results in the job allocation always accumulating nodes from lower to higher weights rather than possibly using mostly higher weight nodes. It also streamlines the resource allocation process for most configurations by eliminating some repeated logic as groups of nodes are added for consideration by the select plugin.
-
- 04 Jul, 2018 3 commits
-
-
Felip Moll authored
bug4451
-
Morris Jette authored
So that multiple nodes changes will be reported on one line rather than one line per node. Otherwise this could lead to performance issues when reloading slurmctld in big systems. Bug4980
-
Felip Moll authored
Cleaned up code that could've caused performance issues when reading config and there was nodes with features defined. bug4980
-