Commits · c2988cef4488e4543770fe303d58edd15d6ccbbf · Manuel G. Marciani / ces_slurm_simulator

11 Jul, 2018 2 commits
- Fix memory leak · c2988cef
  Morris Jette authored Jul 11, 2018
```
Coverity CID 186992
```
  c2988cef
- Remove redundant NULL pointer check · 11f74f4d
  Morris Jette authored Jul 11, 2018
```
Coverity CID 186991
```
  11f74f4d
10 Jul, 2018 3 commits

Morris Jette authored Jul 10, 2018

Pass "first_pass" and "avail_cores to _eval_nodes() so that
  the usable cores can be better identified by the GRES selection
  logic.
Add new function, _select_cores(), to select specific cores for use
Create new data structure with job multi-core spec
Permit off-socket cores to be used with enforce_bind
  Needed so that cores on and off socket can be used. Details will need
  to be handled in _select_cores()

6f78e048

harden a regression test · 44aba2a7

Morris Jette authored Jul 10, 2018

the munge regression test7.16 would fail roughly 0.1% of the time
when modifying a bit that munge did not use. This change modifies
the test to retry once in that case.

44aba2a7

Document recent sdiag enhancements · 4103ccfd
Broderick Gardner authored Jul 10, 2018
```
bug 5337
```
4103ccfd

09 Jul, 2018 3 commits
- Whoops, I used pthread_cond_signal insteadl of slurm_cond_signal. · ff47bcbb
  Danny Auble authored Jul 09, 2018
```
Coverity 186930
```
  ff47bcbb
- mpi/pmix: use `SLURM` prefix of UCX config variables · 8012228f
  Boris Karasev authored Jun 20, 2018
  
  8012228f
- cons_tres: favor using nodes with co-located GPUs+CPUs · 3ca43f11
  Morris Jette authored Jul 09, 2018
  
  3ca43f11
07 Jul, 2018 1 commit

cons_tres: change algorithm to drop low resource nodes · cbed3921

Morris Jette authored Jul 06, 2018

When we need to drop nodes in the selection algorithm, change
from dropping low CPU count nodes to CPU+GPU count (for jobs
requesting GPUs). Not an ideal algorithm, but much better
when using GPUs.

cbed3921

06 Jul, 2018 14 commits
- Continuation of e5f03971 to get rid of the potentially dangerous detached · 6a702158
  Danny Auble authored Jul 06, 2018
```
thread

Bug 5390
```
  6a702158
- Merge remote-tracking branch 'origin/slurm-17.11' · aa5db52a
  Brian Christiansen authored Jul 06, 2018
  
  aa5db52a
- Add workaround for importing newly install namespace packages · da2ecda8
  Thea Flowers authored Jun 22, 2018
```
Bug 5395
```
  da2ecda8
- gres/gpu: add logging for config info · a8c63832
  Morris Jette authored Jul 06, 2018
```
this logs the GPU configuration from the slurmd perspecitve.
while we don't have tools to load the information directly
from nvidia system configuration, i have confirmed where that
logic needs to go and the data structure contents.
```
  a8c63832
- Merge remote-tracking branch 'origin/slurm-17.11' · bffbaf11
  Danny Auble authored Jul 06, 2018
```
# Conflicts:
#	doc/html/faq.shtml
#	src/slurmctld/job_mgr.c
```
  bffbaf11
- Fix potential segfault when closing the mpi/pmi2 plugin. · 4daeedd8
  Danny Auble authored Jul 06, 2018
```
Bug 5390
```
  4daeedd8
- Fix leaking freezer cgroups. · 7f9c4f73
  Marshall Garey authored Jul 06, 2018
```
Continuation of 923c9b37.

There is a delay in the cgroup system when moving a PID from one cgroup
to another. It is usually short, but if we don't wait for the PID to
move before removing cgroup directories the PID previously belonged to,
we could leak cgroups. This was previously fixed in the cpuset and
devices subsystems. This uses the same logic to fix the freezer
subsystem.

Bug 5082.
```
  7f9c4f73
- cons_tres fix max_cpus_per_node logic for overcommit option · cac887a9
  Morris Jette authored Jul 06, 2018
  
  cac887a9
- Fix for possible divide by zero · 514ed12f
  Morris Jette authored Jul 06, 2018
  
  514ed12f
- cons_tres: add partition max_cpus_per_node support · bc6f851b
  Morris Jette authored Jul 06, 2018
  
  bc6f851b
- Combine duplicate code in cgroup fini functions. · 923c9b37
  Marshall Garey authored Jul 06, 2018
```
cpuset and devices subsystems have duplicate code to cleanup the cgroup
and prevent leaking cgroups by moving the process to the root cgroup and
waiting for it to be moved.

Move this duplicate code to a common function so it can be used later by
the freezer subsystem.

Bug 5082.
```
  923c9b37
- Clarify Depth Mean Try Sched in sdiag man page · dd6ca4b0
  Marshall Garey authored Jul 06, 2018
```
Bug 5227
```
  dd6ca4b0
- in sdiag, dump specific agent RPCs and hostlist info · d6bee97d
  Broderick Gardner authored Jul 05, 2018
```
bug 5337
```
  d6bee97d
- Fix test to make sure something happens to deem success. · 2f9a326e
  Danny Auble authored Jul 05, 2018
  
  2f9a326e
05 Jul, 2018 3 commits

Make it so the slurmdbd's pid file gets created before initing · 7e47579f
Danny Auble authored Jul 05, 2018
```
the database.

Bug 5247
```
7e47579f

Fix KNL feature reboot logic · c368ff89

Morris Jette authored Jul 05, 2018

Previous logic could trigger KNL node reboot when job did not
request any KNL MCDRAM or NUMA modes as features. For example:
srun -N3 -C "[foo*1&bar*2]" hostname
would trigger reboot of all KNL nodes even though no KNL-specific
features were requested. This bug only exists in v18.08 and was
introduced when expanding KNL node feature specification capabilities.

c368ff89

cons_tres: refactor node weight management · 5e1a3d59

Morris Jette authored Jul 05, 2018

Invoke select_g_job_test() one time with all valid node rather than
multiple times when adding higher weight nodes. This results in
the job allocation always accumulating nodes from lower to higher
weights rather than possibly using mostly higher weight nodes.
It also streamlines the resource allocation process for most
configurations by eliminating some repeated logic as groups
of nodes are added for consideration by the select plugin.

5e1a3d59

04 Jul, 2018 6 commits
- Add some corrections to FAQ and remove Slurm 1.3 string · 0985c8b1
  Felip Moll authored Jul 04, 2018
```
bug4451
```
  0985c8b1
- Combine the active and available node feature change logs · 3818159e
  Morris Jette authored Jul 04, 2018
```
So that multiple nodes changes will be reported on one line rather than one
line per node. Otherwise this could lead to performance issues when reloading
slurmctld in big systems.

Bug4980
```
  3818159e
- Fix read slurm.conf performance issues · 23e815c6
  Felip Moll authored Jul 04, 2018
```
Cleaned up code that could've caused performance issues when reading config
and there was nodes with features defined.

bug4980
```
  23e815c6
- Harden tests, running daemons under valgrind caused problems · 8783dfb7
  Morris Jette authored Jul 03, 2018
  
  8783dfb7
- Fix memory leak · c1651e74
  Morris Jette authored Jul 03, 2018
  
  c1651e74
- Remove redundant xmalloc, fix memory leak · 9c088906
  Morris Jette authored Jul 03, 2018
  
  9c088906
03 Jul, 2018 8 commits

Fix how node sched_weight is calculated · 5df15339
Morris Jette authored Jul 03, 2018
```
fix for commit 4a0a6a94
bug 4821
```
5df15339
Remove vestigial debug logging from commit 4a0a6a94 · 786b47bf
Morris Jette authored Jul 03, 2018
```
bug 4821
```
786b47bf

Add "sched_weight" logic · 4a0a6a94

Morris Jette authored Jul 03, 2018

Add 64-bit sched_weight (scheduling weight) to node_set struct
Populate it with the node's weight (possibly reboot weight)
plus high-order bits for FLEX-reservation and rebooting.
No longer are node weights of INFINITE or (INFINITE-1) used
to flag FLEX or reboot requirements so we don't need to
worry about overlapping node weight values. It also will
cleanly allow the cons_tres plugin to be passed ALL usable
nodes at one time to accumulate resources from the lowest
weight first and only use individual higher weight nodes
as needed (rather than possibly using mostly higher
weight nodes).
bug 4821

4a0a6a94

add infrastructure to better manage node weights · 1a77652e
Felip Moll authored Jul 03, 2018
```
breaks out node sets by in/out flex reservation and need to reboot
bug 4821
```
1a77652e

Clarify gres.conf Cores documentation · 3ee3795f

Felip Moll authored Jul 03, 2018

Slurm numbers the cores using an abstract index, starting from CPU 0
on the first socket, core, thread, and continuing until N on the last socket,
last core, last thread. Explain that in the documentation.

bug 5189

3ee3795f

fix commit d8c537 (v17.11) to work with commit b26d78 (v18.08) · 58badb1f
Morris Jette authored Jul 03, 2018
```
The node weight of a node requiring reboot is not a fixed value
in v18.08, but configurable
bug 4821
```
58badb1f
Add pending RPC data cache · 4910d69c
Morris Jette authored Jul 02, 2018
```
bug 5337
```
4910d69c
Added pending RPC stats to sdiag output · 6033f246
Broderick Gardner authored Jul 02, 2018
```
bug 5337
```
6033f246