Commits · a48f5d3aa69b8f1e0dbe72e15d9c2514585dccf8 · Manuel G. Marciani / ces_slurm_simulator

12 Jul, 2018 10 commits
- Merge remote-tracking branch 'origin/slurm-17.11' · a48f5d3a
  Danny Auble authored Jul 12, 2018
  
  a48f5d3a
- mpi/pmix: fixed the collectives canceling · f15c8183
  Boris Karasev authored Jun 16, 2018
```
- avoid `abort()` when collective is failed
- added logging of coll details for fail cases

Bug 5067
```
  f15c8183
- Make code compile with hdf5 1.10.2+ · 90c4e7e7
  Danny Auble authored Jul 12, 2018
```
Note, this is setting it up so we can use defunct functions.  It will
probably need to be properly fixed in a future version so we don't
do this.
```
  90c4e7e7
- Fix test suite for hardened GPU logic · f8175b3e
  Morris Jette authored Jul 12, 2018
```
This change is associated with commit 6be109d9
```
  f8175b3e
- cons_tres: better constrain gpu options · 6be109d9
  Morris Jette authored Jul 12, 2018
```
gres_per_socket requires sockets-per-node count specification
gres_per_task requires task count specification
these restrictions are required in order for cons_res to
  support these options in a finite amount of time/code
```
  6be109d9
- Merge branch 'slurm-17.11' · 8fb1d1a6
  Dominik Bartkiewicz authored Jul 12, 2018
  
  8fb1d1a6
- Fix for potential deadlock in the assoc_mgr_get_user_assocs() · 80d38355
  Dominik Bartkiewicz authored Jul 12, 2018
```
Bug 5098.
```
  80d38355
- Expand gpu option testing · 77fc2c23
  Morris Jette authored Jul 12, 2018
  
  77fc2c23
- Fix issues with --exclusive=[user|mcs] to work correctly · 72736af2
  Dominik Bartkiewicz authored Jul 12, 2018
```
with preemption or when job requests a specific list of hosts.

Bug 5293.
```
  72736af2
- cons_tres: add some core filtering/selection logic · 08982723
  Morris Jette authored Jul 11, 2018
  
  08982723
11 Jul, 2018 2 commits
- Fix memory leak · c2988cef
  Morris Jette authored Jul 11, 2018
```
Coverity CID 186992
```
  c2988cef
- Remove redundant NULL pointer check · 11f74f4d
  Morris Jette authored Jul 11, 2018
```
Coverity CID 186991
```
  11f74f4d
10 Jul, 2018 3 commits

Morris Jette authored Jul 10, 2018

Pass "first_pass" and "avail_cores to _eval_nodes() so that
  the usable cores can be better identified by the GRES selection
  logic.
Add new function, _select_cores(), to select specific cores for use
Create new data structure with job multi-core spec
Permit off-socket cores to be used with enforce_bind
  Needed so that cores on and off socket can be used. Details will need
  to be handled in _select_cores()

6f78e048

harden a regression test · 44aba2a7

Morris Jette authored Jul 10, 2018

the munge regression test7.16 would fail roughly 0.1% of the time
when modifying a bit that munge did not use. This change modifies
the test to retry once in that case.

44aba2a7

Document recent sdiag enhancements · 4103ccfd
Broderick Gardner authored Jul 10, 2018
```
bug 5337
```
4103ccfd

09 Jul, 2018 4 commits
- Whoops, I used pthread_cond_signal insteadl of slurm_cond_signal. · ff47bcbb
  Danny Auble authored Jul 09, 2018
```
Coverity 186930
```
  ff47bcbb
- mpi/pmix: use `SLURM` prefix of UCX config variables · 8012228f
  Boris Karasev authored Jun 20, 2018
  
  8012228f
- Add news for 4daeedd8 · d10854d9
  Danny Auble authored Jul 09, 2018
  
  d10854d9
- cons_tres: favor using nodes with co-located GPUs+CPUs · 3ca43f11
  Morris Jette authored Jul 09, 2018
  
  3ca43f11
07 Jul, 2018 1 commit

cons_tres: change algorithm to drop low resource nodes · cbed3921

Morris Jette authored Jul 06, 2018

When we need to drop nodes in the selection algorithm, change
from dropping low CPU count nodes to CPU+GPU count (for jobs
requesting GPUs). Not an ideal algorithm, but much better
when using GPUs.

cbed3921

06 Jul, 2018 14 commits
- Continuation of e5f03971 to get rid of the potentially dangerous detached · 6a702158
  Danny Auble authored Jul 06, 2018
```
thread

Bug 5390
```
  6a702158
- Merge remote-tracking branch 'origin/slurm-17.11' · aa5db52a
  Brian Christiansen authored Jul 06, 2018
  
  aa5db52a
- Add workaround for importing newly install namespace packages · da2ecda8
  Thea Flowers authored Jun 22, 2018
```
Bug 5395
```
  da2ecda8
- gres/gpu: add logging for config info · a8c63832
  Morris Jette authored Jul 06, 2018
```
this logs the GPU configuration from the slurmd perspecitve.
while we don't have tools to load the information directly
from nvidia system configuration, i have confirmed where that
logic needs to go and the data structure contents.
```
  a8c63832
- Merge remote-tracking branch 'origin/slurm-17.11' · bffbaf11
  Danny Auble authored Jul 06, 2018
```
# Conflicts:
#	doc/html/faq.shtml
#	src/slurmctld/job_mgr.c
```
  bffbaf11
- Fix potential segfault when closing the mpi/pmi2 plugin. · 4daeedd8
  Danny Auble authored Jul 06, 2018
```
Bug 5390
```
  4daeedd8
- Fix leaking freezer cgroups. · 7f9c4f73
  Marshall Garey authored Jul 06, 2018
```
Continuation of 923c9b37.

There is a delay in the cgroup system when moving a PID from one cgroup
to another. It is usually short, but if we don't wait for the PID to
move before removing cgroup directories the PID previously belonged to,
we could leak cgroups. This was previously fixed in the cpuset and
devices subsystems. This uses the same logic to fix the freezer
subsystem.

Bug 5082.
```
  7f9c4f73
- cons_tres fix max_cpus_per_node logic for overcommit option · cac887a9
  Morris Jette authored Jul 06, 2018
  
  cac887a9
- Fix for possible divide by zero · 514ed12f
  Morris Jette authored Jul 06, 2018
  
  514ed12f
- cons_tres: add partition max_cpus_per_node support · bc6f851b
  Morris Jette authored Jul 06, 2018
  
  bc6f851b
- Combine duplicate code in cgroup fini functions. · 923c9b37
  Marshall Garey authored Jul 06, 2018
```
cpuset and devices subsystems have duplicate code to cleanup the cgroup
and prevent leaking cgroups by moving the process to the root cgroup and
waiting for it to be moved.

Move this duplicate code to a common function so it can be used later by
the freezer subsystem.

Bug 5082.
```
  923c9b37
- Clarify Depth Mean Try Sched in sdiag man page · dd6ca4b0
  Marshall Garey authored Jul 06, 2018
```
Bug 5227
```
  dd6ca4b0
- in sdiag, dump specific agent RPCs and hostlist info · d6bee97d
  Broderick Gardner authored Jul 05, 2018
```
bug 5337
```
  d6bee97d
- Fix test to make sure something happens to deem success. · 2f9a326e
  Danny Auble authored Jul 05, 2018
  
  2f9a326e
05 Jul, 2018 3 commits

Make it so the slurmdbd's pid file gets created before initing · 7e47579f
Danny Auble authored Jul 05, 2018
```
the database.

Bug 5247
```
7e47579f

Fix KNL feature reboot logic · c368ff89

Morris Jette authored Jul 05, 2018

Previous logic could trigger KNL node reboot when job did not
request any KNL MCDRAM or NUMA modes as features. For example:
srun -N3 -C "[foo*1&bar*2]" hostname
would trigger reboot of all KNL nodes even though no KNL-specific
features were requested. This bug only exists in v18.08 and was
introduced when expanding KNL node feature specification capabilities.

c368ff89

cons_tres: refactor node weight management · 5e1a3d59

Morris Jette authored Jul 05, 2018

Invoke select_g_job_test() one time with all valid node rather than
multiple times when adding higher weight nodes. This results in
the job allocation always accumulating nodes from lower to higher
weights rather than possibly using mostly higher weight nodes.
It also streamlines the resource allocation process for most
configurations by eliminating some repeated logic as groups
of nodes are added for consideration by the select plugin.

5e1a3d59

04 Jul, 2018 3 commits

Add some corrections to FAQ and remove Slurm 1.3 string · 0985c8b1
Felip Moll authored Jul 04, 2018
```
bug4451
```
0985c8b1

Combine the active and available node feature change logs · 3818159e

Morris Jette authored Jul 04, 2018

So that multiple nodes changes will be reported on one line rather than one
line per node. Otherwise this could lead to performance issues when reloading
slurmctld in big systems.

Bug4980

3818159e

Fix read slurm.conf performance issues · 23e815c6

Felip Moll authored Jul 04, 2018

Cleaned up code that could've caused performance issues when reading config
and there was nodes with features defined.

bug4980

23e815c6