Commits · 38eb6d6dd408213fab797d0f0992f30f18e81205 · Manuel G. Marciani / ces_slurm_simulator

13 Jul, 2018 8 commits
- mpi/pmix: change context switch on local contrib, check nbr sequence · 38eb6d6d
  Boris Karasev authored Jun 07, 2018
  
  38eb6d6d
- mpi/pmix: added error msg when Ring coll cannot be used · 73671d03
  Boris Karasev authored Jun 09, 2018
  
  73671d03
- mpi/pmix: addressed remarks of the previous commit · 6e11934d
  Boris Karasev authored Jun 05, 2018
  
  6e11934d
- mpi/pmix: Set of fixes and TODOs in new collective framework · b48085c9
  Artem Polyakov authored Jun 04, 2018
  
  b48085c9
- mpi/pmix: collectives refactoring · 53c0fd12
  Boris Karasev authored May 31, 2018
  
  53c0fd12
- mpi/pmix: fixed early direct connection · 8160a427
  Boris Karasev authored May 30, 2018
  
  8160a427
- mpi/pmix: added the ring collective implementation · 17963d4b
  Boris Karasev authored May 26, 2018
  
  17963d4b
- mpi/pmix: update the tree collective structure · 83dbe46c
  Boris Karasev authored May 26, 2018
  
  83dbe46c
12 Jul, 2018 12 commits
- Run autogen.sh · 38c5ed73
  Danny Auble authored Jul 09, 2018
  
  38c5ed73
- mpi/pmix: rename pmixp_coll.c to pmixp_coll_tree.c · 0f65584d
  Boris Karasev authored May 26, 2018
  
  0f65584d
- Merge remote-tracking branch 'origin/slurm-17.11' · a48f5d3a
  Danny Auble authored Jul 12, 2018
  
  a48f5d3a
- mpi/pmix: fixed the collectives canceling · f15c8183
  Boris Karasev authored Jun 16, 2018
```
- avoid `abort()` when collective is failed
- added logging of coll details for fail cases

Bug 5067
```
  f15c8183
- Make code compile with hdf5 1.10.2+ · 90c4e7e7
  Danny Auble authored Jul 12, 2018
```
Note, this is setting it up so we can use defunct functions.  It will
probably need to be properly fixed in a future version so we don't
do this.
```
  90c4e7e7
- Fix test suite for hardened GPU logic · f8175b3e
  Morris Jette authored Jul 12, 2018
```
This change is associated with commit 6be109d9
```
  f8175b3e
- cons_tres: better constrain gpu options · 6be109d9
  Morris Jette authored Jul 12, 2018
```
gres_per_socket requires sockets-per-node count specification
gres_per_task requires task count specification
these restrictions are required in order for cons_res to
  support these options in a finite amount of time/code
```
  6be109d9
- Merge branch 'slurm-17.11' · 8fb1d1a6
  Dominik Bartkiewicz authored Jul 12, 2018
  
  8fb1d1a6
- Fix for potential deadlock in the assoc_mgr_get_user_assocs() · 80d38355
  Dominik Bartkiewicz authored Jul 12, 2018
```
Bug 5098.
```
  80d38355
- Expand gpu option testing · 77fc2c23
  Morris Jette authored Jul 12, 2018
  
  77fc2c23
- Fix issues with --exclusive=[user|mcs] to work correctly · 72736af2
  Dominik Bartkiewicz authored Jul 12, 2018
```
with preemption or when job requests a specific list of hosts.

Bug 5293.
```
  72736af2
- cons_tres: add some core filtering/selection logic · 08982723
  Morris Jette authored Jul 11, 2018
  
  08982723
11 Jul, 2018 2 commits
- Fix memory leak · c2988cef
  Morris Jette authored Jul 11, 2018
```
Coverity CID 186992
```
  c2988cef
- Remove redundant NULL pointer check · 11f74f4d
  Morris Jette authored Jul 11, 2018
```
Coverity CID 186991
```
  11f74f4d
10 Jul, 2018 3 commits

Morris Jette authored Jul 10, 2018

Pass "first_pass" and "avail_cores to _eval_nodes() so that
  the usable cores can be better identified by the GRES selection
  logic.
Add new function, _select_cores(), to select specific cores for use
Create new data structure with job multi-core spec
Permit off-socket cores to be used with enforce_bind
  Needed so that cores on and off socket can be used. Details will need
  to be handled in _select_cores()

6f78e048

harden a regression test · 44aba2a7

Morris Jette authored Jul 10, 2018

the munge regression test7.16 would fail roughly 0.1% of the time
when modifying a bit that munge did not use. This change modifies
the test to retry once in that case.

44aba2a7

Document recent sdiag enhancements · 4103ccfd
Broderick Gardner authored Jul 10, 2018
```
bug 5337
```
4103ccfd

09 Jul, 2018 4 commits
- Whoops, I used pthread_cond_signal insteadl of slurm_cond_signal. · ff47bcbb
  Danny Auble authored Jul 09, 2018
```
Coverity 186930
```
  ff47bcbb
- mpi/pmix: use `SLURM` prefix of UCX config variables · 8012228f
  Boris Karasev authored Jun 20, 2018
  
  8012228f
- Add news for 4daeedd8 · d10854d9
  Danny Auble authored Jul 09, 2018
  
  d10854d9
- cons_tres: favor using nodes with co-located GPUs+CPUs · 3ca43f11
  Morris Jette authored Jul 09, 2018
  
  3ca43f11
07 Jul, 2018 1 commit

cons_tres: change algorithm to drop low resource nodes · cbed3921

Morris Jette authored Jul 06, 2018

When we need to drop nodes in the selection algorithm, change
from dropping low CPU count nodes to CPU+GPU count (for jobs
requesting GPUs). Not an ideal algorithm, but much better
when using GPUs.

cbed3921

06 Jul, 2018 10 commits
- Continuation of e5f03971 to get rid of the potentially dangerous detached · 6a702158
  Danny Auble authored Jul 06, 2018
```
thread

Bug 5390
```
  6a702158
- Merge remote-tracking branch 'origin/slurm-17.11' · aa5db52a
  Brian Christiansen authored Jul 06, 2018
  
  aa5db52a
- Add workaround for importing newly install namespace packages · da2ecda8
  Thea Flowers authored Jun 22, 2018
```
Bug 5395
```
  da2ecda8
- gres/gpu: add logging for config info · a8c63832
  Morris Jette authored Jul 06, 2018
```
this logs the GPU configuration from the slurmd perspecitve.
while we don't have tools to load the information directly
from nvidia system configuration, i have confirmed where that
logic needs to go and the data structure contents.
```
  a8c63832
- Merge remote-tracking branch 'origin/slurm-17.11' · bffbaf11
  Danny Auble authored Jul 06, 2018
```
# Conflicts:
#	doc/html/faq.shtml
#	src/slurmctld/job_mgr.c
```
  bffbaf11
- Fix potential segfault when closing the mpi/pmi2 plugin. · 4daeedd8
  Danny Auble authored Jul 06, 2018
```
Bug 5390
```
  4daeedd8
- Fix leaking freezer cgroups. · 7f9c4f73
  Marshall Garey authored Jul 06, 2018
```
Continuation of 923c9b37.

There is a delay in the cgroup system when moving a PID from one cgroup
to another. It is usually short, but if we don't wait for the PID to
move before removing cgroup directories the PID previously belonged to,
we could leak cgroups. This was previously fixed in the cpuset and
devices subsystems. This uses the same logic to fix the freezer
subsystem.

Bug 5082.
```
  7f9c4f73
- cons_tres fix max_cpus_per_node logic for overcommit option · cac887a9
  Morris Jette authored Jul 06, 2018
  
  cac887a9
- Fix for possible divide by zero · 514ed12f
  Morris Jette authored Jul 06, 2018
  
  514ed12f
- cons_tres: add partition max_cpus_per_node support · bc6f851b
  Morris Jette authored Jul 06, 2018
  
  bc6f851b