Commits · b72b0c33fc501e0cb4270c80be57af3d614bb05a · Manuel G. Marciani / ces_slurm_simulator

05 Dec, 2018 16 commits
- Fix break lines for html version · b72b0c33
  Albert Gil authored Dec 05, 2018
```
Bug 6163
```
  b72b0c33
- Run SlurmctldPrimaryOffProg when the primary shuts down. · ee29bba8
  Felip Moll authored Dec 05, 2018
```
Backups already run it when dropping to backup.

Bug 6098.
```
  ee29bba8
- Docs - update pam_slurm_adopt info. · 91db45c9
  Marshall Garey authored Dec 05, 2018
```
Remove the README and point to the web page.

Add details on the disable_x11 option.

Bug 5936.
```
  91db45c9
- pam_slurm_adopt - send an error message to the user if no Jobs found. · 9fb15b4a
  Marshall Garey authored Dec 05, 2018
```
Also throw an error message within stepd_available() if the nodename
is not set or cannot be inferred correctly.

Bug 5399.
```
  9fb15b4a
- add links for shifter set udocker as standalone fix 80 col spacing · a100c7e2
  Nate Rini authored Nov 21, 2018
  
  a100c7e2
- Sort containers · 057eabf0
  Brian Christiansen authored Nov 20, 2018
  
  057eabf0
- Fixes · f7a6fbe4
  Brian Christiansen authored Nov 20, 2018
```
Spelling, suggestions, trailing whitespace.
```
  f7a6fbe4
- Create containers guide · 48922480
  Nate Rini authored Nov 21, 2018
```
Bug 6044
```
  48922480
- Fix missing suffixes in squeue. · 9b0399b8
  Trey Dockendorf authored Dec 05, 2018
```
Bug 6120
```
  9b0399b8
- Fix typo, no code change. · 84aa2db5
  Albert Gil authored Dec 05, 2018
  
  84aa2db5
- Decrease an error message to be debug. · 639b3e87
  Tim Wickberg authored Dec 05, 2018
```
Bug 6155
```
  639b3e87
- Decrement message_connections in stepd code on error path correctly. · 57daec20
  Tim Wickberg authored Dec 05, 2018
```
Bug 6155
```
  57daec20
- Add bf_ignore_newly_avail_nodes option to SchedulerParameters. · 5ad1447e
  Felip Moll authored Oct 19, 2018
```
When bf_continue is set, and locks are released during a backfill cycle,
other operations can make new resorces available while part way through
the queue. When backfill continues the cycle and evaluates new jobs, it
may allocate some of these newly available resources to lower priority jobs,
rather than to higher priority jobs that were already considered in this
backfill cycle.

This patch introduces bf_ignore_newly_avail_nodes to SchedulerParameters
to solve this issue. This option will ignore nodes made available when
the backfill scheduler yields when resuming the backfill cycle.

Bug 5279.
```
  5ad1447e
- Add Albert to the team! · 6d88a442
  Danny Auble authored Dec 05, 2018
  
  6d88a442
- Fix stepd segfault race if slurmctld hasn't registered with the launching · 4b14c2d4
  Danny Auble authored Dec 05, 2018
```
slurmd yet delivering it's TRES list.

Bug 6122

Co-authored-by: Marshall Garey <marshall@schedmd.com>
```
  4b14c2d4
- Clarify documentation of --depend=singleton · 9ef8dfd8
  Morris Jette authored Dec 05, 2018
  
  9ef8dfd8
04 Dec, 2018 9 commits
- Revert 8c910226, holding off till 19.05 · 18c6dd16
  Nate Rini authored Nov 28, 2018
```
Bug 6008
```
  18c6dd16
- If there is a constraint construct of the form "[...&...]" · 6738448a
  Morris Jette authored Dec 03, 2018
```
then an error is generated if more than one of those specifications
contains KNL NUMA or MCDRAM modes.

Bug 5846
```
  6738448a
- Improve debug in #if _DEBUG statements, no real code change. · b48cdd75
  Morris Jette authored Dec 03, 2018
```
Bug 5846
```
  b48cdd75
- Fix a scheuling logic bug with respect to XOR operation support when there · 6b9f894f
  Morris Jette authored Dec 03, 2018
```
are down nodes.

Bug 5846
```
  6b9f894f
- Fix scheduling logic bug. There should have been a test for _not_ · 285545d9
  Morris Jette authored Dec 03, 2018
```
NODE_SET_REBOOT to continue.

Bug 5846
```
  285545d9
- Fix scheduling logic to avoid using nodes that require a reboot for KNL · d9b9eb23
  Morris Jette authored Dec 03, 2018
```
node change when possible.

Bug 5846
```
  d9b9eb23
- Docs - rewrite platforms.html page with current info. · 372b51e8
  Tim Wickberg authored Dec 04, 2018
```
Break out a list of Linux distributions as well.
```
  372b51e8
- Fix handling of 'slurmd -f' by setting SLURM_CONF earlier. · 401d1b47
  Marshall Garey authored Dec 04, 2018
```
Plugins reading in their own config files rely on the SLURM_CONF
environment variable pointing to the appropriate directory,
otherwise they will fall back to the build in sysconfdir path.

Set the environment variable early enough so that the -f flag
operates correctly, but not before conf->conffile has definitely
been set. Remove the setenv call that happens before the first
slurmstepd is fork()'d as it is now redundant.

Bug 4774.
```
  401d1b47
- salloc - set SLURM_NTASKS_PER_CORE and SLURM_NTASKS_PER_SOCKET when appropriate. · a36b8a4d
  Alejandro Sanchez authored Dec 04, 2018
```
sbatch sets these, but salloc did not. This should make srun behavior
between the two consistent.

Bug 3861.
```
  a36b8a4d
03 Dec, 2018 2 commits
- When handling runaway jobs remove all usage before rollup to remove any · bf705c80
  Marshall Garey authored Dec 03, 2018
```
time that wasn't existent instead of just updating lines that have time
with a lesser time.
```
  bf705c80
- Fix issue when job's environment is minimal and only contains variables · f1116c67
  Dominik Bartkiewicz authored Dec 03, 2018
```
Slurm is going to replace internally.

Bug 5800
```
  f1116c67
29 Nov, 2018 2 commits
- Validate job_ptr in backfill before restoring preempt state. · 4dec76c9
  Dominik Bartkiewicz authored Nov 29, 2018
```
Bug 6121
```
  4dec76c9
- Fix salloc and missing SLURM_NTASKS. · 8c910226
  Nate Rini authored Nov 28, 2018
```
Bug 6008
```
  8c910226
28 Nov, 2018 6 commits
- Extend test to test patch from bug 6077 · 7cbda917
  Morris Jette authored Nov 28, 2018
  
  7cbda917
- Fix issue when requesting invalid gres. · 80e2cc41
  Alejandro Sanchez authored Nov 28, 2018
```
Bug 6077
```
  80e2cc41
- In route/topology validate the slurmctld doesn't try to initialize the · cae90ff4
  Danny Auble authored Nov 28, 2018
```
node system.

Bug 6037
```
  cae90ff4
- Fix race condition in route/topology when the slurmctld is reconfigured. · f35bb686
  Marshall Garey authored Nov 28, 2018
```
Bug 6037
```
  f35bb686
- mpi/pmix: Remove unneeded libpmix callback drop in tree-based coll · bd283fd3
  Artem Y. Polyakov authored Nov 27, 2018
```
Bug 5983
```
  bd283fd3
- mpi/pmix: Fix double invocation of the PMIx lib fence callback · 674e78b6
  Artem Y. Polyakov authored Nov 05, 2018
```
In case of the error code paths (like collective timeout) it is possible
that a callback provided by PMIx will be called twice leading to a
segmentation fault.
This commit fixes it by properly accounting callback invocations.

Bug 5983
```
  674e78b6
27 Nov, 2018 5 commits

Revert "mpi/pmix: Fix double invocation of the PMIx lib fence callback" · 2dfadf65
Danny Auble authored Nov 27, 2018
```
This reverts commit b0515009.
```
2dfadf65
mpi/pmix: Make multi-slurmd work correctly when using ring communication. · 17ddbd5e
Danny Auble authored Nov 27, 2018
```
Bug 5935
```
17ddbd5e

mpi/pmix: fixed the logging of collective state · e7803212

Boris Karasev authored Nov 11, 2018



This could have caused core dumps if communication failed for one
reason or another.

Signed-off-by: Boris Karasev <karasev.b@gmail.com>

Bug 5935

e7803212

mpi/pmix: Fix double invocation of the PMIx lib fence callback · b0515009

Artem Y. Polyakov authored Nov 05, 2018

In case of the error code paths (like collective timeout) it is possible
that a callback provided by PMIx will be called twice leading to a
segmentation fault.
This commit fixes it by properly accounting callback invocations.

b0515009

Clean up step on a failed node correctly. · e11f4af9

Morris Jette authored Nov 27, 2018

This patch does 2 things:
1. When a step fails on some node, then mark it as complete on those
   nodes. This is needed so that when the step ends on the other
   nodes, slurmctld recognized the step as completely done.
2. If the step does not have the --no-kill option set, then when some
   node fails, send a request to terminate the step on ALL of its nodes.

Bug 5805

e11f4af9