Commits · 84a306ac4684925075cc1bedca79026eb55a309d · Manuel G. Marciani / ces_slurm_simulator

13 Feb, 2018 1 commit
- Fix small memory leaks in node_features plugins on reconfigure. · 84a306ac
  Felip Moll authored Feb 12, 2018
```
Bug 4747.
```
  84a306ac
12 Feb, 2018 1 commit
- slurm.spec - change --with lua check to use pkgconfig · d21c1a08
  Felip Moll authored Jan 19, 2018
```
Fixes some issues around differences in lua package naming.

Bug 4568.
```
  d21c1a08
08 Feb, 2018 1 commit
- Add slurm_load_single_node() function to the Perl API. · 068ed34a
  Dominik Bartkiewicz authored Feb 08, 2018
```
Bug 4709.
```
  068ed34a
07 Feb, 2018 9 commits

Fix limits enforce order when they're set at partition and other levels. · 2ef56d4b

Alejandro Sanchez authored Feb 07, 2018

Previously it was taking the MIN, without respecting the order.

Also add a note to the resource_limits.html page to clarify the
exception for Max[Wall|Time] and/or [Max|Min]Nodes limits, where
the default is that the Partition is the king with regards of
precedence, unless the respective job's QOS flags
Partition[Min|Max|Time]Limit are set.

Bug 4681.

2ef56d4b

Link slurmd against all libraries that slurmstepd links to. · 9bfc98e4

Danny Auble authored Feb 06, 2018

This prevents a hard-to-diagnose issue where slurmstepd may fail
to start due to a missing library. This now ensures slurmd will
fail, and keep the node down until the library issue can be fixed.

Bug 4645, 4644.

9bfc98e4

Add fatal_abort() function to log a last fatal message then abort. · d2ddc71a

Danny Auble authored Feb 06, 2018

fatal() calls exit(1) which precludes getting a backtrace.
That's fine on configuration issues and other types of problem,
but for hitting "impossible" edge cases getting a core dump may
be the only way to isolate the issue.

Adding to 17.11 so we can easily provide diagnostic patches without
needing users to back-port this implementation. Further use will
come in 18.08.

Bug 4599.

d2ddc71a

Start NEWS for v17.11.4. · 42ade47b
Tim Wickberg authored Feb 06, 2018

42ade47b
Add NEWS entry for 17.11.3-2. · 2c459fec
Tim Wickberg authored Feb 06, 2018

2c459fec
Revert "Preserve node features on reconfig" · df966ee4
Tim Wickberg authored Feb 06, 2018
```
This reverts commit 18b35709.
```
df966ee4
Revert "Fix segfault related to un-reset save_ptr value pointing into invalid memory." · 64cb091a
Tim Wickberg authored Feb 06, 2018
```
This reverts commit 6a91845b.
```
64cb091a
Fix segfault related to un-reset save_ptr value pointing into invalid memory. · 6a91845b
Tim Wickberg authored Feb 06, 2018

6a91845b
Start NEWS for v17.11.4 · 08664ed7
Tim Wickberg authored Feb 06, 2018

08664ed7

06 Feb, 2018 3 commits
- Prevent orphaned step_extern processes from job cancellation while running the prolog. · 108502e9
  Danny Auble authored Feb 06, 2018
```
Wait for the prolog to complete, and then clean up after it.
Otherwise orphaned "sleep" executables will be created by the
still-launching step_extern.

Bug 4718.
```
  108502e9
- Preserve node features on reconfig · 18b35709
  Felip Moll authored Feb 06, 2018
```
Preserve node features when slurmctld daemons reconfigured including
active and available KNL features.
bug 4734
```
  18b35709
- Docs - update squeue man page to describe all possible job states. · 0d327bd6
  Isaac Hartung authored Feb 06, 2018
```
Bug 4630.
```
  0d327bd6
05 Feb, 2018 1 commit
- Better debug messages when MaxSubmitJobs is hit. · 785771b2
  Brian Christiansen authored Feb 05, 2018
```
Bug 4722
```
  785771b2
01 Feb, 2018 3 commits

Prevent multiple calls to pthread_atfork() after re-init of config regex. · 5c9f1dd5

Regine Gaudin authored Feb 01, 2018

keyvalue_initialized is reset on 'scontrol reconfigure' and other
cases, which can lead to additional atfork handlers being registered.

These can eventually lead to a segfault if an excessive number of
handlers have been re-registered. Set a separate boolean to protect
against this. Clear that boolean as part of the atfork handler.

Bug 4628.

5c9f1dd5

Treat parameters in JobAcctGather plugin as case-insensitive. · 41343076

Felip Moll authored Feb 01, 2018

UsePss was correct, but UsePSS and usepss would be silently ignored,
leading to confusion as to whether the option was working or not.

Treat all JobAcctGatherParams as case-insensitive to avoid confusion.

Bug 4637.

41343076

Revert "When submitting a --test-only job respect the -M option." · f1776471

Brian Christiansen authored Jan 31, 2018

This reverts commit 516b0d59.

With the fixing of the NEWS file.
We want to keep the idea of only checking one federation.

f1776471

30 Jan, 2018 9 commits
- When submitting a --test-only job respect the -M option. · 516b0d59
  Brian Christiansen authored Jan 30, 2018
```
Bug 4548
```
  516b0d59
- Revert "When submitting a --test-only job respect the -M option." · 3936ca15
  Brian Christiansen authored Jan 30, 2018
```
This reverts commit fb73b8a4.

# Conflicts:
#	NEWS
```
  3936ca15
- When revoking a sibling job in the federation we want to send a start · 61f69978
  Brian Christiansen authored Jan 30, 2018
```
message before purging the job record to get the uid of the revoked job.

Bug 4502
```
  61f69978
- Revert "When revoking a sibling job in the federation we want to send a start" · e2573d0f
  Danny Auble authored Jan 30, 2018
```
This reverts commit fc0c3e6c.
```
  e2573d0f
- When revoking a sibling job in the federation we want to send a start · fc0c3e6c
  Danny Auble authored Jan 30, 2018
```
message before purging the job record to get the uid of the revoked job.

Bug 4502
```
  fc0c3e6c
- SPANK - When slurm_spank_init_post_opt() fails return error correctly. · c3d9135f
  Morris Jette authored Jan 30, 2018
```
Bug 4651
```
  c3d9135f
- When submitting a --test-only job respect the -M option. · fb73b8a4
  Brian Christiansen authored Jan 30, 2018
```
Bug 4548
```
  fb73b8a4
- Make sure the slurmstepd blocks signals like SIGTERM correctly. · d2c83807
  Danny Auble authored Jan 30, 2018
```
Bug 4634
```
  d2c83807
- CRAY - Remove race in the core_spec where we add the slurmstepd to the · eff8690e
  David Gloe authored Jan 30, 2018
```
job container where if the step was canceled would also cancel the stepd
erroneously.

Bug 4634
```
  eff8690e
29 Jan, 2018 3 commits
- burst_buffer/cray: Attempts by job to create persistent burst buffer when · eaddd999
  Morris Jette authored Jan 29, 2018
```
one already exists owned by a different user will be logged and the job
held.

Bug 4614
```
  eaddd999
- Add a job's allocated licenses to the [Pro|Epi]logSlurmctld. · 5b7520b7
  Tim Wickberg authored Jan 29, 2018
  
  5b7520b7
- Fix for association MaxWall enforcement when none is given at submission. · 9143c7c9
  Alejandro Sanchez authored Jan 29, 2018
```
Bug 4681
```
  9143c7c9
25 Jan, 2018 3 commits
- Fix potential memory leak if clean starting and the TRES didn't change · a08b3709
  Danny Auble authored Jan 25, 2018
```
from when last started.

Signed-off-by: Danny Auble <da@schedmd.com>
```
  a08b3709
- Validate command existence on the srun *[pro|epi]log options · 1b027f42
  Felip Moll authored Jan 25, 2018
```
if LaunchParameter test_exec is set.

Bug 4439
```
  1b027f42
- Allow execution of task prolog/epilog when uid has access · 66b40e30
  Felip Moll authored Jan 25, 2018
```
rights by a secondary group id.

Bug 4439
```
  66b40e30
24 Jan, 2018 2 commits
- Copy v17.02 NEWS item to v17.11 · 4b53d53a
  Dominik Bartkiewicz authored Jan 24, 2018
  
  4b53d53a
- Fix whole node allocation cpu counts when --hint=nomultihtread, · 252d0573
  Danny Auble authored Jan 24, 2018
```
introduced in commit ea85d123

Bug 4613
```
  252d0573
23 Jan, 2018 1 commit

task/cgroup - add support to detect OOM_KILL cgroup events. · 943c4a13

Alejandro Sanchez authored Jan 23, 2018

Commit 818a09e8 introduced a new state JOB_OOM and a new state reason
FAIL_OOM (OutOfMemory). The problem was that it based the decision upon
the value of the different memory.[*].failcnt being > 0.

That lead to "false positives" situations when the usage hit the limit
but the Kernel was able to reclaim pages and the process managed to finish
successfully. When this happens there might not necessary be OOM_KILL
events happening.

This patch makes it so the JOB_OOM state is set based upon OOM_KILL events
detected instead of usage hitting the limit. The usage hit will still
be logged as an info() message, and further work will be needed in the
master branch to better discern both type of events, maybe changing
the API and getting rid of the current SIG_OOM and a potential new
SIG_OOM_KILL.

OOM_KILL event is detected using the eventfd notification mechanism
on the cgroup v1 control/event files:
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

If we plan to support cgroup v2, we should monitor 'memory.events' file
modified events. That would mean that any of the available entries changed
its value upon notification.
Entries include: low, high, max, oom, oom_kill:
https://www.kernel.org/doc/Documentation/cgroup-v2.txt
https://patchwork.kernel.org/patch/9737381
but since this is a fairly recent change many sites might be running
kernels still not supporting this feature.

Bug 3820.

943c4a13

22 Jan, 2018 3 commits

Revert "Fix uid check when requesting a jobid from a pid." · 5b1d77fb

Danny Auble authored Jan 22, 2018

This reverts commit d3141dc9.

Bug 4655

Turns out there are many ways to get this information directly from
the slurmstepd.  As you can already get this information from ps we
decided to just revert back to the old non-authenticated way of doing
things.

If we do need this in the future we need to patch the stepd as well as
the slurmd here in all the RPC's that try to grab this.

A user could easily run scontrol (or their own home baked thing)
on the node which will give them a direct contact with the slurmstepd.

5b1d77fb

Revert "Revert "Fix uid check when requesting a jobid from a pid."" · 4a0f4796
Danny Auble authored Jan 22, 2018
```
This reverts commit c4fb9bc3.
```
4a0f4796

Revert "Fix uid check when requesting a jobid from a pid." · c4fb9bc3

Danny Auble authored Jan 22, 2018

This reverts commit d3141dc9.

Bug 4655

Turns out there are many ways to get this information directly from
the slurmstepd.  As you can already get this information from ps we
decided to just revert back to the old non-authenticated way of doing
things.

If we do need this in the future we need to patch the stepd as well as
the slurmd here in all the RPC's that try to grab this.

A user could easily run scontrol (or their own home baked thing)
on the node which will give them a direct contact with the slurmstepd.

c4fb9bc3