Commits · 943c4a130f39dbb1fba1069b9981fe319c1dae6e · Manuel G. Marciani / ces_slurm_simulator

23 Jan, 2018 2 commits

task/cgroup - add support to detect OOM_KILL cgroup events. · 943c4a13

Alejandro Sanchez authored Jan 23, 2018

Commit 818a09e8 introduced a new state JOB_OOM and a new state reason
FAIL_OOM (OutOfMemory). The problem was that it based the decision upon
the value of the different memory.[*].failcnt being > 0.

That lead to "false positives" situations when the usage hit the limit
but the Kernel was able to reclaim pages and the process managed to finish
successfully. When this happens there might not necessary be OOM_KILL
events happening.

This patch makes it so the JOB_OOM state is set based upon OOM_KILL events
detected instead of usage hitting the limit. The usage hit will still
be logged as an info() message, and further work will be needed in the
master branch to better discern both type of events, maybe changing
the API and getting rid of the current SIG_OOM and a potential new
SIG_OOM_KILL.

OOM_KILL event is detected using the eventfd notification mechanism
on the cgroup v1 control/event files:
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

If we plan to support cgroup v2, we should monitor 'memory.events' file
modified events. That would mean that any of the available entries changed
its value upon notification.
Entries include: low, high, max, oom, oom_kill:
https://www.kernel.org/doc/Documentation/cgroup-v2.txt
https://patchwork.kernel.org/patch/9737381
but since this is a fairly recent change many sites might be running
kernels still not supporting this feature.

Bug 3820.

943c4a13

Update What's New html page · 59e5087e
Brian Christiansen authored Jan 22, 2018

59e5087e

22 Jan, 2018 6 commits

Missed some cpu references from commit 7c071ec9 . · 6b57158e
Danny Auble authored Jan 22, 2018

6b57158e
Add missing assoc_mgr_lock_t qos READ_LOCK since g_qos_count is read. · bb12df30
Alejandro Sanchez authored Jan 22, 2018
```
Bug 4656.
```
bb12df30

Revert "Fix uid check when requesting a jobid from a pid." · 5b1d77fb

Danny Auble authored Jan 22, 2018

This reverts commit d3141dc9.

Bug 4655

Turns out there are many ways to get this information directly from
the slurmstepd.  As you can already get this information from ps we
decided to just revert back to the old non-authenticated way of doing
things.

If we do need this in the future we need to patch the stepd as well as
the slurmd here in all the RPC's that try to grab this.

A user could easily run scontrol (or their own home baked thing)
on the node which will give them a direct contact with the slurmstepd.

5b1d77fb

Revert "Revert "Fix uid check when requesting a jobid from a pid."" · 4a0f4796
Danny Auble authored Jan 22, 2018
```
This reverts commit c4fb9bc3.
```
4a0f4796

Revert "Fix uid check when requesting a jobid from a pid." · c4fb9bc3

Danny Auble authored Jan 22, 2018

This reverts commit d3141dc9.

Bug 4655

Turns out there are many ways to get this information directly from
the slurmstepd.  As you can already get this information from ps we
decided to just revert back to the old non-authenticated way of doing
things.

If we do need this in the future we need to patch the stepd as well as
the slurmd here in all the RPC's that try to grab this.

A user could easily run scontrol (or their own home baked thing)
on the node which will give them a direct contact with the slurmstepd.

c4fb9bc3

Fix issues when starting the backup slurmdbd. · 8bb58a31
Danny Auble authored Jan 22, 2018
```
Bug 4656
```
8bb58a31

19 Jan, 2018 1 commit
- Free memory on _file_bcast_register_file failure. · 79be2636
  Morris Jette authored Jan 18, 2018
```
Bug 4619.
```
  79be2636
18 Jan, 2018 7 commits
- Correct dragonfly topology support when job allocation specifies desired · b042789d
  Morris Jette authored Jan 18, 2018
```
switch count.

Bug 4381
```
  b042789d
- Added info string on sh5util when deleting an empty file. · 011b2f23
  Felip Moll authored Jan 16, 2018
```
Bug 4620
```
  011b2f23
- Reject --acctg-freq at submit if invalid. · dda86297
  Danny Auble authored Jan 18, 2018
```
Bug 4620
```
  dda86297
- MYSQL - Fix issue for multi-dimensional machines when using sacct to · a2065801
  Danny Auble authored Jan 18, 2018
```
find jobs that ran on specific nodes.

Bug 4602
```
  a2065801
- Add missing hostlist_find_dims · 7e260158
  Danny Auble authored Jan 18, 2018
```
needed if you are ever working with multi-dimensional systems.

Bug 4602
```
  7e260158
- Fix potentially uninitialized variable in slurmctld. · 3c7da0e7
  Danny Auble authored Jan 18, 2018
```
This isn't a real problem, but older compilers will complain about it.
Newer compilers know in order to get into the place it could be used
'have_count' would have to be set.  If that is set then feature_list
would also be set.
```
  3c7da0e7
- MYSQL - Fix potential abort when attempting to make an account a parent of · 71eb245e
  Danny Auble authored Jan 18, 2018
```
itself.

Bug 4638
```
  71eb245e
17 Jan, 2018 3 commits
- Improve node features / job constraint tests · 0101156d
  Morris Jette authored Jan 16, 2018
  
  0101156d
- Modify test for leading zeros in node names · e1125317
  Morris Jette authored Jan 16, 2018
```
Test 3.11, file inc3.11.9 only runs for some configurations, but
  assumes no leading zeros in node name suffix. When run with
  nodes named "nid[00001-00005]", the test converted the last
  number to it's numeric for and as making requests for
  "nid[00001-5]", which would fail.
```
  e1125317
- Tweak test to avoid error on 5 node cluster · 5ab0f647
  Morris Jette authored Jan 16, 2018
  
  5ab0f647
16 Jan, 2018 5 commits
- Fix output file with %t for pack job · 2358aae3
  Morris Jette authored Jan 16, 2018
```
Fix output file containing "%t" (task ID) for heterogeneous job step to
    be based upon global task ID rather than task ID for that component
    of the heterogeneous job step.
```
  2358aae3
- Clarify some srun pack step option processing logic · 217138e0
  Morris Jette authored Jan 16, 2018
```
This expands some comments
Explicitly sets some pointers to NULL after memcpy (these are
   redundant, but add clarity) and
Move a memcpy to avoid modifying the wrong values
```
  217138e0
- No change in logic, just alphabetize · 54fc30c7
  Morris Jette authored Jan 16, 2018
  
  54fc30c7
- Fix NEWS update from 17.02 missing in commit da2e1160 · fd956448
  Danny Auble authored Jan 16, 2018
  
  fd956448
- Fix heterogeneous step memory corruption · 99cf6884
  Morris Jette authored Jan 16, 2018
```
Fix for possible memory corruption in srun when running heterogeneous
job steps.
bug 4626
```
  99cf6884
12 Jan, 2018 9 commits

Fix potential deadlock in the slurmctld when using list_for_each. · ceb55594

Dominik Bartkiewicz authored Jan 12, 2018

This partially reverts 89bcd975 and aac6bd39

Turns out you can't use a list_for_each and lock something inside the
list_for_each function that does a lock without the write lock.

Bug 4611

ceb55594

Revert "Fix potential deadlock in the slurmctld when using list_for_each." · 529a69e6
Danny Auble authored Jan 12, 2018
```
This reverts commit ff3e77f4.
```
529a69e6

Fix potential deadlock in the slurmctld when using list_for_each. · ff3e77f4

Danny Auble authored Jan 12, 2018

This partially reverts 89bcd975 and aac6bd39

Turns out you can't use a list_for_each and lock something inside the
list_for_each function that does a lock without the write lock.

Bug 4611

ff3e77f4

Docs - remove mention of --max-exit-timeout from heterogeneous_jobs.shtml. · 88852fe4
Tim Wickberg authored Jan 12, 2018
```
Otherwise undocumented, and does not do anything.
Will be removed in 18.08.
```
88852fe4
Copy v17.02 NEWS item to v17.11 · 5ae7505b
Morris Jette authored Jan 12, 2018

5ae7505b
Merge branch 'slurm-17.02' into slurm-17.11 · da2e1160
Morris Jette authored Jan 12, 2018

da2e1160

Fix to aftercorr job array dependency logic · 7b5a3674

Dominik Bartkiewicz authored Jan 12, 2018

Fix job array dependency with "aftercorr" option and some task arrays in
    the first job fail. This fix lets all task array elements that can run
    proceed rather than stopping all subsequent task array elements.
Bug 4590

7b5a3674

Global environment was not set correctly in srun · aff20b90

Felip Moll authored Jan 12, 2018

Creating a copy of the actual environment in env->env defines a new pointer,
then next call to setup_env and setenvf doesn't define variables in the global
environment but in this new copy.

Bug 4615

aff20b90

srun env var logic fix · bc7838d9

Morris Jette authored Jan 12, 2018

This fixes the problem introduced in by commit
777a45f9 and maintains proper PMIx
operation.
bugs 4132 and 4615

bc7838d9

11 Jan, 2018 6 commits
- Update NEWS for v17.02 fix moved to v17.11 · 42d06c9e
  Morris Jette authored Jan 11, 2018
  
  42d06c9e
- Merge branch 'slurm-17.02' into slurm-17.11 · 264fb2e6
  Morris Jette authored Jan 11, 2018
  
  264fb2e6
- node_feature/knl_cray - Fix memory leak · deaacad2
  Morris Jette authored Jan 11, 2018
```
node_feature/knl_cray - Fix memory leak that can occur during normal
    operation. This will happen when an update request for a specific
    node happens.
```
  deaacad2
- Note update in v17.02 copied to v17.11 · 53ad8671
  Morris Jette authored Jan 11, 2018
  
  53ad8671
- Merge branch 'slurm-17.02' into slurm-17.11 · f0ccbb03
  Morris Jette authored Jan 11, 2018
  
  f0ccbb03
- node_feature/knl_cray - Fix memory leaks · 32c93fce
  Morris Jette authored Jan 11, 2018
```
If CnselectPath and/or SyscfgPath defined in knl_cray.conf file and
  slurmctld reconfigured, the original values of those paraemters
  would be over-written and their memory leaked.
```
  32c93fce
10 Jan, 2018 1 commit

Prevent possible double-xfree on buffer · 5c8973dd

Felip Moll authored Dec 22, 2017

Use FREE_NULL_BUFFER instead, otherwise we could attempt to
free_buffer a second time if we jump to the rwfail label.

bug4491

5c8973dd