Commits · 1ca68c7582542602f877898e03a4e5a0d8cb14da · Manuel G. Marciani / ces_slurm_simulator

30 Jan, 2018 5 commits
- Remove "job_id" field from task_p_slurmd_reserve_resources() · 1ca68c75
  Morris Jette authored Jan 30, 2018
```
The argument is redundant and possibly confusing for pack jobs
```
  1ca68c75
- Remove "job_id" from task_p_slurmd_batch_request · 5546d46c
  Morris Jette authored Jan 30, 2018
```
The value is redundant and possibly misleading for pack jobs
```
  5546d46c
- Remove "job_id" field task_p_slurmd_launch_request() · 886bc01b
  Morris Jette authored Jan 30, 2018
```
The information is redundant, and possibly inaccurate for
  heterogeneous jobs
```
  886bc01b
- Make sure the slurmstepd blocks signals like SIGTERM correctly. · d2c83807
  Danny Auble authored Jan 30, 2018
```
Bug 4634
```
  d2c83807
- CRAY - Remove race in the core_spec where we add the slurmstepd to the · eff8690e
  David Gloe authored Jan 30, 2018
```
job container where if the step was canceled would also cancel the stepd
erroneously.

Bug 4634
```
  eff8690e
29 Jan, 2018 4 commits
- burst_buffer/cray: Attempts by job to create persistent burst buffer when · eaddd999
  Morris Jette authored Jan 29, 2018
```
one already exists owned by a different user will be logged and the job
held.

Bug 4614
```
  eaddd999
- Add a job's allocated licenses to the [Pro|Epi]logSlurmctld. · 5b7520b7
  Tim Wickberg authored Jan 29, 2018
  
  5b7520b7
- Fix for association MaxWall enforcement when none is given at submission. · 9143c7c9
  Alejandro Sanchez authored Jan 29, 2018
```
Bug 4681
```
  9143c7c9
- Update NEWS and RELEASE_NOTES · 1a9071e1
  Morris Jette authored Jan 29, 2018
```
In preparation for release of v18.08.0-pre1 later this week.
```
  1a9071e1
25 Jan, 2018 5 commits
- Fix potential memory leak if clean starting and the TRES didn't change · a08b3709
  Danny Auble authored Jan 25, 2018
```
from when last started.

Signed-off-by: Danny Auble <da@schedmd.com>
```
  a08b3709
- Fix issue where unpacking job state after TRES count changed could lead to · e45d0662
  Brian Christiansen authored Jan 25, 2018
```
invalid reads.

Bug 4664
```
  e45d0662
- Validate command existence on the srun *[pro|epi]log options · 1b027f42
  Felip Moll authored Jan 25, 2018
```
if LaunchParameter test_exec is set.

Bug 4439
```
  1b027f42
- Allow execution of task prolog/epilog when uid has access · 66b40e30
  Felip Moll authored Jan 25, 2018
```
rights by a secondary group id.

Bug 4439
```
  66b40e30
- Docs - fix typos. · d0c984e6
  Dominik Bartkiewicz authored Jan 25, 2018
  
  d0c984e6
24 Jan, 2018 5 commits

Fix long time FIXME (2002): wait until backup slurmctld is for real done · 312a52ba

Danny Auble authored Jan 24, 2018

before the primary takes control.

The primary will wait until the backup has replied with it all the way
done shutting down before it takes over.  Before it would just wait 2
seconds which probably was ok in most situations.  Now we don't return
until it is completely out.

This also ups the CONTROL_TIMEOUT from 10 -> 30 seconds.

312a52ba

Defer job signaling until prolog is completed. · df1adea9
Dominik Bartkiewicz authored Jan 24, 2018
```
Bug 4446
```
df1adea9
Copy v17.02 NEWS item to v17.11 · 4b53d53a
Dominik Bartkiewicz authored Jan 24, 2018

4b53d53a
Fix whole node allocation cpu counts when --hint=nomultihtread, · 252d0573
Danny Auble authored Jan 24, 2018
```
introduced in commit ea85d123

Bug 4613
```
252d0573

Expand reservation feature specification · 411470b8

Morris Jette authored Jan 23, 2018

Expand advanced reservation feature specification to support parenthesis and
    counts of nodes with specified features. Nodes with the feature currently
    active will be prefered.
bug 4604

411470b8

23 Jan, 2018 1 commit

task/cgroup - add support to detect OOM_KILL cgroup events. · 943c4a13

Alejandro Sanchez authored Jan 23, 2018

Commit 818a09e8 introduced a new state JOB_OOM and a new state reason
FAIL_OOM (OutOfMemory). The problem was that it based the decision upon
the value of the different memory.[*].failcnt being > 0.

That lead to "false positives" situations when the usage hit the limit
but the Kernel was able to reclaim pages and the process managed to finish
successfully. When this happens there might not necessary be OOM_KILL
events happening.

This patch makes it so the JOB_OOM state is set based upon OOM_KILL events
detected instead of usage hitting the limit. The usage hit will still
be logged as an info() message, and further work will be needed in the
master branch to better discern both type of events, maybe changing
the API and getting rid of the current SIG_OOM and a potential new
SIG_OOM_KILL.

OOM_KILL event is detected using the eventfd notification mechanism
on the cgroup v1 control/event files:
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

If we plan to support cgroup v2, we should monitor 'memory.events' file
modified events. That would mean that any of the available entries changed
its value upon notification.
Entries include: low, high, max, oom, oom_kill:
https://www.kernel.org/doc/Documentation/cgroup-v2.txt
https://patchwork.kernel.org/patch/9737381
but since this is a fairly recent change many sites might be running
kernels still not supporting this feature.

Bug 3820.

943c4a13

22 Jan, 2018 5 commits

Correct SLURM_NTASKS for pack step · f431ca59

Morris Jette authored Jan 22, 2018

Correct SLURM_NTASKS and SLURM_NPROCS environment variable for heterogeneous
    job step. Report values representing full allocation.

f431ca59

Revert "Fix uid check when requesting a jobid from a pid." · 5b1d77fb

Danny Auble authored Jan 22, 2018

This reverts commit d3141dc9.

Bug 4655

Turns out there are many ways to get this information directly from
the slurmstepd.  As you can already get this information from ps we
decided to just revert back to the old non-authenticated way of doing
things.

If we do need this in the future we need to patch the stepd as well as
the slurmd here in all the RPC's that try to grab this.

A user could easily run scontrol (or their own home baked thing)
on the node which will give them a direct contact with the slurmstepd.

5b1d77fb

Revert "Revert "Fix uid check when requesting a jobid from a pid."" · 4a0f4796
Danny Auble authored Jan 22, 2018
```
This reverts commit c4fb9bc3.
```
4a0f4796

Revert "Fix uid check when requesting a jobid from a pid." · c4fb9bc3

Danny Auble authored Jan 22, 2018

This reverts commit d3141dc9.

Bug 4655

Turns out there are many ways to get this information directly from
the slurmstepd.  As you can already get this information from ps we
decided to just revert back to the old non-authenticated way of doing
things.

If we do need this in the future we need to patch the stepd as well as
the slurmd here in all the RPC's that try to grab this.

A user could easily run scontrol (or their own home baked thing)
on the node which will give them a direct contact with the slurmstepd.

c4fb9bc3

Fix issues when starting the backup slurmdbd. · 8bb58a31
Danny Auble authored Jan 22, 2018
```
Bug 4656
```
8bb58a31

19 Jan, 2018 4 commits
- Add job state of SO/STAGE_OUT · 6a67af86
  Morris Jette authored Jan 19, 2018
```
bug 4607
```
  6a67af86
- Add acct_gather_profile/influxdb plugin. · ab6155c0
  Alejandro Sanchez authored Dec 11, 2017
```
Original source:
https://github.com/cfenoy/influxdb-slurm-monitoring
```
  ab6155c0
- Free memory on _file_bcast_register_file failure. · 79be2636
  Morris Jette authored Jan 18, 2018
```
Bug 4619.
```
  79be2636
- Report NodeFeatures plugin config · f2efbb60
  Felip Moll authored Jan 18, 2018
```
Report NodeFeatures plugin configuration with scontrol and sview commands.
bug 4036
```
  f2efbb60
18 Jan, 2018 6 commits
- Correct dragonfly topology support when job allocation specifies desired · b042789d
  Morris Jette authored Jan 18, 2018
```
switch count.

Bug 4381
```
  b042789d
- Added info string on sh5util when deleting an empty file. · 011b2f23
  Felip Moll authored Jan 16, 2018
```
Bug 4620
```
  011b2f23
- Reject --acctg-freq at submit if invalid. · dda86297
  Danny Auble authored Jan 18, 2018
```
Bug 4620
```
  dda86297
- MYSQL - Fix issue for multi-dimensional machines when using sacct to · a2065801
  Danny Auble authored Jan 18, 2018
```
find jobs that ran on specific nodes.

Bug 4602
```
  a2065801
- Fix potentially uninitialized variable in slurmctld. · 3c7da0e7
  Danny Auble authored Jan 18, 2018
```
This isn't a real problem, but older compilers will complain about it.
Newer compilers know in order to get into the place it could be used
'have_count' would have to be set.  If that is set then feature_list
would also be set.
```
  3c7da0e7
- MYSQL - Fix potential abort when attempting to make an account a parent of · 71eb245e
  Danny Auble authored Jan 18, 2018
```
itself.

Bug 4638
```
  71eb245e
17 Jan, 2018 1 commit
- Now programs can be checked before execution if test_exec is set · cf4d2145
  Felip Moll authored Jan 17, 2018
```
when using multi-prog option.

Bug 4621
```
  cf4d2145
16 Jan, 2018 3 commits
- Fix output file with %t for pack job · 2358aae3
  Morris Jette authored Jan 16, 2018
```
Fix output file containing "%t" (task ID) for heterogeneous job step to
    be based upon global task ID rather than task ID for that component
    of the heterogeneous job step.
```
  2358aae3
- Fix NEWS update from 17.02 missing in commit da2e1160 · fd956448
  Danny Auble authored Jan 16, 2018
  
  fd956448
- Fix heterogeneous step memory corruption · 99cf6884
  Morris Jette authored Jan 16, 2018
```
Fix for possible memory corruption in srun when running heterogeneous
job steps.
bug 4626
```
  99cf6884
12 Jan, 2018 1 commit

Fix potential deadlock in the slurmctld when using list_for_each. · ceb55594

Dominik Bartkiewicz authored Jan 12, 2018

This partially reverts 89bcd975 and aac6bd39

Turns out you can't use a list_for_each and lock something inside the
list_for_each function that does a lock without the write lock.

Bug 4611

ceb55594