Commits · 32c87e5d42d7e2d06a8ef9200e7de02942591172 · Manuel G. Marciani / ces_slurm_simulator

07 Feb, 2018 9 commits

Fix limits enforce order when they're set at partition and other levels. · 2ef56d4b

Alejandro Sanchez authored Feb 07, 2018

Previously it was taking the MIN, without respecting the order.

Also add a note to the resource_limits.html page to clarify the
exception for Max[Wall|Time] and/or [Max|Min]Nodes limits, where
the default is that the Partition is the king with regards of
precedence, unless the respective job's QOS flags
Partition[Min|Max|Time]Limit are set.

Bug 4681.

2ef56d4b

Link slurmd against all libraries that slurmstepd links to. · 9bfc98e4

Danny Auble authored Feb 06, 2018

This prevents a hard-to-diagnose issue where slurmstepd may fail
to start due to a missing library. This now ensures slurmd will
fail, and keep the node down until the library issue can be fixed.

Bug 4645, 4644.

9bfc98e4

Add fatal_abort() function to log a last fatal message then abort. · d2ddc71a

Danny Auble authored Feb 06, 2018

fatal() calls exit(1) which precludes getting a backtrace.
That's fine on configuration issues and other types of problem,
but for hitting "impossible" edge cases getting a core dump may
be the only way to isolate the issue.

Adding to 17.11 so we can easily provide diagnostic patches without
needing users to back-port this implementation. Further use will
come in 18.08.

Bug 4599.

d2ddc71a

Start NEWS for v17.11.4. · 42ade47b
Tim Wickberg authored Feb 06, 2018

42ade47b
Add NEWS entry for 17.11.3-2. · 2c459fec
Tim Wickberg authored Feb 06, 2018

2c459fec
Revert "Preserve node features on reconfig" · df966ee4
Tim Wickberg authored Feb 06, 2018
```
This reverts commit 18b35709.
```
df966ee4
Revert "Fix segfault related to un-reset save_ptr value pointing into invalid memory." · 64cb091a
Tim Wickberg authored Feb 06, 2018
```
This reverts commit 6a91845b.
```
64cb091a
Fix segfault related to un-reset save_ptr value pointing into invalid memory. · 6a91845b
Tim Wickberg authored Feb 06, 2018

6a91845b
Start NEWS for v17.11.4 · 08664ed7
Tim Wickberg authored Feb 06, 2018

08664ed7

06 Feb, 2018 3 commits
- Prevent orphaned step_extern processes from job cancellation while running the prolog. · 108502e9
  Danny Auble authored Feb 06, 2018
```
Wait for the prolog to complete, and then clean up after it.
Otherwise orphaned "sleep" executables will be created by the
still-launching step_extern.

Bug 4718.
```
  108502e9
- Preserve node features on reconfig · 18b35709
  Felip Moll authored Feb 06, 2018
```
Preserve node features when slurmctld daemons reconfigured including
active and available KNL features.
bug 4734
```
  18b35709
- Docs - update squeue man page to describe all possible job states. · 0d327bd6
  Isaac Hartung authored Feb 06, 2018
```
Bug 4630.
```
  0d327bd6
05 Feb, 2018 1 commit
- Better debug messages when MaxSubmitJobs is hit. · 785771b2
  Brian Christiansen authored Feb 05, 2018
```
Bug 4722
```
  785771b2
01 Feb, 2018 3 commits

Prevent multiple calls to pthread_atfork() after re-init of config regex. · 5c9f1dd5

Regine Gaudin authored Feb 01, 2018

keyvalue_initialized is reset on 'scontrol reconfigure' and other
cases, which can lead to additional atfork handlers being registered.

These can eventually lead to a segfault if an excessive number of
handlers have been re-registered. Set a separate boolean to protect
against this. Clear that boolean as part of the atfork handler.

Bug 4628.

5c9f1dd5

Treat parameters in JobAcctGather plugin as case-insensitive. · 41343076

Felip Moll authored Feb 01, 2018

UsePss was correct, but UsePSS and usepss would be silently ignored,
leading to confusion as to whether the option was working or not.

Treat all JobAcctGatherParams as case-insensitive to avoid confusion.

Bug 4637.

41343076

Revert "When submitting a --test-only job respect the -M option." · f1776471

Brian Christiansen authored Jan 31, 2018

This reverts commit 516b0d59.

With the fixing of the NEWS file.
We want to keep the idea of only checking one federation.

f1776471

30 Jan, 2018 12 commits
- Remove "job_id" field from task_p_slurmd_reserve_resources() · 1ca68c75
  Morris Jette authored Jan 30, 2018
```
The argument is redundant and possibly confusing for pack jobs
```
  1ca68c75
- When submitting a --test-only job respect the -M option. · 516b0d59
  Brian Christiansen authored Jan 30, 2018
```
Bug 4548
```
  516b0d59
- Revert "When submitting a --test-only job respect the -M option." · 3936ca15
  Brian Christiansen authored Jan 30, 2018
```
This reverts commit fb73b8a4.

# Conflicts:
#	NEWS
```
  3936ca15
- When revoking a sibling job in the federation we want to send a start · 61f69978
  Brian Christiansen authored Jan 30, 2018
```
message before purging the job record to get the uid of the revoked job.

Bug 4502
```
  61f69978
- Revert "When revoking a sibling job in the federation we want to send a start" · e2573d0f
  Danny Auble authored Jan 30, 2018
```
This reverts commit fc0c3e6c.
```
  e2573d0f
- When revoking a sibling job in the federation we want to send a start · fc0c3e6c
  Danny Auble authored Jan 30, 2018
```
message before purging the job record to get the uid of the revoked job.

Bug 4502
```
  fc0c3e6c
- Remove "job_id" from task_p_slurmd_batch_request · 5546d46c
  Morris Jette authored Jan 30, 2018
```
The value is redundant and possibly misleading for pack jobs
```
  5546d46c
- Remove "job_id" field task_p_slurmd_launch_request() · 886bc01b
  Morris Jette authored Jan 30, 2018
```
The information is redundant, and possibly inaccurate for
  heterogeneous jobs
```
  886bc01b
- SPANK - When slurm_spank_init_post_opt() fails return error correctly. · c3d9135f
  Morris Jette authored Jan 30, 2018
```
Bug 4651
```
  c3d9135f
- When submitting a --test-only job respect the -M option. · fb73b8a4
  Brian Christiansen authored Jan 30, 2018
```
Bug 4548
```
  fb73b8a4
- Make sure the slurmstepd blocks signals like SIGTERM correctly. · d2c83807
  Danny Auble authored Jan 30, 2018
```
Bug 4634
```
  d2c83807
- CRAY - Remove race in the core_spec where we add the slurmstepd to the · eff8690e
  David Gloe authored Jan 30, 2018
```
job container where if the step was canceled would also cancel the stepd
erroneously.

Bug 4634
```
  eff8690e
29 Jan, 2018 4 commits
- burst_buffer/cray: Attempts by job to create persistent burst buffer when · eaddd999
  Morris Jette authored Jan 29, 2018
```
one already exists owned by a different user will be logged and the job
held.

Bug 4614
```
  eaddd999
- Add a job's allocated licenses to the [Pro|Epi]logSlurmctld. · 5b7520b7
  Tim Wickberg authored Jan 29, 2018
  
  5b7520b7
- Fix for association MaxWall enforcement when none is given at submission. · 9143c7c9
  Alejandro Sanchez authored Jan 29, 2018
```
Bug 4681
```
  9143c7c9
- Update NEWS and RELEASE_NOTES · 1a9071e1
  Morris Jette authored Jan 29, 2018
```
In preparation for release of v18.08.0-pre1 later this week.
```
  1a9071e1
25 Jan, 2018 5 commits
- Fix potential memory leak if clean starting and the TRES didn't change · a08b3709
  Danny Auble authored Jan 25, 2018
```
from when last started.

Signed-off-by: Danny Auble <da@schedmd.com>
```
  a08b3709
- Fix issue where unpacking job state after TRES count changed could lead to · e45d0662
  Brian Christiansen authored Jan 25, 2018
```
invalid reads.

Bug 4664
```
  e45d0662
- Validate command existence on the srun *[pro|epi]log options · 1b027f42
  Felip Moll authored Jan 25, 2018
```
if LaunchParameter test_exec is set.

Bug 4439
```
  1b027f42
- Allow execution of task prolog/epilog when uid has access · 66b40e30
  Felip Moll authored Jan 25, 2018
```
rights by a secondary group id.

Bug 4439
```
  66b40e30
- Docs - fix typos. · d0c984e6
  Dominik Bartkiewicz authored Jan 25, 2018
  
  d0c984e6
24 Jan, 2018 3 commits

Fix long time FIXME (2002): wait until backup slurmctld is for real done · 312a52ba

Danny Auble authored Jan 24, 2018

before the primary takes control.

The primary will wait until the backup has replied with it all the way
done shutting down before it takes over.  Before it would just wait 2
seconds which probably was ok in most situations.  Now we don't return
until it is completely out.

This also ups the CONTROL_TIMEOUT from 10 -> 30 seconds.

312a52ba

Defer job signaling until prolog is completed. · df1adea9
Dominik Bartkiewicz authored Jan 24, 2018
```
Bug 4446
```
df1adea9
Copy v17.02 NEWS item to v17.11 · 4b53d53a
Dominik Bartkiewicz authored Jan 24, 2018

4b53d53a