Commits · c97284691b6a0df57493a13132787a1a908a749f · Manuel G. Marciani / ces_slurm_simulator

09 Dec, 2018 1 commit

Tim Wickberg authored Dec 08, 2018

New X11 forwarding code will only support forwarding back to
salloc or an allocating srun command.

Using this option within sbatch was always hit-or-miss. If the
user submitting was disconnected from the alloc host for any
reason their xauth credentials would likely fail even if they
managed to get assigned the same local TCP port for forwarding.

Bug 3647.

c9728469

07 Dec, 2018 4 commits

pam_slurm_adopt: Use uid to determine whether root is logging. · 17c63947

Matthias Gerstner authored Dec 07, 2018

In some systems there can be multiple user accounts for uid 0, therefore
the check for literal user name "root" might be insufficient.

Bug 6184

17c63947

pam_slurm_adopt: avoid running outside of the sshd PAM service context · 4f954bd8

Matthias Gerstner authored Dec 05, 2018

This pam module is tailored towards running in the context of remote ssh
logins. When running in a different context like a local sudo call then
the module could be influenced by e.g. passing environment variables
like SLURM_CONF.

By limiting the module to only perform its actions when running in the
sshd context by default this situation can be avoided. An additional pam
module argument service=<service> allows an Administrator to control
this behavior, if different behavior is explicitly desired.

Bug 6184

4f954bd8

salloc/sbatch/srun - print warning if both --mem and --mem-per-cpu are set. · 13a606a4

Nate Rini authored Dec 07, 2018

Only print a warning for 18.08. If a user has SLURM_MEM_PER_CPU or
SLURM_MEM_PER_NODE environment variables set for some reason this
situation could be happening by accident, and we don't want to prevent
the srun command from launching steps at this point.

Bug 6058.

13a606a4

Expand %x in 'scontrol show job' and related API calls. · 0a125e08
Broderick Gardner authored Dec 07, 2018
```
Bug 5648.
```
0a125e08

06 Dec, 2018 5 commits

Bump RLIMIT_NOFILE for daemons in systemd services · 7f2e6a7e

Janne Blomqvist authored Dec 05, 2018

The Linux kernel default hard limit of 4096 for the number of file
descriptors is quite small. Debian/Ubuntu have for a long time
overridden this, increasing it to 1M. Recently systemd also bumped the
default to 512k.

https://github.com/systemd/systemd/blob/master/NEWS

https://github.com/systemd/systemd/pull/10244

https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/ZN5TK3D6L7SE46KGXICUKLKPX2LQISVX/

https://github.com/systemd/systemd/commit/09dad04c49cae3ad2b319c9b4e7773fedd34309a

Here the limits are increased as follows:

- slurmd: 128k; some workloads like Hadoop/Spark need a lot of fd's,
  and recommend that the limit is increased to at least 64k.

- slurmctld: 64k; per the Slurm high throughput and big system guides
  which recommend a file-max of at least 32k.

- slurmdbd: 64k, matching slurmctld, though slurmdbd shouldn't need
  that many fd's, bumping the limit shouldn't hurt either.

Bug 6171

7f2e6a7e

Fix formatting issues when printing uint64_t. · a04ae6d7
Tim Wickberg authored Dec 06, 2018
```
Bug 5248
```
a04ae6d7
job_submit/lua: Add user/group info to jobs. · 6bf90e85
Mike Nolta authored Dec 06, 2018
```
Bug 6055
```
6bf90e85

job_submit/lua: add several slurmctld return codes · c538e144

Mike Nolta authored Nov 16, 2018

Add the following slurmctld return codes to the lua plugin:

  ESLURM_ACCESS_DENIED
  ESLURM_ACCOUNTING_POLICY
  ESLURM_INVALID_NODE_COUNT
  ESLURM_JOB_MISSING_SIZE_SPECIFICATION
  ESLURM_MISSING_TIME_LIMIT

Bug 6055

c538e144

Rename "no_send_gids" to "disable_send_gids". · f05be686
Tim Wickberg authored Dec 06, 2018
```
Rework one timer error message while here.

Bug 5861.
```
f05be686

05 Dec, 2018 10 commits
- Run SlurmctldPrimaryOffProg when the primary shuts down. · ee29bba8
  Felip Moll authored Dec 05, 2018
```
Backups already run it when dropping to backup.

Bug 6098.
```
  ee29bba8
- Run SlurmctldPrimaryOffProg when the primary shuts down. · ba491557
  Felip Moll authored Dec 05, 2018
```
Backups already run it when dropping to backup.

Bug 6098.
```
  ba491557
- pam_slurm_adopt - send an error message to the user if no Jobs found. · 9fb15b4a
  Marshall Garey authored Dec 05, 2018
```
Also throw an error message within stepd_available() if the nodename
is not set or cannot be inferred correctly.

Bug 5399.
```
  9fb15b4a
- Add NEWS entry for 893bb1de . · 06dde2f8
  Tim Wickberg authored Dec 05, 2018
  
  06dde2f8
- Fix missing suffixes in squeue. · 9b0399b8
  Trey Dockendorf authored Dec 05, 2018
```
Bug 6120
```
  9b0399b8
- Decrease an error message to be debug. · 639b3e87
  Tim Wickberg authored Dec 05, 2018
```
Bug 6155
```
  639b3e87
- Decrement message_connections in stepd code on error path correctly. · 57daec20
  Tim Wickberg authored Dec 05, 2018
```
Bug 6155
```
  57daec20
- Add bf_ignore_newly_avail_nodes option to SchedulerParameters. · 5ad1447e
  Felip Moll authored Oct 19, 2018
```
When bf_continue is set, and locks are released during a backfill cycle,
other operations can make new resorces available while part way through
the queue. When backfill continues the cycle and evaluates new jobs, it
may allocate some of these newly available resources to lower priority jobs,
rather than to higher priority jobs that were already considered in this
backfill cycle.

This patch introduces bf_ignore_newly_avail_nodes to SchedulerParameters
to solve this issue. This option will ignore nodes made available when
the backfill scheduler yields when resuming the backfill cycle.

Bug 5279.
```
  5ad1447e
- Fix stepd segfault race if slurmctld hasn't registered with the launching · 4b14c2d4
  Danny Auble authored Dec 05, 2018
```
slurmd yet delivering it's TRES list.

Bug 6122

Co-authored-by: Marshall Garey <marshall@schedmd.com>
```
  4b14c2d4
- Fix salloc and missing SLURM_NTASKS. · 191812fb
  Nate Rini authored Dec 04, 2018
```
Bug 6008
```
  191812fb
04 Dec, 2018 7 commits
- Revert 8c910226, holding off till 19.05 · 18c6dd16
  Nate Rini authored Nov 28, 2018
```
Bug 6008
```
  18c6dd16
- If there is a constraint construct of the form "[...&...]" · 6738448a
  Morris Jette authored Dec 03, 2018
```
then an error is generated if more than one of those specifications
contains KNL NUMA or MCDRAM modes.

Bug 5846
```
  6738448a
- Fix a scheuling logic bug with respect to XOR operation support when there · 6b9f894f
  Morris Jette authored Dec 03, 2018
```
are down nodes.

Bug 5846
```
  6b9f894f
- Fix scheduling logic bug. There should have been a test for _not_ · 285545d9
  Morris Jette authored Dec 03, 2018
```
NODE_SET_REBOOT to continue.

Bug 5846
```
  285545d9
- Fix scheduling logic to avoid using nodes that require a reboot for KNL · d9b9eb23
  Morris Jette authored Dec 03, 2018
```
node change when possible.

Bug 5846
```
  d9b9eb23
- Fix handling of 'slurmd -f' by setting SLURM_CONF earlier. · 401d1b47
  Marshall Garey authored Dec 04, 2018
```
Plugins reading in their own config files rely on the SLURM_CONF
environment variable pointing to the appropriate directory,
otherwise they will fall back to the build in sysconfdir path.

Set the environment variable early enough so that the -f flag
operates correctly, but not before conf->conffile has definitely
been set. Remove the setenv call that happens before the first
slurmstepd is fork()'d as it is now redundant.

Bug 4774.
```
  401d1b47
- salloc - set SLURM_NTASKS_PER_CORE and SLURM_NTASKS_PER_SOCKET when appropriate. · a36b8a4d
  Alejandro Sanchez authored Dec 04, 2018
```
sbatch sets these, but salloc did not. This should make srun behavior
between the two consistent.

Bug 3861.
```
  a36b8a4d
03 Dec, 2018 2 commits
- When handling runaway jobs remove all usage before rollup to remove any · bf705c80
  Marshall Garey authored Dec 03, 2018
```
time that wasn't existent instead of just updating lines that have time
with a lesser time.
```
  bf705c80
- Fix issue when job's environment is minimal and only contains variables · f1116c67
  Dominik Bartkiewicz authored Dec 03, 2018
```
Slurm is going to replace internally.

Bug 5800
```
  f1116c67
29 Nov, 2018 2 commits
- Validate job_ptr in backfill before restoring preempt state. · 4dec76c9
  Dominik Bartkiewicz authored Nov 29, 2018
```
Bug 6121
```
  4dec76c9
- Fix salloc and missing SLURM_NTASKS. · 8c910226
  Nate Rini authored Nov 28, 2018
```
Bug 6008
```
  8c910226
28 Nov, 2018 5 commits
- Fix issue when requesting invalid gres. · 80e2cc41
  Alejandro Sanchez authored Nov 28, 2018
```
Bug 6077
```
  80e2cc41
- In route/topology validate the slurmctld doesn't try to initialize the · cae90ff4
  Danny Auble authored Nov 28, 2018
```
node system.

Bug 6037
```
  cae90ff4
- Fix race condition in route/topology when the slurmctld is reconfigured. · f35bb686
  Marshall Garey authored Nov 28, 2018
```
Bug 6037
```
  f35bb686
- mpi/pmix: Remove unneeded libpmix callback drop in tree-based coll · bd283fd3
  Artem Y. Polyakov authored Nov 27, 2018
```
Bug 5983
```
  bd283fd3
- mpi/pmix: Fix double invocation of the PMIx lib fence callback · 674e78b6
  Artem Y. Polyakov authored Nov 05, 2018
```
In case of the error code paths (like collective timeout) it is possible
that a callback provided by PMIx will be called twice leading to a
segmentation fault.
This commit fixes it by properly accounting callback invocations.

Bug 5983
```
  674e78b6
27 Nov, 2018 4 commits

mpi/pmix: Make multi-slurmd work correctly when using ring communication. · 17ddbd5e
Danny Auble authored Nov 27, 2018
```
Bug 5935
```
17ddbd5e

mpi/pmix: fixed the logging of collective state · e7803212

Boris Karasev authored Nov 11, 2018



This could have caused core dumps if communication failed for one
reason or another.

Signed-off-by: Boris Karasev <karasev.b@gmail.com>

Bug 5935

e7803212

Clean up step on a failed node correctly. · e11f4af9

Morris Jette authored Nov 27, 2018

This patch does 2 things:
1. When a step fails on some node, then mark it as complete on those
   nodes. This is needed so that when the step ends on the other
   nodes, slurmctld recognized the step as completely done.
2. If the step does not have the --no-kill option set, then when some
   node fails, send a request to terminate the step on ALL of its nodes.

Bug 5805

e11f4af9

Make sure SLURM_NTASKS_PER_NODE is set correctly · 19529949
Nate Rini authored Nov 26, 2018
```
when env is overwritten by the command line.

Bug 5977
```
19529949