Commits · a4c6bdd511b2b0e3284f209e3b793b51e19914dd · Manuel G. Marciani / ces_slurm_simulator

20 Dec, 2019 3 commits

Start NEWS for v18.08.10. · cbbc3993
Tim Wickberg authored Dec 20, 2019

cbbc3993

Do not continue with launch if _become_user() fails. · 5ac031b2

Harald Barth authored Dec 18, 2019

If this failed, the step would launch as root instead of the desired
user, which could be exploited.

Bug 8084.

CVE-2019-19728.

5ac031b2

Install slurmdbd.conf.example with 0600 permissions. · a913b209

Johannes Segitz authored Dec 20, 2019

Encourage the use for restricted permissions for slurmdbd.conf since
this file will contain your MySQL username and password on most systems.

CVE-2019-19727.

a913b209

19 Dec, 2019 1 commit
- Perl API - Fix undefined symbol for slurmdbd_pack_fini_msg · e2e6eec8
  Danny Auble authored Dec 19, 2019
```
Bug 7861
```
  e2e6eec8
18 Dec, 2019 3 commits

Fix incorrect SLURM_CLUSTER_NAME env var in batch step · 910c38e2

Douglas Wightman authored Dec 10, 2019

In a multi-cluster environment a job may submit more jobs as part of
its workflow. This fixes situations where the variable is inherited
incorrectly on sub-jobs.

Bug 7998

910c38e2

Ensure x11 is setup before launching a job step · 71df4fae

Marshall Garey authored Oct 24, 2019

srun waits for the prolog to finish before launching a job step.
In _is_prolog_finished(), slurmctld checks the state reason:

	if (job_ptr) {
		is_running = (job_ptr->state_reason != WAIT_PROLOG);
	}

But if the job is updated during the job prolog, then _update_job() will
change the state_reason, and then slurmctld will tell srun that the
prolog is completed even if it isn't. If srun launches a job step before
the extern sets up x11, then the job step won't have x11 information. To
fix this, don't change state_reason in _update_job() if it equals
WAIT_PROLOG.

Bug 7525

71df4fae

Fix for requesting specific nodes when using cons_tres topology. · d0bf6d68

Douglas Wightman authored Dec 18, 2019

This in turn fixes allocation requests that weren't rejected and they
should because the requested nodes didn't have a shared network.

Bug 8210

d0bf6d68

17 Dec, 2019 1 commit
- Fix alloc_node validation when updating a job. · 6ec72513
  Dominik Bartkiewicz authored Nov 05, 2019
```
Bug 8047
```
  6ec72513
16 Dec, 2019 1 commit
- ignore DOWN/DRAIN partitions in reduce_completing_frag logic · 13bf677d
  Dominik Bartkiewicz authored Oct 08, 2019
```
Bug 7766
```
  13bf677d
10 Dec, 2019 1 commit

Fix pending array tasks not always matching 1st task's reason · ee3d4715

Michael Hinton authored Apr 18, 2019

Have the main scheduler and backfill scheduler make the reasons of
subsequent array tasks match the first array task, since they
sometimes didn't do this completely when the array was pending.

Bug 6814

ee3d4715

09 Dec, 2019 1 commit

Honor ntasks_per_node in _compute_c_b_task_dist() · 16eb8b14

Nate Rini authored Aug 27, 2019

Add _at_tpn_limit() as helper to determine when a given node is over the
tasks_per_node limit and to log when then happens.

Bug 7629.

16eb8b14

02 Dec, 2019 1 commit
- Fix parsing of delay_boot in controller when additional args follow · 9daa0563
  Brian Christiansen authored Nov 13, 2019
```
Signed-off-by: Jason Booth <jbooth@schedmd.com>

Bug 7189
```
  9daa0563
26 Nov, 2019 2 commits
- Fix format build error on FreeBSD · 683415cc
  Broderick Gardner authored Nov 26, 2019
```
Bug 8153
```
  683415cc
- Make Slurm compile on linux after sys/sysctl.h was deprecated. · 0f3ec361
  Danny Auble authored Nov 26, 2019
```
Bug 7987

Co-authored-by: Broderick Gardner <broderick@schedmd.com>
Signed-off-by: Broderick Gardner <broderick@schedmd.com>
```
  0f3ec361
21 Nov, 2019 2 commits

Fix misleading error for immediate alloc requests and defer combination. · 1b13f532

Alejandro Sanchez authored Nov 20, 2019

When an allocation request was done with the immediate=1 argument and
SchedulerParameters included defer, Slurm was returning a misleading
ESLURM_FRAGMENTATION error. Logic now a returns a more appropriate
ESLURM_CAN_NOT_START_IMMEDIATELY error for this scenario by decoupling
defer from the too fragmented logic in job_allocate().

Note that this doesn't change behavior as immediate + defer combination
continues having defer as the king in terms of precedence order, meaning
individual submit time allocation attempts will be avoided independently
of immediate.

Bug 5175.

1b13f532

Reject unrunnable jobs submitted to reservations. · ab52c868

Marshall Garey authored Oct 03, 2019

This effectively reverts commit 73351553. That commit's message is,

     "Improve support for overlapping advanced reservations.
      Patch from Bill Brophy, Bull."

Jobs submitted to reservations that request more resources than are on a
node will pend forever because of that commit. Reverting that commit
causes those jobs to be immediately rejected. Also, that commit doesn't
appear to "improve support for overlapping advanced reservations" in any
way.

The job is already immediately rejected if it asks for more resources
than are on a node without being submitted to a reservation, or if the
job asks for more nodes than are currently in the reservation. So, this
commit just makes behavior consistent.

Bug 5175.

ab52c868

15 Nov, 2019 1 commit

Fix both socket-[un]constrained GRES allocation issues. · efcd853a

Michael Hinton authored Oct 23, 2019

Do not assume that these sock_gres_t pointers always exist:
bits_by_sock
bits_by_sock[s]

If they don't, that means there are no current iteration socket `s`
constrained GRES and so the logic shouldn't allocate the current
iteration GRES `g`.

Analogously, do not assume that bits_any_sock sock_gres_t member pointer
is always valid. If it isn't, it means there are no socket-unconstrained
GRES available to satisfy the job request, so the logic should not
allocate the current iteration GRES `g`.

Otherwise, job/node struct members holding GRES allocation information
would end up being incorrect, leading to improper allocations and also
leading to errors logged in slurmctld log at deallocation time like:

error: gres/gpu: job <X> dealloc node <Y> GRES count underflow (0 < 1)

Bug 7827

efcd853a

14 Nov, 2019 1 commit
- Start NEWS for v19.05.5. · b65d9ed2
  Tim Wickberg authored Nov 14, 2019
  
  b65d9ed2
12 Nov, 2019 2 commits

Initialize db_flags correctly in slurmdb_unpack_job_cond(). · 6158e479

Marcin Stolarek authored Oct 31, 2019

For older RPCs we should initialize db_flags with SLURMDB_JOB_FLAG_NOTSET.
(Which is treated differently than SLURMDB_JOB_FLAG_NONE, which is 0.)

Bug 8029.

6158e479

Fix regression caused by . · 4c1ed636

Dominik Bartkiewicz authored Nov 12, 2019

Remove the TIME_FLOAT flag from the reservation to ensure _job_overlap()
does not add the current time on top of the start_time. The prior
approach was incorrect for non-TIME_FLOAT reservations and would
lead to valid reservations being rejected.

Bug 7458, 7908.

4c1ed636

11 Nov, 2019 2 commits

Fix not handling nextstate on reboot of node · 3361eeef
Brian Christiansen authored Oct 03, 2019
```
Signed-off-by: Michael Hinton <hinton@schedmd.com>

Bug 7169
```
3361eeef

Suspend nodes after being idle or down for SuspendTime · 7d34c867

Brian Christiansen authored Oct 18, 2019



Previously it was only after being idle. The problem was that if the
node was downed after a job ran on the node for more than SuspendTime
the node would be suspended quickly. Now it waits SuspendTime after
being idle or down (i.e. since no jobs on the node).

Bug 6774

Signed-off-by: Danny Auble <da@schedmd.com>

7d34c867

08 Nov, 2019 2 commits

Fix issues with --gpu-bind while using cgroups · 5b13fbb3

Michael Hinton authored Aug 09, 2019

CUDA_VISIBLE_DEVICES was not being set to the correct GPU indexes when
cgroups were being used. These issues were exhibited with at least the
map_gpu and mask_gpu binding options.

The issue was that usable_gres is a bitmask of GRESs in the step's
cgroup, but bit_test() was looking at bit i, which is the index of the
global gres_list (not constrained by cgroups).

Bug 7509

5b13fbb3

Fix regression on update from older versions with DefMemPerCPU · 6abe1e75

Felip Moll authored Nov 04, 2019

In 19.05 JOB_MEM_SET flag was added along with a conditional check on
this flag that changed the pn_min_memory when validating job limits.
This caused that after an upgrade, PD jobs in earlier versions didn't
have this flag and the memory was incorrectly set when their limits were
checked before starting. The patch here addresses this issue adding this
flag to jobs from an older protocol version when loading the state
files.

Bug 8011

6abe1e75

07 Nov, 2019 1 commit

Allow coordinators to delete users. · 0d579734

Marshall Garey authored Oct 25, 2019

Previously, coordinators could delete specific associations, but could
not delete users. Allow coordinators to delete users if the users are
only part of accounts that the coordinator is over.

Bug 7413.

0d579734

31 Oct, 2019 5 commits
- scontrol - permit changes to WorkDir for pending jobs. · f445818b
  Chad Vizino authored Oct 31, 2019
```
Bug 7103.
```
  f445818b
- Fix job "--switches=count@time" option handling in cons_tres topology. · c14142a1
  Douglas Wightman authored Oct 30, 2019
```
Bug 7875
```
  c14142a1
- Fix cons_tres topology incorrectly evaluating insufficient resoruces. · dcdfd690
  Douglas Wightman authored Oct 01, 2019
```
Bug 7830
```
  dcdfd690
- slurm.spec - fix pmix_version global context macro. · e8cb3466
  Josh Schwartz authored Oct 31, 2019
```
Bug 7584
```
  e8cb3466
- sched/backfill - fix the estimated sched_nodes for multi-part jobs. · ea54dd2f
  Alejandro Sanchez authored Sep 03, 2019
```
Previously sched_nodes was set to the estimated nodes on the last
evaluated partition that was adding a reservation, instead of the
one offering the earliest estimated start time.

Natural continuation of fdae6a05

.

Bug 7344.

Signed-off-by: Dominik Bartkiewicz <bart@schedmd.com>
```
  ea54dd2f
29 Oct, 2019 1 commit
- Remove duplicate log entry on update job · 85916713
  Felip Moll authored Oct 29, 2019
```
Bug 8014
```
  85916713
28 Oct, 2019 2 commits
- Fix build on 32-bit systems. · 8d8a5955
  Tim Wickberg authored Oct 28, 2019
```
Bug 7749
```
  8d8a5955
- Fix slurmd -G functionality · 2cdbb6f7
  Marcin Stolarek authored Oct 24, 2019
```
gres_node_config_load() requires gres_list to work properly after
fully merge slurm.conf with gres.conf logic added in 4d7df8b0.

Bug 7986
```
  2cdbb6f7
25 Oct, 2019 2 commits

Enforce PART_NODES if only Partition is specified · c8ce5a53
Albert Gil authored Jul 31, 2019
```
Bug 7490
```
c8ce5a53

Avoid abort in dev-build · fe945037

Marshall Garey authored Jun 03, 2019

If not enforcing QOS, it's possible to submit a job without a qos. If
submitting such a job to multiple partitions where at least one has a
qos, slurmctld would abort in a development build. A non-development
build didn't segfault only because _find_qos_part doesn't dereference
the NULL pointer. Prevent the abort.

Bug 7171

fe945037

24 Oct, 2019 1 commit
- Make sview work with glib2 v2.62. · e2ff6b01
  Chad Vizino authored Oct 24, 2019
```
Bug 7712
```
  e2ff6b01
23 Oct, 2019 1 commit
- Add NEWS entry for the last three commits. · 38875f73
  Michael Hinton authored Oct 22, 2019
```
Bug 7884.
```
  38875f73
22 Oct, 2019 2 commits

Fix abort initializing a configuration without acct_gather.conf. · a301635f

Gavin Howard authored Oct 22, 2019

Previous logic would only call s_p_hashtbl_create() to create the hashtable
when the file acct_gather.conf could be successfully stat()'d. This lead to
a subsequent attempt to pack the non-created hashtable into a buffer which
triggered the abort.

This makes it so the hashtable is uncondtionally created no matter if the
file is missing.

Bug 7893.

a301635f

auth/munge - truncate FQDN to shortname for AllocNodes. · 50eaa012

Michael Hinton authored Sep 06, 2019

gethostbyaddr() can potentially return a fully-qualified domain name,
which breaks backwards compatibility with the shortname AllocNodes
expected pre 19.05.

Bug 7653.

50eaa012

21 Oct, 2019 1 commit
- Use correct function signature for step_set_env() interface in gres plugins. · 0ec1ba42
  Michael Hinton authored Oct 18, 2019
```
Fortunately the extra arguments were provided at the end, and thus ignored on
most common platforms.

Bug 7555.
```
  0ec1ba42