Commits · 683415ccbb8b8d66510632ddb4656a605e7f43fc · Manuel G. Marciani / ces_slurm_simulator

26 Nov, 2019 6 commits
- Fix format build error on FreeBSD · 683415cc
  Broderick Gardner authored Nov 26, 2019
```
Bug 8153
```
  683415cc
- Gramattical fix for 'relinquished'. · d3974825
  Michael Hinton authored Nov 26, 2019
  
  d3974825
- Merge remote-tracking branch 'origin/slurm-18.08' into slurm-19.05 · 79bdeb98
  Danny Auble authored Nov 26, 2019
  
  79bdeb98
- Make Slurm compile on linux after sys/sysctl.h was deprecated. · 0f3ec361
  Danny Auble authored Nov 26, 2019
```
Bug 7987

Co-authored-by: Broderick Gardner <broderick@schedmd.com>
Signed-off-by: Broderick Gardner <broderick@schedmd.com>
```
  0f3ec361
- Testsuite - Use test name as job name in test9.9 · ba83c8f0
  Nate Rini authored Aug 29, 2019
```
This avoids possible overlaping with other jobs.

Bug 7661.
```
  ba83c8f0
- Fix typo for 'component'. · e3af99f3
  Michael Hinton authored Nov 25, 2019
  
  e3af99f3
21 Nov, 2019 3 commits

Docs - clarify immediate allocation requests conflict with defer. · 5f233be4
Alejandro Sanchez authored Nov 21, 2019
```
Bug 5175.

Signed-off-by: Marshall Garey <marshall@schedmd.com>
```
5f233be4

Fix misleading error for immediate alloc requests and defer combination. · 1b13f532

Alejandro Sanchez authored Nov 20, 2019

When an allocation request was done with the immediate=1 argument and
SchedulerParameters included defer, Slurm was returning a misleading
ESLURM_FRAGMENTATION error. Logic now a returns a more appropriate
ESLURM_CAN_NOT_START_IMMEDIATELY error for this scenario by decoupling
defer from the too fragmented logic in job_allocate().

Note that this doesn't change behavior as immediate + defer combination
continues having defer as the king in terms of precedence order, meaning
individual submit time allocation attempts will be avoided independently
of immediate.

Bug 5175.

1b13f532

Reject unrunnable jobs submitted to reservations. · ab52c868

Marshall Garey authored Oct 03, 2019

This effectively reverts commit 73351553. That commit's message is,

     "Improve support for overlapping advanced reservations.
      Patch from Bill Brophy, Bull."

Jobs submitted to reservations that request more resources than are on a
node will pend forever because of that commit. Reverting that commit
causes those jobs to be immediately rejected. Also, that commit doesn't
appear to "improve support for overlapping advanced reservations" in any
way.

The job is already immediately rejected if it asks for more resources
than are on a node without being submitted to a reservation, or if the
job asks for more nodes than are currently in the reservation. So, this
commit just makes behavior consistent.

Bug 5175.

ab52c868

19 Nov, 2019 1 commit
- Fix typo in quickstart.shtml · 761616a3
  Elliot Waite authored Nov 19, 2019
  
  761616a3
18 Nov, 2019 1 commit
- Remove stray bluegene.conf.example file. · f9479db3
  Tim Wickberg authored Nov 18, 2019
  
  f9479db3
15 Nov, 2019 1 commit

Fix both socket-[un]constrained GRES allocation issues. · efcd853a

Michael Hinton authored Oct 23, 2019

Do not assume that these sock_gres_t pointers always exist:
bits_by_sock
bits_by_sock[s]

If they don't, that means there are no current iteration socket `s`
constrained GRES and so the logic shouldn't allocate the current
iteration GRES `g`.

Analogously, do not assume that bits_any_sock sock_gres_t member pointer
is always valid. If it isn't, it means there are no socket-unconstrained
GRES available to satisfy the job request, so the logic should not
allocate the current iteration GRES `g`.

Otherwise, job/node struct members holding GRES allocation information
would end up being incorrect, leading to improper allocations and also
leading to errors logged in slurmctld log at deallocation time like:

error: gres/gpu: job <X> dealloc node <Y> GRES count underflow (0 < 1)

Bug 7827

efcd853a

14 Nov, 2019 5 commits
- Start NEWS for v19.05.5. · b65d9ed2
  Tim Wickberg authored Nov 14, 2019
  
  b65d9ed2
- Update META for v19.05.4 release. · e3f7d35a
  Tim Wickberg authored Nov 14, 2019
```
Update slurm.spec and slurm.spec-legacy as well.
```
  e3f7d35a
- Docs - update platforms.shtml. · f31ec999
  Tim Wickberg authored Nov 14, 2019
  
  f31ec999
- Docs - remove slurm_ug_cfp.html. · 593aa863
  Tim Wickberg authored Nov 14, 2019
```
Managed to survive SLUG 2019 without updating this, I suspect
we wouldn't use it for SLUG 2020 either.
```
  593aa863
- Docs - add SLUG 2020 meeting info. · 621dac7e
  Tim Wickberg authored Nov 14, 2019
  
  621dac7e
13 Nov, 2019 1 commit
- Reference that Bull is also Atos. · a3d84e3f
  Danny Auble authored Nov 13, 2019
  
  a3d84e3f
12 Nov, 2019 3 commits

Initialize db_flags correctly in slurmdb_unpack_job_cond(). · 6158e479

Marcin Stolarek authored Oct 31, 2019

For older RPCs we should initialize db_flags with SLURMDB_JOB_FLAG_NOTSET.
(Which is treated differently than SLURMDB_JOB_FLAG_NONE, which is 0.)

Bug 8029.

6158e479

Fix regression caused by . · 4c1ed636

Dominik Bartkiewicz authored Nov 12, 2019

Remove the TIME_FLOAT flag from the reservation to ensure _job_overlap()
does not add the current time on top of the start_time. The prior
approach was incorrect for non-TIME_FLOAT reservations and would
lead to valid reservations being rejected.

Bug 7458, 7908.

4c1ed636

Revert "Fix create_resv()" · cef9d4c6
Dominik Bartkiewicz authored Nov 12, 2019
```
This reverts commit c55f6d65.

Bug 7458.
```
cef9d4c6

11 Nov, 2019 2 commits

Fix not handling nextstate on reboot of node · 3361eeef
Brian Christiansen authored Oct 03, 2019
```
Signed-off-by: Michael Hinton <hinton@schedmd.com>

Bug 7169
```
3361eeef

Suspend nodes after being idle or down for SuspendTime · 7d34c867

Brian Christiansen authored Oct 18, 2019



Previously it was only after being idle. The problem was that if the
node was downed after a job ran on the node for more than SuspendTime
the node would be suspended quickly. Now it waits SuspendTime after
being idle or down (i.e. since no jobs on the node).

Bug 6774

Signed-off-by: Danny Auble <da@schedmd.com>

7d34c867

08 Nov, 2019 2 commits

Fix issues with --gpu-bind while using cgroups · 5b13fbb3

Michael Hinton authored Aug 09, 2019

CUDA_VISIBLE_DEVICES was not being set to the correct GPU indexes when
cgroups were being used. These issues were exhibited with at least the
map_gpu and mask_gpu binding options.

The issue was that usable_gres is a bitmask of GRESs in the step's
cgroup, but bit_test() was looking at bit i, which is the index of the
global gres_list (not constrained by cgroups).

Bug 7509

5b13fbb3

Fix regression on update from older versions with DefMemPerCPU · 6abe1e75

Felip Moll authored Nov 04, 2019

In 19.05 JOB_MEM_SET flag was added along with a conditional check on
this flag that changed the pn_min_memory when validating job limits.
This caused that after an upgrade, PD jobs in earlier versions didn't
have this flag and the memory was incorrectly set when their limits were
checked before starting. The patch here addresses this issue adding this
flag to jobs from an older protocol version when loading the state
files.

Bug 8011

6abe1e75

07 Nov, 2019 1 commit

Allow coordinators to delete users. · 0d579734

Marshall Garey authored Oct 25, 2019

Previously, coordinators could delete specific associations, but could
not delete users. Allow coordinators to delete users if the users are
only part of accounts that the coordinator is over.

Bug 7413.

0d579734

01 Nov, 2019 2 commits
- RELEASE_NOTES - mention cli argument cleanup work done for 19.05. · 12a27b1a
  Tim Wickberg authored Nov 01, 2019
```
Bug 8035.
```
  12a27b1a
- Docs - 'MpiDefault' is the correct configuration key. · 3895bb0e
  Will Furnass authored Oct 31, 2019
```
Bug 8031.
```
  3895bb0e
31 Oct, 2019 8 commits
- Testsuite - make regression.py compatible with Python3. · 5062533e
  Broderick Gardner authored Oct 31, 2019
```
Bug 6633.
```
  5062533e
- scontrol - permit changes to WorkDir for pending jobs. · f445818b
  Chad Vizino authored Oct 31, 2019
```
Bug 7103.
```
  f445818b
- Fix job "--switches=count@time" option handling in cons_tres topology. · c14142a1
  Douglas Wightman authored Oct 30, 2019
```
Bug 7875
```
  c14142a1
- Fix cons_tres topology incorrectly evaluating insufficient resoruces. · dcdfd690
  Douglas Wightman authored Oct 01, 2019
```
Bug 7830
```
  dcdfd690
- Docs - update obsolete sentence about commands recognizing job arrays. · 3777cdb7
  Alejandro Sanchez authored Oct 31, 2019
```
Bug 7936
```
  3777cdb7
- Docs - Add Josh Schwartz (Cray) to the contributors list. · 65c950aa
  Alejandro Sanchez authored Oct 31, 2019
```
Bug 7584
```
  65c950aa
- slurm.spec - fix pmix_version global context macro. · e8cb3466
  Josh Schwartz authored Oct 31, 2019
```
Bug 7584
```
  e8cb3466
- sched/backfill - fix the estimated sched_nodes for multi-part jobs. · ea54dd2f
  Alejandro Sanchez authored Sep 03, 2019
```
Previously sched_nodes was set to the estimated nodes on the last
evaluated partition that was adding a reservation, instead of the
one offering the earliest estimated start time.

Natural continuation of fdae6a05

.

Bug 7344.

Signed-off-by: Dominik Bartkiewicz <bart@schedmd.com>
```
  ea54dd2f
29 Oct, 2019 1 commit
- Remove duplicate log entry on update job · 85916713
  Felip Moll authored Oct 29, 2019
```
Bug 8014
```
  85916713
28 Oct, 2019 3 commits

Fix build on 32-bit systems. · 8d8a5955
Tim Wickberg authored Oct 28, 2019
```
Bug 7749
```
8d8a5955
Testsuite: make function get_mps_count a wrapper for get_gres_count · c99deb1b
Michael Hinton authored Oct 28, 2019
```
Bug 7995
```
c99deb1b

Testsuite: Fix issue with get_gpu_count with multiple GPU types · 4b363a13

Michael Hinton authored Oct 28, 2019

Create generic function get_gres_count to get the node counts of any
GRES, not just GPUs.
Make get_gpu_count able to parse any combination of GRES names and
types.
Create get_gpu_count wrapper of get_gres_count for backwards
compatibility.
Expand the regex to not include newlines.
Rename variable gpu_count to gres_count.

Bug 7995

4b363a13