Commits · 23721c4c9e975dc9e8f276d3e70308596e67764e · Manuel G. Marciani / ces_slurm_simulator

13 Jun, 2017 1 commit

Improve slurmd startup on large systems (> 10000 nodes) · 9b7210ef

Danny Auble authored Jun 13, 2017

What this does is populate the node_hash_table as nodes are being read in
instead of after the node_record_table_ptr has been fully populated.

This speeds up a start of a slurmd with a system of 10000 nodes from
> 1 minute to less than a second.

In 17.11 we will remove the linear xstrcmp check as it should no longer be
needed.

Bug 3885

9b7210ef

12 Jun, 2017 4 commits
- Note when a job finishes in the slurmd to avoid a race when launching a · 55bcec87
  Danny Auble authored Jun 12, 2017
```
batch job takes longer than it takes to finish.

Bug 3833
```
  55bcec87
- Increase number of jobs that are tracked in the slurmd as finishing at one · ccfe2552
  Danny Auble authored Jun 12, 2017
```
time.

Bug 3833
```
  ccfe2552
- Fix bug in task/affinity that could result in slurmd fatal error · 6dd2be3b
  Morris Jette authored Jun 12, 2017
```
An array was only being partially cleared due to bad logic
bug 3876
```
  6dd2be3b
- Only set kmem cgroup limit if ConstrainKmemSpace=yes · ba32ac48
  Tim Wickberg authored Jun 09, 2017
```
Bug 3874.
```
  ba32ac48
08 Jun, 2017 2 commits

Improve preempted job selection logic · 47b5fe60

Dominik Bartkiewicz authored Jun 08, 2017

Improve selection of jobs to preempt when there are multiple partitions
    with jobs subject to preemption.
bug 3824

47b5fe60

Handle update of blocking QOS pointers correctly. · 5e92a3f5
Dominik Bartkiewicz authored Jun 08, 2017
```
Prevent segfault from pointer dereference to the QOS that is
being deleted.

Fix to commit 3e8aa451.
```
5e92a3f5

07 Jun, 2017 1 commit
- Update NEWS for 17.02.5. · 02106184
  Tim Wickberg authored Jun 07, 2017
  
  02106184
03 Jun, 2017 1 commit

Fix issue with sacctmgr show where user='' · bef69448

Danny Auble authored Jun 02, 2017

Fix regression from commit c05dcb8a (bug 1923) that doesn't take
into consideration a blank char * as a valid option.

This fixes the scenario like

sacctmgr list associations user=''

which would only print account associations.
Bug 3862

bef69448

02 Jun, 2017 1 commit

If trying to cancel a step that hasn't started yet for some reason return · eed76f85

Danny Auble authored Jun 02, 2017

a good return code.

This also fixes the situation where the step was ending but not yet ended
so it sends the KILL_TASK_FAILED error instead of JOB_NOTRUNNING.

Also it removes the abort in favor of exit which it should had been anyways.

Bug 3758

eed76f85

01 Jun, 2017 3 commits

Fix --ntasks-per-core parsing in sbatch command. · 3c852da1
Mark Klein authored Jun 01, 2017
```
Inadvertently set to one when requested.

Bug 3855.
```
3c852da1
Always generate core in slurmd, even on non-developer builds. · 7d488e2b
Tim Wickberg authored Jun 01, 2017
```
Bug 3857.
```
7d488e2b

Handle file deletion for purge_old_job() in a separate thread. · b9719be2

Tim Wickberg authored May 24, 2017

File deletion can be slow, especially when StateSaveLocation in on
NFS or other network filesystems. Since purge_old_job() holds all
the slurmctld write locks, this is especially performance sensitive.

Moving this to an independent thread lets the slower filesystem
cleanup happen without owning these locks. purge_old_job() then
results in the purged job ids being queued in the purge_list.

A race with the job id potentially wrapping around again is already
prevented by _dup_job_file_test() in get_next_job_id().

Bug 3763.

b9719be2

31 May, 2017 3 commits
- Add warning about libcurl-devel not being installed during configure. · 0e582365
  Danny Auble authored May 31, 2017
  
  0e582365
- Prevent segfault in sacctmgr due to bad handling of return code. · 15276c01
  Tim Shaw authored May 30, 2017
```
Bug 3840.
```
  15276c01
- Fix NEWS line from commit 56ea068c . · 1503bdcc
  Tim Shaw authored May 30, 2017
  
  1503bdcc
30 May, 2017 2 commits
- don't clear GRES from non-KNL node · 56ea068c
  Tim Shaw authored May 30, 2017
```
node_featurs/knl_cray plugin: Don't clear configured GRES from non-KNL node.
bug 3768
```
  56ea068c
- NEWS entry about better backfill (commit 3e8aa451 ). · 88df7a81
  Danny Auble authored May 30, 2017
  
  88df7a81
26 May, 2017 2 commits

Follow up to commit . This makes this logic optional with · 8c2f4508

Danny Auble authored May 26, 2017

the SchedulerParameters=reduce_completing_frag option.

NOTE: reduce_completing_frag on or off only works with CompletingWait set to
something.

Bug 3756

8c2f4508

Preserve earliest start time for jobs · 86884fb6

Gary authored May 26, 2017

For jobs submited to multiple partitions, report the job's earliest start
    time for any partition.
bug 3754

86884fb6

25 May, 2017 8 commits

When scheduling take the nodes in completing jobs out of the mix to reduce · 54223710
Doug Jacobsen authored May 25, 2017
```
fragmentation.

Bug 3756
```
54223710

Prevent a race between completing jobs on a user-exclusive node from leaving the node owned. · cd9ff91b

Dominik Bartkiewicz authored May 25, 2017

Two jobs completing simultaneously leads to make_node_idle()
returning before it has a chance to decrement node_ptr->owner_job_cnt,
which can result in the node being "owned" by that user even
through no jobs are running on it.

Move the decrement block to the end at a fini label, and make sure
all return paths pass through it. While moving that add a guard
against node_ptr->owner_job_cnt underflowing.

Bug 3771.

cd9ff91b

Prevent a job tested on multiple partitions from being marked WHOLE_NODE_USER. · 162f6a05

Dominik Bartkiewicz authored May 25, 2017

If a job is considered on a partition with ExclusiveUser=YES
then it would be marked as if it was submitted with the
--exclusive flag, which would lead to delays launching it
on ExclusiveUser=NO partitions, and cause lower-than-expected
cluster usage.

As a side effect, the job_ptr->part_ptr->flags need to be
tested wherever WHOLE_NODE_USER is considered, instead of
just job_ptr->details->whole_node.

Bug 3771.

162f6a05

Revert "Prevent a job tested on multiple partitions from being marked" · f1a45962
Tim Wickberg authored May 25, 2017
```
Wrong author attributed by mistake.

This reverts commit 9128476a.
```
f1a45962
Revert "Prevent a race between completing jobs on a user-exclusive node from" · 82b0f802
Tim Wickberg authored May 25, 2017
```
Wrong author attributed by mistake.

This reverts commit a02d04f1.
```
82b0f802

Prevent a race between completing jobs on a user-exclusive node from · a02d04f1

Tim Wickberg authored May 25, 2017

leaving the node owned.

Two jobs completing simultaneously leads to make_node_idle()
returning before it has a chance to decrement node_ptr->owner_job_cnt,
which can result in the node being "owned" by that user even
through no jobs are running on it.

Move the decrement block to the end at a fini label, and make sure
all return paths pass through it. While moving that add a guard
against node_ptr->owner_job_cnt underflowing.

Bug 3771.

a02d04f1

Prevent a job tested on multiple partitions from being marked · 9128476a

Tim Wickberg authored May 25, 2017

WHOLE_NODE_USER.

If a job is considered on a partition with ExclusiveUser=YES
then it would be marked as if it was submitted with the
--exclusive flag, which would lead to delays launching it
on ExclusiveUser=NO partitions, and cause lower-than-expected
cluster usage.

As a side effect, the job_ptr->part_ptr->flags need to be
tested wherever WHOLE_NODE_USER is considered, instead of
just job_ptr->details->whole_node.

Bug 3771.

9128476a

Fix WithSubAccounts option to not include WithDeleted unless requested. · 29ebc4b2

Alejandro Sanchez authored May 25, 2017

_setup_assoc_cond_limits was using the table 'prefix' passed by argument
in the where clause to select the where clause prefix.deleted=something.

It turns out that _setup_assoc_cond_limits is called by these functions:
as_mysql_modify_assocs
as_mysql_remove_assocs
as_mysql_get_assocs
as_mysql_acct_no_users

which set the prefix to 't2' before the call if a QOS is provided or if
WithSubAccounts is provided. The 't2' prefix is fine for other where
conditions in that case, but for choosing the deleted we need the t1
which is the table we're selecting the records off.

Bug 3835

29ebc4b2

24 May, 2017 4 commits
- Check if variable given to scontrol show job is a valid jobid. · ea906a24
  Tim Shaw authored May 24, 2017
```
Bug 3821
```
  ea906a24
- Handle a reservation update to UNLIMITED correctly. · 6180ff64
  Tim Wickberg authored May 23, 2017
```
'scontrol update reservationname=foo duration=unlimited' sets INFINITE
as the duration, which needs to be translated to a year as is done
elsewhere. Otherwise it'll convert to 49710 days, which is definitely
wrong.

Bug 3836.
```
  6180ff64
- Fix unsafe MAX() macro use that can lead to repeated cancellation attempts in scancel. · 5bc278f7
  Alejandro Sanchez authored May 23, 2017
```
Bug 3443.
```
  5bc278f7
- Fix unsafe use of MAX macro that could lead to problems with acct_gather plugins. · 03a374d3
  Alejandro Sanchez authored May 23, 2017
```
MAX() will re-evaluate the higher value argument; if this is a function
is may be called twice over, leading to unintended side effects or a
crash.

Bug 3443.
```
  03a374d3
23 May, 2017 2 commits

Fix it so the backup slurmdbd will take control correctly. · 4f87dc53

Danny Auble authored May 23, 2017

This also fixes the fed_mgr on the backup slurmctld to start backup
correctly if the backup takes control more than once.

Bug 3827

4f87dc53

Fix Partition line in 'scontrol show node'. · e089a84f

Tim Shaw authored May 22, 2017

Previously, incorrect partitions and duplicated partition names
could be shown.

The array needs to be incremented by two, not one, as each
element is a start-end pair.

Bug 3793.

e089a84f

22 May, 2017 1 commit
- Fix null-derefer in sreport cluster ulitization · c30629bc
  Brian Christiansen authored May 22, 2017
```
when configured with memory-leak-debug
```
  c30629bc
19 May, 2017 5 commits
- When doing a dlopen on liblua only attempt the version compiled against. · e75f6118
  Danny Auble authored May 19, 2017
```
Bug 2131
```
  e75f6118
- Add missing QOS read lock to backfill scheduler. · 5d948801
  Danny Auble authored May 19, 2017
```
Bug 3776
```
  5d948801
- node_features/knl_generic: Do not repeatedly log errors when trying to read · ea2a0d25
  Morris Jette authored May 19, 2017
```
KNL modes if not KNL system.

Bug 3825
```
  ea2a0d25
- Revert "node_features/knl_generic: Do not repeatedly log errors when trying to read" · 4e7794e7
  Danny Auble authored May 19, 2017
```
This reverts commit c2380520.
```
  4e7794e7
- node_features/knl_generic: Do not repeatedly log errors when trying to read · c2380520
  Danny Auble authored May 19, 2017
```
KNL modes if not KNL system.

Bug 3825
```
  c2380520