Commits · a1b6a7fd1ed88dfdfc300349e24846c8be33c713 · Manuel G. Marciani / ces_slurm_simulator

19 Jun, 2017 1 commit

Fix --ntasks-per-core option/environment variable parsing to set · a1b6a7fd

Danny Auble authored Jun 15, 2017

the requested value, instead of always setting one.

This would make --hint=multithread not work at all.

See Bug 3855 (commit 3c852da1)

Issue originated from commit 82a959a8.

a1b6a7fd

15 Jun, 2017 1 commit
- Fix for job step task layout with --cpus-per-task option · 81372cc0
  Dominik Bartkiewicz authored Jun 15, 2017
```
bug 3447
```
  81372cc0
14 Jun, 2017 2 commits
- Only make the extern step at job creation. · 9d32c100
  Danny Auble authored Jun 14, 2017
```
Turns out if the extern step is created here and the job was killed long
before hand the step is made erroneously and can cause an assert just lines
later.

Bug 3898
```
  9d32c100
- Make sure srun inside an allocation gets --ntasks-per-[core|socket] · a3a3d368
  Tim Shaw authored Jun 13, 2017
```
set correctly.

Bug 3858
```
  a3a3d368
13 Jun, 2017 2 commits

Add missing NEWS entry for 23721c4c . · 7add853c
Tim Wickberg authored Jun 13, 2017

7add853c

Improve slurmd startup on large systems (> 10000 nodes) · 9b7210ef

Danny Auble authored Jun 13, 2017

What this does is populate the node_hash_table as nodes are being read in
instead of after the node_record_table_ptr has been fully populated.

This speeds up a start of a slurmd with a system of 10000 nodes from
> 1 minute to less than a second.

In 17.11 we will remove the linear xstrcmp check as it should no longer be
needed.

Bug 3885

9b7210ef

12 Jun, 2017 4 commits
- Note when a job finishes in the slurmd to avoid a race when launching a · 55bcec87
  Danny Auble authored Jun 12, 2017
```
batch job takes longer than it takes to finish.

Bug 3833
```
  55bcec87
- Increase number of jobs that are tracked in the slurmd as finishing at one · ccfe2552
  Danny Auble authored Jun 12, 2017
```
time.

Bug 3833
```
  ccfe2552
- Fix bug in task/affinity that could result in slurmd fatal error · 6dd2be3b
  Morris Jette authored Jun 12, 2017
```
An array was only being partially cleared due to bad logic
bug 3876
```
  6dd2be3b
- Only set kmem cgroup limit if ConstrainKmemSpace=yes · ba32ac48
  Tim Wickberg authored Jun 09, 2017
```
Bug 3874.
```
  ba32ac48
08 Jun, 2017 2 commits

Improve preempted job selection logic · 47b5fe60

Dominik Bartkiewicz authored Jun 08, 2017

Improve selection of jobs to preempt when there are multiple partitions
    with jobs subject to preemption.
bug 3824

47b5fe60

Handle update of blocking QOS pointers correctly. · 5e92a3f5
Dominik Bartkiewicz authored Jun 08, 2017
```
Prevent segfault from pointer dereference to the QOS that is
being deleted.

Fix to commit 3e8aa451.
```
5e92a3f5

07 Jun, 2017 1 commit
- Update NEWS for 17.02.5. · 02106184
  Tim Wickberg authored Jun 07, 2017
  
  02106184
03 Jun, 2017 1 commit

Fix issue with sacctmgr show where user='' · bef69448

Danny Auble authored Jun 02, 2017

Fix regression from commit c05dcb8a (bug 1923) that doesn't take
into consideration a blank char * as a valid option.

This fixes the scenario like

sacctmgr list associations user=''

which would only print account associations.
Bug 3862

bef69448

02 Jun, 2017 2 commits

If trying to cancel a step that hasn't started yet for some reason return · eed76f85

Danny Auble authored Jun 02, 2017

a good return code.

This also fixes the situation where the step was ending but not yet ended
so it sends the KILL_TASK_FAILED error instead of JOB_NOTRUNNING.

Also it removes the abort in favor of exit which it should had been anyways.

Bug 3758

eed76f85

Add new SchedulerParameters option bf_window_linear to control the rate at · 3f7e10f8

Gary B Skouson authored Jun 01, 2017

which the backfill test window expands. This can be used on a system with
a modest number of running jobs (hundreds of jobs) to help prevent expected
start times of pending jobs to get pushed forward in time. On systems with
large numbers of running jobs, performance of the backfill scheduler will
suffer and fewer jobs will be evaluated.

Bug 3790

3f7e10f8

01 Jun, 2017 8 commits

Revert "Add new SchedulerParameters option bf_window_linear to control the rate at" · 48f81146
Danny Auble authored Jun 01, 2017
```
This reverts commit da414931.
```
48f81146

Add new SchedulerParameters option bf_window_linear to control the rate at · da414931

Danny Auble authored Jun 01, 2017

which the backfill test window expands. This can be used on a system with
a modest number of running jobs (hundreds of jobs) to help prevent expected
start times of pending jobs to get pushed forward in time. On systems with
large numbers of running jobs, performance of the backfill scheduler will
suffer and fewer jobs will be evaluated.

Bug 3790

da414931

Add bf_max_job_assoc to SchedulerParameters. · 9f36d682
Mark Klein authored Jun 01, 2017
```
Bug 3671
```
9f36d682
Fix --ntasks-per-core parsing in sbatch command. · 3c852da1
Mark Klein authored Jun 01, 2017
```
Inadvertently set to one when requested.

Bug 3855.
```
3c852da1
Always generate core in slurmd, even on non-developer builds. · 7d488e2b
Tim Wickberg authored Jun 01, 2017
```
Bug 3857.
```
7d488e2b
Add bf_max_time to SchedulerParameters. · c8c9694f
Doug Jacobsen authored Jun 01, 2017
```
Bug 3808
```
c8c9694f
Add wall-time to seff output · eeebf5c8
Pablo Escobar authored Jun 01, 2017
```
bug 3846
```
eeebf5c8

Handle file deletion for purge_old_job() in a separate thread. · b9719be2

Tim Wickberg authored May 24, 2017

File deletion can be slow, especially when StateSaveLocation in on
NFS or other network filesystems. Since purge_old_job() holds all
the slurmctld write locks, this is especially performance sensitive.

Moving this to an independent thread lets the slower filesystem
cleanup happen without owning these locks. purge_old_job() then
results in the purged job ids being queued in the purge_list.

A race with the job id potentially wrapping around again is already
prevented by _dup_job_file_test() in get_next_job_id().

Bug 3763.

b9719be2

31 May, 2017 3 commits
- Add warning about libcurl-devel not being installed during configure. · 0e582365
  Danny Auble authored May 31, 2017
  
  0e582365
- Prevent segfault in sacctmgr due to bad handling of return code. · 15276c01
  Tim Shaw authored May 30, 2017
```
Bug 3840.
```
  15276c01
- Fix NEWS line from commit 56ea068c . · 1503bdcc
  Tim Shaw authored May 30, 2017
  
  1503bdcc
30 May, 2017 3 commits

don't clear GRES from non-KNL node · 56ea068c

Tim Shaw authored May 30, 2017

node_featurs/knl_cray plugin: Don't clear configured GRES from non-KNL node.
bug 3768

56ea068c

Improve node configuration parsing · 40806eb5

Morris Jette authored May 30, 2017

Report that "CPUs" plus "Boards" in node configuration invalid only if the
CPUs value is not equal to the total thread count. In any case, the
CPUs value is ignored, but it is also output by "slurmd -C".

40806eb5

NEWS entry about better backfill (commit 3e8aa451 ). · 88df7a81
Danny Auble authored May 30, 2017

88df7a81

26 May, 2017 2 commits

Follow up to commit . This makes this logic optional with · 8c2f4508

Danny Auble authored May 26, 2017

the SchedulerParameters=reduce_completing_frag option.

NOTE: reduce_completing_frag on or off only works with CompletingWait set to
something.

Bug 3756

8c2f4508

Preserve earliest start time for jobs · 86884fb6

Gary authored May 26, 2017

For jobs submited to multiple partitions, report the job's earliest start
    time for any partition.
bug 3754

86884fb6

25 May, 2017 8 commits

When scheduling take the nodes in completing jobs out of the mix to reduce · 54223710
Doug Jacobsen authored May 25, 2017
```
fragmentation.

Bug 3756
```
54223710

Prevent a race between completing jobs on a user-exclusive node from leaving the node owned. · cd9ff91b

Dominik Bartkiewicz authored May 25, 2017

Two jobs completing simultaneously leads to make_node_idle()
returning before it has a chance to decrement node_ptr->owner_job_cnt,
which can result in the node being "owned" by that user even
through no jobs are running on it.

Move the decrement block to the end at a fini label, and make sure
all return paths pass through it. While moving that add a guard
against node_ptr->owner_job_cnt underflowing.

Bug 3771.

cd9ff91b

Prevent a job tested on multiple partitions from being marked WHOLE_NODE_USER. · 162f6a05

Dominik Bartkiewicz authored May 25, 2017

If a job is considered on a partition with ExclusiveUser=YES
then it would be marked as if it was submitted with the
--exclusive flag, which would lead to delays launching it
on ExclusiveUser=NO partitions, and cause lower-than-expected
cluster usage.

As a side effect, the job_ptr->part_ptr->flags need to be
tested wherever WHOLE_NODE_USER is considered, instead of
just job_ptr->details->whole_node.

Bug 3771.

162f6a05

Revert "Prevent a job tested on multiple partitions from being marked" · f1a45962
Tim Wickberg authored May 25, 2017
```
Wrong author attributed by mistake.

This reverts commit 9128476a.
```
f1a45962
Revert "Prevent a race between completing jobs on a user-exclusive node from" · 82b0f802
Tim Wickberg authored May 25, 2017
```
Wrong author attributed by mistake.

This reverts commit a02d04f1.
```
82b0f802

Prevent a race between completing jobs on a user-exclusive node from · a02d04f1

Tim Wickberg authored May 25, 2017

leaving the node owned.

Two jobs completing simultaneously leads to make_node_idle()
returning before it has a chance to decrement node_ptr->owner_job_cnt,
which can result in the node being "owned" by that user even
through no jobs are running on it.

Move the decrement block to the end at a fini label, and make sure
all return paths pass through it. While moving that add a guard
against node_ptr->owner_job_cnt underflowing.

Bug 3771.

a02d04f1

Prevent a job tested on multiple partitions from being marked · 9128476a

Tim Wickberg authored May 25, 2017

WHOLE_NODE_USER.

If a job is considered on a partition with ExclusiveUser=YES
then it would be marked as if it was submitted with the
--exclusive flag, which would lead to delays launching it
on ExclusiveUser=NO partitions, and cause lower-than-expected
cluster usage.

As a side effect, the job_ptr->part_ptr->flags need to be
tested wherever WHOLE_NODE_USER is considered, instead of
just job_ptr->details->whole_node.

Bug 3771.

9128476a

Fix WithSubAccounts option to not include WithDeleted unless requested. · 29ebc4b2

Alejandro Sanchez authored May 25, 2017

_setup_assoc_cond_limits was using the table 'prefix' passed by argument
in the where clause to select the where clause prefix.deleted=something.

It turns out that _setup_assoc_cond_limits is called by these functions:
as_mysql_modify_assocs
as_mysql_remove_assocs
as_mysql_get_assocs
as_mysql_acct_no_users

which set the prefix to 't2' before the call if a QOS is provided or if
WithSubAccounts is provided. The 't2' prefix is fine for other where
conditions in that case, but for choosing the deleted we need the t1
which is the table we're selecting the records off.

Bug 3835

29ebc4b2