Commits · 7d488e2bcb8637e732f5c90c50b141c151562041 · Manuel G. Marciani / ces_slurm_simulator

01 Jun, 2017 2 commits

Always generate core in slurmd, even on non-developer builds. · 7d488e2b
Tim Wickberg authored Jun 01, 2017
```
Bug 3857.
```
7d488e2b

Handle file deletion for purge_old_job() in a separate thread. · b9719be2

Tim Wickberg authored May 24, 2017

File deletion can be slow, especially when StateSaveLocation in on
NFS or other network filesystems. Since purge_old_job() holds all
the slurmctld write locks, this is especially performance sensitive.

Moving this to an independent thread lets the slower filesystem
cleanup happen without owning these locks. purge_old_job() then
results in the purged job ids being queued in the purge_list.

A race with the job id potentially wrapping around again is already
prevented by _dup_job_file_test() in get_next_job_id().

Bug 3763.

b9719be2

31 May, 2017 3 commits
- Add warning about libcurl-devel not being installed during configure. · 0e582365
  Danny Auble authored May 31, 2017
  
  0e582365
- Prevent segfault in sacctmgr due to bad handling of return code. · 15276c01
  Tim Shaw authored May 30, 2017
```
Bug 3840.
```
  15276c01
- Fix NEWS line from commit 56ea068c . · 1503bdcc
  Tim Shaw authored May 30, 2017
  
  1503bdcc
30 May, 2017 2 commits
- don't clear GRES from non-KNL node · 56ea068c
  Tim Shaw authored May 30, 2017
```
node_featurs/knl_cray plugin: Don't clear configured GRES from non-KNL node.
bug 3768
```
  56ea068c
- NEWS entry about better backfill (commit 3e8aa451 ). · 88df7a81
  Danny Auble authored May 30, 2017
  
  88df7a81
26 May, 2017 2 commits

Follow up to commit . This makes this logic optional with · 8c2f4508

Danny Auble authored May 26, 2017

the SchedulerParameters=reduce_completing_frag option.

NOTE: reduce_completing_frag on or off only works with CompletingWait set to
something.

Bug 3756

8c2f4508

Preserve earliest start time for jobs · 86884fb6

Gary authored May 26, 2017

For jobs submited to multiple partitions, report the job's earliest start
    time for any partition.
bug 3754

86884fb6

25 May, 2017 8 commits

When scheduling take the nodes in completing jobs out of the mix to reduce · 54223710
Doug Jacobsen authored May 25, 2017
```
fragmentation.

Bug 3756
```
54223710

Prevent a race between completing jobs on a user-exclusive node from leaving the node owned. · cd9ff91b

Dominik Bartkiewicz authored May 25, 2017

Two jobs completing simultaneously leads to make_node_idle()
returning before it has a chance to decrement node_ptr->owner_job_cnt,
which can result in the node being "owned" by that user even
through no jobs are running on it.

Move the decrement block to the end at a fini label, and make sure
all return paths pass through it. While moving that add a guard
against node_ptr->owner_job_cnt underflowing.

Bug 3771.

cd9ff91b

Prevent a job tested on multiple partitions from being marked WHOLE_NODE_USER. · 162f6a05

Dominik Bartkiewicz authored May 25, 2017

If a job is considered on a partition with ExclusiveUser=YES
then it would be marked as if it was submitted with the
--exclusive flag, which would lead to delays launching it
on ExclusiveUser=NO partitions, and cause lower-than-expected
cluster usage.

As a side effect, the job_ptr->part_ptr->flags need to be
tested wherever WHOLE_NODE_USER is considered, instead of
just job_ptr->details->whole_node.

Bug 3771.

162f6a05

Revert "Prevent a job tested on multiple partitions from being marked" · f1a45962
Tim Wickberg authored May 25, 2017
```
Wrong author attributed by mistake.

This reverts commit 9128476a.
```
f1a45962
Revert "Prevent a race between completing jobs on a user-exclusive node from" · 82b0f802
Tim Wickberg authored May 25, 2017
```
Wrong author attributed by mistake.

This reverts commit a02d04f1.
```
82b0f802

Prevent a race between completing jobs on a user-exclusive node from · a02d04f1

Tim Wickberg authored May 25, 2017

leaving the node owned.

Two jobs completing simultaneously leads to make_node_idle()
returning before it has a chance to decrement node_ptr->owner_job_cnt,
which can result in the node being "owned" by that user even
through no jobs are running on it.

Move the decrement block to the end at a fini label, and make sure
all return paths pass through it. While moving that add a guard
against node_ptr->owner_job_cnt underflowing.

Bug 3771.

a02d04f1

Prevent a job tested on multiple partitions from being marked · 9128476a

Tim Wickberg authored May 25, 2017

WHOLE_NODE_USER.

If a job is considered on a partition with ExclusiveUser=YES
then it would be marked as if it was submitted with the
--exclusive flag, which would lead to delays launching it
on ExclusiveUser=NO partitions, and cause lower-than-expected
cluster usage.

As a side effect, the job_ptr->part_ptr->flags need to be
tested wherever WHOLE_NODE_USER is considered, instead of
just job_ptr->details->whole_node.

Bug 3771.

9128476a

Fix WithSubAccounts option to not include WithDeleted unless requested. · 29ebc4b2

Alejandro Sanchez authored May 25, 2017

_setup_assoc_cond_limits was using the table 'prefix' passed by argument
in the where clause to select the where clause prefix.deleted=something.

It turns out that _setup_assoc_cond_limits is called by these functions:
as_mysql_modify_assocs
as_mysql_remove_assocs
as_mysql_get_assocs
as_mysql_acct_no_users

which set the prefix to 't2' before the call if a QOS is provided or if
WithSubAccounts is provided. The 't2' prefix is fine for other where
conditions in that case, but for choosing the deleted we need the t1
which is the table we're selecting the records off.

Bug 3835

29ebc4b2

24 May, 2017 4 commits
- Check if variable given to scontrol show job is a valid jobid. · ea906a24
  Tim Shaw authored May 24, 2017
```
Bug 3821
```
  ea906a24
- Handle a reservation update to UNLIMITED correctly. · 6180ff64
  Tim Wickberg authored May 23, 2017
```
'scontrol update reservationname=foo duration=unlimited' sets INFINITE
as the duration, which needs to be translated to a year as is done
elsewhere. Otherwise it'll convert to 49710 days, which is definitely
wrong.

Bug 3836.
```
  6180ff64
- Fix unsafe MAX() macro use that can lead to repeated cancellation attempts in scancel. · 5bc278f7
  Alejandro Sanchez authored May 23, 2017
```
Bug 3443.
```
  5bc278f7
- Fix unsafe use of MAX macro that could lead to problems with acct_gather plugins. · 03a374d3
  Alejandro Sanchez authored May 23, 2017
```
MAX() will re-evaluate the higher value argument; if this is a function
is may be called twice over, leading to unintended side effects or a
crash.

Bug 3443.
```
  03a374d3
23 May, 2017 2 commits

Fix it so the backup slurmdbd will take control correctly. · 4f87dc53

Danny Auble authored May 23, 2017

This also fixes the fed_mgr on the backup slurmctld to start backup
correctly if the backup takes control more than once.

Bug 3827

4f87dc53

Fix Partition line in 'scontrol show node'. · e089a84f

Tim Shaw authored May 22, 2017

Previously, incorrect partitions and duplicated partition names
could be shown.

The array needs to be incremented by two, not one, as each
element is a start-end pair.

Bug 3793.

e089a84f

22 May, 2017 1 commit
- Fix null-derefer in sreport cluster ulitization · c30629bc
  Brian Christiansen authored May 22, 2017
```
when configured with memory-leak-debug
```
  c30629bc
19 May, 2017 6 commits
- When doing a dlopen on liblua only attempt the version compiled against. · e75f6118
  Danny Auble authored May 19, 2017
```
Bug 2131
```
  e75f6118
- Add missing QOS read lock to backfill scheduler. · 5d948801
  Danny Auble authored May 19, 2017
```
Bug 3776
```
  5d948801
- node_features/knl_generic: Do not repeatedly log errors when trying to read · ea2a0d25
  Morris Jette authored May 19, 2017
```
KNL modes if not KNL system.

Bug 3825
```
  ea2a0d25
- Revert "node_features/knl_generic: Do not repeatedly log errors when trying to read" · 4e7794e7
  Danny Auble authored May 19, 2017
```
This reverts commit c2380520.
```
  4e7794e7
- node_features/knl_generic: Do not repeatedly log errors when trying to read · c2380520
  Danny Auble authored May 19, 2017
```
KNL modes if not KNL system.

Bug 3825
```
  c2380520
- node_features/knl_cray: Preserve non-KNL active features if slurmctld · bc484054
  Morris Jette authored May 19, 2017
```
reconfigured while node boot in progress.

Bug 3679
```
  bc484054
18 May, 2017 1 commit
- Fix minor typos in the documentation · 0bc04046
  Damien François authored May 18, 2017
```
bug 3822
```
  0bc04046
17 May, 2017 3 commits
- Calculate priority correctly when 'nice' is given. · a1168840
  Dominik Bartkiewicz authored May 17, 2017
```
Bug 3708
```
  a1168840
- NEWS for commit 79ff60f4 · 3618e592
  Danny Auble authored May 17, 2017
  
  3618e592
- Add support for lua5.3. · 7cc4d0d8
  Danny Auble authored May 17, 2017
```
In 17.11(or other future version) we should move a lot of this common
code into a new lib.  The reason I didn't put these common changes
into common/xlua.c was because then I would have to link common to
liblua which I really didn't want to do.
```
  7cc4d0d8
16 May, 2017 4 commits
- Add missing locks to job_submit/pbs plugin when updating a jobs · 5674dd74
  Dominik Bartkiewicz authored May 16, 2017
```
dependencies.

Bug 3708
```
  5674dd74
- Fix incorrect lock levels when testing when job will run or updating a job. · 1120d85a
  Tim Wickberg authored May 16, 2017
```
Bug 3772
```
  1120d85a
- Test if the node_bitmap on a job is NULL when testing if the job's nodes · e9ab5517
  Morris Jette authored May 15, 2017
```
are ready.  This will be NULL is a job was revoked while beginning.
```
  e9ab5517
- Add new burst_buffer function bb_g_job_revoke_alloc() to be executed · e6fa25fa
  Morris Jette authored May 15, 2017
```
if there was a failure after the initial resource allocation. Does not
release previously allocated resources.

Bug 3783

This is the initial patch that adds the stubs for the logic.  Outside of
that this patch really does nothing.
```
  e6fa25fa
15 May, 2017 2 commits
- node_features/knl_generic disable mode change unless RebootProgram · 60a2bd6f
  Morris Jette authored May 15, 2017
```
configured.
```
  60a2bd6f
- node_features/knl_generic - If a node is rebooted for a pending job, but · 8befe639
  Morris Jette authored May 15, 2017
```
fails to enter the desired NUMA and/or MCDRAM mode then drain the node and
requeue the job.

Bug 3785
```
  8befe639