Commits · 6c45c68055bca940aa7cdd67736872060aadc9c5 · Manuel G. Marciani / ces_slurm_simulator

01 Jun, 2017 6 commits

Put in the word 'extern' for consistancy sake. · 6c45c680
Danny Auble authored Jun 01, 2017

6c45c680

Handle file deletion for purge_old_job() in a separate thread. · b9719be2

Tim Wickberg authored May 24, 2017

File deletion can be slow, especially when StateSaveLocation in on
NFS or other network filesystems. Since purge_old_job() holds all
the slurmctld write locks, this is especially performance sensitive.

Moving this to an independent thread lets the slower filesystem
cleanup happen without owning these locks. purge_old_job() then
results in the purged job ids being queued in the purge_list.

A race with the job id potentially wrapping around again is already
prevented by _dup_job_file_test() in get_next_job_id().

Bug 3763.

b9719be2

Make _delete_job_details a static function. · ce2cd1b2
Tim Wickberg authored May 24, 2017
```
Only called from _list_delete_job once the MinJobAge has
passed.
```
ce2cd1b2

Remove timeout code from job_purge_old. · 843e5d38

Tim Wickberg authored May 24, 2017

This will need to be handled differently. The timeout can
lead to the purge process falling further and further behind
on high throughput systems if the number of job scripts that
can be deleted within a second is lower than the job submission
and completion rate of the cluster, eventually leading to
the MaxJobCount limit being reached.

Bug 3763.

843e5d38

Better commit from last · cff4e661
Danny Auble authored Jun 01, 2017

cff4e661
Update test to print warning about rxvt/aterm ignoring SIGFPE. · 3b20bc1d
Danny Auble authored Jun 01, 2017

3b20bc1d

31 May, 2017 6 commits
- Fix test to print out the slurmd node name instead of gethostname so · 9f2b3ffe
  Danny Auble authored May 31, 2017
```
it works better on multi-slurmd installs.
```
  9f2b3ffe
- Docs - clarify performance issues from ConstrainRAMSpace. · 2e833147
  Tim Wickberg authored May 31, 2017
```
Revert some of my b50f4661. Elaborate on tradeoffs, and point
to HTC page as well which is a better location for this info.
```
  2e833147
- Add warning about libcurl-devel not being installed during configure. · 0e582365
  Danny Auble authored May 31, 2017
  
  0e582365
- Docs - remove reference to ConstrainRAMSpace in HTC. · b50f4661
  Tim Wickberg authored May 31, 2017
```
This is better discussed in the high_throughput.shtml doc.

Also, "Contrain" is misspelled adding to the confusion.
```
  b50f4661
- Prevent segfault in sacctmgr due to bad handling of return code. · 15276c01
  Tim Shaw authored May 30, 2017
```
Bug 3840.
```
  15276c01
- Fix NEWS line from commit 56ea068c . · 1503bdcc
  Tim Shaw authored May 30, 2017
  
  1503bdcc
30 May, 2017 6 commits
- don't clear GRES from non-KNL node · 56ea068c
  Tim Shaw authored May 30, 2017
```
node_featurs/knl_cray plugin: Don't clear configured GRES from non-KNL node.
bug 3768
```
  56ea068c
- Reset variables if no qos exist. Continuation of 8129acfe and others · 757a3169
  Danny Auble authored May 30, 2017
  
  757a3169
- Fix correct checking for commit 8129acfe · 50fffb31
  Danny Auble authored May 30, 2017
  
  50fffb31
- NEWS entry about better backfill (commit 3e8aa451 ). · 88df7a81
  Danny Auble authored May 30, 2017
  
  88df7a81
- Avoid need for lock from previous commit. · 8129acfe
  Danny Auble authored May 30, 2017
  
  8129acfe
- Use job_state_qos_grp_limit for more full result from previous commits · f8ca7493
  Danny Auble authored May 30, 2017
  
  f8ca7493
26 May, 2017 6 commits
- Add function to determine if a job is held by a QOS GRP limit. · 44423dad
  Danny Auble authored May 26, 2017
  
  44423dad
- Backfill partitions that use QOS Grp limits to "float" better. · 3e8aa451
  Dominik Bartkiewicz authored May 26, 2017
```
Initial fix for handling floating partitions that use qos grp limits.

Bug 3776
```
  3e8aa451
- Follow up to commit 54223710. This makes this logic optional with · 8c2f4508
  Danny Auble authored May 26, 2017
```
the SchedulerParameters=reduce_completing_frag option.

NOTE: reduce_completing_frag on or off only works with CompletingWait set to
something.

Bug 3756
```
  8c2f4508
- replace list_iterator with list_for_each · 7cf6b54e
  Dominik Bartkiewicz authored May 26, 2017
```
This will improve performance and simplify the code.
bug 3757
```
  7cf6b54e
- Remove vestigial logic · 85978ea4
  Gary authored May 26, 2017
```
bug 3754
```
  85978ea4
- Preserve earliest start time for jobs · 86884fb6
  Gary authored May 26, 2017
```
For jobs submited to multiple partitions, report the job's earliest start
    time for any partition.
bug 3754
```
  86884fb6
25 May, 2017 13 commits

Testsuite - do not run test35.3 as root. · 91cf82e3
Isaac Hartung authored May 25, 2017
```
Burst buffer jobs cannot be run as root currently, change
test to prevent that.

Bug 3723.
```
91cf82e3
Code cleanup from previous 2 commits. · c3ddec0e
Danny Auble authored May 25, 2017
```
Bug 3756
```
c3ddec0e
Simpler bit operations from previous commit · f0023bac
Dominik Bartkiewicz authored May 25, 2017
```
Bug 3756
```
f0023bac
When scheduling take the nodes in completing jobs out of the mix to reduce · 54223710
Doug Jacobsen authored May 25, 2017
```
fragmentation.

Bug 3756
```
54223710

Prevent a race between completing jobs on a user-exclusive node from leaving the node owned. · cd9ff91b

Dominik Bartkiewicz authored May 25, 2017

Two jobs completing simultaneously leads to make_node_idle()
returning before it has a chance to decrement node_ptr->owner_job_cnt,
which can result in the node being "owned" by that user even
through no jobs are running on it.

Move the decrement block to the end at a fini label, and make sure
all return paths pass through it. While moving that add a guard
against node_ptr->owner_job_cnt underflowing.

Bug 3771.

cd9ff91b

Prevent a job tested on multiple partitions from being marked WHOLE_NODE_USER. · 162f6a05

Dominik Bartkiewicz authored May 25, 2017

If a job is considered on a partition with ExclusiveUser=YES
then it would be marked as if it was submitted with the
--exclusive flag, which would lead to delays launching it
on ExclusiveUser=NO partitions, and cause lower-than-expected
cluster usage.

As a side effect, the job_ptr->part_ptr->flags need to be
tested wherever WHOLE_NODE_USER is considered, instead of
just job_ptr->details->whole_node.

Bug 3771.

162f6a05

Revert "Prevent a job tested on multiple partitions from being marked" · f1a45962
Tim Wickberg authored May 25, 2017
```
Wrong author attributed by mistake.

This reverts commit 9128476a.
```
f1a45962
Revert "Prevent a race between completing jobs on a user-exclusive node from" · 82b0f802
Tim Wickberg authored May 25, 2017
```
Wrong author attributed by mistake.

This reverts commit a02d04f1.
```
82b0f802

Prevent a race between completing jobs on a user-exclusive node from · a02d04f1

Tim Wickberg authored May 25, 2017

leaving the node owned.

Two jobs completing simultaneously leads to make_node_idle()
returning before it has a chance to decrement node_ptr->owner_job_cnt,
which can result in the node being "owned" by that user even
through no jobs are running on it.

Move the decrement block to the end at a fini label, and make sure
all return paths pass through it. While moving that add a guard
against node_ptr->owner_job_cnt underflowing.

Bug 3771.

a02d04f1

Prevent a job tested on multiple partitions from being marked · 9128476a

Tim Wickberg authored May 25, 2017

WHOLE_NODE_USER.

If a job is considered on a partition with ExclusiveUser=YES
then it would be marked as if it was submitted with the
--exclusive flag, which would lead to delays launching it
on ExclusiveUser=NO partitions, and cause lower-than-expected
cluster usage.

As a side effect, the job_ptr->part_ptr->flags need to be
tested wherever WHOLE_NODE_USER is considered, instead of
just job_ptr->details->whole_node.

Bug 3771.

9128476a

Fix WithSubAccounts option to not include WithDeleted unless requested. · 29ebc4b2

Alejandro Sanchez authored May 25, 2017

_setup_assoc_cond_limits was using the table 'prefix' passed by argument
in the where clause to select the where clause prefix.deleted=something.

It turns out that _setup_assoc_cond_limits is called by these functions:
as_mysql_modify_assocs
as_mysql_remove_assocs
as_mysql_get_assocs
as_mysql_acct_no_users

which set the prefix to 't2' before the call if a QOS is provided or if
WithSubAccounts is provided. The 't2' prefix is fine for other where
conditions in that case, but for choosing the deleted we need the t1
which is the table we're selecting the records off.

Bug 3835

29ebc4b2

OK, hopefully the last time on this one. Follow on to commit d9f5ac02 · 98425be8
Alejandro Sanchez authored May 25, 2017

98425be8
Correction to capmc JSON formatting · c4a64107
Tim Shaw authored May 25, 2017

c4a64107

24 May, 2017 3 commits
- remove unneeded return after fatal in commit d9f5ac02 · 7e2688a0
  Danny Auble authored May 24, 2017
  
  7e2688a0
- Remove unneeded clause from commit 6d2d54fc · fa59a3a1
  Danny Auble authored May 24, 2017
  
  fa59a3a1
- Revert change from in case to incase on automatically generated file. · 6d2d54fc
  Danny Auble authored May 24, 2017
```
There isn't much we can do about this, it will always be misspelled until
they fix it upstream.  We could correct it, but then every time we run
autogen.sh we would have to ignore the change which seems like more work
than I would want to keep doing.
```
  6d2d54fc