Commits · 17f6229b828c9839feb1aa808f5fb60332bfa3c9 · Manuel G. Marciani / ces_slurm_simulator

16 May, 2018 9 commits

Fix sacct error for job's not started · 17f6229b

Morris Jette authored May 16, 2018

This bug was introduced in commit 4a9bffe1
Test12.7 was resulting in the logging of error messages of this sort:
sacct: error: slurmdb_ave_tres_usage: couldn't make tres_list from ''
This was due to the tres_usage_in_ave and tres_usage_out_ave fields
being empty ('\0') if the job's cpu count is zero, which makes
calculation of averages impossible.
bug 2782

17f6229b

docs - add a note to slurm.conf man page clarifying slurmdbd log rotation. · ebed79e5
Alejandro Sanchez authored May 16, 2018
```
Bug 5174.
```
ebed79e5
docs - remove 'nocreate' option from slurm.conf man logrotate example. · 7d03a8c8
Dan Barke authored May 16, 2018
```
Since having 'nocreate' would override the following option:
create 640 slurm root

Bug 5174.
```
7d03a8c8

Add reboot node weight parameter · 5ca16284

Morris Jette authored May 15, 2018

Add node_features plugin function "node_features_p_reboot_weight()" to
return the node weight to be used for a compute node that requires reboot
for use (e.g. to change the NUMA mode of a KNL node).
Add NodeRebootWeight parameter to knl.conf configuration file.

5ca16284

Make a test more robust · b3c73b97

Morris Jette authored May 15, 2018

If ReturnToService=2 is configured, the test could generate an error
changing node state to resume after setting it to down. The reason
is if the node communicates with slurmctld, then its state will
automatically be changed from down to idle and resuming an idle
node triggers an error.

b3c73b97

Run autogen.sh after previous commit. · 525125b1
Alejandro Sanchez authored May 15, 2018
```
Bug 5168.
```
525125b1

PMIx - override default paths at configure time if --with-pmix is used. · 0b7bbc73

Alejandro Sanchez authored May 15, 2018

Previously the default paths continued to be tested even when new ones
were requested. This had as a consequence that if any of the new paths
was the same as any of the default ones (i.e. /usr or /usr/local), the
configure script was incorrectly erroring out specifying that a version
of PMIx was already found in a previous path.

Bug 5168.

0b7bbc73

Fixes for test7.17 · 59be3d83
Morris Jette authored May 16, 2018
```
Variable initialization plus cosmetic work
```
59be3d83

Add some tres-step related logic · 22bf408d

Morris Jette authored May 14, 2018

Rename gres_per_job for step to gres_per_step
Remove job gres gres_name_type_id field
Build step gres data structure

22bf408d

14 May, 2018 2 commits
- Add time limits to some tests · 42d3f291
  Morris Jette authored May 14, 2018
```
Prevent run-away jobs
```
  42d3f291
- Remove "gres" from job submit/update RPC · ae8a5ade
  Morris Jette authored May 14, 2018
  
  ae8a5ade
11 May, 2018 7 commits
- Merge branch 'slurm-17.11' · c7f97d2f
  Morris Jette authored May 11, 2018
  
  c7f97d2f
- Disable pack-step tests on Cray · e3c355b1
  Morris Jette authored May 11, 2018
```
This is not currently supported and no date for support has been set.
```
  e3c355b1
- Make dwstat test more robust · 73d0d0f8
  Morris Jette authored May 11, 2018
```
If burst_buffer.conf has GetSysState configured to a non-standard
location, but GetSysStatus is not configured that is likely indicative
of a bad configuration rather than a Slurm failure.
```
  73d0d0f8
- Improve test error handling · 1dd0d6f7
  Morris Jette authored May 11, 2018
```
Gracefully fail if salloc does not get job allocation
```
  1dd0d6f7
- Merge branch 'slurm-17.11' · 484540de
  Alejandro Sanchez authored May 11, 2018
  
  484540de
- Fix Coverity CID 185566: Incorrect expression (COPY_PASTE_ERROR). · b3f2ebf1
  Alejandro Sanchez authored May 11, 2018
```
Introduced in bf4cb0b1.
```
  b3f2ebf1
- Fix tests to work correctly with commit b1ff4342 · 6b7e1302
  Danny Auble authored May 10, 2018
  
  6b7e1302
10 May, 2018 8 commits

Remove AIX pieces from testsuite. · 5cfbd15d
Tim Wickberg authored May 10, 2018
```
Support for AIX was removed before 17.02.
```
5cfbd15d
Merge branch 'slurm-17.11' · fa40dbd6
Morris Jette authored May 10, 2018

fa40dbd6
Update and modernize dw_wlm_cli cray emulation script · dc7ca7be
Morris Jette authored May 10, 2018

dc7ca7be
Merge branch 'slurm-17.11' · 1ab63842
Alejandro Sanchez authored May 10, 2018

1ab63842

Fix different issues when requesting memory per cpu/node. · bf4cb0b1

Alejandro Sanchez authored May 10, 2018

First issue was identified on multi partition requests. job_limits_check()
was overriding the original memory requests, so the next partition
Slurm validating limits against was not using the original values. The
solution consists in adding three members to job_details struct to
preserve the original requests. This issue is reported in bug 4895.

Second issue was memory enforcement behavior being different depending on
job the request issued against a reservation or not.

Third issue had to do with the automatic adjustments Slurm did underneath
when the memory request exceeded the limit. These adjustments included
increasing pn_min_cpus (even incorrectly beyond the number of cpus
available on the nodes) or different tricks increasing cpus_per_task and
decreasing mem_per_cpu.

Fourth issue was identified when requesting the special case of 0 memory,
which was handled inside the select plugin after the partition validations
and thus that could be used to incorrectly bypass the limits.

Issues 2-4 were identified in bug 4976.

Patch also includes an entire refactor on how and when job memory is
is both set to default values (if not requested initially) and how and
when limits are validated.

Co-authored-by: Dominik Bartkiewicz <bart@schedmd.com>

bf4cb0b1

Fix erroneous error on slurmctld shutdown. · 924fa548

Danny Auble authored May 09, 2018

The slurmctld doesn't need to send the fini message, and actually if
it does things get messed up as the slurmdbd will close the database
connection prematurely. Up till now we would print an error on the
slurmctld saying we couldn't send the FINI.

924fa548

Fix segfault if running assoc mgr cache and a job finishes and the · 10169937

Danny Auble authored May 09, 2018

partition is removed then the slurmdbd comes up and we go refresh
the tres pointers and try to deference the part_ptr.

Related to commit de7eac9a.

Bug 5136

10169937

Split up src/common/slurmdbd_defs into slurmdbd_[defs|pack] · b1ff4342

Danny Auble authored May 09, 2018

and move the agent into the accounting_storage/slurmdbd plugin.

This should be cleaner going forward and will be easier to maintain.

b1ff4342

09 May, 2018 14 commits
- Fix for possible slurmctld daemon abort with NULL pointer. · b67d7350
  Morris Jette authored May 09, 2018
```
If running without AccountingStorageEnforce but with the DBD and
it isn't up when starting the slurmctld you could get into a
corner case where you don't have a QOS list in the assoc_mgr.  Thus no
usage to transfer.

Bug 5156
```
  b67d7350
- Merge branch 'slurm-17.11' · a4a14092
  Tim Wickberg authored May 09, 2018
  
  a4a14092
- Fix regression from e346aa8a. · a97b1b8f
  Danny Auble authored May 09, 2018
```
It turns out if we are running cache we need to call set_cluster_tres()
again since the pointers in the list have changed and don't have any
counts anymore.

I opted to just modify the existing locks and call it inside the lock
instead of having set_cluster_tres grab the locks.
```
  a97b1b8f
- Start NEWS for v17.11.7 · 8d8bc02b
  Tim Wickberg authored May 09, 2018
  
  8d8bc02b
- Update META for v17.11.6. · 1d1a0a48
  Tim Wickberg authored May 09, 2018
```
Update slurm.spec and slurm.spec-legacy as well
```
  1d1a0a48
- Fix clang error from de7eac9a. · a3267c00
  Tim Wickberg authored May 09, 2018
```
Clang warns about a possible null dereference of job_part_ptr
if the !job_ptr->priority_array part of the conditional is taken.

Remove that part of the conditional, as it doesn't matter if that is
set or not here. The jobs eligibility on one vs. multiple partition is
not determined by that, but by the status of part_ptr_list and part_ptr.

Bug 5136.
```
  a3267c00
- Merge branch 'slurm-17.11' · 752f259b
  Morris Jette authored May 09, 2018
  
  752f259b
- Document which PMI plugin to use with different OMPI versions · 4db67922
  Morris Jette authored May 09, 2018
  
  4db67922
- Remove slurmdbd_auth_info since it hasn't been needed since f1311b56 . · 1a249044
  Danny Auble authored May 09, 2018
  
  1a249044
- Revert commit 7dcc831b which is no longer needed with the · 62e3834c
  Danny Auble authored May 09, 2018
```
advent of persistent connections.
```
  62e3834c
- Fix spelling · 3c5a7302
  Brian Christiansen authored May 09, 2018
  
  3c5a7302
- Fix typo in NEWS · e8892420
  Felip Moll authored May 09, 2018
  
  e8892420
- select/cons_res - Improve handling of --cores-per-socket. · 6de8c831
  Morris Jette authored May 09, 2018
```
Try to fill up each socket completely before moving into additional
sockets. This will minimize the number of sockets needed, improving
packing especially alongside MaxCPUsPerNode.

Bug 4995.
```
  6de8c831
- Fix misplaced NEWS entry for "select/backfill - fix issue with job resizing". · 2507e149
  Tim Wickberg authored May 09, 2018
```
My mistake on commit 602817c8.

Bug 4922.
```
  2507e149