Commits · ce0ad4be2e7701aa9295936b9406d3ad947dd646 · Manuel G. Marciani / ces_slurm_simulator

06 Mar, 2018 12 commits
- Add noinst_HEADERS as well as CLEANFILES to sstat · ce0ad4be
  Bill Brophy authored Feb 07, 2018
  
  ce0ad4be
- Fix noinst_HEADERS in sacct · 3cf5d174
  Danny Auble authored Feb 07, 2018
  
  3cf5d174
- Add new function slurmdb_ave_tres_usage · 51e433ca
  Bill Brophy authored Feb 07, 2018
```
Bug 2782
```
  51e433ca
- Alter slurmdb_make_tres_string_from_simple signature to accept a · 46b14e26
  Bill Brophy authored Feb 07, 2018
```
char * nodelist.

Bug 2782
```
  46b14e26
- Add tres_usage to the slurmdb_stats_t and to the database. · a29dac47
  Bill Brophy authored Feb 07, 2018
```
Bug 2782
```
  a29dac47
- Add tres records to the jobacctinfo struct · fdb08b76
  Bill Brophy authored Feb 06, 2018
```
Bug 2782
```
  fdb08b76
- Add make_tres_usage_str_from_array, hopefully this will go away in · aa3d61b7
  Bill Brophy authored Feb 06, 2018
```
the future.

Bug 2782
```
  aa3d61b7
- Alter pack/unpack of jobacctinfo_t and slurmdb_stats_t · c59a23d5
  Danny Auble authored Feb 06, 2018
```
to get ready for additions.

Hopefully no code change

Bug 2782
```
  c59a23d5
- Add mb_[read|write] to the jag_prec structure. · 4105cfb6
  Bill Brophy authored Feb 06, 2018
```
Bug 2782
```
  4105cfb6
- Add new static TRES USAGE_DISK · 3bb78faf
  Bill Brophy authored Feb 06, 2018
```
Bug 2782
```
  3bb78faf
- Combine multiple instances of the same code into one place. · 239aaae2
  Bill Brophy authored Feb 06, 2018
```
Bug 2782
```
  239aaae2
- Remove --immediate option from sbatch command · 85df676e
  Morris Jette authored Mar 06, 2018
```
Log the option is not supported using info() message:
$ sbatch -I tmp
sbatch: --immediate option is not supported for the sbatch command, ignored
Submitted batch job 1234
```
  85df676e
02 Mar, 2018 1 commit

Felip Moll authored Mar 02, 2018

'sacctmgr show accounts withassoc cluster=x' now behaves like 'show users'
displaying always all the accounts on the enterprise. Non-matching filtered
account is now just displayed without associations.

Bug 4804

da49b8d0

01 Mar, 2018 3 commits

Do not release job resources twice when Cray NHC is stuck · c1a8a651

Felip Moll authored Mar 01, 2018

When Cray NHC is stuck for a job for more than 300 seconds and a
reconfigure is issued, job resources will be released in order to
not keep all the job nodes allocated when there's no reason to.
If then NHC finishes, resources will be released again causing memory
under-allocated errors.

This patch avoids releasing resources more than one time.

Bug 4801

c1a8a651

Remove unused min_offset,max_offset from part_record definition · 40dd6ebe
Dominik Bartkiewicz authored Feb 21, 2018
```
reference commit 08425d9c
```
40dd6ebe
Merge branch 'slurm-17.11' · 4b6abf70
Tim Wickberg authored Feb 28, 2018

4b6abf70

28 Feb, 2018 6 commits

Start NEWS for v17.11.5. · 00cd43ab
Tim Wickberg authored Feb 28, 2018

00cd43ab
Update META for v17.11.4. · 7c934df1
Tim Wickberg authored Feb 28, 2018
```
Update slurm.spec and slurm.spec-legacy as well
```
7c934df1

Fix issue with log rotation for slurmstepd processes on slurmd reconfig. · 346d1634

Isaac Hartung authored Feb 28, 2018

Also throws spurious errors of:
"slurmd: error: Domain socket directory /var/spool/slurmd: No such file or directory"
if you SlurmdSpoolDir is located elsewhere.

Bug 4289.

346d1634

slurmdbd - prevent infinite loop / crash if a QOS is set to preempt itself. · 6e2903f7
Dominik Bartkiewicz authored Feb 28, 2018
```
Add additional protection in slurmctld as well.

Bug 4826.
```
6e2903f7

Fix for job pn_min_cpus where logic adjusted job desc but not job record. · 6041b810

Alejandro Sanchez authored Feb 23, 2018

job_limits_check() uses the job desc to call _valid_pn_min_mem(). This
second function might adjust the following values (up to date):

cpus_per_task
pn_min_memory
min_cpus
max_cpus
pn_min_cpus

If the function returns success, these adjusted members need to be copied
back to the job_record. It turns out pn_min_cpus wasn't copied back,
thus the logs claimed to automatically increase pn_min_cpus but actually
the job record wasn't modified and the select plugin tried to allocate
wrong amount of resources.

Bug 4823.

6041b810

Remove unused LOG_LEVEL_SCHED and schedlog() function. · 0ac29a4c
Tim Wickberg authored Feb 25, 2018
```
Bug 4741.
```
0ac29a4c

27 Feb, 2018 2 commits
- Merge branch 'slurm-17.11' · ce58cd2c
  Tim Wickberg authored Feb 27, 2018
  
  ce58cd2c
- Remove <roken.h> header include from FreeBSD builds. · c6330552
  Tim Wickberg authored Feb 27, 2018
```
No longer needed, and will cause errors on FreeBSD systems
build with WITHOUT_KERBEROS.

Bug 4805.
```
  c6330552
26 Feb, 2018 1 commit

Burst buffer logic refactoring · 074ef9a7

Morris Jette authored Feb 26, 2018

Relatively minor changes to burst_buffer/cray in preparation for
support of #DW create_persistent and #DW destroy_persistent operations

074ef9a7

23 Feb, 2018 2 commits
- Merge branch 'slurm-17.11' · 436be081
  Morris Jette authored Feb 23, 2018
  
  436be081
- Fix task/cgroup affinity to behave correctly. · bf1ef638
  Morris Jette authored Feb 23, 2018
```
Bug 4783
```
  bf1ef638
22 Feb, 2018 12 commits

burst_buffer/cray - Prevent use of "#DW create_persistent" and · 7f537cca

Morris Jette authored Feb 21, 2018

"#DW destroy_persistent" directives available in Cray CLE6.0UP06. This
will be supported in Slurm version 18.08. Use "#BB" directives until then.

Bug 4302

7f537cca

Remove vestigial comment · 72db12a0
Morris Jette authored Feb 22, 2018

72db12a0
Missing rename node_features_p_changible_feature() · d564cf19
Felip Moll authored Feb 22, 2018
```
Fix missing rename of functions.
```
d564cf19
Merge remote-tracking branch 'origin/slurm-17.11' · cd1543c0
Felip Moll authored Feb 22, 2018

cd1543c0

Only launch a single io_timeout_thread · 738890aa

Felip Moll authored Feb 22, 2018

Only a single io_timeout_thread should be created for each sls struct.

Creating multiple, while seemingly harmless in operation, can lead to
fatal() messages when srun shuts down by destroying mutex locks that
are in use by threads that srun doesn't expect to still have running.

Regression caused by a1185f04.

Bug 4596

738890aa

Continuation of b564ef0a for newly created reservations. · 2adde3cb
Morris Jette authored Feb 22, 2018
```
Bug 4806.
```
2adde3cb

Preserve and fix node features on reconfig or restart · e58f5123

Felip Moll authored Feb 07, 2018

This patch fixes the situation that makes features unrecognized where a node
features plugin is active and features are defined to nodes in slurm.conf.

It also preserves KNL node features when slurmctld daemons are reconfigured
including active and available modes.

Features not belonging to node features plugin are reset to what is in
slurm.conf when restarting or reconfiguring.

Bug 4734

e58f5123

Merge branch 'slurm-17.11' · fe4193ab
Alejandro Sanchez authored Feb 22, 2018

fe4193ab

Make MAINT and OVERLAP flags order agnostic on overlap test. · b564ef0a

Alejandro Sanchez authored Feb 22, 2018

_resv_overlap function was only checking the flags for the updated
reservation, but not for the rest of present ones. This implied
that the allowed overlap derived from these flags only applied
depending on the update order.

Bug 4806.

b564ef0a

Merge branch 'slurm-17.11' · a85422d6
Alejandro Sanchez authored Feb 22, 2018

a85422d6

Requeue allocated jobs on nodes requested to DRAIN if POWER_[SAVE|UP]. · 14596246

Alejandro Sanchez authored Feb 22, 2018

After commit b31fa177, we do not defer slurmd node registration if
HealthCheckProgram fails. So at slurmd startup, slurmd executes:

run_script_health_check();
_spawn_registration_engine();

And does not keeps spinning if NHC fails. Now if there are nodes
managed by the Power Save logic, when they are requested to be
POWER_UP because a job is allocated resources, then at slurmd startup
NHC is executed before node registers.

The problem comes when this NHC execution fails, if the NHC program
decides to update the node to DRAIN, since the job was already
allocated before this update, then the job will attempt to start
RUNNING but might fail since NHC detected there's something wrong.

So this change what it does is to detect DRAIN/FAIL node update
requests, then check if node is ALLOC/MIXED and POWER_[SAVE|UP] and
if so then force a requeue, so that the job doesn't start on a failed
node.

Bug 4689.

14596246

Move a warning to debug() from error() on PSS stat collection error. · 10c90b25

Felip Moll authored Feb 22, 2018

Can frequently throw scary-sounding messages on short-lived processes
that disappear while the stats are collected.

Bug 4759.

10c90b25

21 Feb, 2018 1 commit
- Add space in Iintel KNL html doc · 03d97854
  Brian Christiansen authored Feb 21, 2018
  
  03d97854