Commits · 738890aa1324707085aa127aeb134bff7bdbbe2e · Manuel G. Marciani / ces_slurm_simulator

22 Feb, 2018 6 commits

Only launch a single io_timeout_thread · 738890aa

Felip Moll authored Feb 22, 2018

Only a single io_timeout_thread should be created for each sls struct.

Creating multiple, while seemingly harmless in operation, can lead to
fatal() messages when srun shuts down by destroying mutex locks that
are in use by threads that srun doesn't expect to still have running.

Regression caused by a1185f04.

Bug 4596

738890aa

Continuation of b564ef0a for newly created reservations. · 2adde3cb
Morris Jette authored Feb 22, 2018
```
Bug 4806.
```
2adde3cb

Preserve and fix node features on reconfig or restart · e58f5123

Felip Moll authored Feb 07, 2018

This patch fixes the situation that makes features unrecognized where a node
features plugin is active and features are defined to nodes in slurm.conf.

It also preserves KNL node features when slurmctld daemons are reconfigured
including active and available modes.

Features not belonging to node features plugin are reset to what is in
slurm.conf when restarting or reconfiguring.

Bug 4734

e58f5123

Make MAINT and OVERLAP flags order agnostic on overlap test. · b564ef0a

Alejandro Sanchez authored Feb 22, 2018

_resv_overlap function was only checking the flags for the updated
reservation, but not for the rest of present ones. This implied
that the allowed overlap derived from these flags only applied
depending on the update order.

Bug 4806.

b564ef0a

Requeue allocated jobs on nodes requested to DRAIN if POWER_[SAVE|UP]. · 14596246

Alejandro Sanchez authored Feb 22, 2018

After commit b31fa177, we do not defer slurmd node registration if
HealthCheckProgram fails. So at slurmd startup, slurmd executes:

run_script_health_check();
_spawn_registration_engine();

And does not keeps spinning if NHC fails. Now if there are nodes
managed by the Power Save logic, when they are requested to be
POWER_UP because a job is allocated resources, then at slurmd startup
NHC is executed before node registers.

The problem comes when this NHC execution fails, if the NHC program
decides to update the node to DRAIN, since the job was already
allocated before this update, then the job will attempt to start
RUNNING but might fail since NHC detected there's something wrong.

So this change what it does is to detect DRAIN/FAIL node update
requests, then check if node is ALLOC/MIXED and POWER_[SAVE|UP] and
if so then force a requeue, so that the job doesn't start on a failed
node.

Bug 4689.

14596246

Move a warning to debug() from error() on PSS stat collection error. · 10c90b25

Felip Moll authored Feb 22, 2018

Can frequently throw scary-sounding messages on short-lived processes
that disappear while the stats are collected.

Bug 4759.

10c90b25

21 Feb, 2018 15 commits
- Add space in Iintel KNL html doc · 03d97854
  Brian Christiansen authored Feb 21, 2018
  
  03d97854
- Update KNL docs with suggestions for Dell KNLs · 356f1e40
  Chris Samuel authored Feb 21, 2018
```
Bug 4793
```
  356f1e40
- In the federation, make it so you can cancel stranded sibling jobs. · 500e23b5
  Brian Christiansen authored Feb 21, 2018
```
Bug 4504
```
  500e23b5
- Only check requested clusters in federation when using --test-only · 4fb12161
  Brian Christiansen authored Feb 21, 2018
```
submission option.

Bug 4548
```
  4fb12161
- Handle --clusters=all as case insensitive. · 984be512
  Brian Christiansen authored Feb 21, 2018
```
Bug 4548
```
  984be512
- Continuation of commit 10af7fbe · edba9cb6
  Morris Jette authored Feb 21, 2018
```
Bug 4786
```
  edba9cb6
- Print user messages when waiting for allocation · 564abd94
  Brian Christiansen authored Feb 21, 2018
```
Continuation of 873871ed
Bug 4758
```
  564abd94
- Refactor printing out of multi-line user msgs · 873871ed
  Danny Auble authored Feb 21, 2018
```
Bug 4758
```
  873871ed
- Fix removing array jobs from hash in slurmctld. · f381e4e6
  Brian Christiansen authored Feb 21, 2018
```
Bug 4800
```
  f381e4e6
- Add comment about Dell KNL SyscfgTimeout settings · c543cec3
  Isaac Hartung authored Feb 21, 2018
```
Bug 4793
```
  c543cec3
- Don't add TRES whose value is NO_VAL64 when building string line. · 58641ba8
  Danny Auble authored Feb 21, 2018
```
Bug 4698.
```
  58641ba8
- Change comment for better clarity · 342d6355
  Morris Jette authored Feb 21, 2018
  
  342d6355
- Fix to properly restore accounts_list element on a Partition. · 9d05dcfa
  Dominik Bartkiewicz authored Feb 20, 2018
```
Bug 4809.
```
  9d05dcfa
- Fix for restoring overlooked partition parameters. · f4c54d9e
  Dominik Bartkiewicz authored Feb 20, 2018
```
NEWS entry also applies to prior 31667d5d.

Bug 4809.
```
  f4c54d9e
- Restore all partition settings when ReconfigFlags=KeepPartState is set. · 31667d5d
  Dominik Bartkiewicz authored Feb 20, 2018
```
Fixes problems restoring:
- Alternate
- DefMemPerCPU
- MaxCPUsPerNode
- MaxMemPerNode
- OverTimeLimit

Also move AllowNodes, DefaultTime, and GraceTime into alphabetical order.

Bug 4809.
```
  31667d5d
20 Feb, 2018 7 commits

Make sreport job reports also report duplicate jobs correctly. · 9703c989
Marshall Garey authored Feb 20, 2018
```
Bug 4636
```
9703c989
Minor memory leak fixes in the fed_mgr on slurmctld shutdown. · fa8538cb
Brian Christiansen authored Feb 20, 2018
```
Bug 4716
```
fa8538cb
Add some extra locks in fed_mgr to be extra safe. · 0014ebfb
Brian Christiansen authored Feb 20, 2018
```
Bug 4716
```
0014ebfb

Fix for grep regexp · 0ca539e0

Felip Moll authored Feb 12, 2018

In perl tools, fix for regexp that caused extra incorrectly shown results.

Bug 4766

0ca539e0

Remove incorrect function name from header. · 454aed7f
Danny Auble authored Feb 20, 2018

454aed7f
Fix issue where a job could be denied by Reason=MaxMemPerLimit when not · d413c8b7
Danny Auble authored Feb 20, 2018
```
requesting any tasks.

Bug 4730
```
d413c8b7

Correctly check return codes when creating a step to check if needing to · 10af7fbe

Morris Jette authored Feb 20, 2018

wait to retry or not.

I discovered this bug regression testing. Some similar situations will
result in srun continuously issuing step create requests and the
launch_common_create_job_step() function not sleeping between RPCs.
Basically launch_common_create_job_step() sleeps for some error codes
and srun retries the step create on some error codes. The problem is
that those error codes do not match in both places, resulting in
constant retries without sleeps. This situation is very likely with
job preemption combined with salloc, but other conditions can trigger
the same event. The following errno will all trigger this situation:
EAGAIN, ESLURM_DISABLED, ESLURM_POWER_NOT_AVAIL, ESLURM_POWER_RESERVED,
ESLURM_PROLOG_RUNNING, ESLURM_INTERCONNECT_BUSY.

Bug 4786

10af7fbe

16 Feb, 2018 3 commits
- Print MaxQueryTimeRange in sacctmgr show config · 9dc6cbe1
  Brian Christiansen authored Feb 16, 2018
```
Bug 4772
```
  9dc6cbe1
- Fix MaxQueryTimeRange checking · c7de5f0e
  Brian Christiansen authored Feb 16, 2018
```
Was checking time ranges against a bad cast'ed variable.
And MaxQueryTimeRange was being stored as minutes and was being compared
against the difference of two time_t's -- which produces seconds.

Bug 4772
```
  c7de5f0e
- Update test2.2 regexes · 56dcfaf3
  Dominik Bartkiewicz authored Feb 16, 2018
```
was matching more than expected.

Bug 4789
```
  56dcfaf3
15 Feb, 2018 3 commits
- Don't go to TRY_LATER if job_no_reserve is set. · 014f8dab
  Doug Jacobsen authored Feb 15, 2018
```
Fixes issue with bf_min_prio_reserve not being respected, leading
to significantly impacted backfill performance.

Bug 4760.
```
  014f8dab
- Add note for MaxQueryTimeRange on accepted formats. · 6fa7b386
  Tim Wickberg authored Feb 14, 2018
  
  6fa7b386
- Note that ChosLoc option is going away · d93cbfb9
  Morris Jette authored Feb 14, 2018
```
Log that support for the ChosLoc configuration parameter will end in
Slurm version 18.08.

Bug 4791
```
  d93cbfb9
14 Feb, 2018 2 commits
- Fix minor memory leak from 6e9b60d1 . · ca36ef44
  Tim Wickberg authored Feb 14, 2018
  
  ca36ef44
- sbcast - fix a race condition leading to "Unspecified error." · efcc246e
  Tim Wickberg authored Feb 14, 2018
```
And would prevent transmission of a file.

Bug 4787.
```
  efcc246e
13 Feb, 2018 4 commits
- Add sleep for BB stage-out time · 3c6a2298
  Morris Jette authored Feb 13, 2018
  
  3c6a2298
- MYSQL - Fix to handle quotes in a given work_dir of a job. · 6e9b60d1
  Danny Auble authored Feb 13, 2018
```
Bug 4736
Bug 4784
```
  6e9b60d1
- Fix handling of partial writes in io_init_msg_write_to_fd(). · 5297f392
  Morris Jette authored Feb 13, 2018
```
Partial write can happen under high system load, leading to step
termination when finishing the write would let the step launch
properly instead.

Fix suggested by Matthieu Hautreux.

Bug 4778.
```
  5297f392
- slurmdbd - only permit changes to resources from operators or admins · 621266d9
  Felip Moll authored Feb 12, 2018
```
Add a privilege check for when an unprivileged user tries to modify
a resource. Min level set to Operator.

Bug 4735.
```
  621266d9