Commits · 469e2de8f6b0bf05e79f5b1229dbaf033819dae5 · Manuel G. Marciani / ces_slurm_simulator

26 Jan, 2017 3 commits
- Fix case where vestigial reservations were not purged. · 469e2de8
  Alejandro Sanchez authored Jan 26, 2017
```
Bug 3431
```
  469e2de8
- Merge branch 'slurm-15.08' into slurm-16.05 · bb535589
  Morris Jette authored Jan 26, 2017
  
  bb535589
- Testsuite - fix test1.52 to check nodes availability · 0e9a432b
  Alejandro Sanchez authored Jan 26, 2017
```
bug 3433
```
  0e9a432b
25 Jan, 2017 10 commits

burst_buffer/cray race condition fix · 60d682ff

Morris Jette authored Jan 25, 2017

burst_buffer/cray - Fix race condition that could cause multiple batch job
    launch requests resulting in downed nodes.
bug 3366

60d682ff

Fix a few other minor memory leaks when uncommon failures occur. · 0085483a
Dominik Bartkiewicz authored Jan 25, 2017

0085483a
Revert "MYSQL - Fix a few other minor memory leaks when uncommon failures occur." · b95e0323
Danny Auble authored Jan 25, 2017
```
This reverts commit b9bff82f.
```
b95e0323
MYSQL - Fix a few other minor memory leaks when uncommon failures occur. · b9bff82f
Danny Auble authored Jan 25, 2017

b9bff82f
Fix agent retry thread to be detached · e7c58578
Morris Jette authored Jan 24, 2017
```
It was leaking memory otherwise
```
e7c58578

testsuite - refactor test17.39 and test28.7 to avoid memory enforcement. · 46fa6b0f

Tim Wickberg authored Jan 24, 2017

Commit 63b7e3a8 changed the --mem limit to 1MB for the job if
not using a memory SelectType, but this can cause the job to fail
if the JobAcctGatherFrequency is frequent enough to notice that the
"sleep" command is using more than 1MB of resources.

Refactor test to avoid specifying job memory. Use --wrap to avoid
creating a temporary job script as well.

46fa6b0f

testsuite - refactor test3.15 to avoid memory enforcement. · 0339efe8

Tim Wickberg authored Jan 24, 2017

Commit 63b7e3a8 changed the --mem limit to 1MB for the step if
not using a memory SelectType, but this can cause the job to fail
if the JobAcctGatherFrequency is frequent enough to notice that the
"sleep" command is using more than 1MB of resources.

Refactor test to avoid specifying job memory. Use --wrap to avoid
creating a temporary job script.

0339efe8

testsuite - refactor test2.8 to avoid memory enforcement. · 25918419

Tim Wickberg authored Jan 24, 2017

Commit 63b7e3a8 changed the --mem limit to 1MB for the step if
not using a memory SelectType, but this can cause the job to fail
if the JobAcctGatherFrequency is frequent enough to notice that the
"sleep" command is using more than 1MB of resources.

Refactor test to avoid specifying memory memory; and since only one
step is checked for, only run a single step in the job. Use --wrap
to avoid creating a temporary job script.

25918419

Merge branch 'slurm-15.08' into slurm-16.05 · 01c18d79
Tim Wickberg authored Jan 24, 2017

01c18d79
Disable test10.5 and test10.13 on non-BlueGene systems. · bb2a5aae
Tim Wickberg authored Jan 24, 2017

bb2a5aae

24 Jan, 2017 3 commits

Merge branch 'slurm-15.08' into slurm-16.05 · 877aa679
Tim Wickberg authored Jan 24, 2017

877aa679

Fix tests for some configurations · e8bb2944

Morris Jette authored Apr 21, 2016

Some portions of tests 21.30 and 21.34 failed with accounting and
priority basic. These changes disable portions of those tests as
needed based upon configuration.

e8bb2944

Fix race condition in a test · ad455b7d

Morris Jette authored Jan 23, 2017

test1.63 was failing periodically due to a race condition. A signal
  was being sent to srun before the signal handler thread was spawned.

ad455b7d

23 Jan, 2017 9 commits

For batch step, reset job memory after node boot · 0277629b

Morris Jette authored Jan 23, 2017

Reset a job's memory limit based upon what's available after node
  reboot, which can change on a KNL if the MCDRAM mode is changes
  on reboot

0277629b

Fix for backfill launch job with reboot · d72b13f2

Morris Jette authored Jan 23, 2017

This bug was likely the root cause of bug 3366. If the backfill scheduler
  allocates resources for a batch job and a node reboot is required, the
  batch launch RPC would be sent to the agent. At that point, there is a
  race condition between the agent and the job_time_limit() function
  testing for boot completion. If the job_time_limit() function ran
  first, it would trigger a second launch RPC request getting sent to
  the agent.
bug 3366

d72b13f2

Cleaner job configuring logic · f9804256
Morris Jette authored Jan 23, 2017
```
Clean up logic to test if job is configuring
bug 3366
```
f9804256

Avoid launching batch step while configuring · e3a7bdcc

Morris Jette authored Jan 23, 2017

Do not launch a batch step while the job is configuring. Previous
  logic checked for the PrologSlurmctld running, but not nodes
  booting. Checking the job's CONFIGURING state flag will validate
  both.
bug 3366

e3a7bdcc

Avoid duplicate configuration complete logic · db6acb8f

Morris Jette authored Jan 23, 2017

Add check to avoid step allocation logic from executing job
  configuration completion logic multiple times (check if job
  is configurating before clearing flag and resetting time limit).
bug 3366

db6acb8f

fix slurmctld/agent race condition · 53784477

Morris Jette authored Jan 23, 2017

slurmctld/agent race condition fix: Prevent job launch while PrologSlurmctld
    daemon is running or node boot in progress.
bug 3366

53784477

job write lock added to agent_retry() · 379007b8
Morris Jette authored Jan 23, 2017
```
This is required to manage the configuration completion.
bug 3366
```
379007b8
Move agent_retry to separate pthread · ce9a2d79
Morris Jette authored Jan 23, 2017
```
This will be required to lock the job structure
bug 3366
```
ce9a2d79

Remove return value from agent_retry() · bb94c6ce

Morris Jette authored Jan 23, 2017

Remove the return value from the agent_retry() function. It is not
  used anywhere and needs to be removed to run as a pthread.
bug 3366

bb94c6ce

21 Jan, 2017 2 commits
- Merge branch 'slurm-15.08' into slurm-16.05 · b16e03f0
  Tim Wickberg authored Jan 20, 2017
  
  b16e03f0
- Testsuite - speed up by a minute. · dca5cb3f
  Tim Wickberg authored Jan 20, 2017
```
Reasonable NFS systems do not need a minute to propagate changes.
```
  dca5cb3f
20 Jan, 2017 1 commit

Fix mutlicluster options to work with newer ctlds · 8b430b6a

Brian Christiansen authored Jan 20, 2017

If a lower version client would try to communicate with a higher version
controller the dbd would return the controller's version and the client
would use that version to talk to the controller. When the controller
would respond, the client wouldn't know how to unpack the higher version
msg.

8b430b6a

19 Jan, 2017 4 commits
- News for last commit · b6c1e4e4
  Danny Auble authored Jan 19, 2017
  
  b6c1e4e4
- Make backfill scheduler behave like regular scheduler in respect to · 36cb2bbb
  Dominik Bartkiewicz authored Jan 19, 2017
```
'assoc_limit_stop'.
```
  36cb2bbb
- Missed NEWS on commit de3ee50a81d1d · 0a7d222f
  Danny Auble authored Jan 19, 2017
  
  0a7d222f
- Only look at SLURM_STEP_KILLED_MSG_NODE_ID on startup, to avoid race · d8d7ebdb
  Danny Auble authored Jan 19, 2017
```
condition later when looking at a steps env.

Bug 3394
```
  d8d7ebdb
18 Jan, 2017 4 commits
- Make it so sacctmgr accepts column headers like MaxTRESPU and not MaxTRESP. · c675e0bc
  Danny Auble authored Jan 18, 2017
```
Bug 3398
```
  c675e0bc
- MYSQL - Fix minor memory leak when querying steps and the sql fails. · 18dec618
  Danny Auble authored Jan 18, 2017
  
  18dec618
- Reset last_job_update when clearing CONFIGURING flag · 2d16ad91
  Morris Jette authored Jan 17, 2017
```
bug 3399
```
  2d16ad91
- Prevent job timeout on node power up · 4114e6ce
  Morris Jette authored Jan 17, 2017
```
bug 3099
```
  4114e6ce
17 Jan, 2017 4 commits
- Revert "Require the normal scheduler to set/clear an Assoc/QOS limit on a job" · ef9546d0
  Danny Auble authored Jan 17, 2017
```
This reverts commit e92b49d3.
```
  ef9546d0
- Remove commented out if block and move indent back a level. · d5565ba0
  Tim Shaw authored Jan 17, 2017
```
No functional change.
```
  d5565ba0
- Require the normal scheduler to set/clear an Assoc/QOS limit on a job · e92b49d3
  Dominik Bartkiewicz authored Jan 17, 2017
```
instead of also in the backfill scheduler.
```
  e92b49d3
- Fix debug2 message using wrong array index in _qos_job_runnable_post_select(). · 369bfd69
  Josh Samuelson authored Jan 17, 2017
```
Bug 3405.
```
  369bfd69