Commits · 0abbf72732a94d9bceb11cd4435ba440e58c91a1 · Manuel G. Marciani / ces_slurm_simulator

30 Jan, 2017 2 commits

Morris Jette authored Jan 30, 2017

Clear job's reason of "BeginTime" in a more timely fashion and/or prevents
    them from being stuck in a PENDING state. There are multiple ways of
    clearing the reason, especially on a lightly loaded system, but the
    state can persist indefinitely on a heavily loaded system.
bug 3368

0abbf727

will_run fix for job with begin time in past · f75abc9c

Morris Jette authored Jan 30, 2017

Fix to logic for getting expected start time of existing job ID with
explicit begin time that is in the past. Previous logic would
compare that (past) begin time with advanced reservations that
would compete with it rather than the current time.

f75abc9c

29 Jan, 2017 4 commits
- Merge branch 'slurm-15.08' into slurm-16.05 · 76ca2ce7
  Morris Jette authored Jan 29, 2017
  
  76ca2ce7
- Avoid test failure on Cray · c75c6d71
  Morris Jette authored Jan 29, 2017
```
On cray systems with step NHC, the step launches are delayed and
  produce a pair of messages (below) that caused the test to fail:
  srun: Job step creation temporarily disabled, retrying
  srun: Job step created
```
  c75c6d71
- Add delay for job burst buffer purge · 9834b29b
  Morris Jette authored Jan 29, 2017
  
  9834b29b
- Fix some tests for Cray · a6eb2d43
  Morris Jette authored Jan 29, 2017
  
  a6eb2d43
28 Jan, 2017 4 commits
- Merge branch 'slurm-15.08' into slurm-16.05 · 24bca89c
  Morris Jette authored Jan 28, 2017
  
  24bca89c
- Fix test for down nodes · 3d050969
  Morris Jette authored Jan 28, 2017
```
Avoid a test failing of all nodes in a partition are not usable
  (down, drained, reserved, or otherwise unusable).
```
  3d050969
- Merge branch 'slurm-15.08' into slurm-16.05 · 5453fb87
  Morris Jette authored Jan 28, 2017
  
  5453fb87
- Fix tests to work properly on native Cray · f0814910
  Morris Jette authored Jan 28, 2017
```
Disable test if underlying select/linear use
```
  f0814910
27 Jan, 2017 2 commits

Fix DBD cache restore from previous versions. · f31751fe

Danny Auble authored Jan 27, 2017

Turns out this never worked, ever. What used to happen is if the protocol_version that was
read in didn't match the rpc_version given to unpack things was just 0. What this does
now is set the rpc_version to what was stored making it all good.

f31751fe

Make sure the slurmctld is running the same version of the code as local so spank compiles · 6a8d9377
Danny Auble authored Jan 27, 2017
```
correctly.
```
6a8d9377

26 Jan, 2017 3 commits
- Fix case where vestigial reservations were not purged. · 469e2de8
  Alejandro Sanchez authored Jan 26, 2017
```
Bug 3431
```
  469e2de8
- Merge branch 'slurm-15.08' into slurm-16.05 · bb535589
  Morris Jette authored Jan 26, 2017
  
  bb535589
- Testsuite - fix test1.52 to check nodes availability · 0e9a432b
  Alejandro Sanchez authored Jan 26, 2017
```
bug 3433
```
  0e9a432b
25 Jan, 2017 10 commits

burst_buffer/cray race condition fix · 60d682ff

Morris Jette authored Jan 25, 2017

burst_buffer/cray - Fix race condition that could cause multiple batch job
    launch requests resulting in downed nodes.
bug 3366

60d682ff

Fix a few other minor memory leaks when uncommon failures occur. · 0085483a
Dominik Bartkiewicz authored Jan 25, 2017

0085483a
Revert "MYSQL - Fix a few other minor memory leaks when uncommon failures occur." · b95e0323
Danny Auble authored Jan 25, 2017
```
This reverts commit b9bff82f.
```
b95e0323
MYSQL - Fix a few other minor memory leaks when uncommon failures occur. · b9bff82f
Danny Auble authored Jan 25, 2017

b9bff82f
Fix agent retry thread to be detached · e7c58578
Morris Jette authored Jan 24, 2017
```
It was leaking memory otherwise
```
e7c58578

testsuite - refactor test17.39 and test28.7 to avoid memory enforcement. · 46fa6b0f

Tim Wickberg authored Jan 24, 2017

Commit 63b7e3a8 changed the --mem limit to 1MB for the job if
not using a memory SelectType, but this can cause the job to fail
if the JobAcctGatherFrequency is frequent enough to notice that the
"sleep" command is using more than 1MB of resources.

Refactor test to avoid specifying job memory. Use --wrap to avoid
creating a temporary job script as well.

46fa6b0f

testsuite - refactor test3.15 to avoid memory enforcement. · 0339efe8

Tim Wickberg authored Jan 24, 2017

Commit 63b7e3a8 changed the --mem limit to 1MB for the step if
not using a memory SelectType, but this can cause the job to fail
if the JobAcctGatherFrequency is frequent enough to notice that the
"sleep" command is using more than 1MB of resources.

Refactor test to avoid specifying job memory. Use --wrap to avoid
creating a temporary job script.

0339efe8

testsuite - refactor test2.8 to avoid memory enforcement. · 25918419

Tim Wickberg authored Jan 24, 2017

Commit 63b7e3a8 changed the --mem limit to 1MB for the step if
not using a memory SelectType, but this can cause the job to fail
if the JobAcctGatherFrequency is frequent enough to notice that the
"sleep" command is using more than 1MB of resources.

Refactor test to avoid specifying memory memory; and since only one
step is checked for, only run a single step in the job. Use --wrap
to avoid creating a temporary job script.

25918419

Merge branch 'slurm-15.08' into slurm-16.05 · 01c18d79
Tim Wickberg authored Jan 24, 2017

01c18d79
Disable test10.5 and test10.13 on non-BlueGene systems. · bb2a5aae
Tim Wickberg authored Jan 24, 2017

bb2a5aae

24 Jan, 2017 3 commits

Merge branch 'slurm-15.08' into slurm-16.05 · 877aa679
Tim Wickberg authored Jan 24, 2017

877aa679

Fix tests for some configurations · e8bb2944

Morris Jette authored Apr 21, 2016

Some portions of tests 21.30 and 21.34 failed with accounting and
priority basic. These changes disable portions of those tests as
needed based upon configuration.

e8bb2944

Fix race condition in a test · ad455b7d

Morris Jette authored Jan 23, 2017

test1.63 was failing periodically due to a race condition. A signal
  was being sent to srun before the signal handler thread was spawned.

ad455b7d

23 Jan, 2017 9 commits

For batch step, reset job memory after node boot · 0277629b

Morris Jette authored Jan 23, 2017

Reset a job's memory limit based upon what's available after node
  reboot, which can change on a KNL if the MCDRAM mode is changes
  on reboot

0277629b

Fix for backfill launch job with reboot · d72b13f2

Morris Jette authored Jan 23, 2017

This bug was likely the root cause of bug 3366. If the backfill scheduler
  allocates resources for a batch job and a node reboot is required, the
  batch launch RPC would be sent to the agent. At that point, there is a
  race condition between the agent and the job_time_limit() function
  testing for boot completion. If the job_time_limit() function ran
  first, it would trigger a second launch RPC request getting sent to
  the agent.
bug 3366

d72b13f2

Cleaner job configuring logic · f9804256
Morris Jette authored Jan 23, 2017
```
Clean up logic to test if job is configuring
bug 3366
```
f9804256

Avoid launching batch step while configuring · e3a7bdcc

Morris Jette authored Jan 23, 2017

Do not launch a batch step while the job is configuring. Previous
  logic checked for the PrologSlurmctld running, but not nodes
  booting. Checking the job's CONFIGURING state flag will validate
  both.
bug 3366

e3a7bdcc

Avoid duplicate configuration complete logic · db6acb8f

Morris Jette authored Jan 23, 2017

Add check to avoid step allocation logic from executing job
  configuration completion logic multiple times (check if job
  is configurating before clearing flag and resetting time limit).
bug 3366

db6acb8f

fix slurmctld/agent race condition · 53784477

Morris Jette authored Jan 23, 2017

slurmctld/agent race condition fix: Prevent job launch while PrologSlurmctld
    daemon is running or node boot in progress.
bug 3366

53784477

job write lock added to agent_retry() · 379007b8
Morris Jette authored Jan 23, 2017
```
This is required to manage the configuration completion.
bug 3366
```
379007b8
Move agent_retry to separate pthread · ce9a2d79
Morris Jette authored Jan 23, 2017
```
This will be required to lock the job structure
bug 3366
```
ce9a2d79

Remove return value from agent_retry() · bb94c6ce

Morris Jette authored Jan 23, 2017

Remove the return value from the agent_retry() function. It is not
  used anywhere and needs to be removed to run as a pthread.
bug 3366

bb94c6ce

21 Jan, 2017 2 commits
- Merge branch 'slurm-15.08' into slurm-16.05 · b16e03f0
  Tim Wickberg authored Jan 20, 2017
  
  b16e03f0
- Testsuite - speed up by a minute. · dca5cb3f
  Tim Wickberg authored Jan 20, 2017
```
Reasonable NFS systems do not need a minute to propagate changes.
```
  dca5cb3f
20 Jan, 2017 1 commit

Fix mutlicluster options to work with newer ctlds · 8b430b6a

Brian Christiansen authored Jan 20, 2017

If a lower version client would try to communicate with a higher version
controller the dbd would return the controller's version and the client
would use that version to talk to the controller. When the controller
would respond, the client wouldn't know how to unpack the higher version
msg.

8b430b6a