Commits · e7c5857811b4ebc45a6c05db1ac11777ae533f22 · Manuel G. Marciani / ces_slurm_simulator

25 Jan, 2017 6 commits

Fix agent retry thread to be detached · e7c58578
Morris Jette authored Jan 24, 2017
```
It was leaking memory otherwise
```
e7c58578

testsuite - refactor test17.39 and test28.7 to avoid memory enforcement. · 46fa6b0f

Tim Wickberg authored Jan 24, 2017

Commit 63b7e3a8 changed the --mem limit to 1MB for the job if
not using a memory SelectType, but this can cause the job to fail
if the JobAcctGatherFrequency is frequent enough to notice that the
"sleep" command is using more than 1MB of resources.

Refactor test to avoid specifying job memory. Use --wrap to avoid
creating a temporary job script as well.

46fa6b0f

testsuite - refactor test3.15 to avoid memory enforcement. · 0339efe8

Tim Wickberg authored Jan 24, 2017

Commit 63b7e3a8 changed the --mem limit to 1MB for the step if
not using a memory SelectType, but this can cause the job to fail
if the JobAcctGatherFrequency is frequent enough to notice that the
"sleep" command is using more than 1MB of resources.

Refactor test to avoid specifying job memory. Use --wrap to avoid
creating a temporary job script.

0339efe8

testsuite - refactor test2.8 to avoid memory enforcement. · 25918419

Tim Wickberg authored Jan 24, 2017

Commit 63b7e3a8 changed the --mem limit to 1MB for the step if
not using a memory SelectType, but this can cause the job to fail
if the JobAcctGatherFrequency is frequent enough to notice that the
"sleep" command is using more than 1MB of resources.

Refactor test to avoid specifying memory memory; and since only one
step is checked for, only run a single step in the job. Use --wrap
to avoid creating a temporary job script.

25918419

Merge branch 'slurm-15.08' into slurm-16.05 · 01c18d79
Tim Wickberg authored Jan 24, 2017

01c18d79
Disable test10.5 and test10.13 on non-BlueGene systems. · bb2a5aae
Tim Wickberg authored Jan 24, 2017

bb2a5aae

24 Jan, 2017 3 commits

Merge branch 'slurm-15.08' into slurm-16.05 · 877aa679
Tim Wickberg authored Jan 24, 2017

877aa679

Fix tests for some configurations · e8bb2944

Morris Jette authored Apr 21, 2016

Some portions of tests 21.30 and 21.34 failed with accounting and
priority basic. These changes disable portions of those tests as
needed based upon configuration.

e8bb2944

Fix race condition in a test · ad455b7d

Morris Jette authored Jan 23, 2017

test1.63 was failing periodically due to a race condition. A signal
  was being sent to srun before the signal handler thread was spawned.

ad455b7d

23 Jan, 2017 9 commits

For batch step, reset job memory after node boot · 0277629b

Morris Jette authored Jan 23, 2017

Reset a job's memory limit based upon what's available after node
  reboot, which can change on a KNL if the MCDRAM mode is changes
  on reboot

0277629b

Fix for backfill launch job with reboot · d72b13f2

Morris Jette authored Jan 23, 2017

This bug was likely the root cause of bug 3366. If the backfill scheduler
  allocates resources for a batch job and a node reboot is required, the
  batch launch RPC would be sent to the agent. At that point, there is a
  race condition between the agent and the job_time_limit() function
  testing for boot completion. If the job_time_limit() function ran
  first, it would trigger a second launch RPC request getting sent to
  the agent.
bug 3366

d72b13f2

Cleaner job configuring logic · f9804256
Morris Jette authored Jan 23, 2017
```
Clean up logic to test if job is configuring
bug 3366
```
f9804256

Avoid launching batch step while configuring · e3a7bdcc

Morris Jette authored Jan 23, 2017

Do not launch a batch step while the job is configuring. Previous
  logic checked for the PrologSlurmctld running, but not nodes
  booting. Checking the job's CONFIGURING state flag will validate
  both.
bug 3366

e3a7bdcc

Avoid duplicate configuration complete logic · db6acb8f

Morris Jette authored Jan 23, 2017

Add check to avoid step allocation logic from executing job
  configuration completion logic multiple times (check if job
  is configurating before clearing flag and resetting time limit).
bug 3366

db6acb8f

fix slurmctld/agent race condition · 53784477

Morris Jette authored Jan 23, 2017

slurmctld/agent race condition fix: Prevent job launch while PrologSlurmctld
    daemon is running or node boot in progress.
bug 3366

53784477

job write lock added to agent_retry() · 379007b8
Morris Jette authored Jan 23, 2017
```
This is required to manage the configuration completion.
bug 3366
```
379007b8
Move agent_retry to separate pthread · ce9a2d79
Morris Jette authored Jan 23, 2017
```
This will be required to lock the job structure
bug 3366
```
ce9a2d79

Remove return value from agent_retry() · bb94c6ce

Morris Jette authored Jan 23, 2017

Remove the return value from the agent_retry() function. It is not
  used anywhere and needs to be removed to run as a pthread.
bug 3366

bb94c6ce

21 Jan, 2017 2 commits
- Merge branch 'slurm-15.08' into slurm-16.05 · b16e03f0
  Tim Wickberg authored Jan 20, 2017
  
  b16e03f0
- Testsuite - speed up by a minute. · dca5cb3f
  Tim Wickberg authored Jan 20, 2017
```
Reasonable NFS systems do not need a minute to propagate changes.
```
  dca5cb3f
20 Jan, 2017 1 commit

Fix mutlicluster options to work with newer ctlds · 8b430b6a

Brian Christiansen authored Jan 20, 2017

If a lower version client would try to communicate with a higher version
controller the dbd would return the controller's version and the client
would use that version to talk to the controller. When the controller
would respond, the client wouldn't know how to unpack the higher version
msg.

8b430b6a

19 Jan, 2017 4 commits
- News for last commit · b6c1e4e4
  Danny Auble authored Jan 19, 2017
  
  b6c1e4e4
- Make backfill scheduler behave like regular scheduler in respect to · 36cb2bbb
  Dominik Bartkiewicz authored Jan 19, 2017
```
'assoc_limit_stop'.
```
  36cb2bbb
- Missed NEWS on commit de3ee50a81d1d · 0a7d222f
  Danny Auble authored Jan 19, 2017
  
  0a7d222f
- Only look at SLURM_STEP_KILLED_MSG_NODE_ID on startup, to avoid race · d8d7ebdb
  Danny Auble authored Jan 19, 2017
```
condition later when looking at a steps env.

Bug 3394
```
  d8d7ebdb
18 Jan, 2017 4 commits
- Make it so sacctmgr accepts column headers like MaxTRESPU and not MaxTRESP. · c675e0bc
  Danny Auble authored Jan 18, 2017
```
Bug 3398
```
  c675e0bc
- MYSQL - Fix minor memory leak when querying steps and the sql fails. · 18dec618
  Danny Auble authored Jan 18, 2017
  
  18dec618
- Reset last_job_update when clearing CONFIGURING flag · 2d16ad91
  Morris Jette authored Jan 17, 2017
```
bug 3399
```
  2d16ad91
- Prevent job timeout on node power up · 4114e6ce
  Morris Jette authored Jan 17, 2017
```
bug 3099
```
  4114e6ce
17 Jan, 2017 5 commits
- Revert "Require the normal scheduler to set/clear an Assoc/QOS limit on a job" · ef9546d0
  Danny Auble authored Jan 17, 2017
```
This reverts commit e92b49d3.
```
  ef9546d0
- Remove commented out if block and move indent back a level. · d5565ba0
  Tim Shaw authored Jan 17, 2017
```
No functional change.
```
  d5565ba0
- Require the normal scheduler to set/clear an Assoc/QOS limit on a job · e92b49d3
  Dominik Bartkiewicz authored Jan 17, 2017
```
instead of also in the backfill scheduler.
```
  e92b49d3
- Fix debug2 message using wrong array index in _qos_job_runnable_post_select(). · 369bfd69
  Josh Samuelson authored Jan 17, 2017
```
Bug 3405.
```
  369bfd69
- Fix missing TRES read lock in acct_policy_job_runnable_pre_select() code. · 726c7cea
  Josh Samuelson authored Jan 17, 2017
```
acct_policy_job_runnable_pre_select() calls assoc_mgr_set_qos_tres_cnt()
without tres READ_LOCK.

Note that existing code does not modify the tres structures, so this
cannot currently lead to a race condition.

Bug 3406.
```
  726c7cea
15 Jan, 2017 1 commit
- Fix slurm.spec file for BlueGene builds. · 1d6addaf
  Michael Robbert authored Jan 14, 2017
```
job_submit/cnode was previously removed by commit 63bc71ed.

Bug 3403.
```
  1d6addaf
12 Jan, 2017 4 commits
- Change test35.2 to prints warning when run as root. · b201faed
  Isaac Hartung authored Jan 12, 2017
```
Bug 3395
```
  b201faed
- Correct publications listing · 42232fd2
  Morris Jette authored Jan 12, 2017
  
  42232fd2
- burst_buffer/cray - Avoid pre_run operation if not required · 33fed094
  Morris Jette authored Jan 11, 2017
```
burst_buffer/cray - Avoid "pre_run" operation if not using buffer (i.e.
    just creating or deleting a persistent burst buffer).
bug 3391
```
  33fed094
- Correct accounting info for jobs requeue due to burst buffer errors · 68b594fc
  Morris Jette authored Jan 11, 2017
```
Previous job state information was "PENDING" rather than "REQUEUED"
  for each job requeued due to a burst buffer error.
bug 3388
```
  68b594fc
11 Jan, 2017 1 commit

CRAY - Fix deadlock issue when updating accounting in the slurmctld and · 69567910

Danny Auble authored Jan 11, 2017

scheduling a Datawarp job.

The assoc_mgr lock needs to happen before the bb_state.bb_mutex.  One place
this could cause deadlock is from src/slurmctld/controller.c
_accounting_cluster_ready() which calls clusteracct_storage_g_cluster_tres
which inturn calls bb_g_job_set_tres_cnt which calls bb_p_job_set_tres_cnt
which will lock the bb_muxtex after the assoc_mgr is already locked.

Bug 3389

69567910