Commits · b2eb504bbb3c82382d9e957173c8b3be89277932 · Manuel G. Marciani / ces_slurm_simulator

08 Oct, 2015 2 commits

Fix case where the primary and backup dbds would both be performing rollup. · b2eb504b

Brian Christiansen authored Oct 07, 2015

If the backup dbd happened to be doing rollup at the time the primary resumed
both the primary and the backup would be doing rollups and causing contention on
the database tables. The backup would wait for the rollup handler to finish
before giving up control.

The fix is to cancel the rollup_handler and let the backup begin to shutdown so
that it will close an existing connections and then re-exec itself. The re-exec
helps because the rollup handler spawns a thread for each cluster to rollup and
just cancelling the rollup handler doesn't cancel the spawned threads from the
rollup handler. This cleans up the dbd and locks. The re-exec only happens in
the backup if the primary resumed and a rollup was happening.

Bug 1988

b2eb504b

Remove SICP job option · 0f6bf406

Morris Jette authored Oct 07, 2015

This was intended as a step toward managing jobs across mutliple
  clusters, but we will be pursuing a very different design.

0f6bf406

07 Oct, 2015 6 commits
- sacctmgr - Don't allow default account associations to be removed · 9f602cba
  Danny Auble authored Oct 07, 2015
```
from a user.

This would cause the slurmctld to cache the old default which wasn't valid
and cause the user to have to request the association always.
```
  9f602cba
- Do not send burst buffer stage out email unless the job uses burst buffers · 3a63b4e0
  Morris Jette authored Oct 07, 2015
```
byg 2013
```
  3a63b4e0
- Update NEWS. · 075668ae
  David Bigagli authored Oct 07, 2015
  
  075668ae
- Update NEWS · 170d17d7
  David Bigagli authored Oct 07, 2015
  
  170d17d7
- Run autogen for the PMIX plugin. · 49cc4c2d
  David Bigagli authored Oct 07, 2015
  
  49cc4c2d
- Fix issue with sacct, printing 0_0 for array's that had finished in the · 75ea13a3
  Danny Auble authored Oct 06, 2015
```
database but the start record hadn't made it yet.
```
  75ea13a3
06 Oct, 2015 7 commits
- Add acct_gather_energy/ibmaem plugin · 14be4f65
  Axel Auweter authored Oct 06, 2015
```
Add acct_gather_energy/ibmaem plugin for systems with IBM Systems Director
    Active Energy Manager.
```
  14be4f65
- MySQL - Improve the code with asking for jobs in a suspended state. · f0f3dfdb
  Danny Auble authored Oct 06, 2015
  
  f0f3dfdb
- Fix spec file to look for mariadb or mysql devel packages for build · 42e22f03
  Danny Auble authored Oct 06, 2015
```
requirements.
```
  42e22f03
- Add acct_gather_energy/ibmaem plugin · 8937f58a
  Axel Auweter authored Oct 06, 2015
```
Add acct_gather_energy/ibmaem plugin for systems with IBM Systems Director
    Active Energy Manager.
```
  8937f58a
- Permit job_submit plugin to set a job's priority · 3b5f13fa
  Thomas Cadeau authored Oct 06, 2015
```
bug 2011
```
  3b5f13fa
- Fix sacct to not return all jobs if the -j option is given with a trailing · 2646e761
  Danny Auble authored Oct 05, 2015
```
','.
```
  2646e761
- Propagate sbatch "--dist=plane=#" option to srun. · 6868906b
  Morris Jette authored Oct 05, 2015
```
bug 1999
```
  6868906b
03 Oct, 2015 1 commit

Don't requeue RPCs from slurmctld to DOWN nodes · f4ea9dec

Morris Jette authored Oct 02, 2015

Don't requeue RPC going out from slurmctld to DOWN nodes (can generate
    repeating communication errors).
bug 2002

f4ea9dec

02 Oct, 2015 4 commits

Update v15.08.2 NEWS with v14.11.10 work · ff24578a
Morris Jette authored Oct 01, 2015

ff24578a

Don't mark powered down node as not responding · c0bb562a

Morris Jette authored Oct 01, 2015

This will only happen if a PING RPC for the node is already queued
  when the decision is made to power it down, then fails to get
  a response for the ping (since the node is already down).
bug 1995

c0bb562a

Reset job CPU count if CPUs/task ratio increased for mem limit · 29fe3eae

Morris Jette authored Sep 30, 2015

If a job's CPUs/task ratio is increased due to configured MaxMemPerCPU,
then increase it's allocated CPU count in order to enforce CPU limits.
Previous logic would increase/set the cpus_per_task as needed if a
job's --mem-per-cpu was above the configured MaxMemPerCPU, but NOT
increase the min_cpus or max_cpus varilable. This resulted in allocating
the wrong CPU count.

29fe3eae

Don't mark powered down node as not responding · 8c03a8bc

Morris Jette authored Oct 01, 2015

This will only happen if a PING RPC for the node is already queued
  when the decision is made to power it down, then fails to get
  a response for the ping (since the node is already down).
bug 1995

8c03a8bc

01 Oct, 2015 2 commits
- MYSQL - Remove restriction to have to be at least an operator to query TRES · 2bfbcbd8
  Danny Auble authored Oct 01, 2015
```
values.
```
  2bfbcbd8
- Fix advanced reservation core selection logic with network topology · 9e4a695d
  Morris Jette authored Oct 01, 2015
```
This required a fairly major re-write of the select plugin logic
bug 1975
```
  9e4a695d
30 Sep, 2015 3 commits

Make cgroup paths consistent · c5c566ff

Morris Jette authored Sep 30, 2015

Correct some cgroup paths ("step_batch" vs. "step_4294967294", "step_exter"
    vs. "step_extern", and "step_extern" vs. "step_4294967295").

c5c566ff

Reset job CPU count if CPUs/task ratio increased for mem limit · 836912bf

Morris Jette authored Sep 30, 2015

If a job's CPUs/task ratio is increased due to configured MaxMemPerCPU,
then increase it's allocated CPU count in order to enforce CPU limits.
Previous logic would increase/set the cpus_per_task as needed if a
job's --mem-per-cpu was above the configured MaxMemPerCPU, but NOT
increase the min_cpus or max_cpus varilable. This resulted in allocating
the wrong CPU count.

836912bf

Don't start duplicate batch job · c1513956

Morris Jette authored Sep 29, 2015

Requeue/hold batch job launch request if job already running. This is
  possible if node went to DOWN state, but jobs remained active.
In addition, if a prolog/epilog failed DRAIN the node rather than
  setting it down, which could kill jobs that could continue to
  run.
bug 1985

c1513956

29 Sep, 2015 2 commits
- Fix srun -I<timeout> from flooding the controller with step create requests. · 1252d1a1
  Brian Christiansen authored Sep 29, 2015
```
Bug 1938
```
  1252d1a1
- Fix updating job in db after extending job's timelimit past partition's timelimit. · 7a0836fc
  Brian Christiansen authored Sep 29, 2015
```
Bug 1984
```
  7a0836fc
28 Sep, 2015 2 commits

Fix for node state when shrinking jobs · 16f4b6a9

Morris Jette authored Sep 28, 2015

When nodes have been allocated to a job and then released by the
  job while resizing, this patch prevents the nodes from continuing
  to appear allocated and unavailable to other jobs. Requires
  exclusive node allocation to trigger. This prevents the previously
  reported failure, but a proper fix will be quite complex and
  delayed to the next major release of Slurm (v 16.05).
bug 1851

16f4b6a9

Fix for node state when shrinking jobs · 6c9d4540

Morris Jette authored Sep 28, 2015

When nodes have been allocated to a job and then released by the
  job while resizing, this patch prevents the nodes from continuing
  to appear allocated and unavailable to other jobs. Requires
  exclusive node allocation to trigger. This prevents the previously
  reported failure, but a proper fix will be quite complex and
  delayed to the next major release of Slurm (v 16.05).
bug 1851

6c9d4540

26 Sep, 2015 1 commit

add email wrapper script "smail" for job stats · 93d9189c

Dennis McRitchie authored Sep 25, 2015

Add mail wrapper script "smail" that will include job statistics in email
    notification messages.
bug 1611

93d9189c

25 Sep, 2015 2 commits
- Start NEWS for v15.08.2 · b202f5e7
  Morris Jette authored Sep 25, 2015
  
  b202f5e7
- Allow changing job array max task count · 56b0ff1c
  Morris Jette authored Sep 25, 2015
```
Add ability to change a job array's maximum running task count:
    "scontrol update jobid=# arraytaskthrottle=#"
bug 1863
```
  56b0ff1c
24 Sep, 2015 2 commits
- Fix TRES counts on GRES on a clean start of the slurmctld. · 8274ea54
  Danny Auble authored Sep 24, 2015
  
  8274ea54
- Fix issue with wrong protocol version when using the srun --no-allocate · b8b7f2d6
  Danny Auble authored Sep 24, 2015
```
option.
```
  b8b7f2d6
23 Sep, 2015 6 commits
- Put node count in TRES string for steps. · e73ed4f3
  Danny Auble authored Sep 23, 2015
  
  e73ed4f3
- Fix sacct documentation about [Alloc|Req]TRES · 443833e2
  Danny Auble authored Sep 23, 2015
  
  443833e2
- Add [Alloc|Req]Nodes to sacct to be more like cpus. · 9c1c9c62
  Danny Auble authored Sep 23, 2015
  
  9c1c9c62
- For pending jobs have sacct print 0 for nnodes instead of the bogus 2. · 71287134
  Danny Auble authored Sep 23, 2015
```
The 2 came from the nodelist being "None assigned", which would be treated
as 2 hosts when sent into hostlist.
```
  71287134
- Make is so 'scontrol update job 1234 qos='' will set the qos back to · 8942aa1e
  Danny Auble authored Sep 23, 2015
```
the default qos for the association.
```
  8942aa1e
- Fix sacct --format=nnodes to print out correct information for pending · 6d82e5bf
  Danny Auble authored Sep 23, 2015
```
jobs.

Bug 1969
```
  6d82e5bf