Commits · f390c6455c44b6c38ecb2e87e020bada5a171d33 · Manuel G. Marciani / ces_slurm_simulator

21 Oct, 2015 2 commits

sbatch --ntasks precedence fix · 05e0dabe

Morris Jette authored Oct 21, 2015

sbatch --ntasks option to take precedence over --ntasks-per-node plus node
    count, as documented. Set SLURM_NTASKS/SLURM_NPROCS environment variables
    accordingly.
bug 2015

05e0dabe

Fix the pty window manager in slurmstepd. · 9350c830
David Bigagli authored Oct 21, 2015

9350c830

20 Oct, 2015 3 commits

Don't report CPU oversubscription · c82abd9c

Morris Jette authored Oct 20, 2015

Avoid reporting more allocated CPUs than exist on a node. This can be
    triggered by resuming a previosly suspended job, resulting in
    oversubscription of CPUs.
bug 2021

c82abd9c

Fix salloc -I to accept an argument · d133f16a
Danny Auble authored Oct 19, 2015

d133f16a

Add scancel -f/--full option · 26944ca0

Morris Jette authored Oct 19, 2015

Add scancel -f/--full option to signal all steps including batch script and
    all of its child processes.
bug 2031

26944ca0

19 Oct, 2015 5 commits
- Set SLURM_HINT environment variable when --hint is used with sbatch or salloc. · 62d1e1aa
  Brian Christiansen authored Oct 19, 2015
```
Bug 1888
```
  62d1e1aa
- Fix issue on a scontrol reconfig all available GRES/TRES would be zeroed · 63e59dcd
  Danny Auble authored Oct 19, 2015
```
out.

Remove unneeded code that commit 8274ea54 fixed.

This code would 0 out all GRES/TRES on a reconfig which isn't what we want.

8274ea54 does the right thing by itself.
```
  63e59dcd
- Correct backfill logic for job with INFINITE time limit · 1886ac8b
  Hongjia Cao authored Oct 19, 2015
```
bug 2032
```
  1886ac8b
- Fix burst_buffer/cray for interactive allocs >4GB · 3c066cbc
  Morris Jette authored Oct 19, 2015
```
Needed to change a couple of variables from 32- to 64-bit.
```
  3c066cbc
- Add new burst_buffer.conf parameters · 25fcc9db
  Morris Jette authored Oct 19, 2015
```
Add new burst_buffer.conf parameters: ValidateTimeout and OtherTimeout.
See man page for details.
```
  25fcc9db
16 Oct, 2015 1 commit
- Update NEWS. · 55c7cd17
  David Bigagli authored Oct 16, 2015
  
  55c7cd17
15 Oct, 2015 1 commit
- MYSQL - Fix minor issue after an index was added to the database it would · 90e2e552
  Danny Auble authored Oct 14, 2015
```
previously take 2 restarts of the slurmdbd to make it stick correctly.
```
  90e2e552
14 Oct, 2015 1 commit

Fix task/cgroup affinity to work correctly with multi-socket · 31f91bd9

Danny Auble authored Oct 14, 2015

single-threaded cores.  A regression caused only 1 socket to be used on
this kind of node instead of all that were available.

31f91bd9

08 Oct, 2015 2 commits

Fix case where if the backup slurmdbd has existing connections when it gives... · 44bb06bc

Brian Christiansen authored Oct 07, 2015

Fix case where if the backup slurmdbd has existing connections when it gives up control that the it would be killed.

If the backup had existing connections when giving up control, it would try to
signal the existing threads by using pthread_kill to send SIGKILL to the
threads. The problem is that SIGKILL doesn't go the thread but the main process
and the backup dbd would be killed.

44bb06bc

Fixed slurmctld not sending cold-start messages correctly to the database · 4ed2f8c6
Danny Auble authored Oct 07, 2015
```
when a cold-start (-c) happens to the slurmctld.
```
4ed2f8c6

07 Oct, 2015 6 commits
- Fix sacct -j, (nothing but a comma) to not return all jobs. · d5979ef6
  Danny Auble authored Oct 07, 2015
  
  d5979ef6
- sacctmgr - Don't allow default account associations to be removed · 9f602cba
  Danny Auble authored Oct 07, 2015
```
from a user.

This would cause the slurmctld to cache the old default which wasn't valid
and cause the user to have to request the association always.
```
  9f602cba
- Do not send burst buffer stage out email unless the job uses burst buffers · 3a63b4e0
  Morris Jette authored Oct 07, 2015
```
byg 2013
```
  3a63b4e0
- Update NEWS. · 075668ae
  David Bigagli authored Oct 07, 2015
  
  075668ae
- Update NEWS · 170d17d7
  David Bigagli authored Oct 07, 2015
  
  170d17d7
- Fix issue with sacct, printing 0_0 for array's that had finished in the · 75ea13a3
  Danny Auble authored Oct 06, 2015
```
database but the start record hadn't made it yet.
```
  75ea13a3
06 Oct, 2015 6 commits
- MySQL - Improve the code with asking for jobs in a suspended state. · f0f3dfdb
  Danny Auble authored Oct 06, 2015
  
  f0f3dfdb
- Fix spec file to look for mariadb or mysql devel packages for build · 42e22f03
  Danny Auble authored Oct 06, 2015
```
requirements.
```
  42e22f03
- Add acct_gather_energy/ibmaem plugin · 8937f58a
  Axel Auweter authored Oct 06, 2015
```
Add acct_gather_energy/ibmaem plugin for systems with IBM Systems Director
    Active Energy Manager.
```
  8937f58a
- Permit job_submit plugin to set a job's priority · 3b5f13fa
  Thomas Cadeau authored Oct 06, 2015
```
bug 2011
```
  3b5f13fa
- Fix sacct to not return all jobs if the -j option is given with a trailing · 2646e761
  Danny Auble authored Oct 05, 2015
```
','.
```
  2646e761
- Propagate sbatch "--dist=plane=#" option to srun. · 6868906b
  Morris Jette authored Oct 05, 2015
```
bug 1999
```
  6868906b
03 Oct, 2015 1 commit

Don't requeue RPCs from slurmctld to DOWN nodes · f4ea9dec

Morris Jette authored Oct 02, 2015

Don't requeue RPC going out from slurmctld to DOWN nodes (can generate
    repeating communication errors).
bug 2002

f4ea9dec

02 Oct, 2015 4 commits

Update v15.08.2 NEWS with v14.11.10 work · ff24578a
Morris Jette authored Oct 01, 2015

ff24578a

Don't mark powered down node as not responding · c0bb562a

Morris Jette authored Oct 01, 2015

This will only happen if a PING RPC for the node is already queued
  when the decision is made to power it down, then fails to get
  a response for the ping (since the node is already down).
bug 1995

c0bb562a

Reset job CPU count if CPUs/task ratio increased for mem limit · 29fe3eae

Morris Jette authored Sep 30, 2015

If a job's CPUs/task ratio is increased due to configured MaxMemPerCPU,
then increase it's allocated CPU count in order to enforce CPU limits.
Previous logic would increase/set the cpus_per_task as needed if a
job's --mem-per-cpu was above the configured MaxMemPerCPU, but NOT
increase the min_cpus or max_cpus varilable. This resulted in allocating
the wrong CPU count.

29fe3eae

Don't mark powered down node as not responding · 8c03a8bc

Morris Jette authored Oct 01, 2015

This will only happen if a PING RPC for the node is already queued
  when the decision is made to power it down, then fails to get
  a response for the ping (since the node is already down).
bug 1995

8c03a8bc

01 Oct, 2015 2 commits
- MYSQL - Remove restriction to have to be at least an operator to query TRES · 2bfbcbd8
  Danny Auble authored Oct 01, 2015
```
values.
```
  2bfbcbd8
- Fix advanced reservation core selection logic with network topology · 9e4a695d
  Morris Jette authored Oct 01, 2015
```
This required a fairly major re-write of the select plugin logic
bug 1975
```
  9e4a695d
30 Sep, 2015 3 commits

Make cgroup paths consistent · c5c566ff

Morris Jette authored Sep 30, 2015

Correct some cgroup paths ("step_batch" vs. "step_4294967294", "step_exter"
    vs. "step_extern", and "step_extern" vs. "step_4294967295").

c5c566ff

Reset job CPU count if CPUs/task ratio increased for mem limit · 836912bf

Morris Jette authored Sep 30, 2015

If a job's CPUs/task ratio is increased due to configured MaxMemPerCPU,
then increase it's allocated CPU count in order to enforce CPU limits.
Previous logic would increase/set the cpus_per_task as needed if a
job's --mem-per-cpu was above the configured MaxMemPerCPU, but NOT
increase the min_cpus or max_cpus varilable. This resulted in allocating
the wrong CPU count.

836912bf

Don't start duplicate batch job · c1513956

Morris Jette authored Sep 29, 2015

Requeue/hold batch job launch request if job already running. This is
  possible if node went to DOWN state, but jobs remained active.
In addition, if a prolog/epilog failed DRAIN the node rather than
  setting it down, which could kill jobs that could continue to
  run.
bug 1985

c1513956

29 Sep, 2015 2 commits
- Fix srun -I<timeout> from flooding the controller with step create requests. · 1252d1a1
  Brian Christiansen authored Sep 29, 2015
```
Bug 1938
```
  1252d1a1
- Fix updating job in db after extending job's timelimit past partition's timelimit. · 7a0836fc
  Brian Christiansen authored Sep 29, 2015
```
Bug 1984
```
  7a0836fc
28 Sep, 2015 1 commit

Fix for node state when shrinking jobs · 16f4b6a9

Morris Jette authored Sep 28, 2015

When nodes have been allocated to a job and then released by the
  job while resizing, this patch prevents the nodes from continuing
  to appear allocated and unavailable to other jobs. Requires
  exclusive node allocation to trigger. This prevents the previously
  reported failure, but a proper fix will be quite complex and
  delayed to the next major release of Slurm (v 16.05).
bug 1851

16f4b6a9