Commits · 6150f565702385dd6c3a899866fb5dbd6eb5943f · Manuel G. Marciani / ces_slurm_simulator

06 Apr, 2016 12 commits

Fix situation on a heterogeneous memory cluster where the order of · 6150f565

Danny Auble authored Apr 06, 2016

constraints mattered in a job.

    Details include:
    A job doesn't request memory but the system is running
    with CR_*MEMORY with no default memory limit and the job requests nodes
    with features of different sizes.  Previously the order of constraints
    mattered where the smaller memory node would need to be requested first
    or the job would fail.

    Bug 2608

6150f565

Merge remote-tracking branch 'origin/slurm-15.08' · 42fbc9fa
Danny Auble authored Apr 06, 2016

42fbc9fa
Revert "Fix situation on a heterogeneous memory cluster where the order of" · 3ae45a51
Danny Auble authored Apr 06, 2016
```
This reverts commit f559a55c.
```
3ae45a51

Fix situation on a heterogeneous memory cluster where the order of · f559a55c

Danny Auble authored Apr 06, 2016

constraints mattered in a job.

Details include:
A job doesn't request memory but the system is running
with CR_*MEMORY with no default memory limit and the job requests nodes
with features of different sizes.  Previously the order of constraints
mattered where the smaller memory node would need to be requested first
or the job would fail.

Bug 2608

f559a55c

Merge branch 'slurm-15.08' · e96da8bb
Morris Jette authored Apr 06, 2016

e96da8bb
Merge remote-tracking branch 'origin/slurm-15.08' · 699f9f43
Danny Auble authored Apr 06, 2016

699f9f43

Don't change job time limit when updating unrelated field in a job · 594c7997

Morris Jette authored Apr 06, 2016

Previous logic would get an account and/or QOS time limit and use
  that value to overwrite the incoming RPC's NO_VAL value, which
  would change a job's time limit when changing an unrelated
  field (e.g. priority, QOS, etc.).
bug 2610

594c7997

Avoid double calculation on partition QOS if the job is using the same QOS. · e17a7eaf
Danny Auble authored Apr 06, 2016

e17a7eaf

Fix for SEGV · 55d31288

Morris Jette authored Apr 06, 2016

Prevent use of NULL pointer and SEGV when changing a job's QOS when
  the slurmdbd is not configured.

55d31288

Add SLURM_UMASK env var to user job · 58dea246
Morris Jette authored Apr 06, 2016
```
bug 2609
```
58dea246

Add more logging to test · cd19e75f

Morris Jette authored Apr 06, 2016

These tests failed with MinJobAge=3, so when the tests looked for
  completed jobs, the job records had already been purged. Log this
  configuration as a possible reason for failure.

cd19e75f

Fix spelling of 'daemon'. · b714beb6
Tim Wickberg authored Apr 06, 2016

b714beb6

05 Apr, 2016 8 commits

Add locking around slurmd gid cache · 2803eeda
Janne Blomqvist authored Apr 05, 2016

2803eeda
Merge branch 'slurm-15.08' · 9c6cecf4
Morris Jette authored Apr 05, 2016
```
Conflicts:
	src/plugins/sched/backfill/backfill.c
```
9c6cecf4

Fix backfill scheduler race condition · d8b18ff8

Morris Jette authored Apr 05, 2016

Fix backfill scheduler race condition that could cause invalid pointer in
    select/cons_res plugin. Bug introduced in 15.08.9, commit:
    efd9d35e

The scenario is as follows
1. Backfill scheduler is running, then releases locks
2. Main scheduling loop starts a job "A"
3. Backfill scheduler resumes, finds job "A" in its queue and
   resets it's partition pointer.
4. Job "A" completes and tries to remove resource allocation record
   from select/cons_res data structure, but fails to find it because
   it is looking in the table for the wrong partition.
5. Job "A" record gets purged from slurmctld
6. Select/cons_res plugin attempts to operate on resource allocation
   data structure, finds pointer into the now purged data structure
   of job "A" and aborts or gets SEGV
Bug 2603

d8b18ff8

Rename function, no real code change. The old function name was completely · 6f0c2d3f
Danny Auble authored Apr 05, 2016
```
misleading.
```
6f0c2d3f
Merge remote-tracking branch 'origin/slurm-15.08' · 0878501e
Danny Auble authored Apr 04, 2016

0878501e
Change method of TRES sort to sort all non-static tres alphabetically · 9db01723
Danny Auble authored Apr 04, 2016
```
instead of ID to make things easier to read.
```
9db01723
Remove debug from commit 921c59e4 · 24566dd7
Danny Auble authored Apr 04, 2016

24566dd7
Merge remote-tracking branch 'origin/slurm-15.08' · 95399ecb
Danny Auble authored Apr 04, 2016
```
# Conflicts:
#	src/common/gres.c
```
95399ecb

04 Apr, 2016 4 commits
- Remove duplicates from AccountingStorageTRES · 921c59e4
  Danny Auble authored Apr 04, 2016
  
  921c59e4
- Add slurm_set_accounting_storage_tres · 5751b9d6
  Danny Auble authored Apr 04, 2016
  
  5751b9d6
- If using PrologFlags=contain: Don't launch the extern step if a job is · 91a83e41
  Danny Auble authored Apr 04, 2016
```
canceled while launching.
```
  91a83e41
- Change in comment for greater clarity · 3f51a788
  Morris Jette authored Apr 04, 2016
  
  3f51a788
02 Apr, 2016 3 commits
- checkpoint/blcr plugin: Fix memory leak. · 08d520db
  Morris Jette authored Apr 02, 2016
  
  08d520db
- Fix potential divide by zero when tree_width=1 · ef8c5e1b
  Danny Auble authored Apr 01, 2016
  
  ef8c5e1b
- Add spank_task_post_fork to the extern step. · c280838a
  Danny Auble authored Apr 01, 2016
  
  c280838a
01 Apr, 2016 4 commits

Cosmetic change, no change to logic · fabc772e
Morris Jette authored Apr 01, 2016

fabc772e

Morris Jette authored Apr 01, 2016

Rather than making sure that a running job's socket count on a
  node remain constant, just make sure the total core count
  remains constant.

b8a2b13e

Tweak test for oversubscribe change · f9da1f4c
Morris Jette authored Apr 01, 2016

f9da1f4c

Rename "Shared" to "OverSubscribe" · 5fe0915e

Morris Jette authored Apr 01, 2016

Rename partition configuration from "Shared" to "OverSubscribe". Rename
salloc, sbatch, srun option from "--shared" to "--oversubscribe". The old
options will continue to function. Output field names also changed in
scontrol, sinfo, squeue, and sview.

5fe0915e

31 Mar, 2016 3 commits
- Merge branch 'slurm-15.08' · 30274f4d
  Morris Jette authored Mar 31, 2016
  
  30274f4d
- power/cray fix for nodes not ready · 5b0800e4
  Morris Jette authored Mar 31, 2016
```
Power/cray: Don't specify NID list to Cray APIs. If any of those nodes are
    not in a ready state, the API returned an error for ALL nodes rather than
    valid data for nodes in ready state.
bug 2332
```
  5b0800e4
- Make error message in the pmi2 code to debug as the issue can be expected · bcccd20c
  Matthieu Hautreux authored Mar 30, 2016
```
and retries are done making the error message a little misleading.
```
  bcccd20c
30 Mar, 2016 6 commits

Update node socket/core counts on the fly · 606948a8

Morris Jette authored Mar 30, 2016

Update a node's socket and cores per socket counts as needed after a node
boot to reflect configuration changes which can occur on KNL processors.
Note that the node's total core count must not change, only the distribution
of cores across varying socket counts (KNL NUMA nodes treated as sockets by
Slurm).

606948a8

Log heterogeneous socket/core/thread configuration · 3c83b269

Morris Jette authored Mar 30, 2016

Log if the number of cores is not evenly divisible by the socket
  count (which will be the case on some KNL) or the number of
  threads is not evenly divisible by the core count.

3c83b269

Fix issue where if a slurmdbd rollup lasted longer than 1 hour the · 2bec1975
Danny Auble authored Mar 29, 2016
```
rollup would effectively never run again.

bug 2575

and sort of bug 2596
```
2bec1975

Replace SchedulerParameters option assoc_limit_continue · a685e0e9

Morris Jette authored Mar 29, 2016

Remove the SchedulerParameters option of "assoc_limit_continue", making it
the default value. Add option of "assoc_limit_stop". If "assoc_limit_stop"
is set and a job cannot start due to association limits, then do not attempt
to initiate any lower priority jobs in that partition. Setting this can
decrease system throughput and utlization, but avoid potentially starving
larger jobs by preventing them from launching indefinitely.

a685e0e9

Start NEWS for v16.05.0-pre3 · f0f1286e
Morris Jette authored Mar 29, 2016

f0f1286e
Update RELEASE_NOTES through pre2 NEWS · bb50f3ed
Morris Jette authored Mar 29, 2016

bb50f3ed