Commits · 27f201f60379f4a8004948cd2ab39e103f910ddd · Manuel G. Marciani / ces_slurm_simulator

28 Apr, 2016 3 commits
- Fix for srun hang with OpenMPI and PMIx · af47b4b2
  Artem Polyakov authored Apr 28, 2016
```
See bug 2672 for details
```
  af47b4b2
- Move NEWS entry from e8e5d408 into the correct location. · 4291ac95
  Tim Wickberg authored Apr 27, 2016
  
  4291ac95
- Fix version issue when packing GRES information between 2 different versions · e8e5d408
  Danny Auble authored Apr 27, 2016
```
of Slurm.
```
  e8e5d408
27 Apr, 2016 2 commits

Fix test cases to have proper signature for main() · 7925aa42
Tim Wickberg authored Apr 27, 2016
```
Compiler errors out preventing these 13 from running without fixing the implied
int type for main.
```
7925aa42

Fix invalid error with two task plugins · 410e8c6c

Morris Jette authored Apr 26, 2016

Avoid error message of "Requested cpu_bind option requires entire node to
be allocated; disabling affinity" being generated in some cases where
task/affinity and task/cgroup plugins used together.

410e8c6c

26 Apr, 2016 2 commits
- Bluegene - Fix issue with reservations resizing under the covers on a · c18264a6
  Danny Auble authored Apr 26, 2016
```
restart of the slurmctld.
```
  c18264a6
- Prevent integer overflow in memory limit calculation by adding cast. · fe85cc35
  Sam Gallop authored Apr 26, 2016
```
Otherwise miscalculated limit will lead to job cancellation even when well
inside the allocated amount.

Bug 2660.
```
  fe85cc35
23 Apr, 2016 1 commit
- Fix potential issue when adding and removing TRES which could result · fb22bcc2
  Tim Wickberg authored Apr 22, 2016
```
in the slurmdbd segfaulting.

Bug 2656
```
  fb22bcc2
20 Apr, 2016 1 commit

burst_buffer/cray - fix create/desroy buffer only · 1391d29a

Morris Jette authored Apr 20, 2016

burst_buffer/cray - Don't call Datawarp "paths" function if script includes
    only create or destroy of persistent burst buffer. Some versions of Datawarp
    software return an error for such scripts, causing the job to be held.
bug 2624

1391d29a

13 Apr, 2016 2 commits
- Fix typo in NEWS · 0f34e2ad
  Morris Jette authored Apr 13, 2016
  
  0f34e2ad
- Prevent deadlock for flow of data to the slurmdbd when sending reservation · cd7aa558
  Danny Auble authored Apr 12, 2016
```
that wasn't set up correctly.
```
  cd7aa558
12 Apr, 2016 2 commits
- power/cray - returned to operation · a7e03592
  Morris Jette authored Apr 12, 2016
```
power/cray - Fix bug introduced in 15.08.10 preventin operation in many
    cases.
bug 2628
```
  a7e03592
- power/cray - Prevent possible divide by zero · b6a8373c
  Morris Jette authored Apr 12, 2016
  
  b6a8373c
11 Apr, 2016 4 commits
- burst_buffer/cray persistent create/delete fail · 87e1cc67
  Morris Jette authored Apr 11, 2016
```
burst_buffer/cray - Fix for script creating or deleting persistent buffer
    would fail "paths" operation and hold the job.
bug 2624
```
  87e1cc67
- MYSQL - Make the error message more specific when removing a reservation · bd8042e8
  Danny Auble authored Apr 11, 2016
```
and it doesn't meet basic requirements.
```
  bd8042e8
- burst_buffer/cray fix for pre_run fail · 8f667db4
  Morris Jette authored Apr 11, 2016
```
burst_buffer/cray - Decrement job's prolog_running counter if pre_run fails.
bug 2621
```
  8f667db4
- Reset job's prolog_running counter · f3f41e10
  Morris Jette authored Apr 11, 2016
```
If a job is no longer in configuring state, then clear the prolog_running
  counter on slurmctld restart or reconfigure.
bug 2621
```
  f3f41e10
09 Apr, 2016 1 commit

backfill scheduling enhancement · e62a9270

Morris Jette authored Apr 08, 2016

When determining when a pending job will be able to start, rather
than testing after removing each running job and trying to schedule
the pending jobs, remove multiple jobs that all end about the
same time before testing. This reduces the number of calls to
the job placement logic, which is time consuming.

e62a9270

07 Apr, 2016 2 commits
- Fix handling for single-character prognames · 11320ebc
  Sami Ilvonen authored Apr 07, 2016
  
  11320ebc
- fix for job "--contiguous" option · 47a07b54
  Morris Jette authored Apr 06, 2016
```
Fix for job "--contiguous" option that could cause job allocation/launch
    failure or slurmctld crash.
bug 2573
```
  47a07b54
06 Apr, 2016 5 commits

Start NEWS for v15.08.11 · 3a8ecf32
Morris Jette authored Apr 06, 2016

3a8ecf32
Revert "Fix situation on a heterogeneous memory cluster where the order of" · 3ae45a51
Danny Auble authored Apr 06, 2016
```
This reverts commit f559a55c.
```
3ae45a51

Fix situation on a heterogeneous memory cluster where the order of · f559a55c

Danny Auble authored Apr 06, 2016

constraints mattered in a job.

Details include:
A job doesn't request memory but the system is running
with CR_*MEMORY with no default memory limit and the job requests nodes
with features of different sizes.  Previously the order of constraints
mattered where the smaller memory node would need to be requested first
or the job would fail.

Bug 2608

f559a55c

Don't change job time limit when updating unrelated field in a job · 594c7997

Morris Jette authored Apr 06, 2016

Previous logic would get an account and/or QOS time limit and use
  that value to overwrite the incoming RPC's NO_VAL value, which
  would change a job's time limit when changing an unrelated
  field (e.g. priority, QOS, etc.).
bug 2610

594c7997

Avoid double calculation on partition QOS if the job is using the same QOS. · e17a7eaf
Danny Auble authored Apr 06, 2016

e17a7eaf

05 Apr, 2016 1 commit

Fix backfill scheduler race condition · d8b18ff8

Morris Jette authored Apr 05, 2016

Fix backfill scheduler race condition that could cause invalid pointer in
    select/cons_res plugin. Bug introduced in 15.08.9, commit:
    efd9d35e

The scenario is as follows
1. Backfill scheduler is running, then releases locks
2. Main scheduling loop starts a job "A"
3. Backfill scheduler resumes, finds job "A" in its queue and
   resets it's partition pointer.
4. Job "A" completes and tries to remove resource allocation record
   from select/cons_res data structure, but fails to find it because
   it is looking in the table for the wrong partition.
5. Job "A" record gets purged from slurmctld
6. Select/cons_res plugin attempts to operate on resource allocation
   data structure, finds pointer into the now purged data structure
   of job "A" and aborts or gets SEGV
Bug 2603

d8b18ff8

04 Apr, 2016 2 commits
- Remove duplicates from AccountingStorageTRES · 921c59e4
  Danny Auble authored Apr 04, 2016
  
  921c59e4
- If using PrologFlags=contain: Don't launch the extern step if a job is · 91a83e41
  Danny Auble authored Apr 04, 2016
```
canceled while launching.
```
  91a83e41
02 Apr, 2016 2 commits
- checkpoint/blcr plugin: Fix memory leak. · 08d520db
  Morris Jette authored Apr 02, 2016
  
  08d520db
- Fix potential divide by zero when tree_width=1 · ef8c5e1b
  Danny Auble authored Apr 01, 2016
  
  ef8c5e1b
31 Mar, 2016 2 commits

power/cray fix for nodes not ready · 5b0800e4

Morris Jette authored Mar 31, 2016

Power/cray: Don't specify NID list to Cray APIs. If any of those nodes are
    not in a ready state, the API returned an error for ALL nodes rather than
    valid data for nodes in ready state.
bug 2332

5b0800e4

Make error message in the pmi2 code to debug as the issue can be expected · bcccd20c
Matthieu Hautreux authored Mar 30, 2016
```
and retries are done making the error message a little misleading.
```
bcccd20c

30 Mar, 2016 2 commits
- Fix issue where if a slurmdbd rollup lasted longer than 1 hour the · 2bec1975
  Danny Auble authored Mar 29, 2016
```
rollup would effectively never run again.

bug 2575

and sort of bug 2596
```
  2bec1975
- Update NEWS for start of v15.08.10 · 173eb1e6
  Morris Jette authored Mar 29, 2016
  
  173eb1e6
28 Mar, 2016 2 commits

task/cgroup - Fix task binding to CPUs bug · ddf6d9a4

Morris Jette authored Mar 28, 2016

There was a subtle bug in how tasks were bound to CPUs which could result
in an "infinite loop" error. The problem was various socket/core/threasd
calculations were based upon the resources allocated to a step rather than
all resources on the node and rounding errors could occur. Consider for
example a node with 2 sockets, 6 cores per socket and 2 threads per core.
On the idle node, a job requesting 14 CPUs is submitted. That job would
be allocted 4 cores on the first socket and 3 cores on the second socket.
The old logic would get the number of sockets for the job at 2 and the
number of cores at 7, then calculate the number of cores per socket at
7/2 or 3 (rounding down to an integer). The logic layouting out tasks
would bind the first 3 cores on each socket to the job then not find any
remaining cores, report the "infinite loop" error to the user, and run
the job without one of the expected cores. The problem gets even worse
when there are some allocated cores on a node. In a more extreme case,
a job might be allocated 6 cores on one socket and 1 core on a second
socket. In that case, 3 of that job's cores would be unused.
bug 2502

ddf6d9a4

Fix for srun signal handling threading problem · c8d36dba

Morris Jette authored Mar 28, 2016

This is a revision to commit 1ed38f26
The root problem is that a pthread is passed an argument which is
a pointer to a variable on the stack. If that variable is over-written,
the signal number recieved will be garbage, and that bad signal
number will be interpretted by srun to possible abort the request.

c8d36dba

26 Mar, 2016 1 commit

Revert commit · c1dde86c

Morris Jette authored Mar 25, 2016

The previous commit obviously fixed a problem, but introduced a different
set of problems. This will be pursued later, perhaps in version 16.05.

c1dde86c

25 Mar, 2016 2 commits

Revert commit · f5920b77

Morris Jette authored Mar 25, 2016

With some configurations and systems, errors of the following sort were
occuring:
task/cgroup: task[1] infinite loop broken while trying to provision compute elements using block
task/cgroup: task[1] unable to set taskset '0x0'

f5920b77

burst_buffer/cray - pre-run fail fix · 5a48207e

Morris Jette authored Mar 25, 2016

burst_buffer/cray - If the pre-run operation fails then don't issue
    duplicate job cancel/requeue unless the job is still in run state. Prevents
    jobs hung in COMPLETING state.
bug 2587

5a48207e

24 Mar, 2016 1 commit
- Remove Rgt from the association output of scontrol show assoc_mgr. Rgt · ddfd2781
  Danny Auble authored Mar 24, 2016
```
isn't kept up to date in the cache.
```
  ddfd2781