Commits · 0aadbcc4437c547bd341175918d1500ec3b2f2cf · Manuel G. Marciani / ces_slurm_simulator

05 Dec, 2016 2 commits

On state restore in the slurmctld don't overwrite the mem_spec_limit given · 1eeb9e45
Danny Auble authored Dec 05, 2016
```
from the slurm.conf when using FastSchedule=0.
```
1eeb9e45

cray/burst_buffer - slurmctld restart fix · a88a961c

Morris Jette authored Dec 05, 2016

cray/burst_buffer - If slurmctld daemon restarts with pending job and burst
    buffer having unknown file stage-in status, teardown the buffer, defer the
    job, and start stage-in over again.
bug 3295

a88a961c

02 Dec, 2016 3 commits
- NRT - Make it so you can have more than 1 protocol listed in MP_MSG_API · b3b7cf2e
  Danny Auble authored Dec 02, 2016
```
bug 3314
```
  b3b7cf2e
- NRT - Make it so protocols pgas and test are allowed to be used. · adaab822
  Danny Auble authored Dec 02, 2016
  
  adaab822
- Make it so a system running against IBM's PE will work with PE version 1.3 · a037af18
  Danny Auble authored Dec 02, 2016
  
  a037af18
01 Dec, 2016 2 commits

Make sure if a job can't run because of resources we also check accounting · 031c467f

Dominik Bartkiewicz authored Dec 01, 2016

limits after the node selection to make sure it doesn't violate those limits
and if it does change the reason for waiting so we don't reserve resources
on jobs violating accounting limits.

Bug 3029

031c467f

knl_cray: Fix KNL mode/feature race condition · 46128f2b

Morris Jette authored Nov 30, 2016

node_features/knl_cray - Fix possible race condition when changing node
    state that could result in old KNL mode as an active features.
bug 3235

46128f2b

30 Nov, 2016 2 commits

cray/burst_buffer - Increase timer · b4763c75

Morris Jette authored Nov 30, 2016

cray/burst_buffer - Increase time to synchronize operations between threads
    from 5 to 60 seconds ("setup" operation time observed over 17 seconds).
    This should fix a race condition between a thread performing a buffer
    creation (setup) and a thread looking for unexpected buffers. If a
    buffer is found during the time window allowed for creation, it's
    space will be counted twice. First by the status checking thread
    and second by the thread doing the creation. The deallocation only
    happens once, so the used space information can be left with an
    invalid value.
bug 3295

b4763c75

sbcast - prevent segfault in slurmd from multiple zlib compressed transfers · 8c5765c9

Tim Wickberg authored Nov 30, 2016

static variable means multiple active decompression streams will corrupt
zlib's internal state, which can lead to a segfault.

Bug 3299.

8c5765c9

29 Nov, 2016 1 commit

Fix SuspendExcNodes and SuspendExcParts on slurmctld SIGHUP. · bb06dd65

Alejandro Sanchez authored Nov 29, 2016

On a reconfig, the exc_node_bitmap is cleared but then it was
not built again since last_work_scan was declared as a local static
variable in _do_power_work(). The fix is to make it global within the
plugin and reinitialize it to 0 on _init_power_config().

Bug 3078.

bb06dd65

28 Nov, 2016 3 commits
- Make the openssl crypto plugin compile with openssl >= 1.1. · fd747355
  Alejandro Sanchez authored Nov 28, 2016
  
  fd747355
- sacctmgr - prevent segfault when trying to reset usage for an invalid account · 9e028071
  Dominik Bartkiewicz authored Nov 28, 2016
```
Bug 3267.
```
  9e028071
- srun - prevent segfault in launch plugin when terminating not-yet-created step. · d4aa1998
  Dominik Bartkiewicz authored Nov 28, 2016
```
Termination can race against step creation if, e.g., ill-behaved SPANK plugins
are in use.

Bug 3248.
```
  d4aa1998
22 Nov, 2016 5 commits

Correct malloc data type · a12e1a1c

Morris Jette authored Nov 22, 2016

sched/backfill plugin: Make malloc match data type (defined as uint32_t and
allocated as int). No failures observed, if type "int" is smaller than
"uint32_t", it could result in an invalid memory reference.

a12e1a1c

Fix slurm_job_cpus_allocated_str_on_node_id() API call. · 0ed6488e

Sergey Meirovich authored Nov 22, 2016

Fix API call: slurm_job_cpus_allocated_str_on_node_id() and
in turn slurm_job_cpus_allocated_str_on_node() to return correct
results for anything but first node. This was caused by missed logic
to calculate fist bit belongs to particular node. Lookup was always
starting from bit 0.

Bug 3266.

0ed6488e

backfill algorithm logic · e089b63a

Morris Jette authored Nov 22, 2016

After one second of wall time, simulate the termination of all remaining
   running jobs in order to respond in a reasonable time frame.
bug 3275

e089b63a

Modify backfill algorithm · 6008b021

Morris Jette authored Nov 22, 2016

Modify backfill algorithm to improve performance with large numbers of
    running jobs. Group running jobs that end in a "similar" time frame using a
    time window that grows exponentially rather than linearly. The original
    window sizes were (in units of minutes):
    0, 1, 2, 3, 4, 5, 6, 7, ... minutes
    The new window sizes are:
    0.5, 1, 2, 4, 8, 16, 32, ... minutes
    This can dramatically reduce the number of instances where the very time
    consuming "can the pending job run now" operation is executed, especailly
    if there are 1000+ running jobs.
bug 3275

6008b021

testsuite - fix job id output in test17.39 · 44241006
Nicolas Joly authored Nov 22, 2016

44241006

14 Nov, 2016 1 commit

avoid additional job allocations on booting nodes · b927fb08

Morris Jette authored Nov 14, 2016

If a node is booting for some job, don't allocate additional jobs to the
    node until the boot completes.
but 3256

b927fb08

13 Nov, 2016 1 commit
- cgroup plugins - fix two minor memory leaks · 85ab952a
  Alejandro Sanchez authored Nov 13, 2016
```
Found with valgrind. Bug 2846.
```
  85ab952a
11 Nov, 2016 3 commits
- knl_cray plugin - Avoid abort from backup slurmctld at start time · 9702ec22
  Morris Jette authored Nov 11, 2016
```
Move where we set the configuration table bitmaps in order to support
  the backup slurmctld starting and recovering previously saved
  KNL mode information (which can necessitate rebuilding the node
  configuration table).
bug 3241
```
  9702ec22
- Docs - elaborate on how to clear TRES limits in sacctmgr. · cdc737e6
  Tim Wickberg authored Nov 11, 2016
```
Bug 3255.
```
  cdc737e6
- switch/cray plugin - fix use after free in debug message. · e391622e
  David Gloe authored Nov 10, 2016
```
Bug 3253.
```
  e391622e
10 Nov, 2016 2 commits

Fix output routines to prevent unintended rounting. · 377a6259

Tim Wickberg authored Nov 10, 2016

If the input value mod 512 == 0, the value would be subject to
unintended rounding. Rework the function to check against this
on each unit promotion.

Bug 3252.

377a6259

Revert commit · da66b63f

Morris Jette authored Nov 10, 2016

It was causing the loss of node available_features on startup with
  node_features/knl_cray
bug 3241

da66b63f

09 Nov, 2016 2 commits
- Integrate node_feature/knl_generic plugin with HBM GRES · 3712bf37
  Morris Jette authored Nov 09, 2016
```
Set per-node HBM availability as a GRES based upon the KNL node's
  MCDRAM state
bug 3171
```
  3712bf37
- acct_gather_energy/rapl - prevent segfault in slurmd on startup. · c00b3da7
  Alejandro Sanchez authored Nov 09, 2016
```
Caused by race for local_energy which is dynamically allocated. Bail out
of the update if that hasn't been allocated yet.

Bug 3237.
```
  c00b3da7
08 Nov, 2016 4 commits

Permit cancellation of jobs in configuring state. · 6957bd9f
Morris Jette authored Nov 08, 2016
```
bug 3213
```
6957bd9f

select/linear plugin modified to better support heterogeneous clusters · 243fbb0d

Morris Jette authored Nov 08, 2016

select/linear plugin modified to better support heterogeneous clusters when
    topology/none is also configured. Note that use of the select/cons_res
    plugin is strongly recommended for heterogeneous clusters. The use of
    OverSubscribe=exclusive can be used if whole node allocations is
    desired.
bug 3212

243fbb0d

job_submit/lua - Add features field for existing job records. · 9e7e12dc
Alejandro Sanchez authored Nov 08, 2016
```
Bug 3224.
```
9e7e12dc

sched/backfill - avoid starting requeued job · 69af50af

Morris Jette authored Nov 07, 2016

If a job is started by the main scheduling logic and requeued while
  the backfill scheduler has locks released, that can result in an
  invalid data structure in select/cons_res. Namely, the backfill
  scheduler's attempt to start the job would clear the job resources
  node_bitmap. That leaves a NULL pointer in the select/cons_res
  plugin generating an abort. (That pointer is needed to clean up
  the job allocation records when the Epilog or Cray Node Health
  Check, NHC, are complete and the resources become available for
  another job.
bug 3230

69af50af

07 Nov, 2016 1 commit

knl_cray plugin - Avoid abort from backup slurmctld at start time. · 6798b468

Morris Jette authored Nov 07, 2016

Backup slurmctld will now
1. Not abort due to NULL pointer (needed to move code around on restart)
2. Recover KNL MCDRAM and NUMA modes from state save files if capmc and
   cnselect not available
bug 3241

6798b468

05 Nov, 2016 1 commit
- cray/burst_buffer - Update dw_wlm_cli parsing · 84e39fd8
  Morris Jette authored Nov 04, 2016
```
cray/burst_buffer - Update "instance" parsing to match updated dw_wlm_cli
    output.
bug 3222
```
  84e39fd8
04 Nov, 2016 2 commits

cray/burst_buffer - Preserve job ID · 42a90020

Morris Jette authored Nov 03, 2016

cray/burst_buffer - Preserve job ID and don't translate to job array ID
  after slurmctld restart. Prior logic would not set array_task_id to
  NO_VAL, so all job-buffer IDs would be reported in the form
  "JobID=0_0(123)" rather than "JobID=123"

42a90020

Burst_buffer/cray space tracking fix · 1548086f

Morris Jette authored Nov 03, 2016

cray/busrt_buffer - Internally track both allocated and unusable space.
    The reported UsedSpace in a pool is now the allocated space (previously was
    unusable space). Base available space on whichever value leaves least free
    space.
bug 3222

1548086f

01 Nov, 2016 3 commits

Fix regression in 16.05.6 where if you request multiple cpus per task (-c2) · d9fd52b8

Danny Auble authored Nov 01, 2016

and request --ntasks-per-core=1 and only 1 task on the node
the slurmd would abort on an infinite loop fatal.

Regression is from commit 5265420d.

Without this fix you can get into an infinite loop in the task/affinity
plugin.  The loop is handled by producing a fatal.

Bug 3118

d9fd52b8

cray/busrt_buffer - Fix for double count of used_space · cd7f4d86
Morris Jette authored Nov 01, 2016
```
cray/busrt_buffer - Fix for double counting of used_space at slurmctld
    startup.
bug 3222
```
cd7f4d86

cray/burst_buffer correct used pool space calculation · d91417c8

Morris Jette authored Nov 01, 2016

cray/busrt_buffer - If total_space in a pool decreases, reset used_space
    rather than trying to account for buffer allocations in progress.
bug 3222

d91417c8

28 Oct, 2016 1 commit

Fix issue in the priority/multifactor plugin where on a slurmctld restart · be924b88

Danny Auble authored Oct 28, 2016

more time than should be allowed would be accounted for.

This only happened on jobs in the completing state when the slurmctld
was shutdown.

This will also be enhanced in 17.02 as the job's end_time_exp is not
stored which is needed to determine if the job has already been through
the decay_thread at end of job.

Bug 3162

be924b88

27 Oct, 2016 1 commit
- Start NEWS for v16.05.7 · 67d4ee43
  Morris Jette authored Oct 27, 2016
  
  67d4ee43