Commits · 57e47d010d7979cb0ee3fd007a8e9fbf81180502 · Manuel G. Marciani / ces_slurm_simulator

22 Nov, 2016 5 commits

Morris Jette authored Nov 22, 2016

sched/backfill plugin: Make malloc match data type (defined as uint32_t and
allocated as int). No failures observed, if type "int" is smaller than
"uint32_t", it could result in an invalid memory reference.

a12e1a1c

Fix slurm_job_cpus_allocated_str_on_node_id() API call. · 0ed6488e

Sergey Meirovich authored Nov 22, 2016

Fix API call: slurm_job_cpus_allocated_str_on_node_id() and
in turn slurm_job_cpus_allocated_str_on_node() to return correct
results for anything but first node. This was caused by missed logic
to calculate fist bit belongs to particular node. Lookup was always
starting from bit 0.

Bug 3266.

0ed6488e

backfill algorithm logic · e089b63a

Morris Jette authored Nov 22, 2016

After one second of wall time, simulate the termination of all remaining
   running jobs in order to respond in a reasonable time frame.
bug 3275

e089b63a

Modify backfill algorithm · 6008b021

Morris Jette authored Nov 22, 2016

Modify backfill algorithm to improve performance with large numbers of
    running jobs. Group running jobs that end in a "similar" time frame using a
    time window that grows exponentially rather than linearly. The original
    window sizes were (in units of minutes):
    0, 1, 2, 3, 4, 5, 6, 7, ... minutes
    The new window sizes are:
    0.5, 1, 2, 4, 8, 16, 32, ... minutes
    This can dramatically reduce the number of instances where the very time
    consuming "can the pending job run now" operation is executed, especailly
    if there are 1000+ running jobs.
bug 3275

6008b021

testsuite - fix job id output in test17.39 · 44241006
Nicolas Joly authored Nov 22, 2016

44241006

21 Nov, 2016 2 commits
- Fix a few typos in NEWS and RELEASE_NOTES · 6391002e
  Morris Jette authored Nov 21, 2016
  
  6391002e
- Major update to RELEASE_NOTES file · acba3880
  Morris Jette authored Nov 21, 2016
```
Modify several NEWS items for greater clarity.
```
  acba3880
18 Nov, 2016 1 commit

Add new PriorityFlags option of INCR_ONLY · 2d282997

Morris Jette authored Nov 18, 2016

Add new PriorityFlags option of INCR_ONLY, which prevents a job's priority
    from being decremented.
Added for NERSC

2d282997

14 Nov, 2016 1 commit

avoid additional job allocations on booting nodes · b927fb08

Morris Jette authored Nov 14, 2016

If a node is booting for some job, don't allocate additional jobs to the
    node until the boot completes.
but 3256

b927fb08

13 Nov, 2016 1 commit
- cgroup plugins - fix two minor memory leaks · 85ab952a
  Alejandro Sanchez authored Nov 13, 2016
```
Found with valgrind. Bug 2846.
```
  85ab952a
11 Nov, 2016 5 commits
- knl_cray plugin - Avoid abort from backup slurmctld at start time · 9702ec22
  Morris Jette authored Nov 11, 2016
```
Move where we set the configuration table bitmaps in order to support
  the backup slurmctld starting and recovering previously saved
  KNL mode information (which can necessitate rebuilding the node
  configuration table).
bug 3241
```
  9702ec22
- Docs - elaborate on how to clear TRES limits in sacctmgr. · cdc737e6
  Tim Wickberg authored Nov 11, 2016
```
Bug 3255.
```
  cdc737e6
- switch/cray plugin - change legacy spool directory path. · 05cbbf93
  David Gloe authored Nov 10, 2016
```
Since CLE 5.1 I believe, /var/spool/alps is a symlink to
/var/opt/cray/alps/spool, so it should be safe to use
/var/opt/cray/alps/spool.

Bug 3253.
```
  05cbbf93
- switch/cray plugin - fix use after free in debug message. · e391622e
  David Gloe authored Nov 10, 2016
```
Bug 3253.
```
  e391622e
- Allow unit conversion routine to convert 1024M to 1G. · c209ff2f
  Tim Wickberg authored Nov 10, 2016
  
  c209ff2f
10 Nov, 2016 3 commits

Fix output routines to prevent unintended rounting. · 377a6259

Tim Wickberg authored Nov 10, 2016

If the input value mod 512 == 0, the value would be subject to
unintended rounding. Rework the function to check against this
on each unit promotion.

Bug 3252.

377a6259

Revert commit · da66b63f

Morris Jette authored Nov 10, 2016

It was causing the loss of node available_features on startup with
  node_features/knl_cray
bug 3241

da66b63f

Only try to load zonesort module if not already loaded · 3de17a26

Morris Jette authored Nov 10, 2016

Check for zonesort file first, to save time over attempting to load
  a module that is already loaded. It may be loaded by default per
  administrator configuration.

3de17a26

09 Nov, 2016 2 commits
- Integrate node_feature/knl_generic plugin with HBM GRES · 3712bf37
  Morris Jette authored Nov 09, 2016
```
Set per-node HBM availability as a GRES based upon the KNL node's
  MCDRAM state
bug 3171
```
  3712bf37
- acct_gather_energy/rapl - prevent segfault in slurmd on startup. · c00b3da7
  Alejandro Sanchez authored Nov 09, 2016
```
Caused by race for local_energy which is dynamically allocated. Bail out
of the update if that hasn't been allocated yet.

Bug 3237.
```
  c00b3da7
08 Nov, 2016 5 commits

Upgrade "scontrol reboot" logic · 861bab6c

Morris Jette authored Nov 08, 2016

Add new node state flag of NODE_STATE_REBOOT for node reboots triggered by
    "scontrol reboot" commands. Previous logic re-used NODE_STATE_MAINT flag,
    which could lead to inconsistencies. Add "ASAP" option to "scontrol reboot"
    command that will drain a node in order to reboot it as soon as possible,
    then return it to service.
bug 3210

861bab6c

Permit cancellation of jobs in configuring state. · 6957bd9f
Morris Jette authored Nov 08, 2016
```
bug 3213
```
6957bd9f

select/linear plugin modified to better support heterogeneous clusters · 243fbb0d

Morris Jette authored Nov 08, 2016

select/linear plugin modified to better support heterogeneous clusters when
    topology/none is also configured. Note that use of the select/cons_res
    plugin is strongly recommended for heterogeneous clusters. The use of
    OverSubscribe=exclusive can be used if whole node allocations is
    desired.
bug 3212

243fbb0d

job_submit/lua - Add features field for existing job records. · 9e7e12dc
Alejandro Sanchez authored Nov 08, 2016
```
Bug 3224.
```
9e7e12dc

sched/backfill - avoid starting requeued job · 69af50af

Morris Jette authored Nov 07, 2016

If a job is started by the main scheduling logic and requeued while
  the backfill scheduler has locks released, that can result in an
  invalid data structure in select/cons_res. Namely, the backfill
  scheduler's attempt to start the job would clear the job resources
  node_bitmap. That leaves a NULL pointer in the select/cons_res
  plugin generating an abort. (That pointer is needed to clean up
  the job allocation records when the Epilog or Cray Node Health
  Check, NHC, are complete and the resources become available for
  another job.
bug 3230

69af50af

07 Nov, 2016 1 commit

knl_cray plugin - Avoid abort from backup slurmctld at start time. · 6798b468

Morris Jette authored Nov 07, 2016

Backup slurmctld will now
1. Not abort due to NULL pointer (needed to move code around on restart)
2. Recover KNL MCDRAM and NUMA modes from state save files if capmc and
   cnselect not available
bug 3241

6798b468

05 Nov, 2016 1 commit
- cray/burst_buffer - Update dw_wlm_cli parsing · 84e39fd8
  Morris Jette authored Nov 04, 2016
```
cray/burst_buffer - Update "instance" parsing to match updated dw_wlm_cli
    output.
bug 3222
```
  84e39fd8
04 Nov, 2016 3 commits

Add FreeSpace to burst buffer output · 49ee211e

Morris Jette authored Nov 04, 2016

Add "FreeSpace" information for each pool to the "scontrol show burstbuffer"
    output. Required changes to the burst_buffer_info_t data structure.
bug 3222

49ee211e

cray/burst_buffer - Preserve job ID · 42a90020

Morris Jette authored Nov 03, 2016

cray/burst_buffer - Preserve job ID and don't translate to job array ID
  after slurmctld restart. Prior logic would not set array_task_id to
  NO_VAL, so all job-buffer IDs would be reported in the form
  "JobID=0_0(123)" rather than "JobID=123"

42a90020

Burst_buffer/cray space tracking fix · 1548086f

Morris Jette authored Nov 03, 2016

cray/busrt_buffer - Internally track both allocated and unusable space.
    The reported UsedSpace in a pool is now the allocated space (previously was
    unusable space). Base available space on whichever value leaves least free
    space.
bug 3222

1548086f

02 Nov, 2016 1 commit

Add LaunchParameters=mem_sort · cec8638b

Morris Jette authored Nov 02, 2016

Add LaunchParameters=mem_sort option to set configur running of zonesort
    by default at step startup.
Also add documentation about zonesort on KNL web page
bug 3188

cec8638b

01 Nov, 2016 3 commits

Fix regression in 16.05.6 where if you request multiple cpus per task (-c2) · d9fd52b8

Danny Auble authored Nov 01, 2016

and request --ntasks-per-core=1 and only 1 task on the node
the slurmd would abort on an infinite loop fatal.

Regression is from commit 5265420d.

Without this fix you can get into an infinite loop in the task/affinity
plugin.  The loop is handled by producing a fatal.

Bug 3118

d9fd52b8

cray/busrt_buffer - Fix for double count of used_space · cd7f4d86
Morris Jette authored Nov 01, 2016
```
cray/busrt_buffer - Fix for double counting of used_space at slurmctld
    startup.
bug 3222
```
cd7f4d86

cray/burst_buffer correct used pool space calculation · d91417c8

Morris Jette authored Nov 01, 2016

cray/busrt_buffer - If total_space in a pool decreases, reset used_space
    rather than trying to account for buffer allocations in progress.
bug 3222

d91417c8

31 Oct, 2016 1 commit
- Add --mem_bind option of "sort" to run zonesort on KNL nodes at step start. · 69684648
  Morris Jette authored Oct 31, 2016
```
bug 3188
```
  69684648
28 Oct, 2016 1 commit

Fix issue in the priority/multifactor plugin where on a slurmctld restart · be924b88

Danny Auble authored Oct 28, 2016

more time than should be allowed would be accounted for.

This only happened on jobs in the completing state when the slurmctld
was shutdown.

This will also be enhanced in 17.02 as the job's end_time_exp is not
stored which is needed to determine if the job has already been through
the decay_thread at end of job.

Bug 3162

be924b88

27 Oct, 2016 4 commits
- Start NEWS for v16.05.7 · 67d4ee43
  Morris Jette authored Oct 27, 2016
  
  67d4ee43
- Add support for per-partitiion OverTimeLimit configuration · f52b7a60
  Morris Jette authored Oct 27, 2016
```
bug 3139
```
  f52b7a60
- Make sure a job cleans up completely if it has a node fail. Mostly an · 9c0a2f2b
  Danny Auble authored Oct 27, 2016
```
issue with gang scheduling.

Bug 3211
```
  9c0a2f2b
- Docs - remove recommendation for ReleaseAgent setting in slurm.conf. · 1d99a44b
  Tim Wickberg authored Oct 27, 2016
  
  1d99a44b