Commits · ddb8e608486d32429711f970672764b568b3310f · Manuel G. Marciani / ces_slurm_simulator

02 Dec, 2016 1 commit

Add job constraint input environment variables · ddb8e608

Morris Jette authored Dec 02, 2016

Add support for SALLOC_CONSTRAINT, SBATCH_CONSTRAINT and SLURM_CONSTRAINT
    environment variables to set default constraints for salloc, sbatch and
    srun commands respectively.
Bug 3317

ddb8e608

01 Dec, 2016 1 commit

knl_cray: Fix KNL mode/feature race condition · 46128f2b

Morris Jette authored Nov 30, 2016

node_features/knl_cray - Fix possible race condition when changing node
    state that could result in old KNL mode as an active features.
bug 3235

46128f2b

30 Nov, 2016 2 commits

cray/burst_buffer - Increase timer · b4763c75

Morris Jette authored Nov 30, 2016

cray/burst_buffer - Increase time to synchronize operations between threads
    from 5 to 60 seconds ("setup" operation time observed over 17 seconds).
    This should fix a race condition between a thread performing a buffer
    creation (setup) and a thread looking for unexpected buffers. If a
    buffer is found during the time window allowed for creation, it's
    space will be counted twice. First by the status checking thread
    and second by the thread doing the creation. The deallocation only
    happens once, so the used space information can be left with an
    invalid value.
bug 3295

b4763c75

sbcast - prevent segfault in slurmd from multiple zlib compressed transfers · 8c5765c9

Tim Wickberg authored Nov 30, 2016

static variable means multiple active decompression streams will corrupt
zlib's internal state, which can lead to a segfault.

Bug 3299.

8c5765c9

29 Nov, 2016 3 commits

Fix SuspendExcNodes and SuspendExcParts on slurmctld SIGHUP. · bb06dd65

Alejandro Sanchez authored Nov 29, 2016

On a reconfig, the exc_node_bitmap is cleared but then it was
not built again since last_work_scan was declared as a local static
variable in _do_power_work(). The fix is to make it global within the
plugin and reinitialize it to 0 on _init_power_config().

Bug 3078.

bb06dd65

Add "GresEnforceBind=Yes" to "scontrol show job" output · d6652d51
Morris Jette authored Nov 28, 2016

d6652d51

Show job GRES index info with "scontrol -d show job" · 45153689

Morris Jette authored Nov 28, 2016

For example:
     Nodes=nid00001 CPU_IDs=2-3 Mem=1000 GRES_IDX=gpu:alpha(IDX:2)
     Nodes=nid00002 CPU_IDs=0-1 Mem=1000 GRES_IDX=gpu:alpha(IDX:0)

45153689

28 Nov, 2016 5 commits
- Make the openssl crypto plugin compile with openssl >= 1.1. · fd747355
  Alejandro Sanchez authored Nov 28, 2016
  
  fd747355
- Add new mcs/account plugin. · 00a39f9b
  Aline Roy authored Nov 28, 2016
```
Bug 3291.
```
  00a39f9b
- Display specific GRES indecies allocated · a024c355
  Morris Jette authored Nov 28, 2016
```
If GRES are configured with file IDs, then "scontrol -d show node" will
    not only identify the count of currently allocated GRES, but their specific
    index numbers (e.g. "GresUsed=gpu:alpha:2(IDX:0,2),gpu:beta:0(IDX:N/A)").
```
  a024c355
- sacctmgr - prevent segfault when trying to reset usage for an invalid account · 9e028071
  Dominik Bartkiewicz authored Nov 28, 2016
```
Bug 3267.
```
  9e028071
- srun - prevent segfault in launch plugin when terminating not-yet-created step. · d4aa1998
  Dominik Bartkiewicz authored Nov 28, 2016
```
Termination can race against step creation if, e.g., ill-behaved SPANK plugins
are in use.

Bug 3248.
```
  d4aa1998
22 Nov, 2016 7 commits

Added SchedulingParameters option of "bf_job_part_count_reserve" · 209822a8

Morris Jette authored Nov 22, 2016

Added SchedulingParameters option of "bf_job_part_count_reserve". Jobs below
    the specified threshold will not have resources reserved for them.
bug 3275

209822a8

Make it so we don't purge job start messages until after we purge step · 178a929b
Danny Auble authored Nov 22, 2016
```
messages.  Hopefully this will reduce the number of messages lost when
filling up memory when the database/DBD is down.
```
178a929b

Correct malloc data type · a12e1a1c

Morris Jette authored Nov 22, 2016

sched/backfill plugin: Make malloc match data type (defined as uint32_t and
allocated as int). No failures observed, if type "int" is smaller than
"uint32_t", it could result in an invalid memory reference.

a12e1a1c

Fix slurm_job_cpus_allocated_str_on_node_id() API call. · 0ed6488e

Sergey Meirovich authored Nov 22, 2016

Fix API call: slurm_job_cpus_allocated_str_on_node_id() and
in turn slurm_job_cpus_allocated_str_on_node() to return correct
results for anything but first node. This was caused by missed logic
to calculate fist bit belongs to particular node. Lookup was always
starting from bit 0.

Bug 3266.

0ed6488e

backfill algorithm logic · e089b63a

Morris Jette authored Nov 22, 2016

After one second of wall time, simulate the termination of all remaining
   running jobs in order to respond in a reasonable time frame.
bug 3275

e089b63a

Modify backfill algorithm · 6008b021

Morris Jette authored Nov 22, 2016

Modify backfill algorithm to improve performance with large numbers of
    running jobs. Group running jobs that end in a "similar" time frame using a
    time window that grows exponentially rather than linearly. The original
    window sizes were (in units of minutes):
    0, 1, 2, 3, 4, 5, 6, 7, ... minutes
    The new window sizes are:
    0.5, 1, 2, 4, 8, 16, 32, ... minutes
    This can dramatically reduce the number of instances where the very time
    consuming "can the pending job run now" operation is executed, especailly
    if there are 1000+ running jobs.
bug 3275

6008b021

testsuite - fix job id output in test17.39 · 44241006
Nicolas Joly authored Nov 22, 2016

44241006

21 Nov, 2016 2 commits
- Fix a few typos in NEWS and RELEASE_NOTES · 6391002e
  Morris Jette authored Nov 21, 2016
  
  6391002e
- Major update to RELEASE_NOTES file · acba3880
  Morris Jette authored Nov 21, 2016
```
Modify several NEWS items for greater clarity.
```
  acba3880
18 Nov, 2016 1 commit

Add new PriorityFlags option of INCR_ONLY · 2d282997

Morris Jette authored Nov 18, 2016

Add new PriorityFlags option of INCR_ONLY, which prevents a job's priority
    from being decremented.
Added for NERSC

2d282997

14 Nov, 2016 1 commit

avoid additional job allocations on booting nodes · b927fb08

Morris Jette authored Nov 14, 2016

If a node is booting for some job, don't allocate additional jobs to the
    node until the boot completes.
but 3256

b927fb08

13 Nov, 2016 1 commit
- cgroup plugins - fix two minor memory leaks · 85ab952a
  Alejandro Sanchez authored Nov 13, 2016
```
Found with valgrind. Bug 2846.
```
  85ab952a
11 Nov, 2016 5 commits
- knl_cray plugin - Avoid abort from backup slurmctld at start time · 9702ec22
  Morris Jette authored Nov 11, 2016
```
Move where we set the configuration table bitmaps in order to support
  the backup slurmctld starting and recovering previously saved
  KNL mode information (which can necessitate rebuilding the node
  configuration table).
bug 3241
```
  9702ec22
- Docs - elaborate on how to clear TRES limits in sacctmgr. · cdc737e6
  Tim Wickberg authored Nov 11, 2016
```
Bug 3255.
```
  cdc737e6
- switch/cray plugin - change legacy spool directory path. · 05cbbf93
  David Gloe authored Nov 10, 2016
```
Since CLE 5.1 I believe, /var/spool/alps is a symlink to
/var/opt/cray/alps/spool, so it should be safe to use
/var/opt/cray/alps/spool.

Bug 3253.
```
  05cbbf93
- switch/cray plugin - fix use after free in debug message. · e391622e
  David Gloe authored Nov 10, 2016
```
Bug 3253.
```
  e391622e
- Allow unit conversion routine to convert 1024M to 1G. · c209ff2f
  Tim Wickberg authored Nov 10, 2016
  
  c209ff2f
10 Nov, 2016 3 commits

Fix output routines to prevent unintended rounting. · 377a6259

Tim Wickberg authored Nov 10, 2016

If the input value mod 512 == 0, the value would be subject to
unintended rounding. Rework the function to check against this
on each unit promotion.

Bug 3252.

377a6259

Revert commit · da66b63f

Morris Jette authored Nov 10, 2016

It was causing the loss of node available_features on startup with
  node_features/knl_cray
bug 3241

da66b63f

Only try to load zonesort module if not already loaded · 3de17a26

Morris Jette authored Nov 10, 2016

Check for zonesort file first, to save time over attempting to load
  a module that is already loaded. It may be loaded by default per
  administrator configuration.

3de17a26

09 Nov, 2016 2 commits
- Integrate node_feature/knl_generic plugin with HBM GRES · 3712bf37
  Morris Jette authored Nov 09, 2016
```
Set per-node HBM availability as a GRES based upon the KNL node's
  MCDRAM state
bug 3171
```
  3712bf37
- acct_gather_energy/rapl - prevent segfault in slurmd on startup. · c00b3da7
  Alejandro Sanchez authored Nov 09, 2016
```
Caused by race for local_energy which is dynamically allocated. Bail out
of the update if that hasn't been allocated yet.

Bug 3237.
```
  c00b3da7
08 Nov, 2016 5 commits

Upgrade "scontrol reboot" logic · 861bab6c

Morris Jette authored Nov 08, 2016

Add new node state flag of NODE_STATE_REBOOT for node reboots triggered by
    "scontrol reboot" commands. Previous logic re-used NODE_STATE_MAINT flag,
    which could lead to inconsistencies. Add "ASAP" option to "scontrol reboot"
    command that will drain a node in order to reboot it as soon as possible,
    then return it to service.
bug 3210

861bab6c

Permit cancellation of jobs in configuring state. · 6957bd9f
Morris Jette authored Nov 08, 2016
```
bug 3213
```
6957bd9f

select/linear plugin modified to better support heterogeneous clusters · 243fbb0d

Morris Jette authored Nov 08, 2016

select/linear plugin modified to better support heterogeneous clusters when
    topology/none is also configured. Note that use of the select/cons_res
    plugin is strongly recommended for heterogeneous clusters. The use of
    OverSubscribe=exclusive can be used if whole node allocations is
    desired.
bug 3212

243fbb0d

job_submit/lua - Add features field for existing job records. · 9e7e12dc
Alejandro Sanchez authored Nov 08, 2016
```
Bug 3224.
```
9e7e12dc

sched/backfill - avoid starting requeued job · 69af50af

Morris Jette authored Nov 07, 2016

If a job is started by the main scheduling logic and requeued while
  the backfill scheduler has locks released, that can result in an
  invalid data structure in select/cons_res. Namely, the backfill
  scheduler's attempt to start the job would clear the job resources
  node_bitmap. That leaves a NULL pointer in the select/cons_res
  plugin generating an abort. (That pointer is needed to clean up
  the job allocation records when the Epilog or Cray Node Health
  Check, NHC, are complete and the resources become available for
  another job.
bug 3230

69af50af

07 Nov, 2016 1 commit

knl_cray plugin - Avoid abort from backup slurmctld at start time. · 6798b468

Morris Jette authored Nov 07, 2016

Backup slurmctld will now
1. Not abort due to NULL pointer (needed to move code around on restart)
2. Recover KNL MCDRAM and NUMA modes from state save files if capmc and
   cnselect not available
bug 3241

6798b468