- 05 Dec, 2016 2 commits
-
-
Danny Auble authored
from the slurm.conf when using FastSchedule=0.
-
Morris Jette authored
cray/burst_buffer - If slurmctld daemon restarts with pending job and burst buffer having unknown file stage-in status, teardown the buffer, defer the job, and start stage-in over again. bug 3295
-
- 02 Dec, 2016 3 commits
-
-
Danny Auble authored
bug 3314
-
Danny Auble authored
-
Danny Auble authored
-
- 01 Dec, 2016 2 commits
-
-
Dominik Bartkiewicz authored
limits after the node selection to make sure it doesn't violate those limits and if it does change the reason for waiting so we don't reserve resources on jobs violating accounting limits. Bug 3029
-
Morris Jette authored
node_features/knl_cray - Fix possible race condition when changing node state that could result in old KNL mode as an active features. bug 3235
-
- 30 Nov, 2016 2 commits
-
-
Morris Jette authored
cray/burst_buffer - Increase time to synchronize operations between threads from 5 to 60 seconds ("setup" operation time observed over 17 seconds). This should fix a race condition between a thread performing a buffer creation (setup) and a thread looking for unexpected buffers. If a buffer is found during the time window allowed for creation, it's space will be counted twice. First by the status checking thread and second by the thread doing the creation. The deallocation only happens once, so the used space information can be left with an invalid value. bug 3295
-
Tim Wickberg authored
static variable means multiple active decompression streams will corrupt zlib's internal state, which can lead to a segfault. Bug 3299.
-
- 29 Nov, 2016 1 commit
-
-
Alejandro Sanchez authored
On a reconfig, the exc_node_bitmap is cleared but then it was not built again since last_work_scan was declared as a local static variable in _do_power_work(). The fix is to make it global within the plugin and reinitialize it to 0 on _init_power_config(). Bug 3078.
-
- 28 Nov, 2016 3 commits
-
-
Alejandro Sanchez authored
-
Dominik Bartkiewicz authored
Bug 3267.
-
Dominik Bartkiewicz authored
Termination can race against step creation if, e.g., ill-behaved SPANK plugins are in use. Bug 3248.
-
- 22 Nov, 2016 5 commits
-
-
Morris Jette authored
sched/backfill plugin: Make malloc match data type (defined as uint32_t and allocated as int). No failures observed, if type "int" is smaller than "uint32_t", it could result in an invalid memory reference.
-
Sergey Meirovich authored
Fix API call: slurm_job_cpus_allocated_str_on_node_id() and in turn slurm_job_cpus_allocated_str_on_node() to return correct results for anything but first node. This was caused by missed logic to calculate fist bit belongs to particular node. Lookup was always starting from bit 0. Bug 3266.
-
Morris Jette authored
After one second of wall time, simulate the termination of all remaining running jobs in order to respond in a reasonable time frame. bug 3275
-
Morris Jette authored
Modify backfill algorithm to improve performance with large numbers of running jobs. Group running jobs that end in a "similar" time frame using a time window that grows exponentially rather than linearly. The original window sizes were (in units of minutes): 0, 1, 2, 3, 4, 5, 6, 7, ... minutes The new window sizes are: 0.5, 1, 2, 4, 8, 16, 32, ... minutes This can dramatically reduce the number of instances where the very time consuming "can the pending job run now" operation is executed, especailly if there are 1000+ running jobs. bug 3275
-
Nicolas Joly authored
-
- 14 Nov, 2016 1 commit
-
-
Morris Jette authored
If a node is booting for some job, don't allocate additional jobs to the node until the boot completes. but 3256
-
- 13 Nov, 2016 1 commit
-
-
Alejandro Sanchez authored
Found with valgrind. Bug 2846.
-
- 11 Nov, 2016 3 commits
-
-
Morris Jette authored
Move where we set the configuration table bitmaps in order to support the backup slurmctld starting and recovering previously saved KNL mode information (which can necessitate rebuilding the node configuration table). bug 3241
-
Tim Wickberg authored
Bug 3255.
-
David Gloe authored
Bug 3253.
-
- 10 Nov, 2016 2 commits
-
-
Tim Wickberg authored
If the input value mod 512 == 0, the value would be subject to unintended rounding. Rework the function to check against this on each unit promotion. Bug 3252.
-
Morris Jette authored
It was causing the loss of node available_features on startup with node_features/knl_cray bug 3241
-
- 09 Nov, 2016 2 commits
-
-
Morris Jette authored
Set per-node HBM availability as a GRES based upon the KNL node's MCDRAM state bug 3171
-
Alejandro Sanchez authored
Caused by race for local_energy which is dynamically allocated. Bail out of the update if that hasn't been allocated yet. Bug 3237.
-
- 08 Nov, 2016 4 commits
-
-
Morris Jette authored
bug 3213
-
Morris Jette authored
select/linear plugin modified to better support heterogeneous clusters when topology/none is also configured. Note that use of the select/cons_res plugin is strongly recommended for heterogeneous clusters. The use of OverSubscribe=exclusive can be used if whole node allocations is desired. bug 3212
-
Alejandro Sanchez authored
Bug 3224.
-
Morris Jette authored
If a job is started by the main scheduling logic and requeued while the backfill scheduler has locks released, that can result in an invalid data structure in select/cons_res. Namely, the backfill scheduler's attempt to start the job would clear the job resources node_bitmap. That leaves a NULL pointer in the select/cons_res plugin generating an abort. (That pointer is needed to clean up the job allocation records when the Epilog or Cray Node Health Check, NHC, are complete and the resources become available for another job. bug 3230
-
- 07 Nov, 2016 1 commit
-
-
Morris Jette authored
Backup slurmctld will now 1. Not abort due to NULL pointer (needed to move code around on restart) 2. Recover KNL MCDRAM and NUMA modes from state save files if capmc and cnselect not available bug 3241
-
- 05 Nov, 2016 1 commit
-
-
Morris Jette authored
cray/burst_buffer - Update "instance" parsing to match updated dw_wlm_cli output. bug 3222
-
- 04 Nov, 2016 2 commits
-
-
Morris Jette authored
cray/burst_buffer - Preserve job ID and don't translate to job array ID after slurmctld restart. Prior logic would not set array_task_id to NO_VAL, so all job-buffer IDs would be reported in the form "JobID=0_0(123)" rather than "JobID=123"
-
Morris Jette authored
cray/busrt_buffer - Internally track both allocated and unusable space. The reported UsedSpace in a pool is now the allocated space (previously was unusable space). Base available space on whichever value leaves least free space. bug 3222
-
- 01 Nov, 2016 3 commits
-
-
Danny Auble authored
and request --ntasks-per-core=1 and only 1 task on the node the slurmd would abort on an infinite loop fatal. Regression is from commit 5265420d. Without this fix you can get into an infinite loop in the task/affinity plugin. The loop is handled by producing a fatal. Bug 3118
-
Morris Jette authored
cray/busrt_buffer - Fix for double counting of used_space at slurmctld startup. bug 3222
-
Morris Jette authored
cray/busrt_buffer - If total_space in a pool decreases, reset used_space rather than trying to account for buffer allocations in progress. bug 3222
-
- 28 Oct, 2016 1 commit
-
-
Danny Auble authored
more time than should be allowed would be accounted for. This only happened on jobs in the completing state when the slurmctld was shutdown. This will also be enhanced in 17.02 as the job's end_time_exp is not stored which is needed to determine if the job has already been through the decay_thread at end of job. Bug 3162
-
- 27 Oct, 2016 1 commit
-
-
Morris Jette authored
-