- 10 Nov, 2016 1 commit
-
-
Morris Jette authored
Check for zonesort file first, to save time over attempting to load a module that is already loaded. It may be loaded by default per administrator configuration.
-
- 09 Nov, 2016 2 commits
-
-
Morris Jette authored
Set per-node HBM availability as a GRES based upon the KNL node's MCDRAM state bug 3171
-
Alejandro Sanchez authored
Caused by race for local_energy which is dynamically allocated. Bail out of the update if that hasn't been allocated yet. Bug 3237.
-
- 08 Nov, 2016 5 commits
-
-
Morris Jette authored
Add new node state flag of NODE_STATE_REBOOT for node reboots triggered by "scontrol reboot" commands. Previous logic re-used NODE_STATE_MAINT flag, which could lead to inconsistencies. Add "ASAP" option to "scontrol reboot" command that will drain a node in order to reboot it as soon as possible, then return it to service. bug 3210
-
Morris Jette authored
bug 3213
-
Morris Jette authored
select/linear plugin modified to better support heterogeneous clusters when topology/none is also configured. Note that use of the select/cons_res plugin is strongly recommended for heterogeneous clusters. The use of OverSubscribe=exclusive can be used if whole node allocations is desired. bug 3212
-
Alejandro Sanchez authored
Bug 3224.
-
Morris Jette authored
If a job is started by the main scheduling logic and requeued while the backfill scheduler has locks released, that can result in an invalid data structure in select/cons_res. Namely, the backfill scheduler's attempt to start the job would clear the job resources node_bitmap. That leaves a NULL pointer in the select/cons_res plugin generating an abort. (That pointer is needed to clean up the job allocation records when the Epilog or Cray Node Health Check, NHC, are complete and the resources become available for another job. bug 3230
-
- 07 Nov, 2016 1 commit
-
-
Morris Jette authored
Backup slurmctld will now 1. Not abort due to NULL pointer (needed to move code around on restart) 2. Recover KNL MCDRAM and NUMA modes from state save files if capmc and cnselect not available bug 3241
-
- 05 Nov, 2016 1 commit
-
-
Morris Jette authored
cray/burst_buffer - Update "instance" parsing to match updated dw_wlm_cli output. bug 3222
-
- 04 Nov, 2016 3 commits
-
-
Morris Jette authored
Add "FreeSpace" information for each pool to the "scontrol show burstbuffer" output. Required changes to the burst_buffer_info_t data structure. bug 3222
-
Morris Jette authored
cray/burst_buffer - Preserve job ID and don't translate to job array ID after slurmctld restart. Prior logic would not set array_task_id to NO_VAL, so all job-buffer IDs would be reported in the form "JobID=0_0(123)" rather than "JobID=123"
-
Morris Jette authored
cray/busrt_buffer - Internally track both allocated and unusable space. The reported UsedSpace in a pool is now the allocated space (previously was unusable space). Base available space on whichever value leaves least free space. bug 3222
-
- 02 Nov, 2016 1 commit
-
-
Morris Jette authored
Add LaunchParameters=mem_sort option to set configur running of zonesort by default at step startup. Also add documentation about zonesort on KNL web page bug 3188
-
- 01 Nov, 2016 3 commits
-
-
Danny Auble authored
and request --ntasks-per-core=1 and only 1 task on the node the slurmd would abort on an infinite loop fatal. Regression is from commit 5265420d. Without this fix you can get into an infinite loop in the task/affinity plugin. The loop is handled by producing a fatal. Bug 3118
-
Morris Jette authored
cray/busrt_buffer - Fix for double counting of used_space at slurmctld startup. bug 3222
-
Morris Jette authored
cray/busrt_buffer - If total_space in a pool decreases, reset used_space rather than trying to account for buffer allocations in progress. bug 3222
-
- 31 Oct, 2016 1 commit
-
-
Morris Jette authored
bug 3188
-
- 28 Oct, 2016 1 commit
-
-
Danny Auble authored
more time than should be allowed would be accounted for. This only happened on jobs in the completing state when the slurmctld was shutdown. This will also be enhanced in 17.02 as the job's end_time_exp is not stored which is needed to determine if the job has already been through the decay_thread at end of job. Bug 3162
-
- 27 Oct, 2016 6 commits
-
-
Morris Jette authored
-
Morris Jette authored
bug 3139
-
Danny Auble authored
issue with gang scheduling. Bug 3211
-
Tim Wickberg authored
-
Brian Christiansen authored
Federated submissions
-
Morris Jette authored
-
- 26 Oct, 2016 6 commits
-
-
Morris Jette authored
Fix bug that was clearing MAINT mode on nodes scheduled for reboot (bug introduced in version 16.05.5 to address bug in overlapping reservations, commit 5eee1d28). Note that a node's MAINT flag is used for both a requested reboot and maintenance reservation. What I'd like to do is add a new node state flag to differenciate between these two cases, but that involves some significant changes that could introduce instability, so it will be defered to version 17.02 bug 3210
-
Alejandro Sanchez authored
salloc are requested with -n tasks < hosts from -w hostlist or from -N.
-
Danny Auble authored
-
Danny Auble authored
requested with -n tasks < hosts from -w hostlist.
-
Morris Jette authored
bug 2149
-
Morris Jette authored
Add new SchedulerParameter (max_array_tasks) to limit the maximum number of tasks in a job array independently from the maximum task ID (MaxArraySize). bug 2676
-
- 25 Oct, 2016 9 commits
-
-
Dominik Bartkiewicz authored
Bug 3194
-
Morris Jette authored
Add SbcastParameters configuration option to control default file destination directory and compression algorithm. bug 2977
-
Morris Jette authored
Replace sjstat, seff and sjobexit RPM packages with a single "contribs" package.
-
Danny Auble authored
-
Morris Jette authored
Remove separate slurm_blcr package. If Slurm is build with BLCR support, the files will now be part of the main Slurm packages. bug 2061
-
Morris Jette authored
Document that node Weight takes precedence over load with LLN scheduling. bug 3204
-
Tim Wickberg authored
Follow on to commit c3266fca for 17.02+.
-
Tim Wickberg authored
task/cray's _get_numa_nodes() function needs to run before task/cgroup cleans up the cgroup hierarchies, otherwise ALPS memory compaction will never run. Also move task_p_add_pid() outside the #ifdef HAVE_NATIVE_CRAY block so that the plugin will load (albeit without any functionality) on non-Cray systems for testing purposes. Revise documentation and provided slurm.conf templates as well. Bug 3154.
-
Morris Jette authored
Do not include SLURM_JOB_DERIVED_EC, SLURM_JOB_EXIT_CODE, or SLURM_JOB_EXIT_CODE in PrologSlurmctld environment (not available yet). bug 1431
-