- 08 Nov, 2016 2 commits
-
-
Alejandro Sanchez authored
Bug 3224.
-
Morris Jette authored
If a job is started by the main scheduling logic and requeued while the backfill scheduler has locks released, that can result in an invalid data structure in select/cons_res. Namely, the backfill scheduler's attempt to start the job would clear the job resources node_bitmap. That leaves a NULL pointer in the select/cons_res plugin generating an abort. (That pointer is needed to clean up the job allocation records when the Epilog or Cray Node Health Check, NHC, are complete and the resources become available for another job. bug 3230
-
- 07 Nov, 2016 1 commit
-
-
Morris Jette authored
Backup slurmctld will now 1. Not abort due to NULL pointer (needed to move code around on restart) 2. Recover KNL MCDRAM and NUMA modes from state save files if capmc and cnselect not available bug 3241
-
- 05 Nov, 2016 1 commit
-
-
Morris Jette authored
cray/burst_buffer - Update "instance" parsing to match updated dw_wlm_cli output. bug 3222
-
- 04 Nov, 2016 2 commits
-
-
Morris Jette authored
cray/burst_buffer - Preserve job ID and don't translate to job array ID after slurmctld restart. Prior logic would not set array_task_id to NO_VAL, so all job-buffer IDs would be reported in the form "JobID=0_0(123)" rather than "JobID=123"
-
Morris Jette authored
cray/busrt_buffer - Internally track both allocated and unusable space. The reported UsedSpace in a pool is now the allocated space (previously was unusable space). Base available space on whichever value leaves least free space. bug 3222
-
- 01 Nov, 2016 3 commits
-
-
Danny Auble authored
and request --ntasks-per-core=1 and only 1 task on the node the slurmd would abort on an infinite loop fatal. Regression is from commit 5265420d. Without this fix you can get into an infinite loop in the task/affinity plugin. The loop is handled by producing a fatal. Bug 3118
-
Morris Jette authored
cray/busrt_buffer - Fix for double counting of used_space at slurmctld startup. bug 3222
-
Morris Jette authored
cray/busrt_buffer - If total_space in a pool decreases, reset used_space rather than trying to account for buffer allocations in progress. bug 3222
-
- 28 Oct, 2016 1 commit
-
-
Danny Auble authored
more time than should be allowed would be accounted for. This only happened on jobs in the completing state when the slurmctld was shutdown. This will also be enhanced in 17.02 as the job's end_time_exp is not stored which is needed to determine if the job has already been through the decay_thread at end of job. Bug 3162
-
- 27 Oct, 2016 4 commits
-
-
Morris Jette authored
-
Danny Auble authored
issue with gang scheduling. Bug 3211
-
Tim Wickberg authored
-
Morris Jette authored
-
- 26 Oct, 2016 4 commits
-
-
Morris Jette authored
Fix bug that was clearing MAINT mode on nodes scheduled for reboot (bug introduced in version 16.05.5 to address bug in overlapping reservations, commit 5eee1d28). Note that a node's MAINT flag is used for both a requested reboot and maintenance reservation. What I'd like to do is add a new node state flag to differenciate between these two cases, but that involves some significant changes that could introduce instability, so it will be defered to version 17.02 bug 3210
-
Danny Auble authored
-
Danny Auble authored
requested with -n tasks < hosts from -w hostlist.
-
Morris Jette authored
bug 2149
-
- 25 Oct, 2016 3 commits
-
-
Dominik Bartkiewicz authored
Bug 3194
-
Morris Jette authored
Document that node Weight takes precedence over load with LLN scheduling. bug 3204
-
Tim Wickberg authored
task/cray's _get_numa_nodes() function needs to run before task/cgroup cleans up the cgroup hierarchies, otherwise ALPS memory compaction will never run. Also move task_p_add_pid() outside the #ifdef HAVE_NATIVE_CRAY block so that the plugin will load (albeit without any functionality) on non-Cray systems for testing purposes. Revise documentation and provided slurm.conf templates as well. Bug 3154.
-
- 20 Oct, 2016 2 commits
-
-
Tim Wickberg authored
_select_nodes_parts() was resetting state_reason to an admin hold without regard to admin vs user hold state. state_reason is the only place that user vs. admin is distinguished, so this prevented users from releasing these jobs. Bug introduced by commit fb46c84b in 16.05.5. Bug 3197.
-
Danny Auble authored
This is an addition to commit cb7ed937
-
- 19 Oct, 2016 1 commit
-
-
Ole H Nielsen authored
bug 3191
-
- 18 Oct, 2016 2 commits
-
-
Dominik Bartkiewicz authored
Improve reported estimates of start and end times for pending jobs. bug 3184
-
Morris Jette authored
Cray: Prevent abort in backfill scheduling logic for requeued job that has been cancelled while NHC is running. bug 3185
-
- 17 Oct, 2016 1 commit
-
-
Danny Auble authored
new glibc 2.24+ that depricates readdir_r.
-
- 13 Oct, 2016 3 commits
-
-
Morris Jette authored
Added node_features/knl_generic plugin for KNL support on non-Cray systems. NOTE: This plugin is still under development.
-
Morris Jette authored
Do not propagate SLURM_UMASK environment variable to batch script. bug 2609
-
Bjørn-Helge Mevik authored
Correct a bitmap test function (used only by the select/bluegene plugin). The effect of this bug is probably very limited as it will in almost all cases revert prematurely to a bit-by-bit test rather than using a full-word test. bug 3145
-
- 12 Oct, 2016 6 commits
-
-
Tim Wickberg authored
Cannot use ClusterName without reading a config file that may not exist. Bug 3026.
-
Tim Wickberg authored
This introduced an inadvertent dependency on the config file, which does not exist when setting up a new cluster. Bug 3026. This reverts commit c39f9ac9.
-
Morris Jette authored
task/affinity plugin: Honor a job's --ntasks-per-socket and --ntasks-per-core options in task binding. bug 3118
-
Morris Jette authored
Preserve non-KNL node features when updating the KNL node features for a multi-node job in which the non-KNL node features vary by node.
-
Morris Jette authored
node_features/knl_cray plugin: If the reconfiguration of nodes for an interactive job fails, kill the job (it can't be requeued like a batch job).
-
Morris Jette authored
node_features/knl_cray plugin: Add separate thread to interact with capmc in response to unexpected node reboots. bug 3153
-
- 11 Oct, 2016 4 commits
-
-
Alejandro Sanchez authored
bug 3091
-
Morris Jette authored
Prevent possible divide by zero in select/cons_res if a node's board count is higher than it's socket count. bug 3155
-
Morris Jette authored
If a node's socket or core count are changed at registration time (e.g. a KNL node's NUMA mode is changed), change it's board count to match. bug 3155
-
Morris Jette authored
Cray: The slurmd can manipulate the socket/core/thread values reported based upon the configuration. The logic failed to consider select/cray with SelectTypeParameters=other_cons_res as equivalent to select/cons_res. bug 3155
-