- 15 Jun, 2018 2 commits
-
-
Marshall Garey authored
Bug 5270.
-
Tim Wickberg authored
Instead of unintentionally rejecting the update from a non-Administrator if the job_submit plugin modified that field. Bug 5306.
-
- 14 Jun, 2018 1 commit
-
-
Felip Moll authored
sched/backfill: Reset job time limit if needed for deadline scheduling. bug 5183
-
- 13 Jun, 2018 1 commit
-
-
Tim Wickberg authored
I do not see a use for this syntax, especially given that it appends an extra comma in between the two halves. Only allow the full string to change to put this in line with the Comment handling. Remove special handling of an identical AdminComment as well, since the end result is unchanged, and this avoids a potentially expensive xstrcmp call. Bug 5306.
-
- 12 Jun, 2018 3 commits
-
-
Danny Auble authored
-
Danny Auble authored
Bug 5286
-
Tim Wickberg authored
RHEL 6 (and related) use lua as the package name, test if that package exists with a version >= 5.1 if the other tests have already failed. Bug 5263.
-
- 10 Jun, 2018 1 commit
-
-
Dominik Bartkiewicz authored
bug 4987
-
- 08 Jun, 2018 2 commits
-
-
Tim Wickberg authored
And do not list each individual sensor reading but just the computed sum of each one grouped by key. Bug5274
-
Morris Jette authored
This is in anticipation of an upcoming change to the cgroup hierarchy on a future CLE release. Bug 5145.
-
- 07 Jun, 2018 2 commits
-
-
Brian Christiansen authored
If defined, is called when a node failes to resume by ResumeTimeout.
-
Isaac Hartung authored
if any task in the array was requeued. This is a hint to use "sacct --duplicates" to see the whole picture of the array job. Bug 5105
-
- 06 Jun, 2018 3 commits
-
-
Morris Jette authored
burst_buffer.conf - Add SetExecHost flag to enable burst buffer access from the login node for interactive jobs.
-
Alejandro Sanchez authored
And remove the initialization before all the calls to the function. It is non-functional and the motivation is more a preventive thing so that if we ever use slurm_mktime() we know tm_isdst is consistently set to -1. Bug 5230.
-
Brian Christiansen authored
which were marked down due to ResumeTimeout. If a cloud node was marked down due to not responding by ResumeTimeout, the code inadvertently added the node back to the avail_node_bitmap -- after being cleared by set_node_down_ptr(). The scheduler would then attempt to allocate the node again, which would cause a loop of hitting ResumeTimeout and allocating the downed node again. Bug 5264
-
- 05 Jun, 2018 1 commit
-
-
Killian authored
Bug 5206.
-
- 04 Jun, 2018 1 commit
-
-
Morris Jette authored
-
- 02 Jun, 2018 1 commit
-
-
Brian Christiansen authored
srun would not return an exit code if a previous task exited before a latter task exited with a signal. If multiple tasks exit with a signal, srun returns the highest signal. Partially reverts commit 04b449e1 -- the setting of local_global_rc to NO_VAL as srun doesn't need to know whether it's been set or not anymore. srun always sets the signal if a task exited with a signal. Bug 5083
-
- 31 May, 2018 1 commit
-
-
Alejandro Sanchez authored
There were two code paths building an allocation response by calling its own static _build_alloc_msg() function: 1. src/slurmctld/proc_req.c 2. src/slurmctld/srun_comm.c These two functions diverged and both had members that were not filled in but were filled in the other. This patch makes it so we change the signature of the one in proc_req.c to make it extern and then in srun_comm.c we call this newly common function. Also added cpu_freq_[min|max|gov] members in the common one since these were the only members missing in proc_req.c function (the one in srun_comm.c had more members missing, like all the ntasks_per*, account, qos or resv_name). Bug 4999.
-
- 30 May, 2018 7 commits
-
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Marshall Garey authored
Only trust MUNGE signed values, unless the RPC was signed by SlurmUser or root. CVE-2018-10995.
-
Tim Wickberg authored
Do not defer until later, and do not potentially miss out on proper validation of the user_name field which can lead to improper authentication handling. CVE-2018-10995.
-
Dominik Bartkiewicz authored
Bug 5038.
-
Tim Wickberg authored
Caused by pthread_cancel cleanup by commit e5f03971 in 17.11.6. Bug 5181.
-
Tim Wickberg authored
The race condition was created in a7c8964e in 17.11.6 when removing the (unsafe) pthread_cancel code handling thread termination. Bug 5164
-
- 24 May, 2018 1 commit
-
-
Brian Christiansen authored
Commits f18390e8 and eed76f85 modified the stepd so that if the stepd encountered an unkillable step timeout that the stepd would just exit the stepd. If the stepd is a batch step then it would reply back to the controller with a non-zero exit code which will drain the node. But if an srun allocation/step were to get into the unkillable step code, the steps wouldn't let the waiting srun or controller know about the step going away -- leaving a hanging srun and job. This patch enables the stepd to notify the waiting sruns and the ctld of the stepd being done and drains the node for srun'ed alloction and/or steps. Bug 5164
-
- 21 May, 2018 1 commit
-
-
Dominik Bartkiewicz authored
g_qos_count, g_qos_max_priority, must be call under qos write lock. Bug 5159.
-
- 17 May, 2018 2 commits
-
-
Danny Auble authored
PriorityFlags=ACCRUE_ALWAYS is set. Bug 5186
-
Morris Jette authored
Completely remove "gres" field from step record in slurmctld and step info message. Use "tres_per_node", "tres_per_socket", etc.
-
- 16 May, 2018 2 commits
-
-
Morris Jette authored
Add node_features plugin function "node_features_p_reboot_weight()" to return the node weight to be used for a compute node that requires reboot for use (e.g. to change the NUMA mode of a KNL node). Add NodeRebootWeight parameter to knl.conf configuration file.
-
Alejandro Sanchez authored
Previously the default paths continued to be tested even when new ones were requested. This had as a consequence that if any of the new paths was the same as any of the default ones (i.e. /usr or /usr/local), the configure script was incorrectly erroring out specifying that a version of PMIx was already found in a previous path. Bug 5168.
-
- 15 May, 2018 2 commits
-
-
Morris Jette authored
Add node_features plugin function "node_features_p_reboot_weight()" to return the node weight to be used for a compute node that requires reboot for use (e.g. to change the NUMA mode of a KNL node). Add NodeRebootWeight parameter to knl.conf configuration file.
-
Alejandro Sanchez authored
Previously the default paths continued to be tested even when new ones were requested. This had as a consequence that if any of the new paths was the same as any of the default ones (i.e. /usr or /usr/local), the configure script was incorrectly erroring out specifying that a version of PMIx was already found in a previous path. Bug 5168.
-
- 10 May, 2018 1 commit
-
-
Alejandro Sanchez authored
First issue was identified on multi partition requests. job_limits_check() was overriding the original memory requests, so the next partition Slurm validating limits against was not using the original values. The solution consists in adding three members to job_details struct to preserve the original requests. This issue is reported in bug 4895. Second issue was memory enforcement behavior being different depending on job the request issued against a reservation or not. Third issue had to do with the automatic adjustments Slurm did underneath when the memory request exceeded the limit. These adjustments included increasing pn_min_cpus (even incorrectly beyond the number of cpus available on the nodes) or different tricks increasing cpus_per_task and decreasing mem_per_cpu. Fourth issue was identified when requesting the special case of 0 memory, which was handled inside the select plugin after the partition validations and thus that could be used to incorrectly bypass the limits. Issues 2-4 were identified in bug 4976. Patch also includes an entire refactor on how and when job memory is is both set to default values (if not requested initially) and how and when limits are validated. Co-authored-by: Dominik Bartkiewicz <bart@schedmd.com>
-
- 09 May, 2018 5 commits
-
-
Morris Jette authored
If running without AccountingStorageEnforce but with the DBD and it isn't up when starting the slurmctld you could get into a corner case where you don't have a QOS list in the assoc_mgr. Thus no usage to transfer. Bug 5156
-
Tim Wickberg authored
-
Felip Moll authored
-
Morris Jette authored
Try to fill up each socket completely before moving into additional sockets. This will minimize the number of sockets needed, improving packing especially alongside MaxCPUsPerNode. Bug 4995.
-
Tim Wickberg authored
My mistake on commit 602817c8. Bug 4922.
-