- 21 May, 2018 1 commit
-
-
Dominik Bartkiewicz authored
g_qos_count, g_qos_max_priority, must be call under qos write lock. Bug 5159.
-
- 17 May, 2018 1 commit
-
-
Danny Auble authored
PriorityFlags=ACCRUE_ALWAYS is set. Bug 5186
-
- 15 May, 2018 1 commit
-
-
Alejandro Sanchez authored
Previously the default paths continued to be tested even when new ones were requested. This had as a consequence that if any of the new paths was the same as any of the default ones (i.e. /usr or /usr/local), the configure script was incorrectly erroring out specifying that a version of PMIx was already found in a previous path. Bug 5168.
-
- 10 May, 2018 1 commit
-
-
Alejandro Sanchez authored
First issue was identified on multi partition requests. job_limits_check() was overriding the original memory requests, so the next partition Slurm validating limits against was not using the original values. The solution consists in adding three members to job_details struct to preserve the original requests. This issue is reported in bug 4895. Second issue was memory enforcement behavior being different depending on job the request issued against a reservation or not. Third issue had to do with the automatic adjustments Slurm did underneath when the memory request exceeded the limit. These adjustments included increasing pn_min_cpus (even incorrectly beyond the number of cpus available on the nodes) or different tricks increasing cpus_per_task and decreasing mem_per_cpu. Fourth issue was identified when requesting the special case of 0 memory, which was handled inside the select plugin after the partition validations and thus that could be used to incorrectly bypass the limits. Issues 2-4 were identified in bug 4976. Patch also includes an entire refactor on how and when job memory is is both set to default values (if not requested initially) and how and when limits are validated. Co-authored-by: Dominik Bartkiewicz <bart@schedmd.com>
-
- 09 May, 2018 9 commits
-
-
Morris Jette authored
If running without AccountingStorageEnforce but with the DBD and it isn't up when starting the slurmctld you could get into a corner case where you don't have a QOS list in the assoc_mgr. Thus no usage to transfer. Bug 5156
-
Tim Wickberg authored
-
Felip Moll authored
-
Morris Jette authored
Try to fill up each socket completely before moving into additional sockets. This will minimize the number of sockets needed, improving packing especially alongside MaxCPUsPerNode. Bug 4995.
-
Tim Wickberg authored
My mistake on commit 602817c8. Bug 4922.
-
Felip Moll authored
Without this, gang scheduling would incorrectly kick in for these jobs since active_resmap has not been updated appropriately. Bug 4922.
-
Tim Wickberg authored
Otherwise this will return the error message back to the next job submitter. Bug 5106.
-
Tim Wickberg authored
Bug 5106.
-
Alejandro Sanchez authored
job_ptr->part_ptr is NULL if the partition has been deleted. Crash only happens with PriorityFlags=CALCULATE_RUNNING enabled. Bug 5136.
-
- 08 May, 2018 3 commits
-
-
Brian Christiansen authored
Bug 5146
-
Tim Wickberg authored
Caused by a corrupted protocol_version field value being received by the slurmstepd, as we cannot safely write/read a uint16_t across the pipe as if it was an int. Regression caused by commit 90b116c2. Bug 5133.
-
Brian Christiansen authored
Requeued jobs are marked as PENDING|COMPLETING until the epilog checks in. The issue is that if job_set_alloc_tres gets called while in the PENDING|COMPLETING state, the job's alloc_tres_str will be free'd. If this job then gets checkpointed in this state (PENDING|COMPLETING + no tres_alloc_str) on startup the controller would crash because it expected the job to have a tres_alloc_str/cnt when in the COMPLETING state. This could be triggered if starting the controller without the dbd up. When the dbd comes up, the assoc_cache_mgr calls _update_job_tres() which calls job_set_alloc_tres. It could also be triggered by adding new tres. This most likely started happening in 17.11.5 because of commit 865b672f which introduced calling _update_job_tres() on each job after the dbd comes up. Bugs 5137,4522
-
- 04 May, 2018 1 commit
-
-
Brian Christiansen authored
Only when the connection has timedout. If the connection is timing out, consider increasing TCPTimeout in the slurm.conf Bug 4574
-
- 03 May, 2018 3 commits
-
-
Boris Karasev authored
Bug 5129.
-
Alejandro Sanchez authored
Use setenv() instead of setenvfs(), since setenvfs() memory allocation is implemented with xmalloc() and fini_setproctitle() (which is called on reconfigure) free's the memory with free(), leading to a: "free(): invalid size" malloc_printerr error. Continuation of dce83a23. Bug 5095.
-
Felip Moll authored
Due to current design the job limits are checked before the allocation is made when one specifies a generic gres and a specific gres type is configured. The workaround for now is to define a job submit plugin to control the user request and succesfully apply limits. Bug 4767
-
- 02 May, 2018 6 commits
-
-
Dominik Bartkiewicz authored
Bug 4960.
-
Dominik Bartkiewicz authored
Bug 4887.
-
Tim Wickberg authored
Can lead to deadlock within malloc depending on the exact timing. Rework thread startup and shutdown code so pthread_cancel is not needed. Bug 5119, 5103.
-
Tim Wickberg authored
happens. Bug 5108
-
Danny Auble authored
This reverts commit de5a4da2.
-
Danny Auble authored
happens. Bug 5108
-
- 01 May, 2018 1 commit
-
-
Danny Auble authored
Turns out the partititon's billing tres was working off the sum of the node_ptrs which contain the max of all partitions they are in. This isn't correct since each partition's billing can be different. Set it correctly here.
-
- 30 Apr, 2018 3 commits
-
-
Tim Wickberg authored
These functions are not async-cancel-safe, and cannot safely be cancelled. This leads to potential deadlock, either between our own locks, or deep inside glibc when the thread held a malloc arena lock when canceled. Replace with pthread_signal to the appropriate cond to wake threads up at the appropriate time instead. Bug 5103.
-
Marshall Garey authored
Otherwise the extern step will disappear after 11.5 days. Bug 5000.
-
Dominik Bartkiewicz authored
to be sure if it is created under job write lock. Bug 4901
-
- 28 Apr, 2018 2 commits
-
-
Brian Christiansen authored
Bug 5053
-
Brian Christiansen authored
This allows the suspend script to be triggered even if Slurm has the node(s) in a power_save state. Bug 5053
-
- 23 Apr, 2018 2 commits
-
-
Morris Jette authored
-
Morris Jette authored
When any of these --exclusive modes couldn't be satisfied, Slurm was returning an incorrect ESLURM_NODE_NOT_AVAIL, having as a consequence scheduling problems as described in the bug. The fix makes it so the error code is properly set to ESLURM_NODES_BUSY, fixing also the scheduling problems and working over the correct share_node_bitmap. Continuation of commits from bug 4932: e2a14b8d fc4e5ac9 Bug 5047.
-
- 19 Apr, 2018 2 commits
-
-
Marshall Garey authored
Fix an issue in the bit manipulation log introduced in commit 892ffa89. Bug 4997.
-
Tim Wickberg authored
Replace select_p_select_jobinfo_sprint() with the same NO-OP that the other plugins (except alps and bluegene) have implemented. Bug 5077.
-
- 17 Apr, 2018 1 commit
-
-
Morris Jette authored
1. Identifies nodes which are unavailable to a specific job, adding a call to filter_by_node_owner() in select_nodes() where the node list is generated. 2. Removes the "unavail_node_str" argument to select_nodes() as it is no longer useful. This string originally was originally generated once at the start of the job scheduling logic for all jobs, but since each job can have a different set of unavailable nodes (dedicated to user, group, etc.) so the same string for all jobs can be misleading. Bug 4932.
-
- 16 Apr, 2018 3 commits
-
-
Tim Wickberg authored
-
Dominik Bartkiewicz authored
See commit 0dabf4e7. Bug 4932.
-
Dominik Bartkiewicz authored
regression from ef1f3e73. Bug 4885.
-