- 14 May, 2018 2 commits
-
-
Morris Jette authored
Prevent run-away jobs
-
Morris Jette authored
-
- 11 May, 2018 7 commits
-
-
Morris Jette authored
-
Morris Jette authored
This is not currently supported and no date for support has been set.
-
Morris Jette authored
If burst_buffer.conf has GetSysState configured to a non-standard location, but GetSysStatus is not configured that is likely indicative of a bad configuration rather than a Slurm failure.
-
Morris Jette authored
Gracefully fail if salloc does not get job allocation
-
Alejandro Sanchez authored
-
Alejandro Sanchez authored
Introduced in bf4cb0b1.
-
Danny Auble authored
-
- 10 May, 2018 8 commits
-
-
Tim Wickberg authored
Support for AIX was removed before 17.02.
-
Morris Jette authored
-
Morris Jette authored
-
Alejandro Sanchez authored
-
Alejandro Sanchez authored
First issue was identified on multi partition requests. job_limits_check() was overriding the original memory requests, so the next partition Slurm validating limits against was not using the original values. The solution consists in adding three members to job_details struct to preserve the original requests. This issue is reported in bug 4895. Second issue was memory enforcement behavior being different depending on job the request issued against a reservation or not. Third issue had to do with the automatic adjustments Slurm did underneath when the memory request exceeded the limit. These adjustments included increasing pn_min_cpus (even incorrectly beyond the number of cpus available on the nodes) or different tricks increasing cpus_per_task and decreasing mem_per_cpu. Fourth issue was identified when requesting the special case of 0 memory, which was handled inside the select plugin after the partition validations and thus that could be used to incorrectly bypass the limits. Issues 2-4 were identified in bug 4976. Patch also includes an entire refactor on how and when job memory is is both set to default values (if not requested initially) and how and when limits are validated. Co-authored-by: Dominik Bartkiewicz <bart@schedmd.com>
-
Danny Auble authored
The slurmctld doesn't need to send the fini message, and actually if it does things get messed up as the slurmdbd will close the database connection prematurely. Up till now we would print an error on the slurmctld saying we couldn't send the FINI.
-
Danny Auble authored
partition is removed then the slurmdbd comes up and we go refresh the tres pointers and try to deference the part_ptr. Related to commit de7eac9a. Bug 5136
-
Danny Auble authored
and move the agent into the accounting_storage/slurmdbd plugin. This should be cleaner going forward and will be easier to maintain.
-
- 09 May, 2018 23 commits
-
-
Morris Jette authored
If running without AccountingStorageEnforce but with the DBD and it isn't up when starting the slurmctld you could get into a corner case where you don't have a QOS list in the assoc_mgr. Thus no usage to transfer. Bug 5156
-
Tim Wickberg authored
-
Danny Auble authored
It turns out if we are running cache we need to call set_cluster_tres() again since the pointers in the list have changed and don't have any counts anymore. I opted to just modify the existing locks and call it inside the lock instead of having set_cluster_tres grab the locks.
-
Tim Wickberg authored
-
Tim Wickberg authored
Update slurm.spec and slurm.spec-legacy as well
-
Tim Wickberg authored
Clang warns about a possible null dereference of job_part_ptr if the !job_ptr->priority_array part of the conditional is taken. Remove that part of the conditional, as it doesn't matter if that is set or not here. The jobs eligibility on one vs. multiple partition is not determined by that, but by the status of part_ptr_list and part_ptr. Bug 5136.
-
Morris Jette authored
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
advent of persistent connections.
-
Brian Christiansen authored
-
Felip Moll authored
-
Morris Jette authored
Try to fill up each socket completely before moving into additional sockets. This will minimize the number of sockets needed, improving packing especially alongside MaxCPUsPerNode. Bug 4995.
-
Tim Wickberg authored
My mistake on commit 602817c8. Bug 4922.
-
Felip Moll authored
Without this, gang scheduling would incorrectly kick in for these jobs since active_resmap has not been updated appropriately. Bug 4922.
-
Tim Wickberg authored
-
Tim Wickberg authored
Code for this was removed in 2012. Bug 5126.
-
Jessica Nettelblad authored
Bug 4563
-
Marshall Garey authored
Bug 5026.
-
Tim Wickberg authored
Otherwise this will return the error message back to the next job submitter. Bug 5106.
-
Tim Wickberg authored
Bug 5106.
-
Tim Wickberg authored
Link to CRIU as well. Bug 4293.
-
Tim Wickberg authored
-