- 09 May, 2018 10 commits
-
-
Tim Wickberg authored
My mistake on commit 602817c8. Bug 4922.
-
Felip Moll authored
Without this, gang scheduling would incorrectly kick in for these jobs since active_resmap has not been updated appropriately. Bug 4922.
-
Tim Wickberg authored
Code for this was removed in 2012. Bug 5126.
-
Marshall Garey authored
Bug 5026.
-
Tim Wickberg authored
Otherwise this will return the error message back to the next job submitter. Bug 5106.
-
Tim Wickberg authored
Bug 5106.
-
Tim Wickberg authored
Link to CRIU as well. Bug 4293.
-
Tim Wickberg authored
Related to fix from bug 4155.
-
Josh Samuelson authored
Bug 4155.
-
Alejandro Sanchez authored
job_ptr->part_ptr is NULL if the partition has been deleted. Crash only happens with PriorityFlags=CALCULATE_RUNNING enabled. Bug 5136.
-
- 08 May, 2018 3 commits
-
-
Brian Christiansen authored
Bug 5146
-
Tim Wickberg authored
Caused by a corrupted protocol_version field value being received by the slurmstepd, as we cannot safely write/read a uint16_t across the pipe as if it was an int. Regression caused by commit 90b116c2. Bug 5133.
-
Brian Christiansen authored
Requeued jobs are marked as PENDING|COMPLETING until the epilog checks in. The issue is that if job_set_alloc_tres gets called while in the PENDING|COMPLETING state, the job's alloc_tres_str will be free'd. If this job then gets checkpointed in this state (PENDING|COMPLETING + no tres_alloc_str) on startup the controller would crash because it expected the job to have a tres_alloc_str/cnt when in the COMPLETING state. This could be triggered if starting the controller without the dbd up. When the dbd comes up, the assoc_cache_mgr calls _update_job_tres() which calls job_set_alloc_tres. It could also be triggered by adding new tres. This most likely started happening in 17.11.5 because of commit 865b672f which introduced calling _update_job_tres() on each job after the dbd comes up. Bugs 5137,4522
-
- 04 May, 2018 2 commits
-
-
Brian Christiansen authored
Only when the connection has timedout. If the connection is timing out, consider increasing TCPTimeout in the slurm.conf Bug 4574
-
Danny Auble authored
-
- 03 May, 2018 6 commits
-
-
Boris Karasev authored
Bug 5129.
-
Alejandro Sanchez authored
Bug 5110.
-
Alejandro Sanchez authored
Bug 5110.
-
Tim Wickberg authored
Continuation of d0deea4f. Bug 4841.
-
Alejandro Sanchez authored
Use setenv() instead of setenvfs(), since setenvfs() memory allocation is implemented with xmalloc() and fini_setproctitle() (which is called on reconfigure) free's the memory with free(), leading to a: "free(): invalid size" malloc_printerr error. Continuation of dce83a23. Bug 5095.
-
Felip Moll authored
Due to current design the job limits are checked before the allocation is made when one specifies a generic gres and a specific gres type is configured. The workaround for now is to define a job submit plugin to control the user request and succesfully apply limits. Bug 4767
-
- 02 May, 2018 6 commits
-
-
Dominik Bartkiewicz authored
Bug 4960.
-
Dominik Bartkiewicz authored
Bug 4887.
-
Tim Wickberg authored
Can lead to deadlock within malloc depending on the exact timing. Rework thread startup and shutdown code so pthread_cancel is not needed. Bug 5119, 5103.
-
Tim Wickberg authored
happens. Bug 5108
-
Danny Auble authored
This reverts commit de5a4da2.
-
Danny Auble authored
happens. Bug 5108
-
- 01 May, 2018 2 commits
-
-
Danny Auble authored
Turns out the partititon's billing tres was working off the sum of the node_ptrs which contain the max of all partitions they are in. This isn't correct since each partition's billing can be different. Set it correctly here.
-
Tim Wickberg authored
No functional change.
-
- 30 Apr, 2018 7 commits
-
-
Tim Wickberg authored
The use in _watch_tasks needs to be removed as the switch to pthread_signal from pthread_cancel means this will not get interrupted and would keep the step alive for at least a second, potentially harming throughput. Since the call to _poll_data() happens after the first timer expires, this delay turns out to be unnecessary, so we won't be replacing it with a pthread_cond_timedwait() construct. The use jobacct_gather_stat_task() is unnecessary since the two locations this can happen take place after _fork_all_tasks() has setup the tasks, thus the delay should not be necessary. Bug 5103.
-
Tim Wickberg authored
These functions are not async-cancel-safe, and cannot safely be cancelled. This leads to potential deadlock, either between our own locks, or deep inside glibc when the thread held a malloc arena lock when canceled. Replace with pthread_signal to the appropriate cond to wake threads up at the appropriate time instead. Bug 5103.
-
Danny Auble authored
This will make it easier in a future commit to avoid the async pthread_cancel. Bug 5103
-
Alejandro Sanchez authored
Bug 5110.
-
Marshall Garey authored
Remove partition MaxTime limit at the beginning of the test, run the rest of the test, then restore the partition configuration with scontrol reconfigure. Bug 4994.
-
Marshall Garey authored
Otherwise the extern step will disappear after 11.5 days. Bug 5000.
-
Dominik Bartkiewicz authored
to be sure if it is created under job write lock. Bug 4901
-
- 28 Apr, 2018 4 commits
-
-
Brian Christiansen authored
-
Brian Christiansen authored
In conjuction with previous commit (reconginizing nodes being powered up out of band) set node's last_idle to 0 when the node is in a power_save state. Additional meaning that the node isn't booted. Partially reverts da722a89. Checking for (last_idle > 0) when in power_save state isn't necessary because if the node is already in power_save state the node won't be resumed unless (node_ptr->last_idle > (now - SuspendTime)). And with the previous change, the node's last_idle time will be set when the node registers.
-
Brian Christiansen authored
Bug 5053
-
Brian Christiansen authored
This allows the suspend script to be triggered even if Slurm has the node(s) in a power_save state. Bug 5053
-