- 01 May, 2018 3 commits
-
-
Danny Auble authored
thread. Otherwise you could get into a race where we don't have it running when the registration response is sent back which now leads us to not have any TRES to send to the slurmstepd.
-
Danny Auble authored
This will give us debug before we had it before. I see no reason to delay it until later.
-
Danny Auble authored
We found on most systems we would only need to wait < 20000 usecs for this to happen. This is much shorter of a time than the before 1 sec. We found we almost always (100% from my testings) for the step to finish in the first place though.
-
- 30 Apr, 2018 10 commits
-
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Tim Wickberg authored
The use in _watch_tasks needs to be removed as the switch to pthread_signal from pthread_cancel means this will not get interrupted and would keep the step alive for at least a second, potentially harming throughput. Since the call to _poll_data() happens after the first timer expires, this delay turns out to be unnecessary, so we won't be replacing it with a pthread_cond_timedwait() construct. The use jobacct_gather_stat_task() is unnecessary since the two locations this can happen take place after _fork_all_tasks() has setup the tasks, thus the delay should not be necessary. Bug 5103.
-
Tim Wickberg authored
These functions are not async-cancel-safe, and cannot safely be cancelled. This leads to potential deadlock, either between our own locks, or deep inside glibc when the thread held a malloc arena lock when canceled. Replace with pthread_signal to the appropriate cond to wake threads up at the appropriate time instead. Bug 5103.
-
Danny Auble authored
This will make it easier in a future commit to avoid the async pthread_cancel. Bug 5103
-
Alejandro Sanchez authored
Bug 5110.
-
Danny Auble authored
# Conflicts: # src/slurmctld/job_mgr.c
-
Marshall Garey authored
Remove partition MaxTime limit at the beginning of the test, run the rest of the test, then restore the partition configuration with scontrol reconfigure. Bug 4994.
-
Marshall Garey authored
Otherwise the extern step will disappear after 11.5 days. Bug 5000.
-
Dominik Bartkiewicz authored
to be sure if it is created under job write lock. Bug 4901
-
- 28 Apr, 2018 5 commits
-
-
Tim Wickberg authored
-
Brian Christiansen authored
-
Brian Christiansen authored
In conjuction with previous commit (reconginizing nodes being powered up out of band) set node's last_idle to 0 when the node is in a power_save state. Additional meaning that the node isn't booted. Partially reverts da722a89. Checking for (last_idle > 0) when in power_save state isn't necessary because if the node is already in power_save state the node won't be resumed unless (node_ptr->last_idle > (now - SuspendTime)). And with the previous change, the node's last_idle time will be set when the node registers.
-
Brian Christiansen authored
Bug 5053
-
Brian Christiansen authored
This allows the suspend script to be triggered even if Slurm has the node(s) in a power_save state. Bug 5053
-
- 27 Apr, 2018 2 commits
-
-
Danny Auble authored
-
Tim Wickberg authored
-
- 26 Apr, 2018 5 commits
-
-
Morris Jette authored
-
Morris Jette authored
The test was failing solidly on a Cray with NHC configured
-
Morris Jette authored
Disable the tests as needed
-
Tim Wickberg authored
-
Marshall Garey authored
Just in case reboot_program doesn't actually turn this node off for some reason, at least stopping slurmd explicitly will keep the node offline until someone intervenes. Bug 5019.
-
- 25 Apr, 2018 6 commits
-
-
Morris Jette authored
Add configuration paramerers SlurmctldAddr for use with virtual IP to manage backup slurmctld daemons. bug 4768
-
Danny Auble authored
by hwloc_obj_type_snprintf. You will only see this if you have _DEBUG set to 1.
-
Danny Auble authored
-
Morris Jette authored
Add configuration paramerers SlurmctldPrimaryOnProg and SlurmctldPrimaryOffProg, which define programs to execute when a slurmctld daemon becomes the primary server or goes from primary to backup mode. bug 4768
-
Tim Wickberg authored
-
Isaac Hartung authored
Large results from this will cause scheduler performance problems, usually due to running inside a VM without the Linux vDSO module. Bug 4961.
-
- 24 Apr, 2018 7 commits
-
-
Tim Wickberg authored
-
Morris Jette authored
-
Morris Jette authored
The included lightweight corefile description is no longer valid, but misleading at best.
-
Morris Jette authored
-
Christopher Bottoms authored
-
Morris Jette authored
-
Isaac Hartung authored
-
- 23 Apr, 2018 2 commits
-
-
Morris Jette authored
-
Morris Jette authored
Bug introduced in commit 11a75ff4
-