- 30 Jan, 2017 2 commits
-
-
Morris Jette authored
Clear job's reason of "BeginTime" in a more timely fashion and/or prevents them from being stuck in a PENDING state. There are multiple ways of clearing the reason, especially on a lightly loaded system, but the state can persist indefinitely on a heavily loaded system. bug 3368
-
Morris Jette authored
Fix to logic for getting expected start time of existing job ID with explicit begin time that is in the past. Previous logic would compare that (past) begin time with advanced reservations that would compete with it rather than the current time.
-
- 29 Jan, 2017 4 commits
-
-
Morris Jette authored
-
Morris Jette authored
On cray systems with step NHC, the step launches are delayed and produce a pair of messages (below) that caused the test to fail: srun: Job step creation temporarily disabled, retrying srun: Job step created
-
Morris Jette authored
-
Morris Jette authored
-
- 28 Jan, 2017 4 commits
-
-
Morris Jette authored
-
Morris Jette authored
Avoid a test failing of all nodes in a partition are not usable (down, drained, reserved, or otherwise unusable).
-
Morris Jette authored
-
Morris Jette authored
Disable test if underlying select/linear use
-
- 27 Jan, 2017 2 commits
-
-
Danny Auble authored
Turns out this never worked, ever. What used to happen is if the protocol_version that was read in didn't match the rpc_version given to unpack things was just 0. What this does now is set the rpc_version to what was stored making it all good.
-
Danny Auble authored
correctly.
-
- 26 Jan, 2017 3 commits
-
-
Alejandro Sanchez authored
Bug 3431
-
Morris Jette authored
-
Alejandro Sanchez authored
bug 3433
-
- 25 Jan, 2017 10 commits
-
-
Morris Jette authored
burst_buffer/cray - Fix race condition that could cause multiple batch job launch requests resulting in downed nodes. bug 3366
-
Dominik Bartkiewicz authored
-
Danny Auble authored
This reverts commit b9bff82f.
-
Danny Auble authored
-
Morris Jette authored
It was leaking memory otherwise
-
Tim Wickberg authored
Commit 63b7e3a8 changed the --mem limit to 1MB for the job if not using a memory SelectType, but this can cause the job to fail if the JobAcctGatherFrequency is frequent enough to notice that the "sleep" command is using more than 1MB of resources. Refactor test to avoid specifying job memory. Use --wrap to avoid creating a temporary job script as well.
-
Tim Wickberg authored
Commit 63b7e3a8 changed the --mem limit to 1MB for the step if not using a memory SelectType, but this can cause the job to fail if the JobAcctGatherFrequency is frequent enough to notice that the "sleep" command is using more than 1MB of resources. Refactor test to avoid specifying job memory. Use --wrap to avoid creating a temporary job script.
-
Tim Wickberg authored
Commit 63b7e3a8 changed the --mem limit to 1MB for the step if not using a memory SelectType, but this can cause the job to fail if the JobAcctGatherFrequency is frequent enough to notice that the "sleep" command is using more than 1MB of resources. Refactor test to avoid specifying memory memory; and since only one step is checked for, only run a single step in the job. Use --wrap to avoid creating a temporary job script.
-
Tim Wickberg authored
-
Tim Wickberg authored
-
- 24 Jan, 2017 3 commits
-
-
Tim Wickberg authored
-
Morris Jette authored
Some portions of tests 21.30 and 21.34 failed with accounting and priority basic. These changes disable portions of those tests as needed based upon configuration.
-
Morris Jette authored
test1.63 was failing periodically due to a race condition. A signal was being sent to srun before the signal handler thread was spawned.
-
- 23 Jan, 2017 9 commits
-
-
Morris Jette authored
Reset a job's memory limit based upon what's available after node reboot, which can change on a KNL if the MCDRAM mode is changes on reboot
-
Morris Jette authored
This bug was likely the root cause of bug 3366. If the backfill scheduler allocates resources for a batch job and a node reboot is required, the batch launch RPC would be sent to the agent. At that point, there is a race condition between the agent and the job_time_limit() function testing for boot completion. If the job_time_limit() function ran first, it would trigger a second launch RPC request getting sent to the agent. bug 3366
-
Morris Jette authored
Clean up logic to test if job is configuring bug 3366
-
Morris Jette authored
Do not launch a batch step while the job is configuring. Previous logic checked for the PrologSlurmctld running, but not nodes booting. Checking the job's CONFIGURING state flag will validate both. bug 3366
-
Morris Jette authored
Add check to avoid step allocation logic from executing job configuration completion logic multiple times (check if job is configurating before clearing flag and resetting time limit). bug 3366
-
Morris Jette authored
slurmctld/agent race condition fix: Prevent job launch while PrologSlurmctld daemon is running or node boot in progress. bug 3366
-
Morris Jette authored
This is required to manage the configuration completion. bug 3366
-
Morris Jette authored
This will be required to lock the job structure bug 3366
-
Morris Jette authored
Remove the return value from the agent_retry() function. It is not used anywhere and needs to be removed to run as a pthread. bug 3366
-
- 21 Jan, 2017 2 commits
-
-
Tim Wickberg authored
-
Tim Wickberg authored
Reasonable NFS systems do not need a minute to propagate changes.
-
- 20 Jan, 2017 1 commit
-
-
Brian Christiansen authored
If a lower version client would try to communicate with a higher version controller the dbd would return the controller's version and the client would use that version to talk to the controller. When the controller would respond, the client wouldn't know how to unpack the higher version msg.
-