- 26 Jan, 2017 3 commits
-
-
Alejandro Sanchez authored
Bug 3431
-
Morris Jette authored
-
Alejandro Sanchez authored
bug 3433
-
- 25 Jan, 2017 10 commits
-
-
Morris Jette authored
burst_buffer/cray - Fix race condition that could cause multiple batch job launch requests resulting in downed nodes. bug 3366
-
Dominik Bartkiewicz authored
-
Danny Auble authored
This reverts commit b9bff82f.
-
Danny Auble authored
-
Morris Jette authored
It was leaking memory otherwise
-
Tim Wickberg authored
Commit 63b7e3a8 changed the --mem limit to 1MB for the job if not using a memory SelectType, but this can cause the job to fail if the JobAcctGatherFrequency is frequent enough to notice that the "sleep" command is using more than 1MB of resources. Refactor test to avoid specifying job memory. Use --wrap to avoid creating a temporary job script as well.
-
Tim Wickberg authored
Commit 63b7e3a8 changed the --mem limit to 1MB for the step if not using a memory SelectType, but this can cause the job to fail if the JobAcctGatherFrequency is frequent enough to notice that the "sleep" command is using more than 1MB of resources. Refactor test to avoid specifying job memory. Use --wrap to avoid creating a temporary job script.
-
Tim Wickberg authored
Commit 63b7e3a8 changed the --mem limit to 1MB for the step if not using a memory SelectType, but this can cause the job to fail if the JobAcctGatherFrequency is frequent enough to notice that the "sleep" command is using more than 1MB of resources. Refactor test to avoid specifying memory memory; and since only one step is checked for, only run a single step in the job. Use --wrap to avoid creating a temporary job script.
-
Tim Wickberg authored
-
Tim Wickberg authored
-
- 24 Jan, 2017 3 commits
-
-
Tim Wickberg authored
-
Morris Jette authored
Some portions of tests 21.30 and 21.34 failed with accounting and priority basic. These changes disable portions of those tests as needed based upon configuration.
-
Morris Jette authored
test1.63 was failing periodically due to a race condition. A signal was being sent to srun before the signal handler thread was spawned.
-
- 23 Jan, 2017 9 commits
-
-
Morris Jette authored
Reset a job's memory limit based upon what's available after node reboot, which can change on a KNL if the MCDRAM mode is changes on reboot
-
Morris Jette authored
This bug was likely the root cause of bug 3366. If the backfill scheduler allocates resources for a batch job and a node reboot is required, the batch launch RPC would be sent to the agent. At that point, there is a race condition between the agent and the job_time_limit() function testing for boot completion. If the job_time_limit() function ran first, it would trigger a second launch RPC request getting sent to the agent. bug 3366
-
Morris Jette authored
Clean up logic to test if job is configuring bug 3366
-
Morris Jette authored
Do not launch a batch step while the job is configuring. Previous logic checked for the PrologSlurmctld running, but not nodes booting. Checking the job's CONFIGURING state flag will validate both. bug 3366
-
Morris Jette authored
Add check to avoid step allocation logic from executing job configuration completion logic multiple times (check if job is configurating before clearing flag and resetting time limit). bug 3366
-
Morris Jette authored
slurmctld/agent race condition fix: Prevent job launch while PrologSlurmctld daemon is running or node boot in progress. bug 3366
-
Morris Jette authored
This is required to manage the configuration completion. bug 3366
-
Morris Jette authored
This will be required to lock the job structure bug 3366
-
Morris Jette authored
Remove the return value from the agent_retry() function. It is not used anywhere and needs to be removed to run as a pthread. bug 3366
-
- 21 Jan, 2017 2 commits
-
-
Tim Wickberg authored
-
Tim Wickberg authored
Reasonable NFS systems do not need a minute to propagate changes.
-
- 20 Jan, 2017 1 commit
-
-
Brian Christiansen authored
If a lower version client would try to communicate with a higher version controller the dbd would return the controller's version and the client would use that version to talk to the controller. When the controller would respond, the client wouldn't know how to unpack the higher version msg.
-
- 19 Jan, 2017 4 commits
-
-
Danny Auble authored
-
Dominik Bartkiewicz authored
'assoc_limit_stop'.
-
Danny Auble authored
-
Danny Auble authored
condition later when looking at a steps env. Bug 3394
-
- 18 Jan, 2017 4 commits
-
-
Danny Auble authored
Bug 3398
-
Danny Auble authored
-
Morris Jette authored
bug 3399
-
Morris Jette authored
bug 3099
-
- 17 Jan, 2017 4 commits
-
-
Danny Auble authored
This reverts commit e92b49d3.
-
Tim Shaw authored
No functional change.
-
Dominik Bartkiewicz authored
instead of also in the backfill scheduler.
-
Josh Samuelson authored
Bug 3405.
-