- 25 Jan, 2017 6 commits
-
-
Morris Jette authored
It was leaking memory otherwise
-
Tim Wickberg authored
Commit 63b7e3a8 changed the --mem limit to 1MB for the job if not using a memory SelectType, but this can cause the job to fail if the JobAcctGatherFrequency is frequent enough to notice that the "sleep" command is using more than 1MB of resources. Refactor test to avoid specifying job memory. Use --wrap to avoid creating a temporary job script as well.
-
Tim Wickberg authored
Commit 63b7e3a8 changed the --mem limit to 1MB for the step if not using a memory SelectType, but this can cause the job to fail if the JobAcctGatherFrequency is frequent enough to notice that the "sleep" command is using more than 1MB of resources. Refactor test to avoid specifying job memory. Use --wrap to avoid creating a temporary job script.
-
Tim Wickberg authored
Commit 63b7e3a8 changed the --mem limit to 1MB for the step if not using a memory SelectType, but this can cause the job to fail if the JobAcctGatherFrequency is frequent enough to notice that the "sleep" command is using more than 1MB of resources. Refactor test to avoid specifying memory memory; and since only one step is checked for, only run a single step in the job. Use --wrap to avoid creating a temporary job script.
-
Tim Wickberg authored
-
Tim Wickberg authored
-
- 24 Jan, 2017 3 commits
-
-
Tim Wickberg authored
-
Morris Jette authored
Some portions of tests 21.30 and 21.34 failed with accounting and priority basic. These changes disable portions of those tests as needed based upon configuration.
-
Morris Jette authored
test1.63 was failing periodically due to a race condition. A signal was being sent to srun before the signal handler thread was spawned.
-
- 23 Jan, 2017 9 commits
-
-
Morris Jette authored
Reset a job's memory limit based upon what's available after node reboot, which can change on a KNL if the MCDRAM mode is changes on reboot
-
Morris Jette authored
This bug was likely the root cause of bug 3366. If the backfill scheduler allocates resources for a batch job and a node reboot is required, the batch launch RPC would be sent to the agent. At that point, there is a race condition between the agent and the job_time_limit() function testing for boot completion. If the job_time_limit() function ran first, it would trigger a second launch RPC request getting sent to the agent. bug 3366
-
Morris Jette authored
Clean up logic to test if job is configuring bug 3366
-
Morris Jette authored
Do not launch a batch step while the job is configuring. Previous logic checked for the PrologSlurmctld running, but not nodes booting. Checking the job's CONFIGURING state flag will validate both. bug 3366
-
Morris Jette authored
Add check to avoid step allocation logic from executing job configuration completion logic multiple times (check if job is configurating before clearing flag and resetting time limit). bug 3366
-
Morris Jette authored
slurmctld/agent race condition fix: Prevent job launch while PrologSlurmctld daemon is running or node boot in progress. bug 3366
-
Morris Jette authored
This is required to manage the configuration completion. bug 3366
-
Morris Jette authored
This will be required to lock the job structure bug 3366
-
Morris Jette authored
Remove the return value from the agent_retry() function. It is not used anywhere and needs to be removed to run as a pthread. bug 3366
-
- 21 Jan, 2017 2 commits
-
-
Tim Wickberg authored
-
Tim Wickberg authored
Reasonable NFS systems do not need a minute to propagate changes.
-
- 20 Jan, 2017 1 commit
-
-
Brian Christiansen authored
If a lower version client would try to communicate with a higher version controller the dbd would return the controller's version and the client would use that version to talk to the controller. When the controller would respond, the client wouldn't know how to unpack the higher version msg.
-
- 19 Jan, 2017 4 commits
-
-
Danny Auble authored
-
Dominik Bartkiewicz authored
'assoc_limit_stop'.
-
Danny Auble authored
-
Danny Auble authored
condition later when looking at a steps env. Bug 3394
-
- 18 Jan, 2017 4 commits
-
-
Danny Auble authored
Bug 3398
-
Danny Auble authored
-
Morris Jette authored
bug 3399
-
Morris Jette authored
bug 3099
-
- 17 Jan, 2017 5 commits
-
-
Danny Auble authored
This reverts commit e92b49d3.
-
Tim Shaw authored
No functional change.
-
Dominik Bartkiewicz authored
instead of also in the backfill scheduler.
-
Josh Samuelson authored
Bug 3405.
-
Josh Samuelson authored
acct_policy_job_runnable_pre_select() calls assoc_mgr_set_qos_tres_cnt() without tres READ_LOCK. Note that existing code does not modify the tres structures, so this cannot currently lead to a race condition. Bug 3406.
-
- 15 Jan, 2017 1 commit
-
-
Michael Robbert authored
job_submit/cnode was previously removed by commit 63bc71ed. Bug 3403.
-
- 12 Jan, 2017 4 commits
-
-
Isaac Hartung authored
Bug 3395
-
Morris Jette authored
-
Morris Jette authored
burst_buffer/cray - Avoid "pre_run" operation if not using buffer (i.e. just creating or deleting a persistent burst buffer). bug 3391
-
Morris Jette authored
Previous job state information was "PENDING" rather than "REQUEUED" for each job requeued due to a burst buffer error. bug 3388
-
- 11 Jan, 2017 1 commit
-
-
Danny Auble authored
scheduling a Datawarp job. The assoc_mgr lock needs to happen before the bb_state.bb_mutex. One place this could cause deadlock is from src/slurmctld/controller.c _accounting_cluster_ready() which calls clusteracct_storage_g_cluster_tres which inturn calls bb_g_job_set_tres_cnt which calls bb_p_job_set_tres_cnt which will lock the bb_muxtex after the assoc_mgr is already locked. Bug 3389
-