- 17 May, 2017 13 commits
-
-
Brian Christiansen authored
-
Brian Christiansen authored
With the change to the asynchronous model, it's better to have the cluster always get the lock from the origin cluster. Previously, the origin cluster would try to pick one cluster that could start the job the soonest and the scenario where there would be only one sibling was more common. Now that sibling jobs are sent to all clusters this is less common.
-
Brian Christiansen authored
Queue up the fed job completions.
-
Brian Christiansen authored
Federated submissions now happen ansynchronously. Sibling jobs are submitted to the sibling cluster. The sibling cluster queue's up the request to be handled later when it can get the job write lock. The sibling cluster submits the job and sends a message back to the origin cluster which is queued up as well. If the submission failed then the sibling cluster is removed from the job's active siblings.
-
Brian Christiansen authored
The problem was that the origin cluster had to get the internal job write lock to test and set the fed cluster lock. This would hold up the persistent connection and get into a dead lock. The solution is create a separate table for tracking the federated job and the cluster lock which is controlled by seperate lock. Plus all communication on the persist connection must be quick. Thus all communications that need to be modify the actual job need to be put onto a queue for the scheduler to handle later so that the persistent connection isn't being held up. The response will be sent back when the request is processed. This moves to an asynchronous model for communications between clusters in a federation.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
Prevents a second call to the database. This could happen when the origin job is cancelled and the sibling jobs report back that the job is gone as well.
-
Brian Christiansen authored
-
Brian Christiansen authored
Mimicking how cancelled jobs are. The database will show that the job start_time is 0 but in the controller the the start time will be the same as the end time. sacct will set the start time to the end time if there is an end time and the start time is 0.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
- 16 May, 2017 5 commits
-
-
Tim Shaw authored
bug 805
-
Brian Christiansen authored
To be able to set a default federated view for all status commands.
-
Brian Christiansen authored
even if --sibling is specified.
-
Brian Christiansen authored
to show federated view. sacct, scontrol, sinfo, sprio, squeue, sreport
-
Brian Christiansen authored
-
- 15 May, 2017 2 commits
-
-
Brian Christiansen authored
Show a tab in the cluster combo box to select a federated view for a given federation.
-
Morris Jette authored
-
- 13 May, 2017 3 commits
-
-
Morris Jette authored
-
Isaac Hartung authored
Bug 3695
-
Morris Jette authored
bug 3779
-
- 12 May, 2017 4 commits
-
-
Morris Jette authored
If capmc reports a node name, but not mcdram_cfg for the node, then log the missing data rather than assume the value is zero and report a value mismatch with cnselect.
-
Alejandro Sanchez authored
When requesting an operation on jobs, where the operation permits to specify more than one job in the same request, and a job array appears before a regular job (no-array job) in the list of jobs to operate with, the job_array_resp_msg_t pointer was not properly NULL'ed and thus incorrectly accessed when processing the no-array job. This fix prevents the crash from happening in the following scontrol operations: uhold, hold, suspend, requeue, requeuehold, update, release when the same request has <array_jobid>,<non-array_jobid> in this order in the job list to process. Bug 3759
-
Morris Jette authored
Job expansion example in FAQ enhanced to demonstrate operation in heterogeneous environments. bug 2979
-
Alejandro Sanchez authored
Do not attempt to schedule jobs after changing the power cap if there are already many active threads.
-
- 11 May, 2017 2 commits
-
-
Danny Auble authored
# Conflicts: # META # NEWS
-
Danny Auble authored
-
- 10 May, 2017 3 commits
-
-
Danny Auble authored
-
Dominik Bartkiewicz authored
Bug 3760
-
Danny Auble authored
didn't work at all. Bug 3712.
-
- 09 May, 2017 8 commits
-
-
Danny Auble authored
This reverts commit ecfd007f.
-
Dominik Bartkiewicz authored
-
Brian Christiansen authored
Continuation of 9a1370e3 CID 168995
-
Danny Auble authored
It was noticed that while doing any update to a job the admin comment would be blown away. This patch fixes that.
-
Dominik Bartkiewicz authored
Bug 3789
-
Danny Auble authored
run multiple tasks on multiple nodes. Changing the max nodes setting from 3 to 6 fixes the issue without apparent compromise to the test.
-
Danny Auble authored
destroying a mutex.
-
Brian Christiansen authored
When running sacct from a federated client, the db returns jobs for each cluster with duplicate jobs removed on each cluster. A federated job could have ran on a different cluster when the before the jobid's rolled. This patch filters out past old federated jobs and leaves the newest ones. Reverted d31965 which was too slow.
-