- 17 May, 2017 20 commits
-
-
Brian Christiansen authored
-
Brian Christiansen authored
This prevents deadlocks when having the fed_job_list_mutex locked higher up and calling job_completion_logger inside of the locked mutex.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
Don't need to have the fed_write_lock when destroying the persist_conn_server.
-
Brian Christiansen authored
-
Brian Christiansen authored
Since federated submissions are now asynchronous and because the working_cluster_rec can be multithreaded, it's better to have the federadated will_runs in the client. This prevents the deadlocks and holding up the persistent connections as could happen in the previous model.
-
Brian Christiansen authored
-
Brian Christiansen authored
With the change to the asynchronous model, it's better to have the cluster always get the lock from the origin cluster. Previously, the origin cluster would try to pick one cluster that could start the job the soonest and the scenario where there would be only one sibling was more common. Now that sibling jobs are sent to all clusters this is less common.
-
Brian Christiansen authored
Queue up the fed job completions.
-
Brian Christiansen authored
Federated submissions now happen ansynchronously. Sibling jobs are submitted to the sibling cluster. The sibling cluster queue's up the request to be handled later when it can get the job write lock. The sibling cluster submits the job and sends a message back to the origin cluster which is queued up as well. If the submission failed then the sibling cluster is removed from the job's active siblings.
-
Brian Christiansen authored
The problem was that the origin cluster had to get the internal job write lock to test and set the fed cluster lock. This would hold up the persistent connection and get into a dead lock. The solution is create a separate table for tracking the federated job and the cluster lock which is controlled by seperate lock. Plus all communication on the persist connection must be quick. Thus all communications that need to be modify the actual job need to be put onto a queue for the scheduler to handle later so that the persistent connection isn't being held up. The response will be sent back when the request is processed. This moves to an asynchronous model for communications between clusters in a federation.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
Prevents a second call to the database. This could happen when the origin job is cancelled and the sibling jobs report back that the job is gone as well.
-
Brian Christiansen authored
-
Brian Christiansen authored
Mimicking how cancelled jobs are. The database will show that the job start_time is 0 but in the controller the the start time will be the same as the end time. sacct will set the start time to the end time if there is an end time and the start time is 0.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
- 16 May, 2017 5 commits
-
-
Tim Shaw authored
bug 805
-
Brian Christiansen authored
To be able to set a default federated view for all status commands.
-
Brian Christiansen authored
even if --sibling is specified.
-
Brian Christiansen authored
to show federated view. sacct, scontrol, sinfo, sprio, squeue, sreport
-
Brian Christiansen authored
-
- 15 May, 2017 2 commits
-
-
Brian Christiansen authored
Show a tab in the cluster combo box to select a federated view for a given federation.
-
Morris Jette authored
-
- 13 May, 2017 3 commits
-
-
Morris Jette authored
-
Isaac Hartung authored
Bug 3695
-
Morris Jette authored
bug 3779
-
- 12 May, 2017 4 commits
-
-
Morris Jette authored
If capmc reports a node name, but not mcdram_cfg for the node, then log the missing data rather than assume the value is zero and report a value mismatch with cnselect.
-
Alejandro Sanchez authored
When requesting an operation on jobs, where the operation permits to specify more than one job in the same request, and a job array appears before a regular job (no-array job) in the list of jobs to operate with, the job_array_resp_msg_t pointer was not properly NULL'ed and thus incorrectly accessed when processing the no-array job. This fix prevents the crash from happening in the following scontrol operations: uhold, hold, suspend, requeue, requeuehold, update, release when the same request has <array_jobid>,<non-array_jobid> in this order in the job list to process. Bug 3759
-
Morris Jette authored
Job expansion example in FAQ enhanced to demonstrate operation in heterogeneous environments. bug 2979
-
Alejandro Sanchez authored
Do not attempt to schedule jobs after changing the power cap if there are already many active threads.
-
- 11 May, 2017 2 commits
-
-
Danny Auble authored
# Conflicts: # META # NEWS
-
Danny Auble authored
-
- 10 May, 2017 3 commits
-
-
Danny Auble authored
-
Dominik Bartkiewicz authored
Bug 3760
-
Danny Auble authored
didn't work at all. Bug 3712.
-
- 09 May, 2017 1 commit
-
-
Danny Auble authored
This reverts commit ecfd007f.
-