- 17 May, 2017 40 commits
-
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
Running jobs can happen out of order.
-
Brian Christiansen authored
-
Brian Christiansen authored
Original list was being free'd before the copy was.
-
Isaac Hartung authored
sbatch|srun --test-only Bug 3740
-
Isaac Hartung authored
Bug 3802
-
Isaac Hartung authored
Bug 3641
-
Isaac Hartung authored
and steps Bug 3700
-
Isaac Hartung authored
Bug 3699
-
Isaac Hartung authored
Bug 3698
-
Isaac Hartung authored
Bug 3697
-
Isaac Hartung authored
Bug 3667
-
Isaac Hartung authored
to validate scontrol --local and --sibling options Bug 3662
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
Only used in fed_mgr.c
-
Brian Christiansen authored
This will make it easier to add new proto types without having to modifying protocol_defs.[ch]. Leaving job_lock and job_unlock to be handled by slurmctld_req since they aren't a "queued" type.
-
Brian Christiansen authored
prevent possible memory leak.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
All handled in _proc_multi_msg except for sib_job_[un]lock.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
This prevents deadlocks when having the fed_job_list_mutex locked higher up and calling job_completion_logger inside of the locked mutex.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
Don't need to have the fed_write_lock when destroying the persist_conn_server.
-
Brian Christiansen authored
-
Brian Christiansen authored
Since federated submissions are now asynchronous and because the working_cluster_rec can be multithreaded, it's better to have the federadated will_runs in the client. This prevents the deadlocks and holding up the persistent connections as could happen in the previous model.
-
Brian Christiansen authored
-
Brian Christiansen authored
With the change to the asynchronous model, it's better to have the cluster always get the lock from the origin cluster. Previously, the origin cluster would try to pick one cluster that could start the job the soonest and the scenario where there would be only one sibling was more common. Now that sibling jobs are sent to all clusters this is less common.
-
Brian Christiansen authored
Queue up the fed job completions.
-
Brian Christiansen authored
Federated submissions now happen ansynchronously. Sibling jobs are submitted to the sibling cluster. The sibling cluster queue's up the request to be handled later when it can get the job write lock. The sibling cluster submits the job and sends a message back to the origin cluster which is queued up as well. If the submission failed then the sibling cluster is removed from the job's active siblings.
-
Brian Christiansen authored
The problem was that the origin cluster had to get the internal job write lock to test and set the fed cluster lock. This would hold up the persistent connection and get into a dead lock. The solution is create a separate table for tracking the federated job and the cluster lock which is controlled by seperate lock. Plus all communication on the persist connection must be quick. Thus all communications that need to be modify the actual job need to be put onto a queue for the scheduler to handle later so that the persistent connection isn't being held up. The response will be sent back when the request is processed. This moves to an asynchronous model for communications between clusters in a federation.
-