Commits · bf73d13f7398a5d2a98139f719681febe4c01b59 · Manuel G. Marciani / ces_slurm_simulator

17 May, 2017 40 commits
- Adjust timing in test. · bf73d13f
  Brian Christiansen authored May 16, 2017
  
  bf73d13f
- Adding logging to test · 116fab9c
  Brian Christiansen authored May 16, 2017
  
  116fab9c
- Fix null dereference if job is not federated · 00683cd0
  Brian Christiansen authored May 16, 2017
  
  00683cd0
- Fix test regex · 6b84f7d4
  Brian Christiansen authored May 16, 2017
```
Running jobs can happen out of order.
```
  6b84f7d4
- Fix spelling · 9c4c1ee5
  Brian Christiansen authored May 16, 2017
  
  9c4c1ee5
- Fix invalid read · 3b1f425b
  Brian Christiansen authored May 16, 2017
```
Original list was being free'd before the copy was.
```
  3b1f425b
- Add test37.12 to test federated will_runs · 49516997
  Isaac Hartung authored May 15, 2017
```
sbatch|srun --test-only
Bug 3740
```
  49516997
- Add test37.11 to test federated sacct output · 6042f0dc
  Isaac Hartung authored May 15, 2017
```
Bug 3802
```
  6042f0dc
- Add test37.10 to tests federated job cancellations · 8a8fccc2
  Isaac Hartung authored May 15, 2017
```
Bug 3641
```
  8a8fccc2
- Add tests to verify federated scontrol show jobs · 8e402ad5
  Isaac Hartung authored May 15, 2017
```
and steps

Bug 3700
```
  8e402ad5
- Add tests verifying federated sprio output · ffc88a2c
  Isaac Hartung authored May 15, 2017
```
Bug 3699
```
  ffc88a2c
- Add tests for verifying federated sinfo output · ad0e4b74
  Isaac Hartung authored May 15, 2017
```
Bug 3698
```
  ad0e4b74
- Add test37.9 to test federated squeue output · 9a354bac
  Isaac Hartung authored May 15, 2017
```
Bug 3697
```
  9a354bac
- Adds tests for scancel --sibling=<jobid> · 0611d471
  Isaac Hartung authored May 15, 2017
```
Bug 3667
```
  0611d471
- Add test37.8 · a9e12c57
  Isaac Hartung authored May 15, 2017
```
to validate scontrol --local and --sibling options

Bug 3662
```
  a9e12c57
- Fix expect README · 654204bf
  Brian Christiansen authored May 15, 2017
  
  654204bf
- Fix test37.4 · d410fd3d
  Brian Christiansen authored May 15, 2017
  
  d410fd3d
- Remove dead extern func declaration · 7752d7e5
  Brian Christiansen authored May 15, 2017
  
  7752d7e5
- Make function static · 3c7cd702
  Brian Christiansen authored May 15, 2017
```
Only used in fed_mgr.c
```
  3c7cd702
- Refactor fed_mgr to handle fed proto types · ca42335a
  Brian Christiansen authored May 15, 2017
```
This will make it easier to add new proto types without having to
modifying protocol_defs.[ch].

Leaving job_lock and job_unlock to be handled by slurmctld_req since
they aren't a "queued" type.
```
  ca42335a
- NULL out var after security check · 8249e04e
  Brian Christiansen authored May 12, 2017
```
prevent possible memory leak.
```
  8249e04e
- Fix curly brace placement · dbf383fa
  Brian Christiansen authored May 12, 2017
  
  dbf383fa
- Handle sib job cancelling in async model · c60416d5
  Brian Christiansen authored May 12, 2017
  
  c60416d5
- Remove sib requests from main proto handling · 406d6e10
  Brian Christiansen authored May 11, 2017
```
All handled in _proc_multi_msg except for sib_job_[un]lock.
```
  406d6e10
- Update test37.4 with async mods · e363fda1
  Brian Christiansen authored May 11, 2017
  
  e363fda1
- Update test37.5 after async mods · 90d84870
  Brian Christiansen authored May 09, 2017
  
  90d84870
- Update test37.6 after async mods · 8ffee56e
  Brian Christiansen authored May 09, 2017
  
  8ffee56e
- Update test37.4 for async model changes · 07c5379c
  Brian Christiansen authored May 08, 2017
  
  07c5379c
- Move fed job syncing to async model · d9b439b6
  Brian Christiansen authored May 08, 2017
  
  d9b439b6
- Check job instead of fed_job_info · 16c8011d
  Brian Christiansen authored May 08, 2017
```
This prevents deadlocks when having the fed_job_list_mutex locked higher
up and calling job_completion_logger inside of the locked mutex.
```
  16c8011d
- Fix line over 80 chars · 1c652da0
  Brian Christiansen authored May 08, 2017
  
  1c652da0
- Save/Load fed_job_list to/from state file · 8542c50e
  Brian Christiansen authored May 08, 2017
  
  8542c50e
- Adjust locks · 305a35d9
  Brian Christiansen authored May 08, 2017
```
Don't need to have the fed_write_lock when destroying the
persist_conn_server.
```
  305a35d9
- Move federated requeue of jobs to async model · d7e932a1
  Brian Christiansen authored May 02, 2017
  
  d7e932a1
- Move fed will_runs to client · e559d553
  Brian Christiansen authored Apr 27, 2017
```
Since federated submissions are now asynchronous and because the
working_cluster_rec can be multithreaded, it's better to have the
federadated will_runs in the client. This prevents the deadlocks and
holding up the persistent connections as could happen in the previous
model.
```
  e559d553
- Move federated job updates to async model · 8d424fbe
  Brian Christiansen authored Apr 26, 2017
  
  8d424fbe
- Always check with origin cluster to get job lock · 6fcd0ffd
  Brian Christiansen authored Apr 26, 2017
```
With the change to the asynchronous model, it's better to have the
cluster always get the lock from the origin cluster. Previously, the
origin cluster would try to pick one cluster that could start the job
the soonest and the scenario where there would be only one sibling was
more common. Now that sibling jobs are sent to all clusters this is less
common.
```
  6fcd0ffd
- Move fed_job_complete to async model · 34e96f0d
  Brian Christiansen authored Apr 25, 2017
```
Queue up the fed job completions.
```
  34e96f0d
- Move fed submissions to async model · f3998831
  Brian Christiansen authored Apr 25, 2017
```
Federated submissions now happen ansynchronously. Sibling jobs are
submitted to the sibling cluster. The sibling cluster queue's up the
request to be handled later when it can get the job write lock. The
sibling cluster submits the job and sends a message back to the origin
cluster which is queued up as well. If the submission failed then the
sibling cluster is removed from the job's active siblings.
```
  f3998831
- Improve fed job locking · 9fb07473
  Brian Christiansen authored Apr 25, 2017
```
The problem was that the origin cluster had to get the internal job
write lock to test and set the fed cluster lock. This would hold up the
persistent connection and get into a dead lock. The solution is create a
separate table for tracking the federated job and the cluster lock which
is controlled by seperate lock.

Plus all communication on the persist connection must be quick. Thus all
communications that need to be modify the actual job need to be put onto
a queue for the scheduler to handle later so that the persistent
connection isn't being held up. The response will be sent back when the
request is processed. This moves to an asynchronous model for
communications between clusters in a federation.
```
  9fb07473