Commits · 7752d7e5ae044505cffcca3cb3a8c7cfa23cca8d · Manuel G. Marciani / ces_slurm_simulator

17 May, 2017 31 commits
- Remove dead extern func declaration · 7752d7e5
  Brian Christiansen authored May 15, 2017
  
  7752d7e5
- Make function static · 3c7cd702
  Brian Christiansen authored May 15, 2017
```
Only used in fed_mgr.c
```
  3c7cd702
- Refactor fed_mgr to handle fed proto types · ca42335a
  Brian Christiansen authored May 15, 2017
```
This will make it easier to add new proto types without having to
modifying protocol_defs.[ch].

Leaving job_lock and job_unlock to be handled by slurmctld_req since
they aren't a "queued" type.
```
  ca42335a
- NULL out var after security check · 8249e04e
  Brian Christiansen authored May 12, 2017
```
prevent possible memory leak.
```
  8249e04e
- Fix curly brace placement · dbf383fa
  Brian Christiansen authored May 12, 2017
  
  dbf383fa
- Handle sib job cancelling in async model · c60416d5
  Brian Christiansen authored May 12, 2017
  
  c60416d5
- Remove sib requests from main proto handling · 406d6e10
  Brian Christiansen authored May 11, 2017
```
All handled in _proc_multi_msg except for sib_job_[un]lock.
```
  406d6e10
- Update test37.4 with async mods · e363fda1
  Brian Christiansen authored May 11, 2017
  
  e363fda1
- Update test37.5 after async mods · 90d84870
  Brian Christiansen authored May 09, 2017
  
  90d84870
- Update test37.6 after async mods · 8ffee56e
  Brian Christiansen authored May 09, 2017
  
  8ffee56e
- Update test37.4 for async model changes · 07c5379c
  Brian Christiansen authored May 08, 2017
  
  07c5379c
- Move fed job syncing to async model · d9b439b6
  Brian Christiansen authored May 08, 2017
  
  d9b439b6
- Check job instead of fed_job_info · 16c8011d
  Brian Christiansen authored May 08, 2017
```
This prevents deadlocks when having the fed_job_list_mutex locked higher
up and calling job_completion_logger inside of the locked mutex.
```
  16c8011d
- Fix line over 80 chars · 1c652da0
  Brian Christiansen authored May 08, 2017
  
  1c652da0
- Save/Load fed_job_list to/from state file · 8542c50e
  Brian Christiansen authored May 08, 2017
  
  8542c50e
- Adjust locks · 305a35d9
  Brian Christiansen authored May 08, 2017
```
Don't need to have the fed_write_lock when destroying the
persist_conn_server.
```
  305a35d9
- Move federated requeue of jobs to async model · d7e932a1
  Brian Christiansen authored May 02, 2017
  
  d7e932a1
- Move fed will_runs to client · e559d553
  Brian Christiansen authored Apr 27, 2017
```
Since federated submissions are now asynchronous and because the
working_cluster_rec can be multithreaded, it's better to have the
federadated will_runs in the client. This prevents the deadlocks and
holding up the persistent connections as could happen in the previous
model.
```
  e559d553
- Move federated job updates to async model · 8d424fbe
  Brian Christiansen authored Apr 26, 2017
  
  8d424fbe
- Always check with origin cluster to get job lock · 6fcd0ffd
  Brian Christiansen authored Apr 26, 2017
```
With the change to the asynchronous model, it's better to have the
cluster always get the lock from the origin cluster. Previously, the
origin cluster would try to pick one cluster that could start the job
the soonest and the scenario where there would be only one sibling was
more common. Now that sibling jobs are sent to all clusters this is less
common.
```
  6fcd0ffd
- Move fed_job_complete to async model · 34e96f0d
  Brian Christiansen authored Apr 25, 2017
```
Queue up the fed job completions.
```
  34e96f0d
- Move fed submissions to async model · f3998831
  Brian Christiansen authored Apr 25, 2017
```
Federated submissions now happen ansynchronously. Sibling jobs are
submitted to the sibling cluster. The sibling cluster queue's up the
request to be handled later when it can get the job write lock. The
sibling cluster submits the job and sends a message back to the origin
cluster which is queued up as well. If the submission failed then the
sibling cluster is removed from the job's active siblings.
```
  f3998831
- Improve fed job locking · 9fb07473
  Brian Christiansen authored Apr 25, 2017
```
The problem was that the origin cluster had to get the internal job
write lock to test and set the fed cluster lock. This would hold up the
persistent connection and get into a dead lock. The solution is create a
separate table for tracking the federated job and the cluster lock which
is controlled by seperate lock.

Plus all communication on the persist connection must be quick. Thus all
communications that need to be modify the actual job need to be put onto
a queue for the scheduler to handle later so that the persistent
connection isn't being held up. The response will be sent back when the
request is processed. This moves to an asynchronous model for
communications between clusters in a federation.
```
  9fb07473
- Preserve auth_cred when handling multi msg · 7587038a
  Brian Christiansen authored Apr 19, 2017
  
  7587038a
- Display which multi msg is being handled · d0404a29
  Brian Christiansen authored Apr 19, 2017
  
  d0404a29
- Update comment · c19a273b
  Brian Christiansen authored Apr 19, 2017
  
  c19a273b
- Don't revoke a job that is already completed · 293c2bce
  Brian Christiansen authored Apr 19, 2017
```
Prevents a second call to the database. This could happen when the
origin job is cancelled and the sibling jobs report back that the job is
gone as well.
```
  293c2bce
- Clear state_reason for revoked jobs · 811fcb97
  Brian Christiansen authored Apr 19, 2017
  
  811fcb97
- Set revoked job's start time to end time · cf178ae6
  Brian Christiansen authored Apr 19, 2017
```
Mimicking how cancelled jobs are. The database will show that the job
start_time is 0 but in the controller the the start time will be the
same as the end time. sacct will set the start time to the end time if
there is an end time and the start time is 0.
```
  cf178ae6
- Make sure revoked jobs have correct end times · 1fe16e52
  Brian Christiansen authored Apr 19, 2017
  
  1fe16e52
- Name fed agent thread · f1aa850e
  Brian Christiansen authored Apr 19, 2017
  
  f1aa850e
16 May, 2017 5 commits
- Log the down nodes whenever slurmctld restarts · d8e95394
  Tim Shaw authored May 16, 2017
```
bug 805
```
  d8e95394
- Add FederationParameters=fed_display to slurm.conf · d3ee28fa
  Brian Christiansen authored May 16, 2017
```
To be able to set a default federated view for all status commands.
```
  d3ee28fa
- Don't show cluster column if sprio --local · 262bdb0e
  Brian Christiansen authored May 15, 2017
```
even if --sibling is specified.
```
  262bdb0e
- Add --federation option to status commands · 21680136
  Brian Christiansen authored May 15, 2017
```
to show federated view.
sacct, scontrol, sinfo, sprio, squeue, sreport
```
  21680136
- Rename SHOW_ALL flag to SHOW_FEDERATION · 8a5f00c4
  Brian Christiansen authored May 15, 2017
  
  8a5f00c4
15 May, 2017 2 commits
- Add federated views to sview · 97e9dfd5
  Brian Christiansen authored May 15, 2017
```
Show a tab in the cluster combo box to select a federated view for a
given federation.
```
  97e9dfd5
- sbatch cosmetic changes, no changes to logic · 2f313292
  Morris Jette authored May 15, 2017
  
  2f313292
13 May, 2017 2 commits
- Merge branch 'slurm-17.02' · e0ca803c
  Morris Jette authored May 12, 2017
  
  e0ca803c
- Remove log files from test20.12 · 7bb4d9a1
  Isaac Hartung authored May 12, 2017
```
Bug 3695
```
  7bb4d9a1