Commits · 4874f9882ba7e513db97e4cefc62d883891065b8 · Manuel G. Marciani / ces_slurm_simulator

20 Jan, 2017 40 commits
- Enable federated interactive jobs · 4874f988
  Brian Christiansen authored Jan 20, 2017
  
  4874f988
- Remove old and unnescceary check for V2.2 · 493a4dc6
  Brian Christiansen authored Jan 20, 2017
  
  493a4dc6
- Rename RPC SIB_JOB_REVOKE to SIB_JOB_COMPLETE · 03782571
  Brian Christiansen authored Jan 20, 2017
  
  03782571
- Add extra null check. · 41595dbe
  Brian Christiansen authored Jan 20, 2017
  
  41595dbe
- Send fed job completes when a partition is deleted · 7aabeeb6
  Brian Christiansen authored Jan 20, 2017
  
  7aabeeb6
- Add test37.5 to federated requeue · b5b81c1b
  Brian Christiansen authored Jan 20, 2017
  
  b5b81c1b
- Enable canceling fed jobs from origin cluster · 15ce1cbf
  Brian Christiansen authored Jan 16, 2017
  
  15ce1cbf
- Remove squeue --fedtrack option · 832a0118
  Brian Christiansen authored Jan 12, 2017
```
In favor of just using the -a option to show the tracking federated
jobs. This allows scontrol -a show jobs to show the tracking jobs as
well.
```
  832a0118
- Add federation job requeueing · 9cd13bb5
  Brian Christiansen authored Jan 12, 2017
  
  9cd13bb5
- Change job_hold_requeue to return a bool · 8b3edd5f
  Brian Christiansen authored Jan 06, 2017
```
to indicate wheter the job was requeue held or not. This enables the
federation to trigger off whether the job was requeue held or not.
```
  8b3edd5f
- Add KILL_FED_REQUEUE flag to KILL_* flags · 10595c92
  Brian Christiansen authored Jan 06, 2017
```
So that the origin job tell a remote cluster to cancel the job but mark
the job as requeued in the database.

See note about the KILL_* flags actually using 12bits instead of noted
8bits.
```
  10595c92
- Allow non-origin jobs to purge before minjobage · 285b5cdd
  Brian Christiansen authored Jan 06, 2017
  
  285b5cdd
- Make comments on one line · 17917228
  Brian Christiansen authored Jan 06, 2017
  
  17917228
- Fix memory leak. · 23a98db4
  Brian Christiansen authored Jan 06, 2017
```
Follows pattern from c5ace562
```
  23a98db4
- Add comment · 7f88c9c2
  Brian Christiansen authored Jan 06, 2017
  
  7f88c9c2
- Fix comment · fb66df28
  Brian Christiansen authored Jan 06, 2017
  
  fb66df28
- Change info's to debug's · 846656b4
  Brian Christiansen authored Jan 06, 2017
  
  846656b4
- Requeue completing jobs in db · 4f74ad06
  Brian Christiansen authored Jan 05, 2017
```
If a job was requeued while in the completing state, the database wasn't
being updated with the requeue state.
```
  4f74ad06
- Add submitted clusters to job_record · d0bf5ed8
  Brian Christiansen authored Jan 05, 2017
```
When a fed job is requeued, it needs to be requeued to clusters that it was
submittted to.
```
  d0bf5ed8
- Put restart_cnt on job_record · b1793f92
  Brian Christiansen authored Jan 05, 2017
```
When the a fed job is requeued and new siblings are submitted to the
other siblings, the restart_cnt needs to go to the siblings in case the
job runs on a remote sibling.
```
  b1793f92
- Make copy_job_record_to_job_desc extern accessible · a8c75742
  Brian Christiansen authored Jan 05, 2017
```
The federation needs to make a job_desc when requeueing jobs to
siblings.
```
  a8c75742
- Make _purge_job_record() externally accessible · 6a2b41f6
  Brian Christiansen authored Jan 05, 2017
  
  6a2b41f6
- Safely free job_step_kill_msg_t · f5177888
  Brian Christiansen authored Jan 05, 2017
  
  f5177888
- Federation sib* rpcs must have a persistent con · aa171568
  Brian Christiansen authored Jan 04, 2017
```
Since a persistent connection can only be established by SlurmUser this
prevents non-slurmuser users from calling the rpcs. It also requires that all
slurmctlds in the federation have the same SlurmUser.
```
  aa171568
- Don't submit siblings jobs if the job is held. · c7b07a09
  Brian Christiansen authored Dec 22, 2016
  
  c7b07a09
- Don't will_run sib clusters if begintime in future · ea4573ed
  Brian Christiansen authored Dec 22, 2016
```
If the job can't start now, just submit the job to all siblings.
```
  ea4573ed
- Extract helper function · 95e7d8ba
  Brian Christiansen authored Dec 21, 2016
```
_update_sibling_job_siblings()
```
  95e7d8ba
- Make comment on one line. · 5c55672e
  Brian Christiansen authored Dec 21, 2016
  
  5c55672e
- Fix indenting · d4967aec
  Brian Christiansen authored Dec 07, 2016
  
  d4967aec
- Refactor out common fed_mgr_job_revoke function · a2638db4
  Brian Christiansen authored Dec 05, 2016
  
  a2638db4
- Add helper function for determining fed job · 847bb657
  Brian Christiansen authored Nov 29, 2016
  
  847bb657
- Save fed job details to state · 77af869c
  Brian Christiansen authored Nov 29, 2016
  
  77af869c
- Init resp inside of _send_recv_msg · b214caa4
  Brian Christiansen authored Nov 29, 2016
```
like it does in slurm_send_recv_msg. The resp needs to be inited before
_check_send it called.
```
  b214caa4
- Add scheduling of federated batch jobs · 616de6f3
  Brian Christiansen authored Nov 22, 2016
```
Sibling jobs have to get lock from the origin cluster in order to attempt to
allocate nodes. If it gets the allocation then it lets the origin cluster know
and the origin cluster will set the siblings jobs, if any, into a REVOKED state
and purge the jobs. If the sibling job is the only sibling then it assumes the
lock and attempts to start the job to avoid extra communications. If nodes can't
be allocated then the job releases the lock for another cluster to try.
```
  616de6f3
- Only [un]pack sib_msg data_buffer if one exists · 0f855ef3
  Brian Christiansen authored Nov 22, 2016
  
  0f855ef3
- Add JOB_REVOKED state · 7427659e
  Brian Christiansen authored Nov 22, 2016
```
for fed sibling jobs that don't start.
```
  7427659e
- Extract helper function to get fed cluster by id · 22bbba85
  Brian Christiansen authored Nov 22, 2016
  
  22bbba85
- Refactor protocol funcs in fed_mgr · 88739c7c
  Brian Christiansen authored Nov 22, 2016
  
  88739c7c
- Make state in db an unsigned 32bit · edd2e602
  Brian Christiansen authored Nov 22, 2016
```
To handle JOB_REVOKED
```
  edd2e602
- Fit on one line. · e98948d6
  Brian Christiansen authored Nov 21, 2016
  
  e98948d6