Commits · c4810d57995563f7da320b75332622146090d74f · Manuel G. Marciani / ces_slurm_simulator

22 Jun, 2017 29 commits
- Don't queue up job complete msg to origin if down · c4810d57
  Brian Christiansen authored Jun 22, 2017
```
The controller will keeps job in the job_list until the origin comes
back up and will find out about it then.
```
  c4810d57
- Keep federated jobs until origin is up and synced · 08d534c5
  Brian Christiansen authored Jun 22, 2017
```
This allows the origin to be able to sync up jobs after it has been
down.
```
  08d534c5
- Prevent scheduling until all active siblings sync · 5982e2ad
  Brian Christiansen authored Jun 22, 2017
  
  5982e2ad
- Send cancel to all viable sibs when origin is down · 49c27c23
  Brian Christiansen authored Jun 22, 2017
  
  49c27c23
- Fix passing of int ptr · 30eda101
  Brian Christiansen authored Jun 22, 2017
```
(void *)(intptr_t)0 is treated as a NULL
```
  30eda101
- Adjust test timing · a38bb2ff
  Brian Christiansen authored Jun 21, 2017
  
  a38bb2ff
- Add sanity checks · ac81244b
  Brian Christiansen authored Jun 21, 2017
  
  ac81244b
- Update regexes in test37.16 · 4e928ad3
  Brian Christiansen authored Jun 21, 2017
  
  4e928ad3
- Add error checking · 0ec2b13b
  Brian Christiansen authored Jun 20, 2017
```
_cleanup_removed_origin_jobs() could have been called without ever being
part of a federation.
```
  0ec2b13b
- Dont send fed job_complete if job was requeued · a7347067
  Brian Christiansen authored Jun 20, 2017
```
Job could have been requeued if the nodes failed.
```
  a7347067
- Fix clearing of cluster-constraints & clusters · e3a84ed0
  Isaac Hartung authored Jun 20, 2017
```
bef69448 was fixed/changed so that slurm_addto_char_list() would now
add an empty string to the list if no constraints or clusters were
given. The code was expecting an empty List previously.
```
  e3a84ed0
- Fix scontrol completing to show correct fed nodes · faf26c4b
  Brian Christiansen authored Jun 20, 2017
```
Like sview it wasn't mapping the job's node indexes to the correct
nodes since federated nodes are merged into one array.
```
  faf26c4b
- Schedule fed jobs if origin cluster is down · 4625a79c
  Brian Christiansen authored Jun 19, 2017
  
  4625a79c
- Fix memory leaks · 4644e307
  Brian Christiansen authored Jun 16, 2017
  
  4644e307
- Fix memory leak · 7fd31569
  Brian Christiansen authored Jun 16, 2017
  
  7fd31569
- Handle fed jobs when cluster is removed from fed · f12a382d
  Brian Christiansen authored Jun 15, 2017
```
while the cluster is down. The cluster will figure out what changed
after starting up or after resuming from using the cache.
```
  f12a382d
- Update fed_mgr after resuming from using cache · f116be96
  Brian Christiansen authored Jun 15, 2017
```
When the controller starts up and the dbd is not up it waits until the
dbd comes up. At this point, the controller needs to find out if
anything has changed in the federation (e.g. other clusters or self
removed from cluster).
```
  f116be96
- Update comment · 876ab406
  Brian Christiansen authored Jun 15, 2017
  
  876ab406
- Add test37.16 to test cluster removal from fed · e1ba29cb
  Isaac Hartung authored Jun 15, 2017
  
  e1ba29cb
- Handle fed jobs when a cluster is removed from fed · 665efedc
  Isaac Hartung authored Jun 15, 2017
```
When a non-origin cluster is removed:
- running jobs remain - fed_details removed so it can't call home.
- origin cluster removes tracking job for running jobs
- pending jobs are removed.
- pending srun/sallocs don't get notified.
- other clusters remove removed cluster from viable and active sibs

When an origin cluster is removed:
- all pending jobs are removed from all clusters that had job.
- pending srun/sallocs are notified of termination
- running jobs remain.
```
  665efedc
- Fix comment · 1892610f
  Isaac Hartung authored Jun 15, 2017
  
  1892610f
- Add html documentation for federation · 92ffbad2
  Brian Christiansen authored Jun 15, 2017
  
  92ffbad2
- Fix sview federated part tab showing wrong nodes · d2acfa5b
  Isaac Hartung authored Jun 15, 2017
  
  d2acfa5b
- Sort sview federated partition tab by cluster · 7bd6c09f
  Brian Christiansen authored Jun 15, 2017
  
  7bd6c09f
- CRAY - Alter algorithm to come up with the SLURM_ID_HASH. · b182a586
  Danny Auble authored Jun 22, 2017
```
The SLURM_ID_HASH used for Cray systems has changed to fully use the
entire 64 bits of the hash.  Previously the stepid was multiplied by
10,000,000,000 to make it easy to read both the jobid as well as the
stepid in the hash separated by at least a couple of zeros, but this
lead to overflow on the hash with steps like the batch and extern step
where they used all 32 bits to represent the step.  While the new method
ruins the easy readability it fixes the more important overflow issue.
This most likely will go unnoticed by most, just a note of the change.
```
  b182a586
- Merge branch 'slurm-17.02' · fadd98ae
  Tim Wickberg authored Jun 22, 2017
  
  fadd98ae
- Assorted spelling fixes in comments. · 49a6be2c
  Tim Wickberg authored Jun 22, 2017
  
  49a6be2c
- Merge remote-tracking branch 'origin/slurm-17.02' · ded7c879
  Danny Auble authored Jun 22, 2017
```
# Conflicts:
#	NEWS
```
  ded7c879
- Fix race condition which could leave a stepd hung on shutdown. · 10cc6f93
  Hongjia Cao authored Jun 22, 2017
```
Bug 3919
```
  10cc6f93
21 Jun, 2017 1 commit
- Improve scheduling logic with respect to license use and node reboots · 2ae94e26
  Dominik Bartkiewicz authored Jun 21, 2017
```
bug 3757
```
  2ae94e26
20 Jun, 2017 3 commits
- Performance boost for when Slurm is dealing with credentials. · c52f6221
  Danny Auble authored Jun 20, 2017
  
  c52f6221
- Fix comment typo · 2a861fb2
  Danny Auble authored Jun 20, 2017
  
  2a861fb2
- Handle partition QOS submit limits correctly when a job is submitted to · 4b003d37
  Danny Auble authored Jun 20, 2017
```
more than 1 partition or when the partition is changed with scontrol.

Bug 3849
```
  4b003d37
19 Jun, 2017 7 commits
- Note about MpiParams on a Cray system. · e07b13e1
  Danny Auble authored Jun 19, 2017
  
  e07b13e1
- Better checking when a job is finishing to avoid underflow on job's · a034e4ab
  Danny Auble authored Jun 19, 2017
```
submitted to a QOS/association.

Bug 3849
```
  a034e4ab
- Delete job files on job purge · 7fa46b96
  Isaac Hartung authored Jun 19, 2017
```
Continuation of b9719be2
```
  7fa46b96
- Add comment to explain concerning assignment. · ceb491a9
  Danny Auble authored Jun 19, 2017
  
  ceb491a9
- Fix deadcode reported by coverity · 1740609d
  Brian Christiansen authored Jun 19, 2017
```
CID: 170772, 170773
Introduced by commit: 250378c2
```
  1740609d
- Continuation of c8c9694f giving guidance for bf_max_time. · 24c04bce
  Danny Auble authored Jun 19, 2017
  
  24c04bce
- Corect ClusterName state file check logging · 19abe929
  Morris Jette authored Jun 19, 2017
```
Correct error message when ClusterName in configuration files does not match
    the name in the slurmctld daemon's state save file.
```
  19abe929