Commits · 90e2347b54c5b257a5d0b48faa2e27ea14339fe3 · Manuel G. Marciani / ces_slurm_simulator

28 Oct, 2016 3 commits

Add on to commit . This fixes the situation completely where · 90e2347b

Danny Auble authored Oct 28, 2016

a job could be accounted for more than it should in the _decay_thread
inside the priority/multifactor plugin. Before the end_time_exp wasn't
stored for the job which was what was used to determine if the job was
already processed or not. In 16.05 we were able to fix this mostly, but
for the TRES numbers they could get accounted for multiple times. Since
a pack was needed to fix this we had to wait until 17.02.

90e2347b

Merge remote-tracking branch 'origin/slurm-16.05' · 3d6fd2ca
Danny Auble authored Oct 28, 2016

3d6fd2ca

Fix issue in the priority/multifactor plugin where on a slurmctld restart · be924b88

Danny Auble authored Oct 28, 2016

more time than should be allowed would be accounted for.

This only happened on jobs in the completing state when the slurmctld
was shutdown.

This will also be enhanced in 17.02 as the job's end_time_exp is not
stored which is needed to determine if the job has already been through
the decay_thread at end of job.

Bug 3162

be924b88

27 Oct, 2016 37 commits
- Merge branch 'slurm-16.05' · 3fa5a233
  Morris Jette authored Oct 27, 2016
  
  3fa5a233
- Start NEWS for v16.05.7 · 67d4ee43
  Morris Jette authored Oct 27, 2016
  
  67d4ee43
- Add support for per-partitiion OverTimeLimit configuration · f52b7a60
  Morris Jette authored Oct 27, 2016
```
bug 3139
```
  f52b7a60
- Update META for 17.02.0-0pre3 release · d44fd57b
  Danny Auble authored Oct 27, 2016
  
  d44fd57b
- Merge remote-tracking branch 'origin/slurm-16.05' · ef927f40
  Danny Auble authored Oct 27, 2016
```
# Conflicts:
#	META
```
  ef927f40
- New META for tag 16.05.6 · 09a81995
  Danny Auble authored Oct 27, 2016
  
  09a81995
- Make sure a job cleans up completely if it has a node fail. Mostly an · 9c0a2f2b
  Danny Auble authored Oct 27, 2016
```
issue with gang scheduling.

Bug 3211
```
  9c0a2f2b
- Change size to uint32_t in Buf init/grow funcs · b15598c1
  Brian Christiansen authored Oct 27, 2016
```
MAX_BUF_SIZE is a uint32_t so comparing size (int) to it didn't make
sense.
```
  b15598c1
- Remove example release_agent script. · 74eb97fa
  Tim Wickberg authored Oct 27, 2016
```
Unhook it from the build, and remove relevant section from
slurm.spec file as well.
```
  74eb97fa
- Fix memory leak · 8589a94a
  Brian Christiansen authored Oct 27, 2016
  
  8589a94a
- Merge branch 'slurm-16.05' · f614d5d4
  Tim Wickberg authored Oct 27, 2016
  
  f614d5d4
- Remove ReleaseAgent setting from cgroup.conf.example . · 8e3af5e3
  Tim Wickberg authored Oct 27, 2016
  
  8e3af5e3
- Docs - remove references to ReleaseAgent entirely. · 311530ca
  Tim Wickberg authored Oct 27, 2016
  
  311530ca
- Merge branch 'slurm-16.05' · b5668654
  Tim Wickberg authored Oct 27, 2016
  
  b5668654
- Docs - remove recommendation for ReleaseAgent setting in slurm.conf. · 1d99a44b
  Tim Wickberg authored Oct 27, 2016
  
  1d99a44b
- Merge branch 'federation' · ff3f8309
  Brian Christiansen authored Oct 27, 2016
  
  ff3f8309
- Update NEWS · 118ffaf9
  Brian Christiansen authored Oct 27, 2016
```
Federated submissions
```
  118ffaf9
- Fix test to not find expected 'error' · 38f72141
  Brian Christiansen authored Oct 26, 2016
```
e.g.
allocation failure: Unspecified error
```
  38f72141
- Fix spelling in test · c26a295f
  Brian Christiansen authored Oct 26, 2016
  
  c26a295f
- Fix get_next_job_id to return fed job id · 288cd42b
  Brian Christiansen authored Oct 26, 2016
```
get_next_job_id() was returning a local id and then the fed_mgr was
turning that into a fed job id. This was a problem because
get_next_job_id() couldn't check to see if an existing job already had
the fed job id. It was only checking for the local job id. This was
exposed in tests that did a reconfigure and the reconfigure loaded in a
old job_id_sequence so that the next job got an id that was already
being used.
```
  288cd42b
- Add -M<clusters> option to salloc,srun · a542eb33
  Brian Christiansen authored Oct 25, 2016
```
The logic to talk to the correct compute nodes still needs to be
implemented. It will come at a later date.
```
  a542eb33
- Remove extra line · fddabe36
  Brian Christiansen authored Oct 24, 2016
  
  fddabe36
- Enable srun/salloc job allocations to federation · 5510e7d3
  Brian Christiansen authored Oct 18, 2016
```
Will submit using federation submission logic. Scheduling logic to come.
```
  5510e7d3
- Refactor _slurm_rpc_submit_batch_job() · 566af952
  Brian Christiansen authored Oct 18, 2016
```
to make sure job ptr is accessed within locks.
```
  566af952
- Refactor fed_mgr_job_allocate() · 9bcad8ad
  Brian Christiansen authored Oct 18, 2016
```
In prep for refactoring _slurm_rpc_submit_batch_job to make sure the
job_ptr is accessed within locks.
```
  9bcad8ad
- Only do func if loglevel is at least debug3 · 553c5422
  Brian Christiansen authored Oct 18, 2016
  
  553c5422
- Correctly reset cluster weight in test · 35dc99e9
  Brian Christiansen authored Oct 18, 2016
  
  35dc99e9
- Prevent invalid read when fed_mgr is finishing · 9e4f1488
  Brian Christiansen authored Oct 17, 2016
  
  9e4f1488
- Disable job arrays when in a federation · 4c31e15a
  Brian Christiansen authored Oct 13, 2016
  
  4c31e15a
- Add test37.4 to test federated job submissions · 168395fd
  Brian Christiansen authored Oct 13, 2016
  
  168395fd
- Update info message with timestamp · adec2842
  Brian Christiansen authored Oct 13, 2016
  
  adec2842
- Fix fixing sibling that can start fed job now · d3c70366
  Brian Christiansen authored Oct 13, 2016
```
It was picking a higher weighted federation over lower weighted
federations because it had a earlier starttime. This shouldn't happen
because that's what the weights are for.

e.g.
will_run_resp for fed1: start:2016-10-13T15:19:47 sys_usage:0.00   weight:2
will_run_resp for fed2: start:2016-10-13T15:19:48 sys_usage:0.00   weight:1
will_run_resp for fed3: start:2016-10-13T15:19:48 sys_usage:0.00   weight:1
Earliest cluster:fed1 time:1476393587 now:1476393588
Submitted federated job 67119254 to fed1(self)
```
  d3c70366
- Be able to cancel fed tracker only jobs · aff5c3de
  Brian Christiansen authored Oct 13, 2016
  
  aff5c3de
- Add -O, --Format option to squeue --help · 29bb5011
  Brian Christiansen authored Oct 12, 2016
  
  29bb5011
- Document squeue federated job long output options · 1cde3f71
  Brian Christiansen authored Oct 12, 2016
```
fedorigin
fedoriginraw
fedsiblings
fedsiblingsraw
```
  1cde3f71
- Throttle fed_mgr_job_alloc for batch job rpcs · 19cc14f2
  Brian Christiansen authored Oct 12, 2016
  
  19cc14f2
- Fix comment · e765d567
  Brian Christiansen authored Oct 12, 2016
  
  e765d567