Commits · 8d424fbe8ea53dc51440919a9eef7255f83956b8 · Manuel G. Marciani / ces_slurm_simulator

17 May, 2017 13 commits

Move federated job updates to async model · 8d424fbe
Brian Christiansen authored Apr 26, 2017

8d424fbe

Always check with origin cluster to get job lock · 6fcd0ffd

Brian Christiansen authored Apr 26, 2017

With the change to the asynchronous model, it's better to have the
cluster always get the lock from the origin cluster. Previously, the
origin cluster would try to pick one cluster that could start the job
the soonest and the scenario where there would be only one sibling was
more common. Now that sibling jobs are sent to all clusters this is less
common.

6fcd0ffd

Move fed_job_complete to async model · 34e96f0d
Brian Christiansen authored Apr 25, 2017
```
Queue up the fed job completions.
```
34e96f0d

Move fed submissions to async model · f3998831

Brian Christiansen authored Apr 25, 2017

Federated submissions now happen ansynchronously. Sibling jobs are
submitted to the sibling cluster. The sibling cluster queue's up the
request to be handled later when it can get the job write lock. The
sibling cluster submits the job and sends a message back to the origin
cluster which is queued up as well. If the submission failed then the
sibling cluster is removed from the job's active siblings.

f3998831

Improve fed job locking · 9fb07473

Brian Christiansen authored Apr 25, 2017

The problem was that the origin cluster had to get the internal job
write lock to test and set the fed cluster lock. This would hold up the
persistent connection and get into a dead lock. The solution is create a
separate table for tracking the federated job and the cluster lock which
is controlled by seperate lock.

Plus all communication on the persist connection must be quick. Thus all
communications that need to be modify the actual job need to be put onto
a queue for the scheduler to handle later so that the persistent
connection isn't being held up. The response will be sent back when the
request is processed. This moves to an asynchronous model for
communications between clusters in a federation.

9fb07473

Preserve auth_cred when handling multi msg · 7587038a
Brian Christiansen authored Apr 19, 2017

7587038a
Display which multi msg is being handled · d0404a29
Brian Christiansen authored Apr 19, 2017

d0404a29
Update comment · c19a273b
Brian Christiansen authored Apr 19, 2017

c19a273b

Don't revoke a job that is already completed · 293c2bce

Brian Christiansen authored Apr 19, 2017

Prevents a second call to the database. This could happen when the
origin job is cancelled and the sibling jobs report back that the job is
gone as well.

293c2bce

Clear state_reason for revoked jobs · 811fcb97
Brian Christiansen authored Apr 19, 2017

811fcb97

Set revoked job's start time to end time · cf178ae6

Brian Christiansen authored Apr 19, 2017

Mimicking how cancelled jobs are. The database will show that the job
start_time is 0 but in the controller the the start time will be the
same as the end time. sacct will set the start time to the end time if
there is an end time and the start time is 0.

cf178ae6

Make sure revoked jobs have correct end times · 1fe16e52
Brian Christiansen authored Apr 19, 2017

1fe16e52
Name fed agent thread · f1aa850e
Brian Christiansen authored Apr 19, 2017

f1aa850e

16 May, 2017 5 commits
- Log the down nodes whenever slurmctld restarts · d8e95394
  Tim Shaw authored May 16, 2017
```
bug 805
```
  d8e95394
- Add FederationParameters=fed_display to slurm.conf · d3ee28fa
  Brian Christiansen authored May 16, 2017
```
To be able to set a default federated view for all status commands.
```
  d3ee28fa
- Don't show cluster column if sprio --local · 262bdb0e
  Brian Christiansen authored May 15, 2017
```
even if --sibling is specified.
```
  262bdb0e
- Add --federation option to status commands · 21680136
  Brian Christiansen authored May 15, 2017
```
to show federated view.
sacct, scontrol, sinfo, sprio, squeue, sreport
```
  21680136
- Rename SHOW_ALL flag to SHOW_FEDERATION · 8a5f00c4
  Brian Christiansen authored May 15, 2017
  
  8a5f00c4
15 May, 2017 2 commits
- Add federated views to sview · 97e9dfd5
  Brian Christiansen authored May 15, 2017
```
Show a tab in the cluster combo box to select a federated view for a
given federation.
```
  97e9dfd5
- sbatch cosmetic changes, no changes to logic · 2f313292
  Morris Jette authored May 15, 2017
  
  2f313292
13 May, 2017 3 commits
- Merge branch 'slurm-17.02' · e0ca803c
  Morris Jette authored May 12, 2017
  
  e0ca803c
- Remove log files from test20.12 · 7bb4d9a1
  Isaac Hartung authored May 12, 2017
```
Bug 3695
```
  7bb4d9a1
- knl_cray plugin: Change capmc parsing of mcdram_pct from string to number · 7bd276b1
  Morris Jette authored May 12, 2017
```
bug 3779
```
  7bd276b1
12 May, 2017 4 commits

knl_cray plugin: Log incomplete capmc output for a node · 80b27490

Morris Jette authored May 12, 2017

If capmc reports a node name, but not mcdram_cfg for the node, then
  log the missing data rather than assume the value is zero and
  report a value mismatch with cnselect.

80b27490

Prevent scontrol crash when operating on array and no-array jobs at once. · 006f7eeb

Alejandro Sanchez authored May 12, 2017

When requesting an operation on jobs, where the operation permits to specify
more than one job in the same request, and a job array appears before a
regular job (no-array job) in the list of jobs to operate with, the
job_array_resp_msg_t pointer was not properly NULL'ed and thus incorrectly
accessed when processing the no-array job. This fix prevents the crash from
happening in the following scontrol operations:

uhold, hold, suspend, requeue, requeuehold, update, release

when the same request has <array_jobid>,<non-array_jobid> in this order in
the job list to process.

Bug 3759

006f7eeb

Enhance job expansion example · 02b790bc

Morris Jette authored May 12, 2017

Job expansion example in FAQ enhanced to demonstrate operation in
    heterogeneous environments.
bug 2979

02b790bc

avoid starting scheduler on busy system after power cap change · e29e8511
Alejandro Sanchez authored May 12, 2017
```
Do not attempt to schedule jobs after changing the power cap if there are
    already many active threads.
```
e29e8511

11 May, 2017 2 commits
- Merge remote-tracking branch 'origin/slurm-17.02' · 4fa4f65e
  Danny Auble authored May 10, 2017
```
# Conflicts:
#	META
#	NEWS
```
  4fa4f65e
- Update NEWS for next release. · d65ed698
  Danny Auble authored May 10, 2017
  
  d65ed698
10 May, 2017 3 commits
- Update META for v17.02.3 tag · b6f8ca23
  Danny Auble authored May 10, 2017
  
  b6f8ca23
- Return error when bad separator is given for scontrol update job licenses. · 521a574c
  Dominik Bartkiewicz authored May 10, 2017
```
Bug 3760
```
  521a574c
- Partial revert of commit c6a144c1 which made it so CR_ONE_TASK_PER_CORE · 9556b4ab
  Danny Auble authored May 09, 2017
```
didn't work at all.

Bug 3712.
```
  9556b4ab
09 May, 2017 8 commits
- Revert "Return error when bad separator is given for scontrol update job licenses." · 36718220
  Danny Auble authored May 09, 2017
```
This reverts commit ecfd007f.
```
  36718220
- Return error when bad separator is given for scontrol update job licenses. · ecfd007f
  Dominik Bartkiewicz authored May 09, 2017
  
  ecfd007f
- Fix casting · c865bd0c
  Brian Christiansen authored May 09, 2017
```
Continuation of 9a1370e3
CID 168995
```
  c865bd0c
- Don't remove admin comment when updating a job. · 6cf363c4
  Danny Auble authored May 09, 2017
```
It was noticed that while doing any update to a job the admin comment would
be blown away.  This patch fixes that.
```
  6cf363c4
- Fix updating job priority on multiple partitions to be correct. · bf7e0e7b
  Dominik Bartkiewicz authored May 09, 2017
```
Bug 3789
```
  bf7e0e7b
- It was found on system running openmpi using multiple-slurmds we couldn't · 3a4f38d0
  Danny Auble authored May 09, 2017
```
run multiple tasks on multiple nodes.  Changing the max nodes setting from
3 to 6 fixes the issue without apparent compromise to the test.
```
  3a4f38d0
- Message Aggr - Remove race condition on slurmd shutdown with respects to · bc3cdabf
  Danny Auble authored May 09, 2017
```
destroying a mutex.
```
  bc3cdabf
- Filter out duplicate federated jobs · 9a1370e3
  Brian Christiansen authored May 08, 2017
```
When running sacct from a federated client, the db returns jobs for each
cluster with duplicate jobs removed on each cluster. A federated job could have
ran on a different cluster when the before the jobid's rolled. This patch
filters out past old federated jobs and leaves the newest ones.

Reverted d31965 which was too slow.
```
  9a1370e3