Commits · 1eaa4d248b23db3ccf9a5c05b392521f7e44bf57 · Manuel G. Marciani / ces_slurm_simulator

22 Sep, 2016 8 commits
- Merge branch 'slurm-16.05' · 1eaa4d24
  Morris Jette authored Sep 22, 2016
  
  1eaa4d24
- knl_cray: Remove dead/commented out code · 94b0d9e1
  Morris Jette authored Sep 21, 2016
  
  94b0d9e1
- Updates to KNL web page · a36b4035
  Morris Jette authored Sep 21, 2016
  
  a36b4035
- Don't start fed_mgr if not using accounting · 07f6fe52
  Brian Christiansen authored Sep 21, 2016
  
  07f6fe52
- Fix squeue filter by job license when a job has requested more than 1 · cbd1ffad
  Alejandro Sanchez authored Sep 21, 2016
```
license of a certain type.
```
  cbd1ffad
- Revert "Make it so you have to have AccountingStorageEnforce=fed to turn on" · bff300ca
  Brian Christiansen authored Sep 21, 2016
```
This reverts commit 54a270f7.
```
  bff300ca
- Make it so you have to have AccountingStorageEnforce=fed to turn on · 54a270f7
  Danny Auble authored Sep 21, 2016
```
the fed_mgr.
```
  54a270f7
- Remove duplcate prototype · 386d27b3
  Brian Christiansen authored Sep 21, 2016
  
  386d27b3
21 Sep, 2016 11 commits

Docs - fix two typos. · f835d8b6
Tim Wickberg authored Sep 21, 2016

f835d8b6

Increase default CapmcTimeout from 10 to 60 seconds · 1fe5c7cc

Morris Jette authored Sep 21, 2016

node_features/knl_cray plugin: Increase default CapmcTimeout parameter from
    10 to 60 seconds.
bug 3100

1fe5c7cc

capmc_suspend/resume fail rather than operate on individual nodes · 95207e3c

Morris Jette authored Sep 20, 2016

capmc_suspend/resume - If a request modify NUMA or MCDRAM state on a set of
    nodes or reboot a set of nodes fails then just requeue the job and abort the
    entire operation rather than trying to operate on individual nodes.
bug 3100

95207e3c

Allow clearing of node PowerUp state flag · d7818ba1
Morris Jette authored Sep 20, 2016
```
Allow a node's PowerUp state flag to be cleared using update_node RPC.
bug 3100
```
d7818ba1

Pass SLURM_JOB_ID to ResumeProgram · e416a753

Morris Jette authored Sep 20, 2016

When powering up a node to change it's state (e.g. KNL NUMA or MCDRAM mode)
    then pass to the ResumeProgram the job ID assigned to the nodes in the
    SLURM_JOB_ID environment variable.
bug 3100

e416a753

Remove error message on valid state · 1b5382f5

Morris Jette authored Sep 19, 2016

Don't log error for job end_time being zero if node health check is still
    running.
bug 3053

1b5382f5

capmc_suspend/resume fail rather than operate on individual nodes · f07482fd

Morris Jette authored Sep 20, 2016

capmc_suspend/resume - If a request modify NUMA or MCDRAM state on a set of
    nodes or reboot a set of nodes fails then just requeue the job and abort the
    entire operation rather than trying to operate on individual nodes.
bug 3100

f07482fd

Allow clearing of node PowerUp state flag · 61b14031
Morris Jette authored Sep 20, 2016
```
Allow a node's PowerUp state flag to be cleared using update_node RPC.
bug 3100
```
61b14031

Pass SLURM_JOB_ID to ResumeProgram · 45ca1dd4

Morris Jette authored Sep 20, 2016

When powering up a node to change it's state (e.g. KNL NUMA or MCDRAM mode)
    then pass to the ResumeProgram the job ID assigned to the nodes in the
    SLURM_JOB_ID environment variable.
bug 3100

45ca1dd4

Simplify error checking on job_allocate · bdf94e98

Brian Christiansen authored Sep 20, 2016

Previous logic duplicated checking error_codes returned from
job_allocate. job_allocate() will set job state to FAILED if there
was an actual issue.

bdf94e98

Reject job if job violates ANY or ALL part limits · 185ebc81

Brian Christiansen authored Sep 20, 2016

Was just checking for ESLURM_REQUESTED_PART_CONFIG_UNAVAILABLE and
ENFORCE_ALL however in _slurm_rpc_allocate_resources() and
_slurm_rpc_submit_batch_job() both check for ANY and ALL.

185ebc81

20 Sep, 2016 13 commits
- fed_mgr - Change it so update messages don't automatically connect · 4945d049
  Danny Auble authored Sep 20, 2016
```
to siblings (If not already connected).
This will happen when the next message is sent to them.
```
  4945d049
- Add missing signal.h include needed for pthread_kill(). · 4c30a972
  Tim Wickberg authored Sep 19, 2016
  
  4c30a972
- Fix clang error (hopefully) · 22007e03
  Danny Auble authored Sep 19, 2016
  
  22007e03
- Fix memory leak when we aren't ready to accept connections from · 94075636
  Danny Auble authored Sep 19, 2016
```
sibling clusters.
```
  94075636
- Fix issue when the persist_conn didn't exist when trying to send · b1866f31
  Danny Auble authored Sep 19, 2016
```
back a message to the caller.
```
  b1866f31
- If the recv pointer switches more than once during a shutdown of · 012226f6
  Danny Auble authored Sep 19, 2016
```
a federation connection, someone adding and removing the cluster
from the federation lots of times at the same time the cluster could
be not found.
```
  012226f6
- Fix memory corruption issue · 3dac141f
  Danny Auble authored Sep 19, 2016
  
  3dac141f
- Fix possible invalid memory read · 058598fb
  Danny Auble authored Sep 19, 2016
  
  058598fb
- FED_MGR - add pthread_kill to kill ping thread if running · 9903d242
  Danny Auble authored Sep 19, 2016
  
  9903d242
- Add missing lock/unlock to the fed_mgr to avoid memory corruption. · f59cb41b
  Danny Auble authored Sep 19, 2016
  
  f59cb41b
- Remove error message on valid state · 85c136bc
  Morris Jette authored Sep 19, 2016
```
Don't log error for job end_time being zero if node health check is still
    running.
bug 3053
```
  85c136bc
- Add config.h include to slurm_persist_conn.c for HAVE_SYS_PRCTL_H · 0059928c
  Tim Wickberg authored Sep 19, 2016
```
Fixes build issue caused by 844830d4.
```
  0059928c
- Fix FreeBSD build. · 844830d4
  Ben Matthews authored Sep 19, 2016
  
  844830d4
19 Sep, 2016 8 commits
- FED_MGR - Fix issue with ping thread trying to send on a non-existent · 8d2f6153
  Danny Auble authored Sep 19, 2016
```
connection
```
  8d2f6153
- On a fatal, abort so we get a core file instead of just exiting. · 428347cf
  Danny Auble authored Sep 19, 2016
  
  428347cf
- If a message is trying to be freed that never was don't print an · eb25f6f5
  Danny Auble authored Sep 19, 2016
```
error.
```
  eb25f6f5
- Minor memory free move. · 9142c0eb
  Danny Auble authored Sep 19, 2016
  
  9142c0eb
- Only start the persistent send when we need to send something, or · fe8bb844
  Danny Auble authored Sep 19, 2016
```
at startup.  Starting it up when you get a connection from another
cluster could cause delays in processing the request.
```
  fe8bb844
- In the fed_mgr and we are starting up the send connection we · 4416f257
  Danny Auble authored Sep 19, 2016
```
want to only wait for message_timeout instead of forever.  Otherwise
we could hit deadlock if the other person is trying to do the same
thing.
```
  4416f257
- Remove xmallocs from the fed_mgr ping_thread · ba0c6af8
  Danny Auble authored Sep 19, 2016
  
  ba0c6af8
- Add update mutex to the fed_mgr to only allow one update to be · 0d347008
  Danny Auble authored Sep 19, 2016
```
processed at a time.  Otherwise you could get issues if you are
rapidly adding and removing a cluster from a federation.  Probably
not likely in real life, but in testing that is a different story.
```
  0d347008