Commits · 4347b7b5eee32b18f79ec1e5873ffb70e802fe3f · Manuel G. Marciani / ces_slurm_simulator

09 Apr, 2019 19 commits
- Make gres/mps prolog more generic · 4347b7b5
  Morris Jette authored Apr 09, 2019
```
Allow it to support use of any GPU in multi-GPU system
```
  4347b7b5
- Merge branch 'cloud_enhancements' · 38447760
  Brian Christiansen authored Apr 09, 2019
  
  38447760
- Add alloc_booting_nodes SchedulerParameters · 36c30487
  Brian Christiansen authored Mar 30, 2019
```
This allows jobs to be placed on booting nodes rather than being given a
whole node even if it would have been better to wait for the node boot.

Bug 6782
```
  36c30487
- Add PowerSave debug flag · 1f54730d
  Brian Christiansen authored Mar 29, 2019
  
  1f54730d
- Add idle_on_node_suspend SlurmctldParameter · 7abadb99
  Brian Christiansen authored Feb 26, 2019
```
to make nodes available after being suspended even if down, drain,
failed.

Bug 6212
```
  7abadb99
- Add NEWS and RELEASE NOTES · 4ae21d13
  Brian Christiansen authored Mar 30, 2019
```
Bug 6333
```
  4ae21d13
- Allow scontrol update nodename= state=resume to clear powering down · f596b4b6
  Brian Christiansen authored Mar 26, 2019
```
Bug 6333
```
  f596b4b6
- Don't need suspend_node_bitmap · c2c6af91
  Brian Christiansen authored Mar 22, 2019
```
Rely on POWERING_DOWN bit. This allows POWERING_DOWN nodes to be cleared
after a restart -- since suspend_node_bitmap was local to power_save.c.

Bug 6333
```
  c2c6af91
- Make sure powering down node is not available to be allocated · 73f428a2
  Brian Christiansen authored Mar 22, 2019
```
Bug 6333
```
  73f428a2
- State save node last_response time · 6a3c20b0
  Brian Christiansen authored Mar 22, 2019
```
Instead of just guessing the time, let's use the original time.

Bug 6333
```
  6a3c20b0
- Prep load_all_node_state for 19.05 · 017bb943
  Brian Christiansen authored Mar 22, 2019
  
  017bb943
- Put nodes in "powering down" state when being powered down · 5c46df98
  Brian Christiansen authored Mar 21, 2019
```
While "powering down" the nodes aren't eligible to be allocated.
Nodes will remain "powering down" for SuspendTimeout time.

Bug 6333
```
  5c46df98
- Rename var name · c937057c
  Brian Christiansen authored Mar 21, 2019
```
NODE_STATE_POWER_SAVE == node is actually off

Bug 6333
```
  c937057c
- Cloud improvements · 34271c30
  Brian Christiansen authored Feb 25, 2019
```
Run suspend and resume more often than ResumeTimeout after last suspend.
Don't allocate suspending nodes until after SuspendTimeout.

Bug 6333
```
  34271c30
- gres/mps correct prolog.example · 134cd63a
  Morris Jette authored Apr 08, 2019
```
If the MPS server is started with the environment variable
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE, then the MPS server will
be limited to the percentage of the GPU total, which will not
work as desired if additional jobs are initiated.
```
  134cd63a
- gres/mps - Fix some logic related to percentage · be18f76d
  Morris Jette authored Apr 08, 2019
```
User percentage logic was incorrect and unnecesarry.
Removed the logic and associated test.
```
  be18f76d
- Allow reconfig to work when acct_gather.conf changes. · fdab48c6
  Danny Auble authored Apr 08, 2019
```
Bug 5667
```
  fdab48c6
- Avoid read of acct_gather.conf file by stepd. · 00f67f82
  Danny Auble authored Apr 08, 2019
```
Bug 5667
```
  00f67f82
- Add function to pack/unpack a s_p_hashtbl_t · 1dc2339d
  Danny Auble authored Apr 08, 2019
```
Bug 5667
```
  1dc2339d
08 Apr, 2019 2 commits
- Improve gres/mps test suite · ebd1258b
  Morris Jette authored Apr 08, 2019
```
Make tests able to work in greater variety of configurations
```
  ebd1258b
- Add example prolog for gres/mps support · 159488e0
  Morris Jette authored Apr 08, 2019
```
This will start and stop the MPS server as needed
```
  159488e0
07 Apr, 2019 5 commits
- Prevent zero divide in a test · 0574b08d
  Morris Jette authored Apr 07, 2019
```
This should only happen if something bad happens in the test
```
  0574b08d
- Add CUDA MPS regression test · 6e95d205
  Morris Jette authored Apr 07, 2019
  
  6e95d205
- Cosmetic changes to regression test · 5e0419c7
  Morris Jette authored Apr 07, 2019
  
  5e0419c7
- Add simple CUDA test · 88a59976
  Morris Jette authored Apr 06, 2019
  
  88a59976
- Fix burst_buffer/datawarp race condition · 93a1685e
  Morris Jette authored Apr 06, 2019
```
The previous logic could let a job requiring a persistent burst
buffer start and fail without checking that the buffer already
exists. Was causing regression tests 35.1 and 35.4 to fail.
```
  93a1685e
06 Apr, 2019 7 commits

Modify test for changed dwstat (datawarp status) output format · 8c3912ee
Morris Jette authored Apr 05, 2019

8c3912ee

Disable a test on real cray systems · 084c4c44

Morris Jette authored Apr 05, 2019

it would require many changes to the slurm.conf files used in testing
and the functionality being tested here should work the same on
non-cray systems anyway (if it works on a non-cray, the funcitonality
would be fine on a real cray system too).

084c4c44

clean up some formatting in a test · c31dd6f5
Morris Jette authored Apr 05, 2019

c31dd6f5

Disable a test on real cray systems · cbd8ff59

Morris Jette authored Apr 05, 2019

it would require many changes to the slurm.conf files used in testing
and the functionality being tested here should work the same on
non-cray systems anyway (if it works on a non-cray, the funcitonality
would be fine on a real cray system too).

cbd8ff59

change mechanism used to determine if Cray for test · 0ff70377

Morris Jette authored Apr 05, 2019

Change test from select/cray (also used for testing on non-cray
systems) to switch/cray (only used on real cray systems)

0ff70377

Fix test to run properly if not SlurmUser · efa9d777
Morris Jette authored Apr 05, 2019

efa9d777

Modify test to work with node allocation by partition · 98d26e06

Morris Jette authored Apr 05, 2019

This make the test work properly if the default partition configuration
includes a configuration of "OverSubscribe=Exclusive"

98d26e06

05 Apr, 2019 7 commits

Change required for Cray without ALPS · 13b08df5
Morris Jette authored Apr 05, 2019
```
Some conditions were resulting in an srun error about no SGI job
container
```
13b08df5
Change required for clean Cray build · aab8d384
Morris Jette authored Apr 05, 2019

aab8d384
Merge branch 'slurm-18.08' · d01818b8
Alejandro Sanchez authored Apr 05, 2019

d01818b8
Merge branch 'slurm-17.11' into slurm-18.08 · a7256110
Alejandro Sanchez authored Apr 05, 2019

a7256110

fix cons_tres task layout with gpus-per-task · 953987b9

Morris Jette authored Apr 05, 2019

This fixes a potential task layout problem. Specifically if
a cluster allocates by core and each core contains 2 CPUs and
a job requests 3 tasks and one core on each of two nodes is
available and the first node has one GPU available and the
job requests --gpus-per-task=1 then without this patch
cons_tres would try to put 2 tasks on the first node (one
per CPU). This adds a check of the GPU count in order to
prevent that. Observed sportatically when running regression
tests39.[10-15] in immediate succession on a Cray system
or a system configured with Epilog that takes a few seconds
to complete.

953987b9

Update test 7.17 to not point to external address · 66d5c2fc
Ben Roberts authored Apr 03, 2019
```
Updated ControlAddr to point to 127.0.0.1 rather than 123.4.5.6

Bug 6794
```
66d5c2fc
Merge branch 'slurm-18.08' · 6c1ab8c4
Alejandro Sanchez authored Apr 05, 2019

6c1ab8c4