- 09 Apr, 2019 19 commits
-
-
Morris Jette authored
Allow it to support use of any GPU in multi-GPU system
-
Brian Christiansen authored
-
Brian Christiansen authored
This allows jobs to be placed on booting nodes rather than being given a whole node even if it would have been better to wait for the node boot. Bug 6782
-
Brian Christiansen authored
-
Brian Christiansen authored
to make nodes available after being suspended even if down, drain, failed. Bug 6212
-
Brian Christiansen authored
Bug 6333
-
Brian Christiansen authored
Bug 6333
-
Brian Christiansen authored
Rely on POWERING_DOWN bit. This allows POWERING_DOWN nodes to be cleared after a restart -- since suspend_node_bitmap was local to power_save.c. Bug 6333
-
Brian Christiansen authored
Bug 6333
-
Brian Christiansen authored
Instead of just guessing the time, let's use the original time. Bug 6333
-
Brian Christiansen authored
-
Brian Christiansen authored
While "powering down" the nodes aren't eligible to be allocated. Nodes will remain "powering down" for SuspendTimeout time. Bug 6333
-
Brian Christiansen authored
NODE_STATE_POWER_SAVE == node is actually off Bug 6333
-
Brian Christiansen authored
Run suspend and resume more often than ResumeTimeout after last suspend. Don't allocate suspending nodes until after SuspendTimeout. Bug 6333
-
Morris Jette authored
If the MPS server is started with the environment variable CUDA_MPS_ACTIVE_THREAD_PERCENTAGE, then the MPS server will be limited to the percentage of the GPU total, which will not work as desired if additional jobs are initiated.
-
Morris Jette authored
User percentage logic was incorrect and unnecesarry. Removed the logic and associated test.
-
Danny Auble authored
Bug 5667
-
Danny Auble authored
Bug 5667
-
Danny Auble authored
Bug 5667
-
- 08 Apr, 2019 2 commits
-
-
Morris Jette authored
Make tests able to work in greater variety of configurations
-
Morris Jette authored
This will start and stop the MPS server as needed
-
- 07 Apr, 2019 5 commits
-
-
Morris Jette authored
This should only happen if something bad happens in the test
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
The previous logic could let a job requiring a persistent burst buffer start and fail without checking that the buffer already exists. Was causing regression tests 35.1 and 35.4 to fail.
-
- 06 Apr, 2019 7 commits
-
-
Morris Jette authored
-
Morris Jette authored
it would require many changes to the slurm.conf files used in testing and the functionality being tested here should work the same on non-cray systems anyway (if it works on a non-cray, the funcitonality would be fine on a real cray system too).
-
Morris Jette authored
-
Morris Jette authored
it would require many changes to the slurm.conf files used in testing and the functionality being tested here should work the same on non-cray systems anyway (if it works on a non-cray, the funcitonality would be fine on a real cray system too).
-
Morris Jette authored
Change test from select/cray (also used for testing on non-cray systems) to switch/cray (only used on real cray systems)
-
Morris Jette authored
-
Morris Jette authored
This make the test work properly if the default partition configuration includes a configuration of "OverSubscribe=Exclusive"
-
- 05 Apr, 2019 7 commits
-
-
Morris Jette authored
Some conditions were resulting in an srun error about no SGI job container
-
Morris Jette authored
-
Alejandro Sanchez authored
-
Alejandro Sanchez authored
-
Morris Jette authored
This fixes a potential task layout problem. Specifically if a cluster allocates by core and each core contains 2 CPUs and a job requests 3 tasks and one core on each of two nodes is available and the first node has one GPU available and the job requests --gpus-per-task=1 then without this patch cons_tres would try to put 2 tasks on the first node (one per CPU). This adds a check of the GPU count in order to prevent that. Observed sportatically when running regression tests39.[10-15] in immediate succession on a Cray system or a system configured with Epilog that takes a few seconds to complete.
-
Ben Roberts authored
Updated ControlAddr to point to 127.0.0.1 rather than 123.4.5.6 Bug 6794
-
Alejandro Sanchez authored
-