- 10 Apr, 2019 11 commits
-
-
Alejandro Sanchez authored
-
Dominik Bartkiewicz authored
Bug 6807.
-
Alejandro Sanchez authored
-
Alejandro Sanchez authored
==8640== Thread 5 bckfl: ==8640== Syscall param openat(filename) points to unaddressable byte(s) ==8640== at 0x4A81D0E: open (open64.c:48) ==8640== by 0x5934ABB: _update_job_env (burst_buffer_cray.c:3338) ==8640== by 0x5934ABB: bb_p_job_begin (burst_buffer_cray.c:3962) ... ==8640== Address 0x6b96120 is 16 bytes inside a block of size 61 free'd ==8640== at 0x48369AB: free (vg_replace_malloc.c:530) ==8640== by 0x49D4873: slurm_xfree (xmalloc.c:244) ==8640== by 0x490C317: free_command_argv (run_command.c:249) ==8640== by 0x5934A5C: bb_p_job_begin (burst_buffer_cray.c:3947) ... ==8640== Block was alloc'd at ==8640== at 0x4837B65: calloc (vg_replace_malloc.c:752) ==8640== by 0x49D4566: slurm_xmalloc (xmalloc.c:87) ==8640== by 0x49D4B67: makespace (xstring.c:103) ==8640== by 0x49D4C91: _xstrcat (xstring.c:134) ==8640== by 0x49D4ECF: _xstrfmtcat (xstring.c:280) ==8640== by 0x593497C: bb_p_job_begin (burst_buffer_cray.c:3936) ... Bug 6807.
-
Doug Jacobsen authored
Bug 6807.
-
Doug Jacobsen authored
Bug 6807.
-
Doug Jacobsen authored
Bug 6807.
-
Ben Roberts authored
Changed the behavior of "scontrol reboot" to require the user to specify the nodes to reboot rather than defaulting to ALL. Bug 6465
-
Morris Jette authored
This corrects the gres/mps test to insure that CUDA_VISIBLE_DEVICES is always zero (it is dependent upon the devices under MPS control and not related to cgroup constrained devices). Also correct some logic related to how the percentage calculation works in the test.
-
Morris Jette authored
Permit any GPU to be used for gres/mps mode, but only one GPU can be used
-
Morris Jette authored
a request for --gres=mps:1 (specifically with a count of one) was in some places being treated like a request for a full GPU
-
- 09 Apr, 2019 20 commits
-
-
Morris Jette authored
The variable is relative to which GPUs are managed by MPS. Currently Slurm only allows one GPU to be managed by MPS at a time, so the env var should always be zero.
-
Morris Jette authored
Allow it to support use of any GPU in multi-GPU system
-
Brian Christiansen authored
-
Brian Christiansen authored
This allows jobs to be placed on booting nodes rather than being given a whole node even if it would have been better to wait for the node boot. Bug 6782
-
Brian Christiansen authored
-
Brian Christiansen authored
to make nodes available after being suspended even if down, drain, failed. Bug 6212
-
Brian Christiansen authored
Bug 6333
-
Brian Christiansen authored
Bug 6333
-
Brian Christiansen authored
Rely on POWERING_DOWN bit. This allows POWERING_DOWN nodes to be cleared after a restart -- since suspend_node_bitmap was local to power_save.c. Bug 6333
-
Brian Christiansen authored
Bug 6333
-
Brian Christiansen authored
Instead of just guessing the time, let's use the original time. Bug 6333
-
Brian Christiansen authored
-
Brian Christiansen authored
While "powering down" the nodes aren't eligible to be allocated. Nodes will remain "powering down" for SuspendTimeout time. Bug 6333
-
Brian Christiansen authored
NODE_STATE_POWER_SAVE == node is actually off Bug 6333
-
Brian Christiansen authored
Run suspend and resume more often than ResumeTimeout after last suspend. Don't allocate suspending nodes until after SuspendTimeout. Bug 6333
-
Morris Jette authored
If the MPS server is started with the environment variable CUDA_MPS_ACTIVE_THREAD_PERCENTAGE, then the MPS server will be limited to the percentage of the GPU total, which will not work as desired if additional jobs are initiated.
-
Morris Jette authored
User percentage logic was incorrect and unnecesarry. Removed the logic and associated test.
-
Danny Auble authored
Bug 5667
-
Danny Auble authored
Bug 5667
-
Danny Auble authored
Bug 5667
-
- 08 Apr, 2019 2 commits
-
-
Morris Jette authored
Make tests able to work in greater variety of configurations
-
Morris Jette authored
This will start and stop the MPS server as needed
-
- 07 Apr, 2019 5 commits
-
-
Morris Jette authored
This should only happen if something bad happens in the test
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
The previous logic could let a job requiring a persistent burst buffer start and fail without checking that the buffer already exists. Was causing regression tests 35.1 and 35.4 to fail.
-
- 06 Apr, 2019 2 commits
-
-
Morris Jette authored
-
Morris Jette authored
it would require many changes to the slurm.conf files used in testing and the functionality being tested here should work the same on non-cray systems anyway (if it works on a non-cray, the funcitonality would be fine on a real cray system too).
-