- 18 Dec, 2018 3 commits
-
-
Morris Jette authored
The commit introduced other problems and still needs more work.
-
Morris Jette authored
The previous select/cons_tres logic would under some circumstances allocate a job more cores/CPUs than requested. One specific example is a cluster having nodes with 4 cores, 2 hyperthreads each. A job request for 20 tasks would launch 8 tasks on 8 CPUs on each of 2 nodes and 4 tasks on 6 CPUs on a third node (i.e. a total of 22 CPUs when only 20 are needed).
-
Tim Wickberg authored
Only use removed prior to 17.11 in 04bc96f6. Bug 6261.
-
- 17 Dec, 2018 4 commits
-
-
Morris Jette authored
without this logic the job could get more gres on a node than there are CPUs for
-
Morris Jette authored
-
Michael Hinton authored
Update GRES docs regarding ignore records and links. Update tests to get rid of extra fields for GRES ignores lines. Add tests to check for improved ignore syntax. Test the gres.conf examples. Add mps to regression slurm.conf, to test mps record parsing. Bug 5520
-
Tim Wickberg authored
Add --version option to slurmd, and document that new option.
-
- 15 Dec, 2018 5 commits
-
-
Tim Wickberg authored
-
Tim Wickberg authored
This has been disabled since before Slurm 1.0.0.
-
Tim Wickberg authored
No functional change, all these are in comments.
-
Morris Jette authored
Insure that output appears in a fixed order for parsing by test.
-
Morris Jette authored
This supports heterogeneous environments (i.e. different MPS counts on different GPUs within a node)
-
- 14 Dec, 2018 4 commits
-
-
Morris Jette authored
if the gres count on a node with topology changes when the slurmctld restarts then the gres data structures were left in an inconsistent state. Namely the bitmaps would reflect the old size while the count reflects the new size, which resulted in asserts. In addition, the gres/mps data structure sizes need to match the gpu count on each node. This new logic will synchronize mps data structures on gpu count changes.
-
Michael Hinton authored
-
Tim Wickberg authored
-
Morris Jette authored
-
- 13 Dec, 2018 2 commits
-
-
Morris Jette authored
Add support for co-scheduling of gres/gpu and gres/mps. GPUs that are allocated to one are avoided for the other GRES type. Add gres/mps documentation Recover job gres/mps state on slurmctld restart. Wwe need to use job gres/mps state to recover node info since we will not know the count of mps on each device file until the node registers
-
Michael Hinton authored
Check for cgroup usage and change GPU indexes accordingly. Fix formatting errors in docs. bug 5520
-
- 11 Dec, 2018 12 commits
-
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Tim Wickberg authored
Update slurm.spec and slurm.spec-legacy as well.
-
Tim Wickberg authored
-
Morris Jette authored
-
Tim Wickberg authored
Bug 6029.
-
Morris Jette authored
-
Morris Jette authored
Duplicate file names will cause problems for gres/mps, which needs to make 1-to-1 to gres/gpu devices
-
Morris Jette authored
Support undocumented "Files=" in addition to "File=". Note that multiple file name can be used as an argument and this minor change eliminates some possible confusion.
-
Morris Jette authored
-
Morris Jette authored
bug 5520
-
Michael Hinton authored
bug 5520
-
- 10 Dec, 2018 6 commits
-
-
Morris Jette authored
without this, the jobs were being assigned the wrong CUDA_VISIBLE_DEVICES value
-
Morris Jette authored
-
Morris Jette authored
The cpu frequency set by the user is not exact with current kernels. There seems to be a fair variation depending upon timing and other events. This is resulting in test1.76 failing sporatically. This changes the logic to retry if the frequency differs by more than 10 percent rather than failing immediately.
-
Morris Jette authored
The device numbers are set using the same mechanism used to set CUDA_VISIBLE_DEVICES bug 5520
-
Morris Jette authored
-
Michael Hinton authored
Add step_unconfigure_hardware() to GRES plugin API Update test39.18 regarding links. Update GRES docs. Update docs related to links. Document GPU frequency resetting behavior. Specify what the default is for GpuFreqDef. Move NVML init and shutdown to configure() and unconfigure(). Get rid of superfluous `!= 0`-style statements. Print note when GPU index != minor number. Clean up various formatting and other errors. bug 5520
-
- 09 Dec, 2018 4 commits
-
-
Tim Wickberg authored
No functional change.
-
Tim Wickberg authored
-
Tim Wickberg authored
Due to upcoming changes in the X11 forwarding subsystem, support for older-style X11 tunnels will be removed. Older client commands cannot support the newer style. Rather than have the tunnel fail, request the job allocation request up front. Bug 3647.
-
Tim Wickberg authored
Also tweak the one info() message here to match these others.
-