Commits · fb361fd24745ba89835943bda2ed508a3b092e09 · Manuel G. Marciani / ces_slurm_simulator

18 Dec, 2018 3 commits

Revert commit 97c0c50e · fb361fd2
Morris Jette authored Dec 18, 2018
```
The commit introduced other problems and still needs more work.
```
fb361fd2

Morris Jette authored Dec 18, 2018

The previous select/cons_tres logic would under some circumstances
allocate a job more cores/CPUs than requested. One specific example
is a cluster having nodes with 4 cores, 2 hyperthreads each. A
job request for 20 tasks would launch 8 tasks on 8 CPUs on each
of 2 nodes and 4 tasks on 6 CPUs on a third node (i.e. a total of
22 CPUs when only 20 are needed).

97c0c50e

Remove SHOW_DETAIL2 flag, and handling from scontrol. · e55e892d
Tim Wickberg authored Dec 18, 2018
```
Only use removed prior to 17.11 in 04bc96f6.

Bug 6261.
```
e55e892d

17 Dec, 2018 4 commits

gres-per-job with cpus-per-tres · 65c0f8f8
Morris Jette authored Dec 17, 2018
```
without this logic the job could get more gres on a node than there are
CPUs for
```
65c0f8f8
Clarify gres.conf cores option use · b84ace89
Morris Jette authored Dec 17, 2018

b84ace89

gres.conf ignore lines now only need Name and File fields · cdc92fa2

Michael Hinton authored Dec 17, 2018

Update GRES docs regarding ignore records and links.
Update tests to get rid of extra fields for GRES ignores lines.
Add tests to check for improved ignore syntax.
Test the gres.conf examples.
Add mps to regression slurm.conf, to test mps record parsing.

Bug 5520

cdc92fa2

Add getopt_long() handling to slurmd to support long option names. · e6318e30
Tim Wickberg authored Dec 15, 2018
```
Add --version option to slurmd, and document that new option.
```
e6318e30

15 Dec, 2018 5 commits
- Fix typo in comment. · 96cc5f84
  Tim Wickberg authored Dec 15, 2018
  
  96cc5f84
- Remove dead code. · 282baa7c
  Tim Wickberg authored Dec 15, 2018
```
This has been disabled since before Slurm 1.0.0.
```
  282baa7c
- Grammatical fix. s/Insure/Ensure/. · f02164b6
  Tim Wickberg authored Dec 15, 2018
```
No functional change, all these are in comments.
```
  f02164b6
- Add sleep to test · e752fa8a
  Morris Jette authored Dec 14, 2018
```
Insure that output appears in a fixed order for parsing by test.
```
  e752fa8a
- gres/mps job allocation logic · 76e3f6b9
  Morris Jette authored Dec 14, 2018
```
This supports heterogeneous environments (i.e. different
MPS counts on different GPUs within a node)
```
  76e3f6b9
14 Dec, 2018 4 commits

handle change in gres count · 0469cca2

Morris Jette authored Dec 14, 2018

if the gres count on a node with topology changes when the slurmctld
restarts then the gres data structures were left in an inconsistent
state. Namely the bitmaps would reflect the old size while the count
reflects the new size, which resulted in asserts. In addition, the
gres/mps data structure sizes need to match the gpu count on each
node. This new logic will synchronize mps data structures on gpu count
changes.

0469cca2

Fix minor typos in GRES docs · f9b19c62
Michael Hinton authored Dec 14, 2018

f9b19c62
Remove redundant listen() call. · d24adda1
Tim Wickberg authored Dec 13, 2018

d24adda1
Add gres/mps step allocation logic and test · 70996715
Morris Jette authored Dec 13, 2018

70996715

13 Dec, 2018 2 commits

gres/mps scheduling · b1d7368e

Morris Jette authored Dec 12, 2018

Add support for co-scheduling of gres/gpu and gres/mps.
  GPUs that are allocated to one are avoided for the other
  GRES type.
Add gres/mps documentation
Recover job gres/mps state on slurmctld restart. Wwe need to use
  job gres/mps state to recover node info since we will not know
   the count of mps on each device file until the node registers

b1d7368e

Fix bug where frequencies of GPUs in a cgroup were not set · 8a11ea4e
Michael Hinton authored Dec 13, 2018
```
Check for cgroup usage and change GPU indexes accordingly.
Fix formatting errors in docs.
bug 5520
```
8a11ea4e

11 Dec, 2018 12 commits
- Merge branch 'slurm-18.08' · d583c9ed
  Tim Wickberg authored Dec 11, 2018
  
  d583c9ed
- Start NEWS for v18.08.5 · 17e96ba6
  Tim Wickberg authored Dec 11, 2018
  
  17e96ba6
- Update META for v18.08.4 release. · 35da90df
  Tim Wickberg authored Dec 11, 2018
```
Update slurm.spec and slurm.spec-legacy as well.
```
  35da90df
- Merge branch 'slurm-18.08' · 75aa5195
  Tim Wickberg authored Dec 11, 2018
  
  75aa5195
- Add logging of gres socket details · 9cfc7bc2
  Morris Jette authored Dec 11, 2018
  
  9cfc7bc2
- Docs - change dual-factor to multi-factor. · 5a2e926b
  Tim Wickberg authored Dec 11, 2018
```
Bug 6029.
```
  5a2e926b
- gres/mps get core-binding info from gres/gpu if available · d3f45190
  Morris Jette authored Dec 11, 2018
  
  d3f45190
- gres.conf parsing, treat duplicate Files as an error · d008e388
  Morris Jette authored Dec 11, 2018
```
Duplicate file names will cause problems for gres/mps,
which needs to make 1-to-1 to gres/gpu devices
```
  d008e388
- gres.conf parsing change · 64ec6538
  Morris Jette authored Dec 11, 2018
```
Support undocumented "Files=" in addition to "File=".
Note that multiple file name can be used as an argument
and this minor change eliminates some possible confusion.
```
  64ec6538
- Fix for unitialized variable and mem leak without NVML · 97961caa
  Morris Jette authored Dec 10, 2018
  
  97961caa
- cosmetic changes to commit 371afa7c · 6550af12
  Morris Jette authored Dec 10, 2018
```
bug 5520
```
  6550af12
- At step end, reset GPU frequency to default boot value · 189cdc0d
  Michael Hinton authored Dec 10, 2018
```
bug 5520
```
  189cdc0d
10 Dec, 2018 6 commits

Fix for gres/gpu with cgroup constrained devices · 5c2af4b2
Morris Jette authored Dec 10, 2018
```
without this, the jobs were being assigned the wrong
CUDA_VISIBLE_DEVICES value
```
5c2af4b2
Merge branch 'slurm-18.08' · e08e51ff
Morris Jette authored Dec 10, 2018

e08e51ff

Make CPU frequency test more forgiving · 8ee71c24

Morris Jette authored Dec 10, 2018

The cpu frequency set by the user is not exact with current kernels.
There seems to be a fair variation depending upon timing and other
events. This is resulting in test1.76 failing sporatically. This
changes the logic to retry if the frequency differs by more than
10 percent rather than failing immediately.

8ee71c24

Fix GPU device numbers frequency control · f582b8db

Morris Jette authored Dec 10, 2018

The device numbers are set using the same mechanism used to set
CUDA_VISIBLE_DEVICES

bug 5520

f582b8db

Minor cosmetic changes, no change in logic · 96cb1908
Morris Jette authored Dec 10, 2018

96cb1908

GPU frequencies now reset after job is finished · 7a48b58f

Michael Hinton authored Dec 10, 2018

Add step_unconfigure_hardware() to GRES plugin API
Update test39.18 regarding links.
Update GRES docs.
Update docs related to links.
Document GPU frequency resetting behavior.
Specify what the default is for GpuFreqDef.
Move NVML init and shutdown to configure() and unconfigure().
Get rid of superfluous `!= 0`-style statements.
Print note when GPU index != minor number.
Clean up various formatting and other errors.

bug 5520

7a48b58f

09 Dec, 2018 4 commits
- Add 19.05 protocol_version blocks to _{pack,unpack}_job_desc_msg. · 64948c7e
  Tim Wickberg authored Dec 08, 2018
```
No functional change.
```
  64948c7e
- Block batch jobs from requesting X11 forwarding. · 688c2c91
  Tim Wickberg authored Dec 08, 2018
  
  688c2c91
- Block jobs requesting X11 forwarding from older client commands. · f2da4d7c
  Tim Wickberg authored Dec 08, 2018
```
Due to upcoming changes in the X11 forwarding subsystem, support
for older-style X11 tunnels will be removed. Older client commands
cannot support the newer style. Rather than have the tunnel fail,
request the job allocation request up front.

Bug 3647.
```
  f2da4d7c
- Use __func__ in info() messages in _validate_job_desc(). · 61dc0965
  Tim Wickberg authored Dec 08, 2018
```
Also tweak the one info() message here to match these others.
```
  61dc0965