Commits · 8f66e1a12a7dd750dfb67dafb0875233eb47bfc4 · Manuel G. Marciani / ces_slurm_simulator

03 Apr, 2019 1 commit
- Decrease logging from gres/gpu and gres/mps · 8f66e1a1
  Morris Jette authored Apr 02, 2019
```
They were a bit too verbose for my taste
```
  8f66e1a1
02 Apr, 2019 6 commits

Do not set CUDA_VISIBLE_DEVICES=NoDevFiles · d618ade0

Felip Moll authored Apr 02, 2019

In 0e149092 not setting the variable when job was not requesting any gres
was considered a bug. The cuda API will use all devices if the variable is
not set. If it is set to some unknown or empty value, it will use no devices.
This variable should be used only for testing purposes and ConstrainDevices=yes
in cgroup is recommended.

Bug 6412

d618ade0

Always call environment setup in gres plugins · e5646775

Felip Moll authored Apr 02, 2019

gres plugins will setup environment for every gres in the system even if the
job has not requested it.

Bug 6412

e5646775

Revert "Fix bad logic to work correctly when there are more · eec070ba
Felip Moll authored Apr 02, 2019
```
than one GRES of the same name but different type"

This reverts f7fca7ba

Bug 6412
```
eec070ba
Load node gres info from slurm.conf in slurmd · 9947b1e1
Morris Jette authored Apr 02, 2019
```
initial work needed for bug 6761 support
```
9947b1e1
cosmetic changes · e5cee22e
Morris Jette authored Apr 01, 2019
```
comment format and change some log messages
```
e5cee22e

harden a test · 81cc4c9d

Morris Jette authored Apr 01, 2019

this problem was triggered with a configuation of
PrologFlags=Alloc,Contain

81cc4c9d

01 Apr, 2019 2 commits

eliminate job resize error · 8372e7f6

Morris Jette authored Apr 01, 2019

This eliminates a slurmctld error message when a job shrinks to
size zero. There is no need to re-compute the CPU count and the
job_resources node_bitmap is empty. Logic works fine without
this change if job size shrinks, but not to size zero.
bug 6472

8372e7f6

job resize to zero node count error · 9464a255

Morris Jette authored Apr 01, 2019

When a job size was reset to zero, this error message was printed:
slurm_allocation_lookup: Job/step already completing or completed
which may lead the user to believe the operation failed when it
worked as planned. Now it prints this:
To reset Slurm environment variables, execute
  For bash or sh shells:  . ./slurm_job_43565_resize.sh
  For csh shells:         source ./slurm_job_43565_resize.csh
Where the reset scripts contain zero node count information:
export SLURM_NODELIST=""
export SLURM_JOB_NODELIST=""
export SLURM_NNODES=0
export SLURM_JOB_NUM_NODES=0
export SLURM_JOB_CPUS_PER_NODE=""
unset SLURM_NPROCS
unset SLURM_NTASKS
unset SLURM_TASKS_PER_NODE

9464a255

31 Mar, 2019 4 commits
- Merge remote-tracking branch 'origin/slurm-18.08' · bf39613f
  Brian Christiansen authored Mar 30, 2019
  
  bf39613f
- Add cloud_dns SlurmctldParameter to cloud html doc · 70cb9245
  Brian Christiansen authored Mar 30, 2019
```
Continuation of 2764f3fd

Bug 6589
```
  70cb9245
- Add NoAddrCache CommunicationParameter to cloud html doc · cb367513
  Brian Christiansen authored Mar 30, 2019
```
Continuation of 9a243a1a

Bug 6592
```
  cb367513
- Fix spelling in man page · 9d16f4e0
  Brian Christiansen authored Mar 30, 2019
  
  9d16f4e0
30 Mar, 2019 1 commit

Cosmetic changes in update job logic · 05eab4b0

Morris Jette authored Mar 29, 2019

Many comments were modified to follow Linux kernel standard
Many log messages were using the old function name and now
  print __func__ instead
A few log messages lacked the function name and those were added

05eab4b0

29 Mar, 2019 2 commits
- Add comment describing possible error cause to tests · becb03a7
  Morris Jette authored Mar 29, 2019
```
No change in any logic
```
  becb03a7
- Improve gres validation on node registration · 401763fb
  Morris Jette authored Mar 28, 2019
```
This adds logic to validate the count of GRES by device Type and
not just Name and modifies the data structures as needed for
consistency within slurmctld.
```
  401763fb
28 Mar, 2019 3 commits
- make regression test error message more clear · d268582f
  Morris Jette authored Mar 28, 2019
  
  d268582f
- Optimize backfill max job parameters · 7c80bc9b
  Broderick Gardner authored Jan 10, 2019
```
Removed linear search, replaced with direct record references and a hashmap.
This is faster and avoids potential collisions between assoc id's and user id's.
Bug 4811
```
  7c80bc9b
- Change xhash to work with fixed length keys · e72555a8
  Broderick Gardner authored Mar 04, 2019
```
Fixed existing usages as well.
Bug 4811
```
  e72555a8
27 Mar, 2019 11 commits

Fix memory leak · 726d465e
Morris Jette authored Mar 27, 2019
```
Coverity CID 197448
bug 6303
```
726d465e

Remove vestigial logic · 06caf845

Morris Jette authored Mar 27, 2019

Remove reference to REQUEST_SIGNAL_PROCESS_GROUP in slurmstepd.
It has been defunct since July 2013

06caf845

make gres test order independent · f77cc826

Morris Jette authored Mar 27, 2019

sort the expected and actual output for GRES APIs irrelevant.
Depending upon the GRES plugins loaded (specifically gres/gpu plus
gres/mps), the GRES records can be sorted by File name to insure
the GRES records line up (the same position in both lists should
refer to the same device file).

f77cc826

Merge branch 'slurm-18.08' · 28ec160d
Alejandro Sanchez authored Mar 27, 2019

28ec160d
Fix slurmctld segfault due to job's partition pointer NULL dereference. · b08e3137
Dominik Bartkiewicz authored Mar 27, 2019
```
Bug 6750.
```
b08e3137
Merge remote-tracking branch 'origin/slurm-18.08' · a33a4c33
Danny Auble authored Mar 27, 2019

a33a4c33

Fix for GRES with zero count · e3269a5a

Morris Jette authored Mar 27, 2019

This logic could allocate to a job a GRES device with
an availability count of zero.

e3269a5a

Cosmetic changes, no change in logic · be35f362
Morris Jette authored Mar 27, 2019

be35f362

gres/mps prevent possible zero divide · af461fad

Morris Jette authored Mar 27, 2019

This should only happen if there is flawed logic somewhere, but
avoiding an abort is better than not.

af461fad

Set CUDA_VISIBLE_DEVICES if GPU count mismatch · 059e2287

Morris Jette authored Mar 27, 2019

If the count of GPUs configured in slurm.conf and gres.conf differ
and FastSchedule>=1 then the bitmap identifying the GPU allocation
sent from slurmctld to slurmd will differ. Previously this resulted
in CUDA_VISIBLE_DEVICES being set to NULL. Now it will be set correctly.

bug 6725

059e2287

Fix for gres/gpu count mismatch · ec0e7c8c

Morris Jette authored Mar 27, 2019

If slurmd finds GRES with files and slurmctld can't use them
(i.e. slurm.conf has a GRES count of 0), then avoid trying to
create zero length bitmaps in the GRES data structure.
bug 6725

ec0e7c8c

26 Mar, 2019 10 commits
- correct gres bitmap size · c96180e8
  Morris Jette authored Mar 26, 2019
```
This makes the gres bitmap size equal to the number of records for
shared gres (i.e. gres/mps), otherwise it is the gres count (i.e.
gres/gpu).

bug 6733
```
  c96180e8
- gres/mps fix for non-ordered gres/gpu · e9dde61c
  Morris Jette authored Mar 26, 2019
```
if the device files for gres/gpu are out of order or grouped
in an unordered fashion (e.g. "Name=gpu Files=/dev/nvidia[2,8,10]")
then split the gres/gpu records to one record per file and make
sure the gres/mps records are in an identical order. Required for
matching gres/gpu and gres/mps records (one GPU can be allocated
either as gres/gpu or as gres/mps, but not both, so we need to
be able to match records in slurmctld).
```
  e9dde61c
- Check for possible NULL pointer · 164c129f
  Morris Jette authored Mar 26, 2019
```
Coverity CID 197447
```
  164c129f
- sched/backfill - Make hetjobs sensitive to bf_max_job_start. · 399e04b9
  Alejandro Sanchez authored Mar 20, 2019
```
Bug 6710.
```
  399e04b9
- Document a safe way to use scontrol suspend/resume · a706bcb3
  Marshall Garey authored Mar 25, 2019
```
Bug 6590.
```
  a706bcb3
- Tweak tests for CR_ONE_TASK_PER_CORE · f43a71b8
  Morris Jette authored Mar 26, 2019
```
Make some tests better able to work with CR_ONE_TASK_PER_CORE
```
  f43a71b8
- Enable select/cons_tres with CR_ONE_TASK_PER_CORE · afaafedc
  Morris Jette authored Mar 26, 2019
  
  afaafedc
- cons_tres: everything working with CR_ONE_TASK_PER_CORE · 8359dbd6
  Morris Jette authored Mar 26, 2019
  
  8359dbd6
- select/cons_tres working with CR_ONE_TASK_PER_CORE · 5854e834
  Morris Jette authored Mar 26, 2019
```
More testing required. This configuration is still disabled in
select_cons_tres.c
```
  5854e834
- Tweak tests for CR_ONE_TASK_PER_CORE · b91fd55b
  Morris Jette authored Mar 26, 2019
```
Add --ntasks-per-core option to execute line as needed
```
  b91fd55b