Commits · ef7d818897201f42fe78493374b3d1f270637061 · Manuel G. Marciani / ces_slurm_simulator

04 Apr, 2019 6 commits
- fix memory leak · ef7d8188
  Morris Jette authored Apr 04, 2019
  
  ef7d8188
- prevent memory leak · 9174b993
  Morris Jette authored Apr 03, 2019
```
gres needs to locally keep the mps_table size rather than use
node_record_count, which gets reset to zero at shutdown.
```
  9174b993
- Add more GRES error checking · b4c39572
  Morris Jette authored Apr 03, 2019
```
Check for out of range node index. Not observed, but prevents possible
invalid memory reference.
```
  b4c39572
- Fix memory leak · 6d38939c
  Morris Jette authored Apr 03, 2019
  
  6d38939c
- Fix memory leak · 81f53f46
  Morris Jette authored Apr 03, 2019
  
  81f53f46
- Fix memory leaks · b09ab4ec
  Morris Jette authored Apr 03, 2019
  
  b09ab4ec
03 Apr, 2019 22 commits
- Memory leak fix · b3155551
  Morris Jette authored Apr 03, 2019
```
Copied array without including the array size pointer, so it did
not get freed.
```
  b3155551
- Fix memory leak · f42956eb
  Morris Jette authored Apr 03, 2019
  
  f42956eb
- Improve valgrind use documentation for slurmd · 52858991
  Morris Jette authored Apr 03, 2019
  
  52858991
- Fix memory leaks in select/cons_tres · 94ae2f5e
  Morris Jette authored Apr 03, 2019
  
  94ae2f5e
- Remove "--immediate" option from test · a32d34e7
  Morris Jette authored Apr 02, 2019
```
It was failing due to an Epilog, but could also fail when run in
parallel with other jobs.
```
  a32d34e7
- Add some valgrind instructions in comments · 60a54e31
  Morris Jette authored Apr 03, 2019
```
This includes information about how to get a clean HWLOC report.
```
  60a54e31
- Fix slurmd shutdown race condition · d0a32028
  Morris Jette authored Apr 03, 2019
```
Without this change I was able to fairly consistently cause
"scontrol shutdown" to NOT cause the slurmd to exit.
1. Start slurmd and slurmctld
2. Immediately execute "scontrol reconfig" and "scontrol shutdown"
```
  d0a32028
- cosmetic changes · 77ec932f
  Morris Jette authored Apr 03, 2019
```
Format changes only
```
  77ec932f
- cosmetic changes · 961bee02
  Morris Jette authored Apr 03, 2019
```
log message and comment format changes
```
  961bee02
- Fix white spaces and formatting; no functional change. · d297bca9
  Alejandro Sanchez authored Apr 03, 2019
```
Bug 5851.
```
  d297bca9
- Merge remote-tracking branch 'origin/slurm-18.08' · 715c5449
  Danny Auble authored Apr 03, 2019
```
# Conflicts:
#	slurm/slurm.h.in
```
  715c5449
- Fix missing xfree. · dc4ddb0b
  Danny Auble authored Apr 03, 2019
  
  dc4ddb0b
- Add new job's bit_flags of INVALID_DEPEND. · 10bd9f21
  Alejandro Sanchez authored Apr 01, 2019
```
This prevents rebuilding a job's dependency string when it has at least
one invalid (never satisfied) dependency, no matter if such invalid
dependency has already been purged (after MinJobAge) or not. This can
be useful to track down the culprit invalid dependencies even after they
are gone from ctld's in-memory job list.

The flag is cleared upon a successful job dependency update or after
another job in the dependency list has been satisfied if such list is
composed with the '?' symbol (OR'ed).

Bug 5851.
```
  10bd9f21
- Fix issue with OR'ed job dependencies. · f4ae8783
  Alejandro Sanchez authored Apr 01, 2019
```
Job dependencies separated by "?" (OR'ed) should make the dependant job
be independent as soon as any of the dependencies are resolved to be
satisfied. Without this patch, if an invalid (non satisfiable)
dependency was resolved before a satisfiable one, then the dependant job
would never become independent, even after the satisfiable one got
eventually resolved.

Bug 5851.
```
  f4ae8783
- Move test_job_dependency failure cases treatment to common place. · 7380058e
  Alejandro Sanchez authored Apr 01, 2019
```
No functional change, just preparement for a following commit with an
actual fix.

Bug 5851.
```
  7380058e
- Fix check of XCC response length · 42bc17f5
  Felip Moll authored Apr 03, 2019
```
The response of the XCC raw command is always 16 bytes, we log it and return if
we don't get an answer of this size.

Bug 6743
```
  42bc17f5
- Fix typo in test log message · a291b11f
  Morris Jette authored Apr 02, 2019
  
  a291b11f
- Load GRES info from slurm.conf · 02ad8fdd
  Morris Jette authored Apr 02, 2019
```
If GRES configuration data is unavailable from gres.conf, then use the
    node's "Gres=" information slurm.conf. This will eliminate or minimize the
    gres.conf file in many situations.
bug 6761
```
  02ad8fdd
- Plug some memory leaks (one time only) · a81451de
  Morris Jette authored Apr 02, 2019
  
  a81451de
- Eliminate some memory leaks at slurmd shutdown · d83f19b6
  Morris Jette authored Apr 02, 2019
  
  d83f19b6
- Fix memory leaks in gres.c logic · 09c9c45d
  Morris Jette authored Apr 02, 2019
  
  09c9c45d
- Decrease logging from gres/gpu and gres/mps · 8f66e1a1
  Morris Jette authored Apr 02, 2019
```
They were a bit too verbose for my taste
```
  8f66e1a1
02 Apr, 2019 6 commits

Do not set CUDA_VISIBLE_DEVICES=NoDevFiles · d618ade0

Felip Moll authored Apr 02, 2019

In 0e149092 not setting the variable when job was not requesting any gres
was considered a bug. The cuda API will use all devices if the variable is
not set. If it is set to some unknown or empty value, it will use no devices.
This variable should be used only for testing purposes and ConstrainDevices=yes
in cgroup is recommended.

Bug 6412

d618ade0

Always call environment setup in gres plugins · e5646775

Felip Moll authored Apr 02, 2019

gres plugins will setup environment for every gres in the system even if the
job has not requested it.

Bug 6412

e5646775

Revert "Fix bad logic to work correctly when there are more · eec070ba
Felip Moll authored Apr 02, 2019
```
than one GRES of the same name but different type"

This reverts f7fca7ba

Bug 6412
```
eec070ba
Load node gres info from slurm.conf in slurmd · 9947b1e1
Morris Jette authored Apr 02, 2019
```
initial work needed for bug 6761 support
```
9947b1e1
cosmetic changes · e5cee22e
Morris Jette authored Apr 01, 2019
```
comment format and change some log messages
```
e5cee22e

harden a test · 81cc4c9d

Morris Jette authored Apr 01, 2019

this problem was triggered with a configuation of
PrologFlags=Alloc,Contain

81cc4c9d

01 Apr, 2019 2 commits

eliminate job resize error · 8372e7f6

Morris Jette authored Apr 01, 2019

This eliminates a slurmctld error message when a job shrinks to
size zero. There is no need to re-compute the CPU count and the
job_resources node_bitmap is empty. Logic works fine without
this change if job size shrinks, but not to size zero.
bug 6472

8372e7f6

job resize to zero node count error · 9464a255

Morris Jette authored Apr 01, 2019

When a job size was reset to zero, this error message was printed:
slurm_allocation_lookup: Job/step already completing or completed
which may lead the user to believe the operation failed when it
worked as planned. Now it prints this:
To reset Slurm environment variables, execute
  For bash or sh shells:  . ./slurm_job_43565_resize.sh
  For csh shells:         source ./slurm_job_43565_resize.csh
Where the reset scripts contain zero node count information:
export SLURM_NODELIST=""
export SLURM_JOB_NODELIST=""
export SLURM_NNODES=0
export SLURM_JOB_NUM_NODES=0
export SLURM_JOB_CPUS_PER_NODE=""
unset SLURM_NPROCS
unset SLURM_NTASKS
unset SLURM_TASKS_PER_NODE

9464a255

31 Mar, 2019 4 commits
- Merge remote-tracking branch 'origin/slurm-18.08' · bf39613f
  Brian Christiansen authored Mar 30, 2019
  
  bf39613f
- Add cloud_dns SlurmctldParameter to cloud html doc · 70cb9245
  Brian Christiansen authored Mar 30, 2019
```
Continuation of 2764f3fd

Bug 6589
```
  70cb9245
- Add NoAddrCache CommunicationParameter to cloud html doc · cb367513
  Brian Christiansen authored Mar 30, 2019
```
Continuation of 9a243a1a

Bug 6592
```
  cb367513
- Fix spelling in man page · 9d16f4e0
  Brian Christiansen authored Mar 30, 2019
  
  9d16f4e0