Commits · a243f6527dad7effdd3b65f972cea5a90ad40681 · Manuel G. Marciani / ces_slurm_simulator

12 Apr, 2019 5 commits
- Expand gres/mps web page documentation · a243f652
  Morris Jette authored Apr 12, 2019
  
  a243f652
- Enhance gres/mps prolog script · e59e5ad9
  Morris Jette authored Apr 12, 2019
  
  e59e5ad9
- Merge branch 'slurm-18.08' · 514a5cb5
  Tim Wickberg authored Apr 11, 2019
  
  514a5cb5
- Start NEWS for v18.08.8 · ec8a13d5
  Tim Wickberg authored Apr 11, 2019
  
  ec8a13d5
- Update META for v18.08.7 release. · 806826a3
  Tim Wickberg authored Apr 11, 2019
```
Update slurm.spec and slurm.spec-legacy as well.
```
  806826a3
11 Apr, 2019 7 commits

Add missing unlock · d4b81c40
Morris Jette authored Apr 11, 2019
```
Coverity CID 197586
```
d4b81c40
Major update of gres/mps web page · f2cee387
Morris Jette authored Apr 11, 2019

f2cee387

Job's --time-min used when not needed · 7dcde848

Morris Jette authored Apr 11, 2019

This bug was observed once running test1.103 and found to be
reproducible under some circumstances. How this works is when
a job is submitted with a --time-min value, then the backfill
scheduler set's the job's time limit to that value (see below),
tries to start it, and if successful then set the time limit to
the largest value possible without delaying the expected start
time of any higher priority jobs. If the job can't be started,
the job's (maximum) time limit is supposed to be restored to
its previous value. That was happening in some, but not all
places in the code. This patch rests the time limit in the missing
cases.

Note: The job's time limit is set to the --time-min value here:
diff --git a/src/plugins/sched/backfill/backfill.c b/src/plugins/sched/backfill/backfill.c
index 5600495e1d..e22a810394 100644
--- a/src/plugins/sched/backfill/backfill.c
+++ b/src/plugins/sched/backfill/backfill.c
@@ -1930,7 +1930,7 @@ next_task:
 		    slurm_get_preempt_mode())
 	...

7dcde848

Merge branch 'slurm-18.08' · f1837a40
Tim Wickberg authored Apr 11, 2019

f1837a40
Docs - add "accruetime" to the list of fields accessible through "squeue -O". · b297b19d
Tim Wickberg authored Apr 11, 2019

b297b19d
Add EndTime and CompletingTime fields to 'scontrol completing'. · c39a9715
Doug Jacobsen authored Apr 11, 2019
```
Bug 6787.
```
c39a9715

Add sleep to test · be016fd7

Morris Jette authored Apr 10, 2019

What can happen without this sleep is the "JOBID" output from
task 0 comes after the "HOST" output from task 1, which breaks
the parsing just below this.

be016fd7

10 Apr, 2019 17 commits

Add test to insure gres/mps and gres/gpu don't overlap · ff3cabfe

Morris Jette authored Apr 10, 2019

Make sure that a job using gres/mps and gres/gpu on the same
node either run at different times or use different GPUs.
This restores part of test40.2 that was bad and removed in
commit 5a6409bb. It also insures that the logic added in
commit ac1182cf works as desired.

ff3cabfe

Fix gres/gpu gres/mps overlap issue without core binding · ac1182cf

Morris Jette authored Apr 10, 2019

Without this change if core binding was not specified (i.e. no
cores/cpus in the gres/conf file for gpu), then gres/mps could
be allocated to overlap with gres/gpu that were already allocated.

ac1182cf

Remove direct support for Cray NHC system. · f5e3b151
Tim Wickberg authored Apr 10, 2019
```
Bug 4188.
```
f5e3b151
Merge branch 'slurm-18.08' · f65be327
Alejandro Sanchez authored Apr 10, 2019

f65be327
Add extra debug2 logs to identify why BadConstraints reason is set. · 20c2b615
Albert Gil authored Apr 10, 2019
```
Bug 6608.
```
20c2b615
Add Chad to SchedMD support team. · ae9cb8aa
Chad Vizino authored Apr 09, 2019

ae9cb8aa
Merge branch 'slurm-18.08' · e8bacd36
Alejandro Sanchez authored Apr 10, 2019

e8bacd36
Fix invalid reads of size 1 due to non null-terminated string reads. · c496c812
Dominik Bartkiewicz authored Apr 10, 2019
```
Bug 6807.
```
c496c812
Merge branch 'slurm-18.08' · 7342ea21
Alejandro Sanchez authored Apr 10, 2019

7342ea21

burst_buffer/cray - fix script_argv use-after-free. · 81b9d7bd

Alejandro Sanchez authored Apr 09, 2019

==8640== Thread 5 bckfl:
==8640== Syscall param openat(filename) points to unaddressable byte(s)
==8640==    at 0x4A81D0E: open (open64.c:48)
==8640==    by 0x5934ABB: _update_job_env (burst_buffer_cray.c:3338)
==8640==    by 0x5934ABB: bb_p_job_begin (burst_buffer_cray.c:3962)
...
==8640==  Address 0x6b96120 is 16 bytes inside a block of size 61 free'd
==8640==    at 0x48369AB: free (vg_replace_malloc.c:530)
==8640==    by 0x49D4873: slurm_xfree (xmalloc.c:244)
==8640==    by 0x490C317: free_command_argv (run_command.c:249)
==8640==    by 0x5934A5C: bb_p_job_begin (burst_buffer_cray.c:3947)
...
==8640==  Block was alloc'd at
==8640==    at 0x4837B65: calloc (vg_replace_malloc.c:752)
==8640==    by 0x49D4566: slurm_xmalloc (xmalloc.c:87)
==8640==    by 0x49D4B67: makespace (xstring.c:103)
==8640==    by 0x49D4C91: _xstrcat (xstring.c:134)
==8640==    by 0x49D4ECF: _xstrfmtcat (xstring.c:280)
==8640==    by 0x593497C: bb_p_job_begin (burst_buffer_cray.c:3936)
...

Bug 6807.

81b9d7bd

node_features/knl_cray - fix script_argv use-after-free. · 97c833b3
Doug Jacobsen authored Apr 09, 2019
```
Bug 6807.
```
97c833b3
burst_buffer/cray - fix memory leak due to unfreed job script content. · f21be235
Doug Jacobsen authored Apr 09, 2019
```
Bug 6807.
```
f21be235
burst_buffer/cray - fix slurmctld SIGABRT due to illegal read/writes. · f1e4527d
Doug Jacobsen authored Apr 09, 2019
```
Bug 6807.
```
f1e4527d

Changed "scontrol reboot" to not default to ALL nodes. · 85324ad5

Ben Roberts authored Mar 12, 2019

Changed the behavior of "scontrol reboot" to require the user to specify
the nodes to reboot rather than defaulting to ALL.

Bug 6465

85324ad5

Corrections to gres/mps test · 5a6409bb

Morris Jette authored Apr 09, 2019

This corrects the gres/mps test to insure that CUDA_VISIBLE_DEVICES
is always zero (it is dependent upon the devices under MPS control
and not related to cgroup constrained devices). Also correct some
logic related to how the percentage calculation works in the test.

5a6409bb

Insure only one GPU used for gres/mps · 992e8a7c
Morris Jette authored Apr 09, 2019
```
Permit any GPU to be used for gres/mps mode, but only
one GPU can be used
```
992e8a7c

correct some logic for determining if a gres is shared · 6d59f7f4

Morris Jette authored Apr 09, 2019

a request for --gres=mps:1 (specifically with a count of one)
was in some places being treated like a request for a full GPU

6d59f7f4

09 Apr, 2019 11 commits
- Corect how CUDA_VISIBLE_DEVICE env set set for gres/mps · 08c908ca
  Morris Jette authored Apr 09, 2019
```
The variable is relative to which GPUs are managed by
MPS. Currently Slurm only allows one GPU to be managed
by MPS at a time, so the env var should always be zero.
```
  08c908ca
- Make gres/mps prolog more generic · 4347b7b5
  Morris Jette authored Apr 09, 2019
```
Allow it to support use of any GPU in multi-GPU system
```
  4347b7b5
- Merge branch 'cloud_enhancements' · 38447760
  Brian Christiansen authored Apr 09, 2019
  
  38447760
- Add alloc_booting_nodes SchedulerParameters · 36c30487
  Brian Christiansen authored Mar 30, 2019
```
This allows jobs to be placed on booting nodes rather than being given a
whole node even if it would have been better to wait for the node boot.

Bug 6782
```
  36c30487
- Add PowerSave debug flag · 1f54730d
  Brian Christiansen authored Mar 29, 2019
  
  1f54730d
- Add idle_on_node_suspend SlurmctldParameter · 7abadb99
  Brian Christiansen authored Feb 26, 2019
```
to make nodes available after being suspended even if down, drain,
failed.

Bug 6212
```
  7abadb99
- Add NEWS and RELEASE NOTES · 4ae21d13
  Brian Christiansen authored Mar 30, 2019
```
Bug 6333
```
  4ae21d13
- Allow scontrol update nodename= state=resume to clear powering down · f596b4b6
  Brian Christiansen authored Mar 26, 2019
```
Bug 6333
```
  f596b4b6
- Don't need suspend_node_bitmap · c2c6af91
  Brian Christiansen authored Mar 22, 2019
```
Rely on POWERING_DOWN bit. This allows POWERING_DOWN nodes to be cleared
after a restart -- since suspend_node_bitmap was local to power_save.c.

Bug 6333
```
  c2c6af91
- Make sure powering down node is not available to be allocated · 73f428a2
  Brian Christiansen authored Mar 22, 2019
```
Bug 6333
```
  73f428a2
- State save node last_response time · 6a3c20b0
  Brian Christiansen authored Mar 22, 2019
```
Instead of just guessing the time, let's use the original time.

Bug 6333
```
  6a3c20b0