Commits · 71688f3289b3c1c17d05fc2e8bfc6abd63010ba1 · Manuel G. Marciani / ces_slurm_simulator

29 Dec, 2014 6 commits

Test to check CpuFreqGovernors configuration value · 71688f32

Morris Jette authored Dec 29, 2014

The test was assuming the governors configured on a CPU are
available to the job, ignoring the configured CpuFreqGovernors
value.

71688f32

Critical bug fix · b503f45b

Morris Jette authored Dec 29, 2014

Removed mkdir (not the critical bit) then test vestigial errno value
so the original cpu environment was never being reset and test1.76 was
consistently failing.

b503f45b

Fix search path issues in cpu-freq test · 513a6a2d
Morris Jette authored Dec 29, 2014

513a6a2d
translate CPU governor values between versions · 3d4ff78e
Morris Jette authored Dec 29, 2014
```
The values are changing between v14.11 and 15.08.
```
3d4ff78e

Revert CPU_FREQ enum changes · 7b8744ec

Morris Jette authored Dec 29, 2014

The CPU_FREQ values can not be changed unless logic is added to
do the translation of values from old format command and save/restore
state. Since that logic does not exist, the values were restored
to their original values.

7b8744ec

test enahancemnts · e278d3a8

Morris Jette authored Dec 29, 2014

Make test work if slurm commands or working directory not in search path
move test for FastScheduler value into globals

e278d3a8

25 Dec, 2014 1 commit
- CPU requencies: updates from code review and basic testing · 263a17c5
  Morris Jette authored Dec 24, 2014
  
  263a17c5
24 Dec, 2014 12 commits
- Minor improvements to document formats · 000729e6
  Morris Jette authored Dec 24, 2014
  
  000729e6
- Add new cpu-freq test · bfeadafe
  Rod Schultz authored Dec 24, 2014
  
  bfeadafe
- Fourth patch for cpu_max_frequency · fe1b8e19
  Rod Schultz authored Dec 24, 2014
  
  fe1b8e19
- Third patch for cpu_max_frequency · 14679bcf
  Rod Schultz authored Dec 24, 2014
  
  14679bcf
- Second patch for cpu_max_frequency · e1d68ab8
  Rod Schultz authored Dec 24, 2014
  
  e1d68ab8
- First patch of cpu_max_frequency · ad818b31
  Rod Schultz authored Dec 24, 2014
  
  ad818b31
- Enable per-partition gang sched resolution · 5e02af31
  Morris Jette authored Dec 24, 2014
```
Enable per-partition gang scheduling resource resolution (e.g. the partition
can have SelectTypeParameters=CR_CORE, while the global value is CR_SOCKET).
bug 1299
```
  5e02af31
- sview burst buffer fixes · 367ffbff
  Morris Jette authored Dec 24, 2014
```
Added the user name rather than just printing the user ID number.
Fixed the format for a job array record ("_" rather than "." separator):
Added a GRES field.
```
  367ffbff
- sview support for burst buffers fleshed out · f3b79bec
  Nathan Yee authored Dec 24, 2014
  
  f3b79bec
- Enforce partition shared option · f8fb79d5
  Morris Jette authored Dec 23, 2014
```
Properly enforce partition Shared=YES option. Previously oversubscribing
resources required gang scheduling to also be configured.
```
  f8fb79d5
- Fix bad job array task ID value · 46a2e9a1
  Morris Jette authored Dec 23, 2014
```
Prevent invalid job array task ID value if a task is started using gang
scheduling (i.e. the task starts in a SUSPENDED state). The task ID gets
set to NO_VAL and the task string is also cleared.
```
  46a2e9a1
- Document some more TCP config params for HTC · c949f2b7
  Morris Jette authored Dec 23, 2014
  
  c949f2b7
23 Dec, 2014 7 commits

Disable portion of a test if gang scheduling · c4772156
Morris Jette authored Dec 23, 2014

c4772156
Correct test for gang scheduling config · dad7a3ac
Morris Jette authored Dec 23, 2014

dad7a3ac
Merge branch 'slurm-14.11' · 2cbae725
Morris Jette authored Dec 23, 2014

2cbae725

Prevent gang resume of suspended job · 161d0336

Morris Jette authored Dec 23, 2014

Prevent a job manually suspended from being resumed by gang scheduler once
free resources are available.
bug 1335

161d0336

Correct ntasks_per_core test logic · 732b0d9c

Morris Jette authored Dec 23, 2014

Now that slurm is checking that the job's ntasks_per_core is valid,
this tests bad value was causing the job submit to fail. Change
the option to use ntasks_per_socket instead, which matches the
test logic.

732b0d9c

Merge branch 'slurm-14.11' · d87f9867
Morris Jette authored Dec 22, 2014

d87f9867

set node state RESERVED on maint reservation delete · cf846644

Dorian Krause authored Dec 22, 2014

we have hit the following problem that seems to be present in Slurm
slurm-14-11-2-1 and previous versions. When a node is reserved and an
overlapping maint reservation is created and later deleted the scontrol
output will report the node as IDLE rather than RESERVED:

+ scontrol show node node1
+ grep State
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol create reservation starttime=now duration=120 user=usr01000
nodes=node1 ReservationName=X
Reservation created: X
+ sleep 10
+ scontrol show nodes node1
+ grep State
   State=RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol create reservation starttime=now duration=120 user=usr01000
nodes=ALL flags=maint,ignore_jobs ReservationName=Y
Reservation created: Y
+ sleep 10
+ grep State
+ scontrol show nodes node1
   State=MAINT ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol delete ReservationName=Y
+ sleep 10
+ scontrol show nodes node1
+ grep State
*   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1*
+ scontrol delete ReservationName=X
+ sleep 10
+ scontrol show nodes node1
+ grep State
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1

Note that the after the deletion of reservation "X" the State=IDLE
instead of State=RESERVED. I think that the delete_resv() function in
slurmctld/reservation.c should call set_node_maint_mode(true) like
update_resv() does. With the patch pasted at the end of this e-mail I
get the following output which matches my expectation:

+ scontrol show node node1
+ grep State
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol create reservation starttime=now duration=120 user=usr01000
nodes=node1 ReservationName=X
Reservation created: X
+ sleep 10
+ scontrol show nodes node1
+ grep State
   State=RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol create reservation starttime=now duration=120 user=usr01000
nodes=ALL flags=maint,ignore_jobs ReservationName=Y
Reservation created: Y
+ sleep 10
+ scontrol show nodes node1
+ grep State
   State=MAINT ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol delete ReservationName=Y
+ sleep 10
+ scontrol show nodes node1
+ grep State
*   State=RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=1*
+ scontrol delete ReservationName=X
+ sleep 10
+ scontrol show nodes node1
+ grep State
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1

Thanks,
Dorian

cf846644

22 Dec, 2014 10 commits

initialize a variable to avoid bad xfree · d2a81ccf
Morris Jette authored Dec 22, 2014
```
Bug introduced earlier today with new logic in
commit 54396196
```
d2a81ccf
Auth/munge - Correct AccountingStoragePass parsing · 2edef50d
Daniel Ahlin authored Dec 22, 2014
```
Correct parsing of AccountingStoragePass when specified in old format
(just a path name)
```
2edef50d

Deny use of nodes based on ntasks_per_core/socket · 54396196

Morris Jette authored Dec 22, 2014

If a job specifies ntasks_per_core and/or ntasks_per_socket, deny
use of nodes which lack sufficient resources. Previously this was
ignored.
bug 1296

54396196

Documentation updates. · 270e24ce
Brian Christiansen authored Dec 22, 2014

270e24ce
Add SLURM_CPUS_PER_TASK to salloc,sbatch,srun man pages. · e5f20824
Brian Christiansen authored Dec 22, 2014
```
Bug 1331
```
e5f20824
Prevent refernce of NULL pointer · f6a5cff9
Morris Jette authored Dec 22, 2014

f6a5cff9
Merge branch 'slurm-14.11' · 1449c3f1
Morris Jette authored Dec 22, 2014

1449c3f1

avoid delay on commit for PMI task at rank 0 · fcc11e22

Rémi Palancher authored Dec 22, 2014

Intel MPI, on MPI jobs initialisation through PMI, uses to call PMI_KVS_Put()
many many times from task at rank 0, and each on these call is followed by
PMI_KVS_Commit(). Slurm implementation of PMI_KVS_Commit() imposes a delay
to avoid DDOS on original srun. This delay is proportional to the total number.
It could be up to 3 secs for large jobs for ex. with 7168 tasks. Therefore,
when Intel MPI calls PMI_KVS_Commit() 475 times (mesured on a test case) from
task at rank 0, 28 minutes are spent in delay function.
All other tasks in the job are waiting for a PMI_Barrier. Therefore, there is
no risk for a DDOS from this single task 0. The patch alters the delaying time
calculation to make sure task at rank 0 will does not be delayed. All other
tasks are globally spreaded in the same time range as before.

fcc11e22

Fix to set more job env vars · c2b6d81f

Morris Jette authored Dec 22, 2014

This moves a bzero() call checked in with commit 30e45f8a
I also noticed that test1.14 was generating errors like this
"srun: error: cpus_per_node array is not set"
This was due to previously uninitialized variables now being
cleared by bzero (i.e. the old data was garbage, but avoided
the error message). The properly cleared variables were introduced
in commit 0252a63e
bug 1306

c2b6d81f

Fix to set more job env vars · 30e45f8a

Morris Jette authored Dec 22, 2014

This is a correction to commit 0252a63e
Previous logic failed to populate data structure as used in another RPC
bug 1306

30e45f8a

20 Dec, 2014 4 commits
- Add sview burst buffer display. · 64d14d0e
  Nathan Yee authored Dec 19, 2014
  
  64d14d0e
- Merge remote-tracking branch 'origin/slurm-14.11' · 6bd35cc8
  Danny Auble authored Dec 19, 2014
  
  6bd35cc8
- Make it so previous versions of salloc/srun work with newer versions · d11ece80
  Danny Auble authored Dec 19, 2014
```
    of Slurm daemons.

The slurmstepd still needs to be fixed, which most likely can't be fixed
until 15.08.
```
  d11ece80
- Add the account, qos and reservation to srun. · 0252a63e
  David Bigagli authored Dec 19, 2014
  
  0252a63e