Commits · 48016f8642765350f4b19aec60a2dfebf9937724 · Manuel G. Marciani / ces_slurm_simulator

23 Dec, 2014 4 commits

Fix bad job array task ID value · 48016f86

Morris Jette authored Dec 23, 2014

Prevent invalid job array task ID value if a task is started using gang
scheduling (i.e. the task starts in a SUSPENDED state). The task ID gets
set to NO_VAL and the task string is also cleared.

48016f86

Document some more TCP config params for HTC · 2da84715
Morris Jette authored Dec 23, 2014

2da84715

Prevent gang resume of suspended job · 161d0336

Morris Jette authored Dec 23, 2014

Prevent a job manually suspended from being resumed by gang scheduler once
free resources are available.
bug 1335

161d0336

set node state RESERVED on maint reservation delete · cf846644

Dorian Krause authored Dec 22, 2014

we have hit the following problem that seems to be present in Slurm
slurm-14-11-2-1 and previous versions. When a node is reserved and an
overlapping maint reservation is created and later deleted the scontrol
output will report the node as IDLE rather than RESERVED:

+ scontrol show node node1
+ grep State
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol create reservation starttime=now duration=120 user=usr01000
nodes=node1 ReservationName=X
Reservation created: X
+ sleep 10
+ scontrol show nodes node1
+ grep State
   State=RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol create reservation starttime=now duration=120 user=usr01000
nodes=ALL flags=maint,ignore_jobs ReservationName=Y
Reservation created: Y
+ sleep 10
+ grep State
+ scontrol show nodes node1
   State=MAINT ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol delete ReservationName=Y
+ sleep 10
+ scontrol show nodes node1
+ grep State
*   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1*
+ scontrol delete ReservationName=X
+ sleep 10
+ scontrol show nodes node1
+ grep State
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1

Note that the after the deletion of reservation "X" the State=IDLE
instead of State=RESERVED. I think that the delete_resv() function in
slurmctld/reservation.c should call set_node_maint_mode(true) like
update_resv() does. With the patch pasted at the end of this e-mail I
get the following output which matches my expectation:

+ scontrol show node node1
+ grep State
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol create reservation starttime=now duration=120 user=usr01000
nodes=node1 ReservationName=X
Reservation created: X
+ sleep 10
+ scontrol show nodes node1
+ grep State
   State=RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol create reservation starttime=now duration=120 user=usr01000
nodes=ALL flags=maint,ignore_jobs ReservationName=Y
Reservation created: Y
+ sleep 10
+ scontrol show nodes node1
+ grep State
   State=MAINT ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol delete ReservationName=Y
+ sleep 10
+ scontrol show nodes node1
+ grep State
*   State=RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=1*
+ scontrol delete ReservationName=X
+ sleep 10
+ scontrol show nodes node1
+ grep State
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1

Thanks,
Dorian

cf846644

22 Dec, 2014 4 commits

Auth/munge - Correct AccountingStoragePass parsing · 2edef50d
Daniel Ahlin authored Dec 22, 2014
```
Correct parsing of AccountingStoragePass when specified in old format
(just a path name)
```
2edef50d
Documentation updates. · 270e24ce
Brian Christiansen authored Dec 22, 2014

270e24ce
Add SLURM_CPUS_PER_TASK to salloc,sbatch,srun man pages. · e5f20824
Brian Christiansen authored Dec 22, 2014
```
Bug 1331
```
e5f20824

avoid delay on commit for PMI task at rank 0 · fcc11e22

Rémi Palancher authored Dec 22, 2014

Intel MPI, on MPI jobs initialisation through PMI, uses to call PMI_KVS_Put()
many many times from task at rank 0, and each on these call is followed by
PMI_KVS_Commit(). Slurm implementation of PMI_KVS_Commit() imposes a delay
to avoid DDOS on original srun. This delay is proportional to the total number.
It could be up to 3 secs for large jobs for ex. with 7168 tasks. Therefore,
when Intel MPI calls PMI_KVS_Commit() 475 times (mesured on a test case) from
task at rank 0, 28 minutes are spent in delay function.
All other tasks in the job are waiting for a PMI_Barrier. Therefore, there is
no risk for a DDOS from this single task 0. The patch alters the delaying time
calculation to make sure task at rank 0 will does not be delayed. All other
tasks are globally spreaded in the same time range as before.

fcc11e22

20 Dec, 2014 3 commits
- Make it so previous versions of salloc/srun work with newer versions · d11ece80
  Danny Auble authored Dec 19, 2014
```
    of Slurm daemons.

The slurmstepd still needs to be fixed, which most likely can't be fixed
until 15.08.
```
  d11ece80
- Merge remote-tracking branch 'origin/slurm-14.03' into slurm-14.11 · 18c282bd
  Danny Auble authored Dec 19, 2014
  
  18c282bd
- Addition to 61db2e34 · 3858a8c8
  Danny Auble authored Dec 19, 2014
  
  3858a8c8
19 Dec, 2014 4 commits
- Make it so previous versions of salloc/srun work with newer versions · 61db2e34
  Danny Auble authored Dec 19, 2014
```
of Slurm daemons.
```
  61db2e34
- Fix for task/affinity if an admin configures a node for having threads · 731b6ded
  Danny Auble authored Dec 18, 2014
```
but then sets CPUs to only represent the number of cores on the node.
```
  731b6ded
- Updated documentation on the -c option for slurmctld · 4924e7e5
  Danny Auble authored Dec 18, 2014
  
  4924e7e5
- MySQL - Enhanced coordinator security checks. · 75af062c
  Danny Auble authored Dec 18, 2014
  
  75af062c
17 Dec, 2014 2 commits
- Fix ghost job when submitting job after all jobids are exhausted. · c8754578
  Brian Christiansen authored Dec 17, 2014
```
Bug 1327
```
  c8754578
- In srun honor ntasks_per_node before looking at cpu count when the user · 6a389a7a
  Danny Auble authored Dec 16, 2014
```
doesn't request a number of tasks.
```
  6a389a7a
16 Dec, 2014 4 commits
- Fix job hash table bug · f293ce7c
  Morris Jette authored Dec 16, 2014
```
Fix job array hash table bug, could result in slurmctld infinite loop or
invalid memory reference.
bug 1309
```
  f293ce7c
- Fix for test21.26. Before it would remove all the QOS from all clusters. · b100af68
  Nathan Yee authored Dec 16, 2014
  
  b100af68
- Update news file. · 9c6f34d8
  David Bigagli authored Dec 15, 2014
  
  9c6f34d8
- Revert "Commit 38068d21 expanded the reason for unavailable jobs but" · 7497fb94
  David Bigagli authored Dec 15, 2014
```
as it may cause core dumo in squeue.

This reverts commit 322c783c.
```
  7497fb94
12 Dec, 2014 9 commits
- Prevent vestigial job array record · 42d75a09
  Morris Jette authored Dec 12, 2014
```
If a master job array record is complete, then consider all pending
tasks as also complete. This problem happens when a master job array
record is pending (has pending tasks) and is cancelled. The result
previously was a job record not visible to squeue/scontrol, but occupying
memory.
The same type of problem happened with respect to a dependency on a job
array which was cancelled.
```
  42d75a09
- Cosmetic mod to test · 2e2ccca3
  Morris Jette authored Dec 12, 2014
  
  2e2ccca3
- Do not prevent dumping jobs ready for purge · d5a45670
  Morris Jette authored Dec 12, 2014
```
This change will better reveal any vestigial job records not being purged
```
  d5a45670
- Update news for next tag · e437094d
  Danny Auble authored Dec 12, 2014
  
  e437094d
- setup for new tag · a136a5b0
  Danny Auble authored Dec 12, 2014
  
  a136a5b0
- Merge remote-tracking branch 'origin/slurm-14.03' into slurm-14.11 · 3e6f6a32
  Danny Auble authored Dec 12, 2014
```
Conflicts:
	META
```
  3e6f6a32
- Update news for potential next tag · f84b4724
  Danny Auble authored Dec 12, 2014
  
  f84b4724
- new tag for 14.03.11 · ff2b2a02
  Danny Auble authored Dec 12, 2014
  
  ff2b2a02
- Update FastSchedule documenation · 0b5fe2fa
  Morris Jette authored Dec 12, 2014
```
Done in response to bug 1323
```
  0b5fe2fa
11 Dec, 2014 10 commits
- Update to test to work correctly when dealing with systems that have · 3e0f9642
  Danny Auble authored Dec 11, 2014
```
other config files like bluegene.conf and such.
```
  3e0f9642
- Merge remote-tracking branch 'origin/slurm-14.03' into slurm-14.11 · 885a8d0b
  Danny Auble authored Dec 11, 2014
  
  885a8d0b
- Sanity check for Correct QOS on startup. · 7f5b20c9
  Danny Auble authored Dec 11, 2014
```
If a QOS was added for the job and then removed and it just happened
to be the largest QOS id wise if the slurmctld was restarted and the job
wasn't flushed out yet it could mess things up.
```
  7f5b20c9
- Merge remote-tracking branch 'origin/slurm-14.03' into slurm-14.11 · b47b9c22
  Danny Auble authored Dec 11, 2014
```
Conflicts:
	src/plugins/task/cray/task_cray.c
```
  b47b9c22
- Add Cray standard debug message. · 9c99c4cd
  David Bigagli authored Dec 11, 2014
  
  9c99c4cd
- Update RELEASE_NOTES and documentation. · c9195388
  Brian Christiansen authored Dec 11, 2014
  
  c9195388
- fter commit b1c5716d which added support for NO_VAL in · ad9917a5
  Nicolas Joly authored Dec 11, 2014
```
slurmdb_purge_string(), there is no reason to check for this specific
value anymore in read_config().
```
  ad9917a5
- Update news file · 78739a24
  David Bigagli authored Dec 11, 2014
  
  78739a24
- Exports eio_new_initial_obj to the plugins · a140dfd3
  Hongjia Cao authored Dec 11, 2014
```
and initialize kvs_seq on mpi/pmi2 setup to support launching.
```
  a140dfd3
- Commit 38068d21 expanded the reason for unavailable jobs but · 322c783c
  David Bigagli authored Dec 11, 2014
```
it also need client side.
```
  322c783c