Commits · 096d534cc3147887ea351697ac13791ac847e5c6 · Manuel G. Marciani / ces_slurm_simulator

07 Jan, 2015 1 commit
- Increase maximum node list size (for incoming RPC) from 64k to num_tasks * MAXHOSTNAMELEN. · 096d534c
  Brian Christiansen authored Jan 07, 2015
```
Bug 1352
```
  096d534c
06 Jan, 2015 5 commits

Morris Jette authored Jan 06, 2015

Added Makefile for contribs/sgi file.
Moved hypercube symbol definitions from select/linear to common.
Minor format changes for consistency with other Slurm code.
Moved a variable definition (l_distance) to start of code block
   to avoid error with some compilers.
Fix for possible uninitialized variable use (leftover_nodes).

526d9d62

job array depdendency fix · 744f114b

Morris Jette authored Jan 05, 2015

Fix race condition that could start a job that is dependent upon a job array
before all tasks of that job array complete.
bug 1324

744f114b

Fix segfault in slurmstepd when job exceeded memory limit. · a4b05bad
Brian Christiansen authored Jan 05, 2015
```
Bug 1350
```
a4b05bad
BLUEGENE - Remove check that would erroneously remove the CONFIGURING · 77787999
Danny Auble authored Jan 05, 2015
```
flag from a job while the job is waiting for a block to boot.
```
77787999
BGQ - Put print statement under a DebugFlag. This was just an oversight. · 8681a0f5
Danny Auble authored Jan 05, 2015

8681a0f5

05 Jan, 2015 1 commit
- Correct the pbs parser. · e35c6c4b
  David Bigagli authored Jan 05, 2015
  
  e35c6c4b
02 Jan, 2015 3 commits
- Revert "Make srun reload configuration if conf is different after receiving allocation." · 534dfb67
  Brian Christiansen authored Jan 02, 2015
```
This reverts commit abc435fd.
```
  534dfb67
- Fix segfault with job arrays. · db98d624
  Brian Christiansen authored Jan 02, 2015
```
Bug 1346
```
  db98d624
- Fix cosmetic info statements when dealing with a job array task instead of · 70837b3f
  Danny Auble authored Jan 02, 2015
```
a normal job.
```
  70837b3f
01 Jan, 2015 1 commit
- Fix sacct when searching by nodelist. · 99440d95
  Brian Christiansen authored Dec 31, 2014
  
  99440d95
31 Dec, 2014 1 commit
- Remove trailing spaces from cpu-freq documents · f1bf381b
  Morris Jette authored Dec 31, 2014
  
  f1bf381b
30 Dec, 2014 3 commits
- Update openmpi documentation. · 10577d87
  David Bigagli authored Dec 30, 2014
  
  10577d87
- Make srun reload configuration if conf is different after receiving allocation. · abc435fd
  Brian Christiansen authored Dec 30, 2014
```
Bug 1333
```
  abc435fd
- Restore the SLURM_STEP_RESV_PORTS env variable. · 5170be55
  David Bigagli authored Dec 30, 2014
  
  5170be55
29 Dec, 2014 2 commits
- Make it so a newer version of a slurmstepd can talk to an older srun. · 2a5271c1
  Danny Auble authored Dec 29, 2014
  
  2a5271c1
- Fix documentation issues in slurm.conf. · 4fcc08e2
  David Bigagli authored Dec 29, 2014
  
  4fcc08e2
26 Dec, 2014 1 commit
- Fixes for clean build on FreeBSD. · a8f11909
  Jason Bacon authored Dec 26, 2014
  
  a8f11909
24 Dec, 2014 3 commits

Enable per-partition gang sched resolution · 5e02af31

Morris Jette authored Dec 24, 2014

Enable per-partition gang scheduling resource resolution (e.g. the partition
can have SelectTypeParameters=CR_CORE, while the global value is CR_SOCKET).
bug 1299

5e02af31

Enforce partition shared option · f8fb79d5

Morris Jette authored Dec 23, 2014

Properly enforce partition Shared=YES option. Previously oversubscribing
resources required gang scheduling to also be configured.

f8fb79d5

Fix bad job array task ID value · 46a2e9a1

Morris Jette authored Dec 23, 2014

Prevent invalid job array task ID value if a task is started using gang
scheduling (i.e. the task starts in a SUSPENDED state). The task ID gets
set to NO_VAL and the task string is also cleared.

46a2e9a1

23 Dec, 2014 3 commits

Fix bad job array task ID value · 48016f86

Morris Jette authored Dec 23, 2014

Prevent invalid job array task ID value if a task is started using gang
scheduling (i.e. the task starts in a SUSPENDED state). The task ID gets
set to NO_VAL and the task string is also cleared.

48016f86

Prevent gang resume of suspended job · 161d0336

Morris Jette authored Dec 23, 2014

Prevent a job manually suspended from being resumed by gang scheduler once
free resources are available.
bug 1335

161d0336

set node state RESERVED on maint reservation delete · cf846644

Dorian Krause authored Dec 22, 2014

we have hit the following problem that seems to be present in Slurm
slurm-14-11-2-1 and previous versions. When a node is reserved and an
overlapping maint reservation is created and later deleted the scontrol
output will report the node as IDLE rather than RESERVED:

+ scontrol show node node1
+ grep State
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol create reservation starttime=now duration=120 user=usr01000
nodes=node1 ReservationName=X
Reservation created: X
+ sleep 10
+ scontrol show nodes node1
+ grep State
   State=RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol create reservation starttime=now duration=120 user=usr01000
nodes=ALL flags=maint,ignore_jobs ReservationName=Y
Reservation created: Y
+ sleep 10
+ grep State
+ scontrol show nodes node1
   State=MAINT ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol delete ReservationName=Y
+ sleep 10
+ scontrol show nodes node1
+ grep State
*   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1*
+ scontrol delete ReservationName=X
+ sleep 10
+ scontrol show nodes node1
+ grep State
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1

Note that the after the deletion of reservation "X" the State=IDLE
instead of State=RESERVED. I think that the delete_resv() function in
slurmctld/reservation.c should call set_node_maint_mode(true) like
update_resv() does. With the patch pasted at the end of this e-mail I
get the following output which matches my expectation:

+ scontrol show node node1
+ grep State
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol create reservation starttime=now duration=120 user=usr01000
nodes=node1 ReservationName=X
Reservation created: X
+ sleep 10
+ scontrol show nodes node1
+ grep State
   State=RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol create reservation starttime=now duration=120 user=usr01000
nodes=ALL flags=maint,ignore_jobs ReservationName=Y
Reservation created: Y
+ sleep 10
+ scontrol show nodes node1
+ grep State
   State=MAINT ThreadsPerCore=1 TmpDisk=0 Weight=1
+ scontrol delete ReservationName=Y
+ sleep 10
+ scontrol show nodes node1
+ grep State
*   State=RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=1*
+ scontrol delete ReservationName=X
+ sleep 10
+ scontrol show nodes node1
+ grep State
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1

Thanks,
Dorian

cf846644

22 Dec, 2014 2 commits

Auth/munge - Correct AccountingStoragePass parsing · 2edef50d
Daniel Ahlin authored Dec 22, 2014
```
Correct parsing of AccountingStoragePass when specified in old format
(just a path name)
```
2edef50d

avoid delay on commit for PMI task at rank 0 · fcc11e22

Rémi Palancher authored Dec 22, 2014

Intel MPI, on MPI jobs initialisation through PMI, uses to call PMI_KVS_Put()
many many times from task at rank 0, and each on these call is followed by
PMI_KVS_Commit(). Slurm implementation of PMI_KVS_Commit() imposes a delay
to avoid DDOS on original srun. This delay is proportional to the total number.
It could be up to 3 secs for large jobs for ex. with 7168 tasks. Therefore,
when Intel MPI calls PMI_KVS_Commit() 475 times (mesured on a test case) from
task at rank 0, 28 minutes are spent in delay function.
All other tasks in the job are waiting for a PMI_Barrier. Therefore, there is
no risk for a DDOS from this single task 0. The patch alters the delaying time
calculation to make sure task at rank 0 will does not be delayed. All other
tasks are globally spreaded in the same time range as before.

fcc11e22

20 Dec, 2014 3 commits
- Add sview burst buffer display. · 64d14d0e
  Nathan Yee authored Dec 19, 2014
  
  64d14d0e
- Make it so previous versions of salloc/srun work with newer versions · d11ece80
  Danny Auble authored Dec 19, 2014
```
    of Slurm daemons.

The slurmstepd still needs to be fixed, which most likely can't be fixed
until 15.08.
```
  d11ece80
- Add the account, qos and reservation to srun. · 0252a63e
  David Bigagli authored Dec 19, 2014
  
  0252a63e
19 Dec, 2014 4 commits
- Make it so previous versions of salloc/srun work with newer versions · 61db2e34
  Danny Auble authored Dec 19, 2014
```
of Slurm daemons.
```
  61db2e34
- Add the environment variables SLURM_JOB_ACCOUNT, SLURM_JOB_QOS · 167d9ef2
  David Bigagli authored Dec 18, 2014
```
and SLURM_JOB_RESERVATION in the batch job.
```
  167d9ef2
- Fix for task/affinity if an admin configures a node for having threads · 731b6ded
  Danny Auble authored Dec 18, 2014
```
but then sets CPUs to only represent the number of cores on the node.
```
  731b6ded
- MySQL - Enhanced coordinator security checks. · 75af062c
  Danny Auble authored Dec 18, 2014
  
  75af062c
17 Dec, 2014 2 commits
- Fix ghost job when submitting job after all jobids are exhausted. · c8754578
  Brian Christiansen authored Dec 17, 2014
```
Bug 1327
```
  c8754578
- In srun honor ntasks_per_node before looking at cpu count when the user · 6a389a7a
  Danny Auble authored Dec 16, 2014
```
doesn't request a number of tasks.
```
  6a389a7a
16 Dec, 2014 2 commits
- Fix job hash table bug · f293ce7c
  Morris Jette authored Dec 16, 2014
```
Fix job array hash table bug, could result in slurmctld infinite loop or
invalid memory reference.
bug 1309
```
  f293ce7c
- Update news file. · 9c6f34d8
  David Bigagli authored Dec 15, 2014
  
  9c6f34d8
12 Dec, 2014 3 commits

Prevent vestigial job array record · 42d75a09

Morris Jette authored Dec 12, 2014

If a master job array record is complete, then consider all pending
tasks as also complete. This problem happens when a master job array
record is pending (has pending tasks) and is cancelled. The result
previously was a job record not visible to squeue/scontrol, but occupying
memory.
The same type of problem happened with respect to a dependency on a job
array which was cancelled.

42d75a09

update news for next tag · ed352bce
Danny Auble authored Dec 12, 2014

ed352bce
Update news for next tag · e437094d
Danny Auble authored Dec 12, 2014

e437094d