Commits · 3bbcc89cfb547ded64c8538dc585044c5367273a · Manuel G. Marciani / ces_slurm_simulator

17 Oct, 2014 16 commits
- Merge branch 'slurm-14.11' · 3bbcc89c
  Morris Jette authored Oct 17, 2014
```
Conflicts:
	META
	NEWS
```
  3bbcc89c
- create v14.11.0-rc2 tag · b33b0b90
  Morris Jette authored Oct 17, 2014
  
  b33b0b90
- Merge branch 'slurm-14.03' into slurm-14.11 · 4bf02c18
  Morris Jette authored Oct 17, 2014
```
Conflicts:
	META
	src/slurmd/slurmd/req.c
```
  4bf02c18
- Update META for v14.03.9 tag · b074bc7f
  Morris Jette authored Oct 17, 2014
  
  b074bc7f
- Fix build failure. · e4c70ad2
  David Bigagli authored Oct 17, 2014
  
  e4c70ad2
- If failed to launch a batch job requeue it in hold. · 2bc9bc29
  David Bigagli authored Oct 17, 2014
  
  2bc9bc29
- If failed to launch a batch job requeue it in hold. · 1797273a
  David Bigagli authored Oct 17, 2014
  
  1797273a
- Correct license count for suspended jobs · 77a0bb65
  Morris Jette authored Oct 17, 2014
```
Correct tracking of licenses for suspended jobs on slurmctld reconfigure or
restart. Previously licenses for suspended jobs were not counted, so
the license count could be exceeded with those jobs get resumed.
```
  77a0bb65
- Merge remote-tracking branch 'origin/slurm-14.11' · 6fc9954f
  Danny Auble authored Oct 17, 2014
  
  6fc9954f
- Merge remote-tracking branch 'origin/slurm-14.03' into slurm-14.11 · e74d480a
  Danny Auble authored Oct 17, 2014
  
  e74d480a
- ALPS - Only set -M if node_min_mem is specified. In conjunction with · 348a30b3
  Danny Auble authored Oct 17, 2014
```
commit a9dc50d4.
```
  348a30b3
- Add sbcast support for job steps · 2c604442
  Morris Jette authored Oct 17, 2014
```
Add sbcast support for file transfer to resources allocated to a job step
rather than a job allocation.
```
  2c604442
- Fix typos in slurmdbd messages. · 89fe9fa1
  Nicolas Joly authored Oct 17, 2014
  
  89fe9fa1
- Merge remote-tracking branch 'origin/slurm-14.11' · be5d4ee4
  Danny Auble authored Oct 16, 2014
  
  be5d4ee4
- Fix various backwards compatibility issues. · 8eba37bd
  Danny Auble authored Oct 16, 2014
  
  8eba37bd
- Merge branch 'slurm-14.11' · 5657eda2
  Morris Jette authored Oct 16, 2014
  
  5657eda2
16 Oct, 2014 16 commits

Remove functions needed to translate the now defunct SLURMDBD_*_VERSION to · b8cbad26
Danny Auble authored Oct 16, 2014
```
be the SLURM_*_PROTOCOL_VERSION.
```
b8cbad26
sched/backfill: don't clear running job start_time · e6290537
Morris Jette authored Oct 16, 2014
```
bug 1178
```
e6290537
Minor code refactoring · 3bccf24f
Morris Jette authored Oct 16, 2014
```
No change in functionality, just eliminated a call to time().
```
3bccf24f
Create new API/Protocol version (remove 2.6) · 8374c793
Danny Auble authored Oct 16, 2014

8374c793

Morris Jette authored Oct 16, 2014

Insure that a file needed for the test exists. This problem happens
on emulated cray systems.

10288dfd

Start NEWS for v15.08 branch (master) · cb213564
Morris Jette authored Oct 16, 2014

cb213564
Merge remote-tracking branch 'origin/slurm-14.03' · dcbee39c
Danny Auble authored Oct 16, 2014
```
Conflicts:
	src/plugins/jobcomp/mysql/jobcomp_mysql.c
```
dcbee39c
Fix small memory leak in jobcomp/mysql. · e1c42895
Brian Christiansen authored Oct 16, 2014

e1c42895
Merge remote-tracking branch 'origin/slurm-14.03' · 09bf9b94
Danny Auble authored Oct 16, 2014

09bf9b94
ALPS - Remove sanity code to work like it did in 2.5. This is an addition · 2c95e2d2
Danny Auble authored Oct 16, 2014
```
to commit a9dc50d4.
```
2c95e2d2
Merge remote-tracking branch 'origin/slurm-14.03' · 2792bc2e
Danny Auble authored Oct 16, 2014
```
Conflicts:
	src/plugins/task/cray/task_cray.c
```
2792bc2e
Remove vestigial variable · 463df8fd
Morris Jette authored Oct 16, 2014

463df8fd

Merge branch 'slurm-14.03' · 23a1fd67

Morris Jette authored Oct 16, 2014

Conflicts:
	contribs/cray/slurm.conf.template
	src/plugins/task/cray/task_cray.c
	src/slurmctld/job_mgr.c

23a1fd67

Cray PMI refinements · eeb97050

Morris Jette authored Oct 16, 2014

Refine commit 5f89223f based upon
feedback from David Gloe:
* It's not only MPI jobs, but anything that uses PMI. That includes MPI,
shmem, etc, so you may want to reword the error message.
* I added the terminated flag because if multiple tasks on a node exit,
you would get an error message from each of them. That reduces it to one
error message per node. Cray bug 810310 prompted that change.
* Since we're now relying on --kill-on-bad-exit, I think we should update
the Cray slurm.conf template to default to 1 (set KillOnBadExit=1 in
contribs/cray/slurm.conf.template).
bug 1171

eeb97050

Change Cray mpi_fini failure logic · 5f89223f

Morris Jette authored Oct 16, 2014

Treat Cray MPI job calling exit() without mpi_fini() as fatal error for
that specific task and let srun handle all timeout logic.
Previous logic would cancel the entire job step and srun options
for wait time and kill on exit were ignored. The new logic provides
users with the following type of response:

$ srun -n3 -K0 -N3 --wait=60 ./tmp
Task:0 Cycle:1
Task:2 Cycle:1
Task:1 Cycle:1
Task:0 Cycle:2
Task:2 Cycle:2
slurmstepd: step 14927.0 task 1 exited without calling mpi_fini()
srun: error: tux2: task 1: Killed
Task:0 Cycle:3
Task:2 Cycle:3
Task:0 Cycle:4
...

bug 1171

5f89223f

Expand explanation of job end_time · 9022d62f

jette authored Oct 15, 2014

Specifically if job terminated from suspend state, end_time will
be the time that the job suspend began

9022d62f

15 Oct, 2014 8 commits

Add PrivateData value of "cloud" · dca13a92
Morris Jette authored Oct 15, 2014
```
If set, powered down nodes in the cloud will be visible.
```
dca13a92

Avoid duplicate PowerUp of node on slurmctld start · bb1f77e0

Morris Jette authored Oct 15, 2014

This fixes a race condition if the slurmctld needed to power up
a node shortly after startup. Previously it would execute the
ResumeProgram twice for effected nodes.

bb1f77e0

if DOWN node set to PowerDown, clear NoResp flag · 9aa2a727

Morris Jette authored Oct 15, 2014

Without this change, a node in the cloud that failed to power up,
would not have its NoResponding flag cleared, which would prevent
its later use. The NoResponding flag is now cleared when manuallly
when the node is modified to PowerDown.

9aa2a727

Permit more batch requeues for cloud bursting · 3e571581

Morris Jette authored Oct 15, 2014

If a batch job launch to the cloud fails, permit an unlimited
number of job requeues. Previously the job would abort on the
second launch failure.

3e571581

Avoid duplicate PowerUp of node on slurmctld start · d99cf552

Morris Jette authored Oct 15, 2014

This fixes a race condition if the slurmctld needed to power up
a node shortly after startup. Previously it would execute the
ResumeProgram twice for effected nodes.

d99cf552

if DOWN node set to PowerDown, clear NoResp flag · 13023913

Morris Jette authored Oct 15, 2014

Without this change, a node in the cloud that failed to power up,
would not have its NoResponding flag cleared, which would prevent
its later use. The NoResponding flag is now cleared when manuallly
when the node is modified to PowerDown.

13023913

Permit more batch requeues for cloud bursting · 22a01dc7

Morris Jette authored Oct 15, 2014

If a batch job launch to the cloud fails, permit an unlimited
number of job requeues. Previously the job would abort on the
second launch failure.

22a01dc7

Further fixing done to get archive load to work with 14.11. Continuation · 67d901e5
Nicolas Joly authored Oct 15, 2014
```
to commit fa73701e.
```
67d901e5