Commits · cd96dcc53fa6d7637e2a499485484b5b1a8a3846 · Manuel G. Marciani / ces_slurm_simulator

04 Nov, 2014 3 commits
- Update news for potential next 14.03 version · cd96dcc5
  Danny Auble authored Nov 03, 2014
  
  cd96dcc5
- Update META for new tag · f90b17c0
  Danny Auble authored Nov 03, 2014
  
  f90b17c0
- BLUEGENE - Fix issue where requeuing jobs could cause an assert. · fed60eb2
  Danny Auble authored Nov 03, 2014
```
This was an unrealized regression from commit 0da01963.  The problem
is we were clearing the job_ptr->job_resrcs too early.  This patch fixes
it to wait until the job is actually being requeued so it does the right
thing.
```
  fed60eb2
31 Oct, 2014 5 commits
- Better check when related to commit e5635a76, and it was deemed ok to · d30417a3
  Danny Auble authored Oct 31, 2014
```
pack it this way so we will not change it in 15.08
```
  d30417a3
- Give better estimates on pending node count if no node count is requested. · e5635a76
  Danny Auble authored Oct 31, 2014
  
  e5635a76
- Fix potential buffer overflow. · 219826b0
  Danny Auble authored Oct 31, 2014
```
This isn't that big of an issue for 14.03, but 14.11 added more to this
string which could overflow the buffer since sprintf is used instead of
snprintf.  Using xstrfmtcat fixes the issue and is easier to read code.
```
  219826b0
- ALPS - Don't set the env var APRUN_DEFAULT_MEMORY, it is not needed anymore. · d62896db
  Danny Auble authored Oct 30, 2014
  
  d62896db
- ALPS - If an allocation requests -n set the BASIL -N option to the · 2e2de6a4
  Danny Auble authored Oct 30, 2014
```
amount of tasks / number of node.
```
  2e2de6a4
30 Oct, 2014 2 commits
- Check the status of the database connection before using it. · 18fb57b7
  David Bigagli authored Oct 30, 2014
  
  18fb57b7
- Check the status of the db connection first before trying · 1d485262
  David Bigagli authored Oct 30, 2014
```
to dereference it.
```
  1d485262
27 Oct, 2014 2 commits

Fix issue with squeue/scontrol showing correct node_cnt when only tasks · af19cb55

Danny Auble authored Oct 27, 2014

are specified.

This is a fix to commit b9cc5b31 which just didn't know
mc_ptr->ntasks_per_core is initialized to INFINITE.  Without it the
node_cnt packed would be set to 1 on the user tools.

This fixes bug 1148.

af19cb55

Improve slurmctld error message · 8889108a
Morris Jette authored Oct 27, 2014
```
bug 1207
```
8889108a

24 Oct, 2014 3 commits
- Fix test if "partition" not defined in globals · 790310cb
  Morris Jette authored Oct 24, 2014
```
Use default_partition by default for the test
```
  790310cb
- Fix variable name typo in test · 0c1798ba
  Morris Jette authored Oct 24, 2014
```
Returned test passed, when it should have failed because of bad
variable name.
```
  0c1798ba
- Prevent negative job array index · b02facfc
  David Singleton authored Oct 23, 2014
```
We've seen slurmctld crashes due to negative job array indices.
```
  b02facfc
23 Oct, 2014 4 commits
- BGQ - Refine logic to handle down cnodes · 447a498f
  Morris Jette authored Oct 23, 2014
```
The previous patch should work in most cases, but this should
work more reliably and the comment is more clear
bug 1196
```
  447a498f
- Revert "Commit image files for the nonstop.html document." · 0dcc431d
  David Bigagli authored Oct 22, 2014
```
This reverts commit 7e65f924.
```
  0dcc431d
- Commit image files for the nonstop.html document. · 7e65f924
  David Bigagli authored Oct 22, 2014
  
  7e65f924
- BGQ fix race condition causing slurmctld abort · 69184880
  Morris Jette authored Oct 22, 2014
```
BGQ: Fix race condition when job fails due to hardware failure and is
requeued. Previous code could result in slurmctld abort with NULL pointer.
bug 1096
```
  69184880
22 Oct, 2014 1 commit
- slurm.conf.5 man page spelling error · eef344d2
  Gennaro Oliva authored Oct 21, 2014
  
  eef344d2
21 Oct, 2014 1 commit

Fix job gres info clear on slurmctld restart · 1209a664

Morris Jette authored Oct 21, 2014

Fix bug that prevented preservation of a job's GRES bitmap on slurmctld
restart or reconfigure (bug was introduced in 14.03.5 "Clear record of a
job's gres when requeued" and only applies when GRES mapped to specific
files).
bug 1192

1209a664

20 Oct, 2014 4 commits
- Fix minor memory leak in jobcomp/mysql on slurmctld reconfig. · d151a215
  Danny Auble authored Oct 20, 2014
  
  d151a215
- When using gres and cgroup ConstrainDevices set correct access · bcebd453
  David Bigagli authored Oct 20, 2014
```
permission for the batch step.
```
  bcebd453
- Fix wrong documentation. · faf059c9
  David Bigagli authored Oct 20, 2014
  
  faf059c9
- Require SlurmSchedLogFile if SlurmSchedLogLevel set · 4fb81073
  jette authored Oct 19, 2014
```
Otherwise there will be no log file to write to, resulting in an
abort
bug 1185
```
  4fb81073
18 Oct, 2014 1 commit
- Fix a few sacctmgr error messages. · c1a43890
  Nicolas Joly authored Oct 17, 2014
  
  c1a43890
17 Oct, 2014 6 commits
- Remove unused variable · d6d0b9f1
  Morris Jette authored Oct 17, 2014
  
  d6d0b9f1
- Update META for v14.03.9 tag · b074bc7f
  Morris Jette authored Oct 17, 2014
  
  b074bc7f
- Fix build failure. · e4c70ad2
  David Bigagli authored Oct 17, 2014
  
  e4c70ad2
- If failed to launch a batch job requeue it in hold. · 2bc9bc29
  David Bigagli authored Oct 17, 2014
  
  2bc9bc29
- Correct license count for suspended jobs · 77a0bb65
  Morris Jette authored Oct 17, 2014
```
Correct tracking of licenses for suspended jobs on slurmctld reconfigure or
restart. Previously licenses for suspended jobs were not counted, so
the license count could be exceeded with those jobs get resumed.
```
  77a0bb65
- ALPS - Only set -M if node_min_mem is specified. In conjunction with · 348a30b3
  Danny Auble authored Oct 17, 2014
```
commit a9dc50d4.
```
  348a30b3
16 Oct, 2014 5 commits

Fix small memory leak in jobcomp/mysql. · e1c42895
Brian Christiansen authored Oct 16, 2014

e1c42895
ALPS - Remove sanity code to work like it did in 2.5. This is an addition · 2c95e2d2
Danny Auble authored Oct 16, 2014
```
to commit a9dc50d4.
```
2c95e2d2
Remove vestigial variable · 463df8fd
Morris Jette authored Oct 16, 2014

463df8fd

Cray PMI refinements · eeb97050

Morris Jette authored Oct 16, 2014

Refine commit 5f89223f based upon
feedback from David Gloe:
* It's not only MPI jobs, but anything that uses PMI. That includes MPI,
shmem, etc, so you may want to reword the error message.
* I added the terminated flag because if multiple tasks on a node exit,
you would get an error message from each of them. That reduces it to one
error message per node. Cray bug 810310 prompted that change.
* Since we're now relying on --kill-on-bad-exit, I think we should update
the Cray slurm.conf template to default to 1 (set KillOnBadExit=1 in
contribs/cray/slurm.conf.template).
bug 1171

eeb97050

Change Cray mpi_fini failure logic · 5f89223f

Morris Jette authored Oct 16, 2014

Treat Cray MPI job calling exit() without mpi_fini() as fatal error for
that specific task and let srun handle all timeout logic.
Previous logic would cancel the entire job step and srun options
for wait time and kill on exit were ignored. The new logic provides
users with the following type of response:

$ srun -n3 -K0 -N3 --wait=60 ./tmp
Task:0 Cycle:1
Task:2 Cycle:1
Task:1 Cycle:1
Task:0 Cycle:2
Task:2 Cycle:2
slurmstepd: step 14927.0 task 1 exited without calling mpi_fini()
srun: error: tux2: task 1: Killed
Task:0 Cycle:3
Task:2 Cycle:3
Task:0 Cycle:4
...

bug 1171

5f89223f

15 Oct, 2014 3 commits

Avoid duplicate PowerUp of node on slurmctld start · d99cf552

Morris Jette authored Oct 15, 2014

This fixes a race condition if the slurmctld needed to power up
a node shortly after startup. Previously it would execute the
ResumeProgram twice for effected nodes.

d99cf552

if DOWN node set to PowerDown, clear NoResp flag · 13023913

Morris Jette authored Oct 15, 2014

Without this change, a node in the cloud that failed to power up,
would not have its NoResponding flag cleared, which would prevent
its later use. The NoResponding flag is now cleared when manuallly
when the node is modified to PowerDown.

13023913

Permit more batch requeues for cloud bursting · 22a01dc7

Morris Jette authored Oct 15, 2014

If a batch job launch to the cloud fails, permit an unlimited
number of job requeues. Previously the job would abort on the
second launch failure.

22a01dc7