Commits · 8889108a6ab0c2afa550ddb2b66a3dd5a75c3d33 · Manuel G. Marciani / ces_slurm_simulator

27 Oct, 2014 - 1 commit
- Improve slurmctld error message · 8889108a
  Morris Jette authored Oct 27, 2014
```
bug 1207
```
  8889108a
24 Oct, 2014 - 3 commits
- Fix test if "partition" not defined in globals · 790310cb
  Morris Jette authored Oct 24, 2014
```
Use default_partition by default for the test
```
  790310cb
- Fix variable name typo in test · 0c1798ba
  Morris Jette authored Oct 24, 2014
```
Returned test passed, when it should have failed because of bad
variable name.
```
  0c1798ba
- Prevent negative job array index · b02facfc
  David Singleton authored Oct 23, 2014
```
We've seen slurmctld crashes due to negative job array indices.
```
  b02facfc
23 Oct, 2014 - 4 commits
- BGQ - Refine logic to handle down cnodes · 447a498f
  Morris Jette authored Oct 23, 2014
```
The previous patch should work in most cases, but this should
work more reliably and the comment is more clear
bug 1196
```
  447a498f
- Revert "Commit image files for the nonstop.html document." · 0dcc431d
  David Bigagli authored Oct 22, 2014
```
This reverts commit 7e65f924.
```
  0dcc431d
- Commit image files for the nonstop.html document. · 7e65f924
  David Bigagli authored Oct 22, 2014
  
  7e65f924
- BGQ fix race condition causing slurmctld abort · 69184880
  Morris Jette authored Oct 22, 2014
```
BGQ: Fix race condition when job fails due to hardware failure and is
requeued. Previous code could result in slurmctld abort with NULL pointer.
bug 1096
```
  69184880
22 Oct, 2014 - 1 commit
- slurm.conf.5 man page spelling error · eef344d2
  Gennaro Oliva authored Oct 21, 2014
  
  eef344d2
21 Oct, 2014 - 1 commit

Fix job gres info clear on slurmctld restart · 1209a664

Morris Jette authored Oct 21, 2014

Fix bug that prevented preservation of a job's GRES bitmap on slurmctld
restart or reconfigure (bug was introduced in 14.03.5 "Clear record of a
job's gres when requeued" and only applies when GRES mapped to specific
files).
bug 1192

1209a664

20 Oct, 2014 - 4 commits
- Fix minor memory leak in jobcomp/mysql on slurmctld reconfig. · d151a215
  Danny Auble authored Oct 20, 2014
  
  d151a215
- When using gres and cgroup ConstrainDevices set correct access · bcebd453
  David Bigagli authored Oct 20, 2014
```
permission for the batch step.
```
  bcebd453
- Fix wrong documentation. · faf059c9
  David Bigagli authored Oct 20, 2014
  
  faf059c9
- Require SlurmSchedLogFile if SlurmSchedLogLevel set · 4fb81073
  jette authored Oct 19, 2014
```
Otherwise there will be no log file to write to, resulting in an
abort
bug 1185
```
  4fb81073
18 Oct, 2014 - 1 commit
- Fix a few sacctmgr error messages. · c1a43890
  Nicolas Joly authored Oct 17, 2014
  
  c1a43890
17 Oct, 2014 - 6 commits
- Remove unused variable · d6d0b9f1
  Morris Jette authored Oct 17, 2014
  
  d6d0b9f1
- Update META for v14.03.9 tag · b074bc7f
  Morris Jette authored Oct 17, 2014
  
  b074bc7f
- Fix build failure. · e4c70ad2
  David Bigagli authored Oct 17, 2014
  
  e4c70ad2
- If failed to launch a batch job requeue it in hold. · 2bc9bc29
  David Bigagli authored Oct 17, 2014
  
  2bc9bc29
- Correct license count for suspended jobs · 77a0bb65
  Morris Jette authored Oct 17, 2014
```
Correct tracking of licenses for suspended jobs on slurmctld reconfigure or
restart. Previously licenses for suspended jobs were not counted, so
the license count could be exceeded with those jobs get resumed.
```
  77a0bb65
- ALPS - Only set -M if node_min_mem is specified. In conjunction with · 348a30b3
  Danny Auble authored Oct 17, 2014
```
commit a9dc50d4.
```
  348a30b3
16 Oct, 2014 - 5 commits

Fix small memory leak in jobcomp/mysql. · e1c42895
Brian Christiansen authored Oct 16, 2014

e1c42895
ALPS - Remove sanity code to work like it did in 2.5. This is an addition · 2c95e2d2
Danny Auble authored Oct 16, 2014
```
to commit a9dc50d4.
```
2c95e2d2
Remove vestigial variable · 463df8fd
Morris Jette authored Oct 16, 2014

463df8fd

Cray PMI refinements · eeb97050

Morris Jette authored Oct 16, 2014

Refine commit 5f89223f based upon
feedback from David Gloe:
* It's not only MPI jobs, but anything that uses PMI. That includes MPI,
shmem, etc, so you may want to reword the error message.
* I added the terminated flag because if multiple tasks on a node exit,
you would get an error message from each of them. That reduces it to one
error message per node. Cray bug 810310 prompted that change.
* Since we're now relying on --kill-on-bad-exit, I think we should update
the Cray slurm.conf template to default to 1 (set KillOnBadExit=1 in
contribs/cray/slurm.conf.template).
bug 1171

eeb97050

Change Cray mpi_fini failure logic · 5f89223f

Morris Jette authored Oct 16, 2014

Treat Cray MPI job calling exit() without mpi_fini() as fatal error for
that specific task and let srun handle all timeout logic.
Previous logic would cancel the entire job step and srun options
for wait time and kill on exit were ignored. The new logic provides
users with the following type of response:

$ srun -n3 -K0 -N3 --wait=60 ./tmp
Task:0 Cycle:1
Task:2 Cycle:1
Task:1 Cycle:1
Task:0 Cycle:2
Task:2 Cycle:2
slurmstepd: step 14927.0 task 1 exited without calling mpi_fini()
srun: error: tux2: task 1: Killed
Task:0 Cycle:3
Task:2 Cycle:3
Task:0 Cycle:4
...

bug 1171

5f89223f

15 Oct, 2014 - 9 commits

Avoid duplicate PowerUp of node on slurmctld start · d99cf552

Morris Jette authored Oct 15, 2014

This fixes a race condition if the slurmctld needed to power up
a node shortly after startup. Previously it would execute the
ResumeProgram twice for effected nodes.

d99cf552

if DOWN node set to PowerDown, clear NoResp flag · 13023913

Morris Jette authored Oct 15, 2014

Without this change, a node in the cloud that failed to power up,
would not have its NoResponding flag cleared, which would prevent
its later use. The NoResponding flag is now cleared when manuallly
when the node is modified to PowerDown.

13023913

Permit more batch requeues for cloud bursting · 22a01dc7

Morris Jette authored Oct 15, 2014

If a batch job launch to the cloud fails, permit an unlimited
number of job requeues. Previously the job would abort on the
second launch failure.

22a01dc7

Revert "Revert "MYSQL - Fix load of archive files."" · fa73701e
Nicolas Joly authored Oct 15, 2014
```
This reverts commit 4d03d0b4.

Make sure the correct Author is attributed here.
```
fa73701e
Revert "MYSQL - Fix load of archive files." · 4d03d0b4
Danny Auble authored Oct 15, 2014
```
This reverts commit 1891936e.
```
4d03d0b4

MYSQL - Fix load of archive files. · 1891936e

Danny Auble authored Oct 15, 2014

This has apparently been broken from the get go.  This fixes bug 1172.

test21.22 should be updated to test the dump and load of a file that is
generated.

1891936e

Only print warning message on test1.91 if running with FastSchedule=2 · 9c1cdee9
Danny Auble authored Oct 15, 2014
```
since this represents the user could be exaggerating their system.
```
9c1cdee9
Fix misleading comments. These represent a core mask, not CPU. · 3b9ecd1c
Danny Auble authored Oct 15, 2014

3b9ecd1c

Fix issue with task/affinity oversubscribing cpus erroneously when · 03dc6ea7

Danny Auble authored Oct 15, 2014

using --ntasks-per-node.

This is related to bug 1145.  What was happening is all the cpus were
allocated on one socket instead of a cyclic method.  While this is allowed
it is strange and resulted in this bug.  There appears to be a different
bug as to why the tasks were laid out in a block fashion in the first
place.

03dc6ea7

14 Oct, 2014 - 5 commits

ALPS - Don't drain nodes if epilog fails. It leaves them in drain state · 8b479a41

Danny Auble authored Oct 14, 2014

with no way to get them out.

This fixes bug 1134.  It is advised the pro/epilog to call xtprocadmin in
the script instead of returning a non-zero exit code.

8b479a41

Fix invalid read by always finding trigger's job. · aa4b5b74

Brian Christiansen authored Oct 14, 2014

The job could have been purged from a short MinJobAge and the
trigger would then point to an invalid job.
Bug #1144

aa4b5b74

Make undocumented --alps return error if used on any system. · a52c8147
Danny Auble authored Oct 14, 2014

a52c8147

Don't invoke spank prolog/epilog if no config file · 967e11d5

Morris Jette authored Oct 14, 2014

Note that PlugStackConfig defaults to plugstack.conf in the same
directory as slurm.conf. The added logic tests if the file actually
exists (using stat) and if not found then do not fork/exec slurmstepd
to invoke the spank prolog/epilog. This saves about 14msec on startup
and 14msec on shutdown if no spank plugins are configured. It also
eliminates some possible failures (e.g. if fork() fails, or the
slurmstepd processes can not exec()). This logic also caches the
PlugStackConfig value and reads it again on reconfigure, but avoid
reading the value for each job.
bug 982

967e11d5

Cosmetic changes · 5b7c84cc

Morris Jette authored Oct 14, 2014

Add "void" argument to a function and rename a local function to have
a prefix of "_"

5b7c84cc