Commits · 2bc9bc2901a0a5b85dfd550e3cb5b8512f2ad655 · Manuel G. Marciani / ces_slurm_simulator

17 Oct, 2014 3 commits
- If failed to launch a batch job requeue it in hold. · 2bc9bc29
  David Bigagli authored Oct 17, 2014
  
  2bc9bc29
- Correct license count for suspended jobs · 77a0bb65
  Morris Jette authored Oct 17, 2014
```
Correct tracking of licenses for suspended jobs on slurmctld reconfigure or
restart. Previously licenses for suspended jobs were not counted, so
the license count could be exceeded with those jobs get resumed.
```
  77a0bb65
- ALPS - Only set -M if node_min_mem is specified. In conjunction with · 348a30b3
  Danny Auble authored Oct 17, 2014
```
commit a9dc50d4.
```
  348a30b3
16 Oct, 2014 5 commits

Fix small memory leak in jobcomp/mysql. · e1c42895
Brian Christiansen authored Oct 16, 2014

e1c42895
ALPS - Remove sanity code to work like it did in 2.5. This is an addition · 2c95e2d2
Danny Auble authored Oct 16, 2014
```
to commit a9dc50d4.
```
2c95e2d2
Remove vestigial variable · 463df8fd
Morris Jette authored Oct 16, 2014

463df8fd

Morris Jette authored Oct 16, 2014

Refine commit 5f89223f based upon
feedback from David Gloe:
* It's not only MPI jobs, but anything that uses PMI. That includes MPI,
shmem, etc, so you may want to reword the error message.
* I added the terminated flag because if multiple tasks on a node exit,
you would get an error message from each of them. That reduces it to one
error message per node. Cray bug 810310 prompted that change.
* Since we're now relying on --kill-on-bad-exit, I think we should update
the Cray slurm.conf template to default to 1 (set KillOnBadExit=1 in
contribs/cray/slurm.conf.template).
bug 1171

eeb97050

Change Cray mpi_fini failure logic · 5f89223f

Morris Jette authored Oct 16, 2014

Treat Cray MPI job calling exit() without mpi_fini() as fatal error for
that specific task and let srun handle all timeout logic.
Previous logic would cancel the entire job step and srun options
for wait time and kill on exit were ignored. The new logic provides
users with the following type of response:

$ srun -n3 -K0 -N3 --wait=60 ./tmp
Task:0 Cycle:1
Task:2 Cycle:1
Task:1 Cycle:1
Task:0 Cycle:2
Task:2 Cycle:2
slurmstepd: step 14927.0 task 1 exited without calling mpi_fini()
srun: error: tux2: task 1: Killed
Task:0 Cycle:3
Task:2 Cycle:3
Task:0 Cycle:4
...

bug 1171

5f89223f

15 Oct, 2014 9 commits

Avoid duplicate PowerUp of node on slurmctld start · d99cf552

Morris Jette authored Oct 15, 2014

This fixes a race condition if the slurmctld needed to power up
a node shortly after startup. Previously it would execute the
ResumeProgram twice for effected nodes.

d99cf552

if DOWN node set to PowerDown, clear NoResp flag · 13023913

Morris Jette authored Oct 15, 2014

Without this change, a node in the cloud that failed to power up,
would not have its NoResponding flag cleared, which would prevent
its later use. The NoResponding flag is now cleared when manuallly
when the node is modified to PowerDown.

13023913

Permit more batch requeues for cloud bursting · 22a01dc7

Morris Jette authored Oct 15, 2014

If a batch job launch to the cloud fails, permit an unlimited
number of job requeues. Previously the job would abort on the
second launch failure.

22a01dc7

Revert "Revert "MYSQL - Fix load of archive files."" · fa73701e
Nicolas Joly authored Oct 15, 2014
```
This reverts commit 4d03d0b4.

Make sure the correct Author is attributed here.
```
fa73701e
Revert "MYSQL - Fix load of archive files." · 4d03d0b4
Danny Auble authored Oct 15, 2014
```
This reverts commit 1891936e.
```
4d03d0b4

MYSQL - Fix load of archive files. · 1891936e

Danny Auble authored Oct 15, 2014

This has apparently been broken from the get go.  This fixes bug 1172.

test21.22 should be updated to test the dump and load of a file that is
generated.

1891936e

Only print warning message on test1.91 if running with FastSchedule=2 · 9c1cdee9
Danny Auble authored Oct 15, 2014
```
since this represents the user could be exaggerating their system.
```
9c1cdee9
Fix misleading comments. These represent a core mask, not CPU. · 3b9ecd1c
Danny Auble authored Oct 15, 2014

3b9ecd1c

Fix issue with task/affinity oversubscribing cpus erroneously when · 03dc6ea7

Danny Auble authored Oct 15, 2014

using --ntasks-per-node.

This is related to bug 1145.  What was happening is all the cpus were
allocated on one socket instead of a cyclic method.  While this is allowed
it is strange and resulted in this bug.  There appears to be a different
bug as to why the tasks were laid out in a block fashion in the first
place.

03dc6ea7

14 Oct, 2014 8 commits

ALPS - Don't drain nodes if epilog fails. It leaves them in drain state · 8b479a41

Danny Auble authored Oct 14, 2014

with no way to get them out.

This fixes bug 1134.  It is advised the pro/epilog to call xtprocadmin in
the script instead of returning a non-zero exit code.

8b479a41

Fix invalid read by always finding trigger's job. · aa4b5b74

Brian Christiansen authored Oct 14, 2014

The job could have been purged from a short MinJobAge and the
trigger would then point to an invalid job.
Bug #1144

aa4b5b74

Make undocumented --alps return error if used on any system. · a52c8147
Danny Auble authored Oct 14, 2014

a52c8147

Don't invoke spank prolog/epilog if no config file · 967e11d5

Morris Jette authored Oct 14, 2014

Note that PlugStackConfig defaults to plugstack.conf in the same
directory as slurm.conf. The added logic tests if the file actually
exists (using stat) and if not found then do not fork/exec slurmstepd
to invoke the spank prolog/epilog. This saves about 14msec on startup
and 14msec on shutdown if no spank plugins are configured. It also
eliminates some possible failures (e.g. if fork() fails, or the
slurmstepd processes can not exec()). This logic also caches the
PlugStackConfig value and reads it again on reconfigure, but avoid
reading the value for each job.
bug 982

967e11d5

Cosmetic changes · 5b7c84cc

Morris Jette authored Oct 14, 2014

Add "void" argument to a function and rename a local function to have
a prefix of "_"

5b7c84cc

Fix for bad logic in commit 686cd117 · aeeadc29
Nicolas Joly authored Oct 14, 2014

aeeadc29
Print NONE for slurmdb_purge_string if purge == NO_VAL. addition to commit · b1c5716d
Danny Auble authored Oct 14, 2014
```
9b00f12c
```
b1c5716d
Ignore NO_VAL in SLURMDB_PURGE_* macros. · 9b00f12c
Nicolas Joly authored Oct 14, 2014
```
Signed-off-by: Danny Auble <da@schedmd.com>
```
9b00f12c

11 Oct, 2014 3 commits

Permit power_down of node in down state · 46e8f296

Morris Jette authored Oct 10, 2014

if a node is down, then permit setting its state to power down,
which causes the SuspendProgram to run and set the node state
back to cloud.

46e8f296

Prevent power up of CLOUD node on slurmctld startup · d758c741
Morris Jette authored Oct 10, 2014
```
If a node is powered down, then do not power it up on slurmctld
restart.
```
d758c741

Note when setting node state power up/down happens · eb3dd7b3

Morris Jette authored Oct 10, 2014

The power up/down request only takes effect after the ResumeTimeout
or SuspendTimeout is reached in order to avoid a race condition.

eb3dd7b3

10 Oct, 2014 12 commits

Sview - Fix cpu and node counts for partitions. · 6bf40ed9
Danny Auble authored Oct 10, 2014

6bf40ed9
Fix sinfo to display mixed nodes as allocated in '%F' output. · 5d6a2dc2
Brian Christiansen authored Oct 10, 2014
```
Bug #1143
```
5d6a2dc2
Fix typos · 40bec9cd
Danny Auble authored Oct 10, 2014

40bec9cd

Job step memory allocation logic fix · 0dd12469

Dorian Krause authored Oct 10, 2014

This commit fixes a bug we observed when combining select/linear with
gres. If an allocation was requested with a --gres argument an srun
execution within that allocation would stall indefinitely:

-bash-4.1$ salloc -N 1 --gres=gpfs:100
salloc: Granted job allocation 384049
bash-4.1$ srun -w j3c017 -n 1 hostname
srun: Job step creation temporarily disabled, retrying

The slurmctld log showed:

debug3: StepDesc: user_id=10034 job_id=384049 node_count=1-1 cpu_count=1
debug3:    cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=j3c017
debug3:    host=j3l02 port=33608 name=hostname network=(null) exclusive=0
debug3:    checkpoint-dir=/home/user checkpoint_int=0
debug3:    mem_per_node=62720 resv_port_cnt=65534 immediate=0 no_kill=0
debug3:    overcommit=0 time_limit=0 gres=(null) constraints=(null)
debug:  Configuration for job 384049 complete
_pick_step_nodes: some requested nodes j3c017 still have memory used by other steps
_slurm_rpc_job_step_create for job 384049: Requested nodes are busy

If srun --exclusive would have be used instead everything would work fine.
The reason is that in exclusive mode the code properly checks whether memory
is a reserved resource in the _pick_step_node() function.
This commit modifies the alternate code path to do the same.

0dd12469

Add SUG14 paper link · 70a4091f
Morris Jette authored Oct 10, 2014

70a4091f
Remove white space at end of lines in man pages. · 2327e268
Brian Christiansen authored Oct 10, 2014

2327e268
SLURMDBD - Only set the archive flag if purging the object · 686cd117
Danny Auble authored Oct 10, 2014
```
(i.e ArchiveJobs PurgeJobs).  This is only a cosmetic change.
```
686cd117
Add ArchiveResvs to the output of sacctmgr show config and init the variable · 13a91611
Nicolas Joly authored Oct 10, 2014
```
on slurmdbd startup.
```
13a91611
Fix minor memory leak in the backfill scheduler when shutting down. · 3c0f7863
Danny Auble authored Oct 10, 2014

3c0f7863
Safer check to avoid invalid reads when shutting down the slurmctld with · 27338987
Danny Auble authored Oct 10, 2014
```
lots of jobs.
```
27338987
better fix for commit e18c3d9f · 9b4b53b5
Danny Auble authored Oct 09, 2014

9b4b53b5
Add Ryan Cox to the Team (finally) · d6bab7c3
Danny Auble authored Oct 09, 2014

d6bab7c3