Commits · 6dc63a9a53bab92b16b4b6a9de109f5590adb62f · Manuel G. Marciani / ces_slurm_simulator

12 Nov, 2014 2 commits
- Do not let srun overwrite SLURM_JOB_NUM_NODES if already in an allocation. · 6dc63a9a
  Danny Auble authored Nov 12, 2014
  
  6dc63a9a
- do not requeue cancelled job · e3e726f1
  Morris Jette authored Nov 12, 2014
```
Do not requeue a batch job from slurmd daemon if it is killed while in
the process of being launched (a race condition introduced in v14.03.9).
This partially reverts commit 2bc9bc29
```
  e3e726f1
10 Nov, 2014 1 commit

Fix issue where exclusive allocations wouldn't lay tasks out correctly · 7461c119

Danny Auble authored Nov 10, 2014

with CR_PACK_NODES.

Really do commit d388dd67 a different way to get the same info and
be able to lay out tasks correctly when --hint=nomultithread.

tests on a 4 core 8 thread system are
srun -n6 --hint=nomultithread --exclusive whereami | sort -h
srun: cpu count 6
   0 snowflake0 - MASK:0x1
   1 snowflake0 - MASK:0x2
   2 snowflake0 - MASK:0x4
   3 snowflake0 - MASK:0x8
   4 snowflake1 - MASK:0x1
   5 snowflake1 - MASK:0x2

and

srun -n10 -N5 --hint=nomultithread --exclusive whereami | sort -h
srun: cpu count 10
   0 snowflake0 - MASK:0x1
   1 snowflake0 - MASK:0x2
   2 snowflake0 - MASK:0x4
   3 snowflake0 - MASK:0x8
   4 snowflake1 - MASK:0x1
   5 snowflake1 - MASK:0x2
   6 snowflake1 - MASK:0x4
   7 snowflake2 - MASK:0x1
   8 snowflake3 - MASK:0x1
   9 snowflake4 - MASK:0x1

7461c119

07 Nov, 2014 2 commits
- If requested (scontrol reboot node_name) reboot a node even if it has · d843eb1d
  David Bigagli authored Nov 07, 2014
```
    an maintenance reservation that is not active yet.
```
  d843eb1d
- Fix jobcomp/mysql plugin for MariaDB 10+/Mysql 5.6+ to work with reserved · 75f49511
  Danny Auble authored Nov 07, 2014
```
work "partition".

reference bug 1246
```
  75f49511
06 Nov, 2014 4 commits

Give even better estimates on pending node count if no node count · 3f736aa2

Danny Auble authored Nov 06, 2014

is requested.

This is a re-factor of commit e5635a76 related to bug 1148 to handle
the cases where a job could run, but an error was given when selecting
the nodes.

3f736aa2

Small formatting issue in NEWS · ce94fd65
Danny Auble authored Nov 06, 2014

ce94fd65

Modify assoc_mgr_fill_in_qos() to allow for a flag to know if the QOS read · 4b6441ae

Danny Auble authored Nov 06, 2014

lock was locked outside of the function or not. This also fixes a race
condition when adding a QOS and planning on using it right away when the
controller is busy with previous requests.

4b6441ae

ALPS - Fix issue when tracking memory on a PerNode basis instead of · 35eca545

Danny Auble authored Nov 05, 2014

PerCPU.

Before it wasn't taking into account if the user was requesting per node
memory or the job was told it needed to use less than the node allowed.

35eca545

05 Nov, 2014 1 commit
- ALPS - Fix depth for Memory items in BASIL with CLE 5.2. · 71e8044d
  Danny Auble authored Nov 05, 2014
  
  71e8044d
04 Nov, 2014 2 commits

Update news for potential next 14.03 version · cd96dcc5
Danny Auble authored Nov 03, 2014

cd96dcc5

BLUEGENE - Fix issue where requeuing jobs could cause an assert. · fed60eb2

Danny Auble authored Nov 03, 2014

This was an unrealized regression from commit 0da01963.  The problem
is we were clearing the job_ptr->job_resrcs too early.  This patch fixes
it to wait until the job is actually being requeued so it does the right
thing.

fed60eb2

31 Oct, 2014 4 commits
- Give better estimates on pending node count if no node count is requested. · e5635a76
  Danny Auble authored Oct 31, 2014
  
  e5635a76
- Fix potential buffer overflow. · 219826b0
  Danny Auble authored Oct 31, 2014
```
This isn't that big of an issue for 14.03, but 14.11 added more to this
string which could overflow the buffer since sprintf is used instead of
snprintf.  Using xstrfmtcat fixes the issue and is easier to read code.
```
  219826b0
- ALPS - Don't set the env var APRUN_DEFAULT_MEMORY, it is not needed anymore. · d62896db
  Danny Auble authored Oct 30, 2014
  
  d62896db
- ALPS - If an allocation requests -n set the BASIL -N option to the · 2e2de6a4
  Danny Auble authored Oct 30, 2014
```
amount of tasks / number of node.
```
  2e2de6a4
30 Oct, 2014 1 commit
- Check the status of the database connection before using it. · 18fb57b7
  David Bigagli authored Oct 30, 2014
  
  18fb57b7
27 Oct, 2014 1 commit

Fix issue with squeue/scontrol showing correct node_cnt when only tasks · af19cb55

Danny Auble authored Oct 27, 2014

are specified.

This is a fix to commit b9cc5b31 which just didn't know
mc_ptr->ntasks_per_core is initialized to INFINITE.  Without it the
node_cnt packed would be set to 1 on the user tools.

This fixes bug 1148.

af19cb55

24 Oct, 2014 1 commit
- Prevent negative job array index · b02facfc
  David Singleton authored Oct 23, 2014
```
We've seen slurmctld crashes due to negative job array indices.
```
  b02facfc
23 Oct, 2014 1 commit

BGQ fix race condition causing slurmctld abort · 69184880

Morris Jette authored Oct 22, 2014

BGQ: Fix race condition when job fails due to hardware failure and is
requeued. Previous code could result in slurmctld abort with NULL pointer.
bug 1096

69184880

21 Oct, 2014 1 commit

Fix job gres info clear on slurmctld restart · 1209a664

Morris Jette authored Oct 21, 2014

Fix bug that prevented preservation of a job's GRES bitmap on slurmctld
restart or reconfigure (bug was introduced in 14.03.5 "Clear record of a
job's gres when requeued" and only applies when GRES mapped to specific
files).
bug 1192

1209a664

20 Oct, 2014 4 commits
- Fix minor memory leak in jobcomp/mysql on slurmctld reconfig. · d151a215
  Danny Auble authored Oct 20, 2014
  
  d151a215
- When using gres and cgroup ConstrainDevices set correct access · bcebd453
  David Bigagli authored Oct 20, 2014
```
permission for the batch step.
```
  bcebd453
- Fix wrong documentation. · faf059c9
  David Bigagli authored Oct 20, 2014
  
  faf059c9
- Require SlurmSchedLogFile if SlurmSchedLogLevel set · 4fb81073
  jette authored Oct 19, 2014
```
Otherwise there will be no log file to write to, resulting in an
abort
bug 1185
```
  4fb81073
18 Oct, 2014 1 commit
- Fix a few sacctmgr error messages. · c1a43890
  Nicolas Joly authored Oct 17, 2014
  
  c1a43890
17 Oct, 2014 3 commits
- Update META for v14.03.9 tag · b074bc7f
  Morris Jette authored Oct 17, 2014
  
  b074bc7f
- If failed to launch a batch job requeue it in hold. · 2bc9bc29
  David Bigagli authored Oct 17, 2014
  
  2bc9bc29
- Correct license count for suspended jobs · 77a0bb65
  Morris Jette authored Oct 17, 2014
```
Correct tracking of licenses for suspended jobs on slurmctld reconfigure or
restart. Previously licenses for suspended jobs were not counted, so
the license count could be exceeded with those jobs get resumed.
```
  77a0bb65
16 Oct, 2014 2 commits

Fix small memory leak in jobcomp/mysql. · e1c42895
Brian Christiansen authored Oct 16, 2014

e1c42895

Change Cray mpi_fini failure logic · 5f89223f

Morris Jette authored Oct 16, 2014

Treat Cray MPI job calling exit() without mpi_fini() as fatal error for
that specific task and let srun handle all timeout logic.
Previous logic would cancel the entire job step and srun options
for wait time and kill on exit were ignored. The new logic provides
users with the following type of response:

$ srun -n3 -K0 -N3 --wait=60 ./tmp
Task:0 Cycle:1
Task:2 Cycle:1
Task:1 Cycle:1
Task:0 Cycle:2
Task:2 Cycle:2
slurmstepd: step 14927.0 task 1 exited without calling mpi_fini()
srun: error: tux2: task 1: Killed
Task:0 Cycle:3
Task:2 Cycle:3
Task:0 Cycle:4
...

bug 1171

5f89223f

15 Oct, 2014 4 commits

Revert "Revert "MYSQL - Fix load of archive files."" · fa73701e
Nicolas Joly authored Oct 15, 2014
```
This reverts commit 4d03d0b4.

Make sure the correct Author is attributed here.
```
fa73701e
Revert "MYSQL - Fix load of archive files." · 4d03d0b4
Danny Auble authored Oct 15, 2014
```
This reverts commit 1891936e.
```
4d03d0b4

MYSQL - Fix load of archive files. · 1891936e

Danny Auble authored Oct 15, 2014

This has apparently been broken from the get go.  This fixes bug 1172.

test21.22 should be updated to test the dump and load of a file that is
generated.

1891936e

Fix issue with task/affinity oversubscribing cpus erroneously when · 03dc6ea7

Danny Auble authored Oct 15, 2014

using --ntasks-per-node.

This is related to bug 1145.  What was happening is all the cpus were
allocated on one socket instead of a cyclic method.  While this is allowed
it is strange and resulted in this bug.  There appears to be a different
bug as to why the tasks were laid out in a block fashion in the first
place.

03dc6ea7

14 Oct, 2014 2 commits
- ALPS - Don't drain nodes if epilog fails. It leaves them in drain state · 8b479a41
  Danny Auble authored Oct 14, 2014
```
with no way to get them out.

This fixes bug 1134.  It is advised the pro/epilog to call xtprocadmin in
the script instead of returning a non-zero exit code.
```
  8b479a41
- Ignore NO_VAL in SLURMDB_PURGE_* macros. · 9b00f12c
  Nicolas Joly authored Oct 14, 2014
```
Signed-off-by: Danny Auble <da@schedmd.com>
```
  9b00f12c
10 Oct, 2014 3 commits

Sview - Fix cpu and node counts for partitions. · 6bf40ed9
Danny Auble authored Oct 10, 2014

6bf40ed9
Fix sinfo to display mixed nodes as allocated in '%F' output. · 5d6a2dc2
Brian Christiansen authored Oct 10, 2014
```
Bug #1143
```
5d6a2dc2

Job step memory allocation logic fix · 0dd12469

Dorian Krause authored Oct 10, 2014

This commit fixes a bug we observed when combining select/linear with
gres. If an allocation was requested with a --gres argument an srun
execution within that allocation would stall indefinitely:

-bash-4.1$ salloc -N 1 --gres=gpfs:100
salloc: Granted job allocation 384049
bash-4.1$ srun -w j3c017 -n 1 hostname
srun: Job step creation temporarily disabled, retrying

The slurmctld log showed:

debug3: StepDesc: user_id=10034 job_id=384049 node_count=1-1 cpu_count=1
debug3:    cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=j3c017
debug3:    host=j3l02 port=33608 name=hostname network=(null) exclusive=0
debug3:    checkpoint-dir=/home/user checkpoint_int=0
debug3:    mem_per_node=62720 resv_port_cnt=65534 immediate=0 no_kill=0
debug3:    overcommit=0 time_limit=0 gres=(null) constraints=(null)
debug:  Configuration for job 384049 complete
_pick_step_nodes: some requested nodes j3c017 still have memory used by other steps
_slurm_rpc_job_step_create for job 384049: Requested nodes are busy

If srun --exclusive would have be used instead everything would work fine.
The reason is that in exclusive mode the code properly checks whether memory
is a reserved resource in the _pick_step_node() function.
This commit modifies the alternate code path to do the same.

0dd12469