Commits · 523b19927d815bba999a364341570fcf2a5ff491 · Manuel G. Marciani / ces_slurm_simulator

03 Jun, 2013 2 commits

Fix for job step allocation with required hostlist and exclusive option · 523b1992
jette authored Jun 03, 2013
```
Previously if the required node has no available CPUs left, then other
nodes in the job allocation would be used
```
523b1992

restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8

Hongjia Cao authored Jun 03, 2013

We're having some trouble getting our slurm jobs to successfully
restart after a checkpoint.  For this test, I'm using sbatch and a
simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
I'm submitting the job using sbatch:

$ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh

I am able to create the checkpoint and vacate the node:

$ scontrol checkpoint create 137
.... time passes ....
$ scontrol vacate 137

At that point, I see the checkpoint file from blcr in the current
directory and the checkpoint file from Slurm
in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
restart the job:

$ scontrol checkpoint restart 137
scontrol_checkpoint error: Node count specification invalid

In slurmctld's log (at level 7) I see:

[2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
[2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
[2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
[2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid

f82e0fb8

30 May, 2013 1 commit
- Select/cons_res - Fix bug resulting in held job · b574168b
  Morris Jette authored May 29, 2013
```
Uninitialized variables resulted in error of
"cons_res: sync loop not progressing, holding job #"
```
  b574168b
29 May, 2013 1 commit

Fix job step allocation with --exclusive and --hostlist option · 85cab0cb

jette authored May 28, 2013

The most notable problem case is on a cray where a job step
specifically requests one or more node that are not the first
nodes in the job allocation

85cab0cb

23 May, 2013 8 commits
- sched/backfill - Modify logic to reduce overhead under heavy load. · 941a5ac9
  Morris Jette authored May 23, 2013
```
The problem we have observed is the backfill scheduler temporarily
gives up its locks (one second), but then reclaims them before the
backlog of work completes, basically keeping the backfill scheduler
running for a really long time when under a heavy load.
bug 297
```
  941a5ac9
- switch/nrt - Correct network_id use logic. · 7faae23f
  Morris Jette authored May 23, 2013
  
  7faae23f
- switch/nrt, enable support for --network=sn_all or sn_single options · 4e4013a1
  Morris Jette authored May 23, 2013
  
  4e4013a1
- Move scheduling start timer (for sdiag) to remove lock time · 87df6f1a
  Morris Jette authored May 22, 2013
  
  87df6f1a
- sdiag documentation update · 958db08c
  Morris Jette authored May 22, 2013
```
Fix minor bug in sdiag backfill scheduling time reported on Bluegene systems
Improve explanation of backfill scheduling cycle time calculation.
```
  958db08c
- Node reboot logic correction · fcc63508
  Morris Jette authored May 22, 2013
```
Defers (rather than forgets) reboot request with job running on the
node within a reservation.
```
  fcc63508
- CRAY - fix a missing transient print · 164ea1e9
  Danny Auble authored May 22, 2013
  
  164ea1e9
- CRAY - Support CLE 4.2.0 · b7b4b7d5
  Danny Auble authored May 22, 2013
  
  b7b4b7d5
22 May, 2013 5 commits
- BGQ - remove unused variable. · b89ac514
  Danny Auble authored May 22, 2013
  
  b89ac514
- BGQ - When --geo is requested do not impose the default conn_types. · 8f1d9c6b
  Danny Auble authored May 22, 2013
  
  8f1d9c6b
- Expand explanation of MPICH2 build and use · c5f9abb5
  Morris Jette authored May 22, 2013
  
  c5f9abb5
- switch/nrt - Validate dynamic window allocation size. · 922251e5
  jette authored May 22, 2013
  
  922251e5
- switch/nrt: report window state information in more compact format · a7c45e54
  jette authored May 22, 2013
  
  a7c45e54
21 May, 2013 1 commit
- Clarify documentation for node reboot logic · 1b9e9e64
  Morris Jette authored May 20, 2013
  
  1b9e9e64
18 May, 2013 2 commits
- BGQ - Fix issue with preemption on sub-block jobs where a job would kill · 3a849f26
  Danny Auble authored May 17, 2013
```
all preemptable jobs on the midplane instead of just the ones it needed to.
```
  3a849f26
- BLUEGENE - remove duplicate definition · f6eaa251
  Danny Auble authored May 17, 2013
  
  f6eaa251
16 May, 2013 2 commits
- Prevent clearing reason field for pending jobs. · 1f8e47ba
  Morris Jette authored May 16, 2013
```
This bug was introduced in commit f1cf6d2d
fix for bug 290
```
  1f8e47ba
- POE - pack missing variable to allow fanout (more than 32 nodes) · f45b7e9a
  Danny Auble authored May 16, 2013
  
  f45b7e9a
14 May, 2013 2 commits
- Change comparison to avoid redundant variable reset · bf9e7e44
  Morris Jette authored May 14, 2013
  
  bf9e7e44
- Priority/multifactor - Avoid underflow in half-life calculation. · 5d70ccce
  Morris Jette authored May 14, 2013
  
  5d70ccce
13 May, 2013 2 commits
- Add comments to clarify logic · b94720b4
  Morris Jette authored May 13, 2013
  
  b94720b4
- Drain node on prolog or epilog failure, rather than downing the node · e43239ae
  Morris Jette authored May 13, 2013
```
Downing the node will kill all jobs allocated to the node, very bad
on something like a BlueGene system
```
  e43239ae
11 May, 2013 1 commit
- Update the manpage to better describe the io redirection on BGQ system. · d510ead6
  David Bigagli authored May 10, 2013
  
  d510ead6
10 May, 2013 1 commit

correctly set alloc state of node in select/linear · 0ef764b5

Hongjia Cao authored May 10, 2013

fix of the following problem:
if a node is excised from a job and a reconfiguration(e.g., update
partition) is done when the job is still running, the node will be left
in state idle but not available any more until the next
reconfiguration/restart of slurmctld after the job finished.

0ef764b5

08 May, 2013 3 commits
- Update NEWS file for Bug#284. · bae01305
  David Bigagli authored May 08, 2013
  
  bae01305
- Bug#284 Fix invalid memory read. · 486e0233
  David Bigagli authored May 08, 2013
  
  486e0233
- sview - Fix race condition where new information could of slipped past · 68f0f5db
  Danny Auble authored May 07, 2013
```
the node tab and we didn't notice.
```
  68f0f5db
07 May, 2013 1 commit
- Fix the array index from i to node_inx to avoid reading the array boundary. · ff2ee1b1
  David Bigagli authored May 07, 2013
  
  ff2ee1b1
04 May, 2013 1 commit
- Note the restricted use of the --gid option by the salloc command · b7d5e0ea
  Morris Jette authored May 03, 2013
```
Response to bug 274
```
  b7d5e0ea
03 May, 2013 1 commit

Make test more robust · 2592eb5e

jette authored May 02, 2013

Make test work if current working directory not in the search path
Check for appropriate task rank on POE based systems
Disable the entire test on POE systems

2592eb5e

02 May, 2013 4 commits
- POE - Fix logic binding tasks to CPUs. · 48e164e0
  jette authored May 02, 2013
```
Without this change pmdv12 was bound to one CPU and could not use
all of the resources allocated to the job step for the tasks that
it launches
```
  48e164e0
- POE - Correct task count for srun --launch-cmd option · d89d7cd9
  jette authored May 02, 2013
```
This only changes behaviour when the --ntasks option is not used,
but the --cpus-per-task option is use
```
  d89d7cd9
- Modify POE test due to signal handling differences from native slurm · 2d2eadee
  jette authored May 02, 2013
  
  2d2eadee
- Point search to new link · dacab0d5
  Danny Auble authored May 01, 2013
  
  dacab0d5
01 May, 2013 2 commits
- Accounting - fix for packing tasks instead of packing cpus. It used to be · 451f462b
  Danny Auble authored May 01, 2013
```
we used cpus erroneously but now we use tasks.  The cpus variable will
be taken out in 2.6.
```
  451f462b
- POE - Correct logic to support srun network instances count with POE. · 2fe37e32
  Morris Jette authored May 01, 2013
  
  2fe37e32