Commits · d235f891e5786dc6af3cfe614029fabee8be5556 · Manuel G. Marciani / ces_slurm_simulator

05 Jul, 2013 1 commit
- switch/nrt - Don't allocate network resources unless job step has 2+ nodes · 7ea11af2
  jette authored Jul 05, 2013
  
  7ea11af2
28 Jun, 2013 3 commits
- Select/cons_res - Correct total CPU count allocated to a job · 9a17ba1c
  Morris Jette authored Jun 28, 2013
```
Effects jobs with --exclusive and --cpus-per-task options
bug 355
```
  9a17ba1c
- When a job is aborted send a message for any tasks that have completed. · ee125a47
  Danny Auble authored Jun 28, 2013
  
  ee125a47
- Fix it so bluegene and serial systems don't get warnings over new NODEDATA · 44662565
  Danny Auble authored Jun 27, 2013
```
enum.
```
  44662565
26 Jun, 2013 2 commits
- Update news for start of v2.6.0 development · 7395805c
  Morris Jette authored Jun 26, 2013
  
  7395805c
- Fix memory corruption, but leave the merge sort as not part of 2.6 · e8b341d3
  Dominik Friedrich authored Jun 25, 2013
  
  e8b341d3
25 Jun, 2013 3 commits
- IPMI - fix adjustment on poll when using EnergyIPMICalcAdjustment. · 58e87d46
  Thomas Cadeau authored Jun 25, 2013
  
  58e87d46
- sstat - Fix issue where if -j wasn't given allow last argument to be checked · 1b5701bf
  Danny Auble authored Jun 25, 2013
```
for as the job/step id.
```
  1b5701bf
- Updated the automake min version in autogen.sh to be correct. · 390e7558
  David Gloe authored Jun 24, 2013
```
The SLURM Makefile.am scripts use pkglibexecdir. One source indicates
that this was not added until automake 1.10.2
(https://github.com/rerun/rerun/issues/167).

So we just made that to be the minimum.
```
  390e7558
24 Jun, 2013 1 commit

Modify slurmctld locking to improve performance · ba58d59c

jette authored Jun 24, 2013

Under very heavy load with many thousands of batch job submissions
or job signals, the write lock can be held for very long periods of
time preventing job scheduling, squeue response, etc. This code
inserts a timing break to permit other functions to get the locks.

ba58d59c

21 Jun, 2013 4 commits
- Remove hardcoded /usr/local from slurm.spec. · 9c36b72f
  Danny Auble authored Jun 21, 2013
  
  9c36b72f
- Remove --program-prefix from spec file since it appears to be added by · fdba3fb5
  Danny Auble authored Jun 21, 2013
```
default and appeared to break other things.
```
  fdba3fb5
- Get html/man files to install in correct places with rpms. · 73df7da7
  Danny Auble authored Jun 21, 2013
  
  73df7da7
- Make SLURM_DISTRIBUTION env var hold both types of distribution if · 971706cd
  Martin Perry authored Jun 20, 2013
```
specified.
```
  971706cd
18 Jun, 2013 2 commits
- NEWS for checkin for list algo change · e109cebe
  Danny Auble authored Jun 18, 2013
  
  e109cebe
- ACCT_GATHER - handle suspending correctly for polling threads. · 3aeda666
  Danny Auble authored Jun 17, 2013
  
  3aeda666
12 Jun, 2013 1 commit
- Fix bug that would leak memory and over-write the AllowGroups field · 7d47017b
  Morris Jette authored Jun 11, 2013
```
if on "scontrol reconfig" when AllowNodes is manually changed using
scontrol since last slurmctld restart.
```
  7d47017b
10 Jun, 2013 1 commit

Avoid gres step allocation errors when a job shrinks in size · 9c216b9d

Morris Jette authored Jun 10, 2013

due to either down nodes or explicit resizing.
Generated slurmctld errors of this type:
[2013-06-04T12:43:46+06:00] error: gres/gpu: step_test 68662.4294967294 gres_bit_alloc is NULL
This is a movement of the logic introduced in commit
https://github.com/SchedMD/slurm/commit/6fff97bb77d2d88aa808c47fd7880246a0c1d090
to eliminate a memory leak.

9c216b9d

07 Jun, 2013 1 commit
- HDF5 - Fix issue with Ubuntu where HDF5 development headers are · a7b5262d
  Danny Auble authored Jun 07, 2013
```
overwritten by the parallel versions thus making it so we need handle
both cases.
```
  a7b5262d
06 Jun, 2013 1 commit
- Fix for slurmctld segfault on NULL front-end reason field. · df858ff1
  Mark Nelson authored Jun 06, 2013
  
  df858ff1
05 Jun, 2013 5 commits

energy - On a single node only use the last task for gathering energy. · 09601d60
Danny Auble authored Jun 05, 2013
```
Since we don't currently track energy usage per task (only per step).
Otherwise we get double the energy.
```
09601d60
srun - Don't check for executable if --test-only flag is used. · 3309d164
Danny Auble authored Jun 05, 2013

3309d164

priority/multifactor2 - Prevent possible divide by zero. · fc3997f9

Janne Blomqvist authored Jun 05, 2013

Andy Wettstein (University of Chicago) reported privately to me that slurmctld
2.5.4 crashed after he enabled the priority/multifactor2 plugin due to a
division by zero error.

I was able to reproduce the crash by creating an account hierarchy where all
the accounts and users had zero shares.
See bug 315

fc3997f9

Add another test of job step node selection within a job allocation · 8f6bf3cb
David Bigagli authored Jun 04, 2013
```
Revert premature change of META
```
8f6bf3cb

launch/poe - Fix for hostlist file support with repeated host names. · cc9187f5

jette authored Jun 04, 2013

Without this change, it appears that POE ignores the -procs argument
resulting in a job step request with multiple host names, but only
one ntask required

cc9187f5

04 Jun, 2013 3 commits
- Start NEWS for v2.5.8 · 6102d42c
  Morris Jette authored Jun 03, 2013
  
  6102d42c
- Add ability to specify host repitition count in the srun hostfile · a3ae22b7
  jette authored Jun 04, 2013
```
For example "host1*2" is equivalent to "host1,host1".
```
  a3ae22b7
- launch/poe - Fix for hostlist file support with repeated host names. · 58c21140
  jette authored Jun 04, 2013
```
Without this change, it appears that POE ignores the -procs argument
resulting in a job step request with multiple host names, but only
one ntask required
```
  58c21140
03 Jun, 2013 2 commits

Start NEWS for v2.5.8 · c795724d
Morris Jette authored Jun 03, 2013

c795724d

restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8

Hongjia Cao authored Jun 03, 2013

We're having some trouble getting our slurm jobs to successfully
restart after a checkpoint.  For this test, I'm using sbatch and a
simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
I'm submitting the job using sbatch:

$ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh

I am able to create the checkpoint and vacate the node:

$ scontrol checkpoint create 137
.... time passes ....
$ scontrol vacate 137

At that point, I see the checkpoint file from blcr in the current
directory and the checkpoint file from Slurm
in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
restart the job:

$ scontrol checkpoint restart 137
scontrol_checkpoint error: Node count specification invalid

In slurmctld's log (at level 7) I see:

[2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
[2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
[2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
[2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid

f82e0fb8

31 May, 2013 1 commit

Fix srun support for job step --memory specification (--mem-per-cpu OK) · f032de75

Morris Jette authored May 30, 2013

Rename slurm_step_ctx_params_t field from "mem_per_cpu" to "pn_min_memory".
Job step now accepts memory specification in either per-cpu or per-node basis.

f032de75

30 May, 2013 1 commit
- Select/cons_res - Fix bug resulting in held job · b574168b
  Morris Jette authored May 29, 2013
```
Uninitialized variables resulted in error of
"cons_res: sync loop not progressing, holding job #"
```
  b574168b
29 May, 2013 1 commit

Fix job step allocation with --exclusive and --hostlist option · 85cab0cb

jette authored May 28, 2013

The most notable problem case is on a cray where a job step
specifically requests one or more node that are not the first
nodes in the job allocation

85cab0cb

24 May, 2013 2 commits
- Added sbatch option "--ignore-pbs" to ignore "#PBS" options in the batch script. · d41ddb20
  Morris Jette authored May 24, 2013
  
  d41ddb20
- Added "PriorityFlags" value of "SMALL_RELATIVE_TO_TIME". · d8ec48cd
  Morris Jette authored May 24, 2013
```
If set, the job's size component will be based upon not the job size
alone, but the job's size divided by it's time limit.
```
  d8ec48cd
23 May, 2013 5 commits
- sched/backfill - Modify logic to reduce overhead under heavy load. · 941a5ac9
  Morris Jette authored May 23, 2013
```
The problem we have observed is the backfill scheduler temporarily
gives up its locks (one second), but then reclaims them before the
backlog of work completes, basically keeping the backfill scheduler
running for a really long time when under a heavy load.
bug 297
```
  941a5ac9
- switch/nrt - Correct network_id use logic. · 7d4a8441
  Morris Jette authored May 23, 2013
  
  7d4a8441
- Node reboot logic correction · 024787b6
  Morris Jette authored May 22, 2013
```
Defers (rather than forgets) reboot request with job running on the
node within a reservation.
```
  024787b6
- switch/nrt - Correct network_id use logic. · 7faae23f
  Morris Jette authored May 23, 2013
  
  7faae23f
- Node reboot logic correction · fcc63508
  Morris Jette authored May 22, 2013
```
Defers (rather than forgets) reboot request with job running on the
node within a reservation.
```
  fcc63508