Commits · 04f06338896c14e83a3d5fb0dec81c60d4aca071 · Manuel G. Marciani / ces_slurm_simulator

08 Oct, 2013 1 commit

EpilogSlurmctld race condition/SEGV fix · 04f06338

Morris Jette authored Oct 08, 2013

EpilogSlurmctld pthread is passed required arguments rather than a pointer
to the job record, which under some conditions could be purged and result
in an invalid memory reference.

04f06338

02 Oct, 2013 1 commit
- Enforce QOS MaxCPUsMin limit when job has no time limit · 6db7a305
  Morris Jette authored Oct 02, 2013
```
bug 436
```
  6db7a305
23 Sep, 2013 1 commit
- Reorder get config logic to avoid deadlock. · 262374a8
  Morris Jette authored Sep 23, 2013
```
bug 428
```
  262374a8
13 Aug, 2013 1 commit

select/cons_res - Avoid extraneous "oversubscribe" error messages · 302d8b3f

jette authored Aug 13, 2013

This problem was reported by Harvard University and could be
reproduced with a command line of "srun -N1 --tasks-per-node=2 -O id".
With other job types, the error message could be logged many times
for each job. This change logs the error once per job and only if
the job request does not include the -O/--overcommit option.

302d8b3f

05 Jul, 2013 1 commit
- switch/nrt - Don't allocate network resources unless job step has 2+ nodes · 7ea11af2
  jette authored Jul 05, 2013
  
  7ea11af2
28 Jun, 2013 1 commit
- Select/cons_res - Correct total CPU count allocated to a job · 9a17ba1c
  Morris Jette authored Jun 28, 2013
```
Effects jobs with --exclusive and --cpus-per-task options
bug 355
```
  9a17ba1c
25 Jun, 2013 1 commit

Updated the automake min version in autogen.sh to be correct. · 390e7558

David Gloe authored Jun 24, 2013

The SLURM Makefile.am scripts use pkglibexecdir. One source indicates
that this was not added until automake 1.10.2
(https://github.com/rerun/rerun/issues/167).

So we just made that to be the minimum.

390e7558

21 Jun, 2013 2 commits
- Remove --program-prefix from spec file since it appears to be added by · fdba3fb5
  Danny Auble authored Jun 21, 2013
```
default and appeared to break other things.
```
  fdba3fb5
- Get html/man files to install in correct places with rpms. · 73df7da7
  Danny Auble authored Jun 21, 2013
  
  73df7da7
12 Jun, 2013 1 commit
- Fix bug that would leak memory and over-write the AllowGroups field · 7d47017b
  Morris Jette authored Jun 11, 2013
```
if on "scontrol reconfig" when AllowNodes is manually changed using
scontrol since last slurmctld restart.
```
  7d47017b
10 Jun, 2013 1 commit

Avoid gres step allocation errors when a job shrinks in size · 9c216b9d

Morris Jette authored Jun 10, 2013

due to either down nodes or explicit resizing.
Generated slurmctld errors of this type:
[2013-06-04T12:43:46+06:00] error: gres/gpu: step_test 68662.4294967294 gres_bit_alloc is NULL
This is a movement of the logic introduced in commit
https://github.com/SchedMD/slurm/commit/6fff97bb77d2d88aa808c47fd7880246a0c1d090
to eliminate a memory leak.

9c216b9d

06 Jun, 2013 1 commit
- Fix for slurmctld segfault on NULL front-end reason field. · df858ff1
  Mark Nelson authored Jun 06, 2013
  
  df858ff1
05 Jun, 2013 3 commits

energy - On a single node only use the last task for gathering energy. · 09601d60
Danny Auble authored Jun 05, 2013
```
Since we don't currently track energy usage per task (only per step).
Otherwise we get double the energy.
```
09601d60
srun - Don't check for executable if --test-only flag is used. · 3309d164
Danny Auble authored Jun 05, 2013

3309d164

priority/multifactor2 - Prevent possible divide by zero. · fc3997f9

Janne Blomqvist authored Jun 05, 2013

Andy Wettstein (University of Chicago) reported privately to me that slurmctld
2.5.4 crashed after he enabled the priority/multifactor2 plugin due to a
division by zero error.

I was able to reproduce the crash by creating an account hierarchy where all
the accounts and users had zero shares.
See bug 315

fc3997f9

04 Jun, 2013 2 commits
- Start NEWS for v2.5.8 · 6102d42c
  Morris Jette authored Jun 03, 2013
  
  6102d42c
- launch/poe - Fix for hostlist file support with repeated host names. · 58c21140
  jette authored Jun 04, 2013
```
Without this change, it appears that POE ignores the -procs argument
resulting in a job step request with multiple host names, but only
one ntask required
```
  58c21140
03 Jun, 2013 1 commit

restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8

Hongjia Cao authored Jun 03, 2013

We're having some trouble getting our slurm jobs to successfully
restart after a checkpoint.  For this test, I'm using sbatch and a
simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
I'm submitting the job using sbatch:

$ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh

I am able to create the checkpoint and vacate the node:

$ scontrol checkpoint create 137
.... time passes ....
$ scontrol vacate 137

At that point, I see the checkpoint file from blcr in the current
directory and the checkpoint file from Slurm
in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
restart the job:

$ scontrol checkpoint restart 137
scontrol_checkpoint error: Node count specification invalid

In slurmctld's log (at level 7) I see:

[2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
[2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
[2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
[2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid

f82e0fb8

30 May, 2013 1 commit
- Select/cons_res - Fix bug resulting in held job · b574168b
  Morris Jette authored May 29, 2013
```
Uninitialized variables resulted in error of
"cons_res: sync loop not progressing, holding job #"
```
  b574168b
29 May, 2013 1 commit

Fix job step allocation with --exclusive and --hostlist option · 85cab0cb

jette authored May 28, 2013

The most notable problem case is on a cray where a job step
specifically requests one or more node that are not the first
nodes in the job allocation

85cab0cb

23 May, 2013 4 commits
- sched/backfill - Modify logic to reduce overhead under heavy load. · 941a5ac9
  Morris Jette authored May 23, 2013
```
The problem we have observed is the backfill scheduler temporarily
gives up its locks (one second), but then reclaims them before the
backlog of work completes, basically keeping the backfill scheduler
running for a really long time when under a heavy load.
bug 297
```
  941a5ac9
- switch/nrt - Correct network_id use logic. · 7faae23f
  Morris Jette authored May 23, 2013
  
  7faae23f
- Node reboot logic correction · fcc63508
  Morris Jette authored May 22, 2013
```
Defers (rather than forgets) reboot request with job running on the
node within a reservation.
```
  fcc63508
- CRAY - Support CLE 4.2.0 · b7b4b7d5
  Danny Auble authored May 22, 2013
  
  b7b4b7d5
22 May, 2013 2 commits
- BGQ - When --geo is requested do not impose the default conn_types. · 8f1d9c6b
  Danny Auble authored May 22, 2013
  
  8f1d9c6b
- switch/nrt - Validate dynamic window allocation size. · 922251e5
  jette authored May 22, 2013
  
  922251e5
18 May, 2013 1 commit
- BGQ - Fix issue with preemption on sub-block jobs where a job would kill · 3a849f26
  Danny Auble authored May 17, 2013
```
all preemptable jobs on the midplane instead of just the ones it needed to.
```
  3a849f26
16 May, 2013 2 commits
- Prevent clearing reason field for pending jobs. · 1f8e47ba
  Morris Jette authored May 16, 2013
```
This bug was introduced in commit f1cf6d2d
fix for bug 290
```
  1f8e47ba
- POE - pack missing variable to allow fanout (more than 32 nodes) · f45b7e9a
  Danny Auble authored May 16, 2013
  
  f45b7e9a
14 May, 2013 1 commit
- Priority/multifactor - Avoid underflow in half-life calculation. · 5d70ccce
  Morris Jette authored May 14, 2013
  
  5d70ccce
13 May, 2013 1 commit
- Drain node on prolog or epilog failure, rather than downing the node · e43239ae
  Morris Jette authored May 13, 2013
```
Downing the node will kill all jobs allocated to the node, very bad
on something like a BlueGene system
```
  e43239ae
08 May, 2013 2 commits
- Update NEWS file for Bug#284. · bae01305
  David Bigagli authored May 08, 2013
  
  bae01305
- sview - Fix race condition where new information could of slipped past · 68f0f5db
  Danny Auble authored May 07, 2013
```
the node tab and we didn't notice.
```
  68f0f5db
02 May, 2013 2 commits

POE - Fix logic binding tasks to CPUs. · 48e164e0

jette authored May 02, 2013

Without this change pmdv12 was bound to one CPU and could not use
all of the resources allocated to the job step for the tasks that
it launches

48e164e0

POE - Correct task count for srun --launch-cmd option · d89d7cd9

jette authored May 02, 2013

This only changes behaviour when the --ntasks option is not used,
but the --cpus-per-task option is use

d89d7cd9

01 May, 2013 4 commits
- POE - Correct logic to support srun network instances count with POE. · 2fe37e32
  Morris Jette authored May 01, 2013
  
  2fe37e32
- Accounting - Fix minor initialization error. · df0faeac
  Danny Auble authored May 01, 2013
  
  df0faeac
- POE - Correct logic to support poe option "-euidevice sn_all" · 878d67f1
  Morris Jette authored May 01, 2013
```
also "-euidevice sn_single".
```
  878d67f1
- CRAY - Change logging of transient ALPS errors from error() to debug(). · fd456175
  Morris Jette authored May 01, 2013
  
  fd456175
30 Apr, 2013 1 commit
- Accounting - make average by task not cpu. · 81ccec93
  Danny Auble authored Apr 29, 2013
  
  81ccec93