Commits · 1e16b4bf25e94745e395d4450ea3d6a47d690257 · Manuel G. Marciani / ces_slurm_simulator

05 Feb, 2013 7 commits
- Fix for not building sview if glib exists on a system but not the gtk libs. · 1e16b4bf
  Danny Auble authored Feb 05, 2013
  
  1e16b4bf
- BGQ - Changed order of library inclusions and fixed incorrect declaration · 530c3ff5
  Danny Auble authored Feb 04, 2013
```
to compile correctly on newer compilers

Signed-off-by: Danny Auble <da@schedmd.com>
```
  530c3ff5
- Set default sacct format using SACCT_FORMAT environment variable · 872ce12d
  Don Lipari authored Feb 05, 2013
  
  872ce12d
- Merge branch 'slurm-2.4' into slurm-2.5 · 6253538c
  jette authored Feb 05, 2013
  
  6253538c
- Merge branch 'slurm-2.3' into slurm-2.4 · 29c2661a
  jette authored Feb 05, 2013
  
  29c2661a
- Fix race condition in job dependency logic · ff26cc50
  Morris Jette authored Feb 05, 2013
```
If job involved in dependency completes and is purged the logic
used to test for circular dependencies can use the invalid pointer
and generate an invalid memory reference before the pointer is
cleared from the dependency list data structure.
```
  ff26cc50
- Fix some typos and expand CPU bind explanation in srun man page · 7578099f
  Morris Jette authored Feb 05, 2013
  
  7578099f
04 Feb, 2013 5 commits
- Fix broken link, typo in name · 9ed43d86
  Morris Jette authored Feb 04, 2013
  
  9ed43d86
- task/affinity - improve logic for Power7 without hyperthreading · c7994bfb
  jette authored Feb 04, 2013
```
Without this change, allocations less than a whole node can result
in an incorrect task binding for power7 processors
```
  c7994bfb
- Remove vestigial logging · 9c34cced
  Morris Jette authored Feb 04, 2013
  
  9c34cced
- Fix bug in CPU bind logic · 5065c774
  jette authored Feb 04, 2013
  
  5065c774
- task/affinity plugin - Fix bug in CPU masks for some processors · 5fb4f738
  Morris Jette authored Feb 04, 2013
  
  5fb4f738
01 Feb, 2013 7 commits
- Remove vestigial logging messages · 1b14e272
  Morris Jette authored Feb 01, 2013
  
  1b14e272
- Enable task affinity testing with task/cgroup · 7b03c12c
  jette authored Feb 01, 2013
```
Without this change, the testing only happens with task/affinity
```
  7b03c12c
- Change to reservation test for some systems · 0ed32fb9
  jette authored Feb 01, 2013
  
  0ed32fb9
- Gres/gpu plugin - If no GPUs requested, set CUDA_VISIBLE_DEVICES=NoDevFiles · 0e14909e
  Morris Jette authored Feb 01, 2013
```
This bug was introduced in 2.5.2 for the case where a GPU count was
configured, but without device files. Unfortunately it left the
CUDA_VISIBLE_DEVICES environment variable unset if GPUs were configured,
but the job did not request any of them.
```
  0e14909e
- Modify some tests for Bluegene systems · cfc7a1f6
  Nathan Yee authored Jan 31, 2013
  
  cfc7a1f6
- Modify test for Bluegene system · 66bb5da1
  Morris Jette authored Jan 31, 2013
  
  66bb5da1
- Disable a core reservation test except on select/cons_res · 0341a1a9
  Morris Jette authored Jan 31, 2013
  
  0341a1a9
31 Jan, 2013 4 commits
- BLUEGENE - Fix in reservation logic that could cause abort. · c301e599
  Morris Jette authored Jan 31, 2013
  
  c301e599
- Update META for v2.5.2 tag · ac6909cc
  Morris Jette authored Jan 31, 2013
  
  ac6909cc
- Expanded BGQ allocation test, needs more work · 51425724
  Nathan Yee authored Jan 31, 2013
  
  51425724
- Switch/nrt - Dynamically load libnrt.so from within the plugin as needed. · 7c6c6fb4
  Morris Jette authored Jan 31, 2013
```
This eliminates the need for libnrt.so on the head node.
```
  7c6c6fb4
30 Jan, 2013 3 commits
- Note that slurmctld restart is needed to add or remove nodes · 799869bd
  Morris Jette authored Jan 30, 2013
  
  799869bd
- Update maximum line length in slurm configuration file documentation · 745d8f5c
  David Bigagli authored Jan 30, 2013
  
  745d8f5c
- Describe how to add nodes to slurm.conf in FAQ · 9bd2b996
  Morris Jette authored Jan 30, 2013
  
  9bd2b996
29 Jan, 2013 9 commits
- BLUEGENE - If we made a block that isn't runnable because of a overlapping · 73d46476
  Danny Auble authored Jan 29, 2013
```
block, destroy it correctly.
```
  73d46476
- Avoid apparent kernel bug in 2.6.32 which apparently is solved in · 43ffa23a
  Danny Auble authored Jan 29, 2013
```
at least 3.5.0.  This avoids a stack overflow when running jobs on
more than 120k nodes.
```
  43ffa23a
- Flush out perl callbacks for step launching · 82f97a86
  Danny Auble authored Jan 29, 2013
  
  82f97a86
- Fix typo · 453b5608
  Danny Auble authored Jan 29, 2013
  
  453b5608
- Add missing perl callbacks for step_ctx.c (task launch callbacks) · 0ffa571a
  Morris Jette authored Jan 29, 2013
```
The new callbacks are not fleshed out, but eliminates a build error
```
  0ffa571a
- Avoid invalid memory when GRES has no available plugin · 8a63416d
  David Bigagli authored Jan 29, 2013
  
  8a63416d
- Fix for write off end of allocated memory · e60849a5
  David Bigagli authored Jan 29, 2013
  
  e60849a5
- Change variable info to be io_info to avoid confusion with the info · 23bf863f
  Danny Auble authored Jan 28, 2013
```
function.
```
  23bf863f
- Fix typo in log message · 3164de16
  Morris Jette authored Jan 28, 2013
  
  3164de16
28 Jan, 2013 1 commit
- Fix typo in squeue man page · 282b964c
  David Bigagli authored Jan 28, 2013
  
  282b964c
26 Jan, 2013 1 commit
- reset errno to 0 if able to coomunicate with a socket · eedbceda
  Danny Auble authored Jan 26, 2013
  
  eedbceda
23 Jan, 2013 2 commits

In select/cons_res, correct logic when job removed from only some nodes. · eb3c1046

jette authored Jan 23, 2013

I run into a problem with slurm-2.5.1 that IDLE nodes can not be
allocated to jobs. This can be reproduced as follows:

First, submit a job with --no-kill option (I have SLURM_EXCLUSIVE set to
allocate nodes exclusively by default). Then set one of the nodes
allocated to the job(cn2) to state DOWN:

srun: error: Node failure on cn2
srun: error: Node failure on cn2
srun: error: cn2: task 0: Killed
^Csrun: interrupt (one more within 1 sec to abort)
srun: task 1: running
srun: task 0: exited abnormally
^Csrun: sending Ctrl-C to job 22605.0
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: Force Terminated job step 22605.0

Then change state of the node to IDLE again. But it can not be allocated
to jobs:

srun: job 22606 queued and waiting for resources

  JOBID PARTITION     NAME     USER  ST       TIME  NODES
NODELIST(REASON)
  22606      work hostname     root  PD       0:00      1 (Resources)
  22604      work   sbatch     root   R       3:06      1 cn1

NodeName=cn2 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=0.05 Features=abc
   Gres=(null)
   NodeAddr=cn2 NodeHostName=cn2
   OS=Linux RealMemory=30000 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2012-12-24T15:22:34 SlurmdStartTime=2013-01-14T11:06:32
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

I traced and located the problem in select/cons_res. The call sequence
is:

slurmctld/node_mgr.c: update_node() =>
slurmctld/job_mgr.c: kill_running_job_by_node_name() =>
excise_node_from_job() =>
plugins/select/cons_res/select_cons_res.c: select_p_job_resized() =>
_rm_job_from_one_node() => _build_row_bitmaps() =>
common/job_resources: remove_job_from_cores()

If there are other jobs running in the partition, the partition row
bitmap will not be set correctly. In the example above, before
_build_row_bitmaps(), output of _dump_part() is:

[2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
[2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-63

after setting the node down, output of _dump_part() is

[2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
[2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-47

Cores of cn2 are not marked as available. Instead, cores of other nodes
are released. When another job requires the node cn2, the following log
message appears:

[2013-01-19T13:25:03+08:00] debug3: cons_res: _vns: node cn2 busy

I do not understand the design of select/cons_res well and I do not know
how to fix this. But it seems that _build_row_bitmaps() should not be
called, since the job is not removed totally, but only one of the nodes
released.

eb3c1046

Correction to comment in spank.h · 8e0ee95a
Morris Jette authored Jan 22, 2013

8e0ee95a

22 Jan, 2013 1 commit
- Minor fix to doc · cf28bf32
  Danny Auble authored Jan 22, 2013
  
  cf28bf32