Commits · 4b1f66088cf2fb05cfef4e035a76fac1b98d4210 · Manuel G. Marciani / ces_slurm_simulator

18 Oct, 2012 8 commits
- BGQ - Add logic to make it so blocks can't use a midplane with a nodeboard · 4b1f6608
  Danny Auble authored Oct 18, 2012
```
in error for passthrough.
```
  4b1f6608
- Fixed InactiveLimit math to work correctly · 13a8882a
  Danny Auble authored Oct 17, 2012
  
  13a8882a
- BGQ - Fixed InactiveLimit to work correctly to avoid scenarios where a · 65fef1ff
  Danny Auble authored Oct 17, 2012
```
user's pending allocation was started with srun and then for some reason
the slurmctld was brought down and while it was down the srun was removed.
```
  65fef1ff
- BGQ - fix thread id to handle before realtime thread correctly · 4bebf027
  Danny Auble authored Oct 17, 2012
```
previously it overwrote the poll_thread id
```
  4bebf027
- BGQ - increase size of variable for larger systems · 80e58be9
  Danny Auble authored Oct 17, 2012
  
  80e58be9
- remove unused variable. · a0b7be77
  Danny Auble authored Oct 17, 2012
  
  a0b7be77
- BGQ - Add functionality to make it so we track the actions on a block. · baf267e0
  Danny Auble authored Oct 17, 2012
```
This is needed for when a free request is added to a block but there are
jobs finishing up so we don't start new jobs on the block since they will
fail on start.
```
  baf267e0
- BGQ - add extra debugging to the runjob_mux plugin to print a job · 3338207e
  Danny Auble authored Oct 17, 2012
  
  3338207e
17 Oct, 2012 1 commit
- BlueGene - don't change pending job's node count when changing partition. · 2b59f495
  Morris Jette authored Oct 17, 2012
```
Previously the node count would change from c-node count to midplane count
(but still be interpreted as a c-node count).
```
  2b59f495
16 Oct, 2012 1 commit
- Fix for older < glibc 2.4 systems to use euidaccess instead of eaccess. · d9e28215
  Danny Auble authored Oct 15, 2012
  
  d9e28215
04 Oct, 2012 2 commits
- PerlAPI - handle null node names correctly (as sview and others do). · 4cad1159
  Danny Auble authored Oct 04, 2012
  
  4cad1159
- BGQ - fix ordering of locks to prevent deadlock - related to commit · eca7bfb4
  Danny Auble authored Oct 04, 2012
```
4b1aed73
```
  eca7bfb4
03 Oct, 2012 2 commits
- Add Slurm User Group Meeting 2012 info to publication web page · e65638f0
  Morris Jette authored Oct 03, 2012
  
  e65638f0
- Change comment for better clarity · fee9ae4a
  Morris Jette authored Oct 03, 2012
  
  fee9ae4a
02 Oct, 2012 4 commits

Correct -mem-per-cpu logic for multiple threads per core · 6a103f2e

Morris Jette authored Oct 02, 2012

See bugzilla bug 132

When using select/cons_res and CR_Core_Memory, hyperthreaded nodes may be
overcommitted on memory when CPU counts are scaled. I've tested 2.4.2 and HEAD
(2.5.0-pre3).

Conditions:
-----------
* SelectType=select/cons_res
* SelectTypeParameters=CR_Core_Memory
* Using threads
  - Ex. "NodeName=linux0 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2
RealMemory=400"

Description:
------------
In the cons_res plugin, _verify_node_state() in job_test.c checks if a node has
sufficient memory for a job. However, the per-CPU memory limits appear to be
scaled by the number of threads. This new value may exceed the available memory
on the node. And, once a node is overcommitted on memory, future memory checks
in _verify_node_state() will always succeed.

Scenario to reproduce:
----------------------
With the example node linux0, we run a single-core job with 250MB/core
    srun --mem-per-cpu=250 sleep 60

cons_res checks that it will fit: ((real - alloc) >= job mem)
    ((400 - 0) >= 250) and the job starts

Then, the memory requirement is doubled:
    "slurmctld: error: cons_res: node linux0 memory is overallocated (500) for
job X"
    "slurmd: scaling CPU count by factor of 2"

This job should not have started

While the first job is still running, we submit a second, identical job
    srun --mem-per-cpu=250 sleep 60

cons_res checks that it will fit:
    ((400 - 500) >= 250), the unsigned int wraps, the test passes, and the job
starts

This second job also should not have started

6a103f2e

Modify strigger so that a filter option of "--user=0" is supported · 7166976e
Morris Jette authored Oct 02, 2012

7166976e
one more fix · fb0269f3
Danny Auble authored Oct 01, 2012

fb0269f3
BGQ - make regression tests work correctly on real systems · 4585b4c0
Danny Auble authored Oct 01, 2012

4585b4c0

01 Oct, 2012 1 commit
- BGQ - Make it so bluegene test only runs on an emulated system. · 283c860b
  Danny Auble authored Oct 01, 2012
  
  283c860b
28 Sep, 2012 1 commit
- BGQ - Fixes to tests on a real BGQ system · 4488ae30
  Don Lipari authored Sep 28, 2012
  
  4488ae30
27 Sep, 2012 5 commits
- BGQ - Logic added to make sure a job has finished on a block before it is · 0badb119
  Danny Auble authored Sep 27, 2012
```
purged from the system if its front-end node goes down.
```
  0badb119
- remove extra magic clear · dd3704ed
  Danny Auble authored Sep 27, 2012
  
  dd3704ed
- BGQ - If a job goes away while still trying to free it up in the · 064ee393
  Danny Auble authored Sep 27, 2012
```
database, and the job is running on a small block make sure we free up
the correct node count.
```
  064ee393
- update various documentation · 7d2aa36e
  Danny Auble authored Sep 26, 2012
  
  7d2aa36e
- Fix for srun --test-only to work correctly with timelimits · 36e819e5
  Bill Brophy authored Sep 26, 2012
  
  36e819e5
25 Sep, 2012 1 commit
- Fixed typo (backgroud -> background) · 527d7eb9
  Danny Auble authored Sep 25, 2012
  
  527d7eb9
24 Sep, 2012 1 commit
- Execute slurm_spank_job_epilog when there is no system Epilog configured. · c57ab123
  Morris Jette authored Sep 24, 2012
```
This addresses bug 130
```
  c57ab123
21 Sep, 2012 1 commit
- BGQ - Fix issue when a cnode going to an error (not SoftwareError) state · 4b1aed73
  Danny Auble authored Sep 21, 2012
```
with a job running or trying to run on it.
```
  4b1aed73
20 Sep, 2012 1 commit
- BGQ - Fix if large block goes into error and the next highest priority jobs · 7e53c48f
  Danny Auble authored Sep 19, 2012
```
are planning on using the block.  Previously it would fail those jobs
erroneously.
```
  7e53c48f
19 Sep, 2012 1 commit
- BGQ - minor fix to make build work in emulated mode. · c9f14f80
  Danny Auble authored Sep 19, 2012
  
  c9f14f80
18 Sep, 2012 7 commits
- Updates for v2.4.3 tag · 1410b960
  Morris Jette authored Sep 18, 2012
  
  1410b960
- Cosmetic changes, no changes in logic · 79ddd5dc
  Morris Jette authored Sep 18, 2012
  
  79ddd5dc
- fix for default partition having a '*' after it · fb9bd16f
  Danny Auble authored Sep 18, 2012
  
  fb9bd16f
- up memory constraint (I think something happened with glibc or the kernel) · a772fc04
  Danny Auble authored Sep 18, 2012
```
no big deal, but now we get just a bit more memory allocated (4152 over
instead of the 4100 we were previously looking for)
```
  a772fc04
- Added all available limits to the output of sacctmgr list qos · 2c500639
  Danny Auble authored Sep 18, 2012
  
  2c500639
- Correct path in Cray web page · e6238a4f
  Morris Jette authored Sep 18, 2012
  
  e6238a4f
- Minor updates to SLURM/Cray guide, mostly typos · 773b7121
  Morris Jette authored Sep 17, 2012
  
  773b7121
17 Sep, 2012 3 commits
- Fix sacct to work with QOS' that have previously been deleted. · 48bf06d8
  Danny Auble authored Sep 17, 2012
  
  48bf06d8
- Remove unneeded variables. · 6ddd55bf
  Danny Auble authored Sep 17, 2012
  
  6ddd55bf
- DBD - if list is filling up remove step_complete messages as well since · a2fc224d
  Danny Auble authored Sep 17, 2012
```
they are a no-opt when they get to the DBD and we would favor having
job completion messages instead of step completion messages if possible.
```
  a2fc224d