Commits · 65ce5232ee9cd009ef805a431eb9d6b8582a262b · Manuel G. Marciani / ces_slurm_simulator

19 Feb, 2013 1 commit
- Make sched/backfill the default plugin rather than sched/builtin (FIFO) · 65ce5232
  David Bigagli authored Feb 19, 2013
  
  65ce5232
15 Feb, 2013 1 commit
- Correct sinfo "%c" (node's CPU count) output value for Bluegene systems · ed63a86b
  Morris Jette authored Feb 15, 2013
  
  ed63a86b
13 Feb, 2013 2 commits
- BGQ - When cnodes fail in a timeout fashion correctly look up parent · 8e0f2cc4
  Danny Auble authored Feb 13, 2013
```
midplane.
```
  8e0f2cc4
- More robust logic for tree message forward · 3026f7a7
  Hongjia Cao authored Feb 12, 2013
```
Handle situation where a receiving/forwarding host can't unpack the header
of the sender (not compatible version).
```
  3026f7a7
12 Feb, 2013 6 commits
- Eliminate configuration file 4096 character line limitation · 97a5fa2f
  Puenlap Lee authored Feb 12, 2013
  
  97a5fa2f
- Comment out all of the logic in the job_submit/defaults plugin. · 0ce6c2ee
  Morris Jette authored Feb 12, 2013
```
The logic is only an example and not meant for actual use.
```
  0ce6c2ee
- Fix for handling a test-only job or immediate job that fails while being · 3117e934
  Danny Auble authored Feb 12, 2013
```
built.
```
  3117e934
- Added allocated memory to node information available · 3533f822
  Morris Jette authored Feb 12, 2013
```
(within the existing select_nodeinfo field of the node_info_t data structure).
Added Allocated Memory to node information displayed by sview and scontrol commands.
bug 229
```
  3533f822
- Change partition-specific SelectType to SelectTypeParameters · 373a3df1
  Morris Jette authored Feb 11, 2013
```
This makes the configuration parameter names consistent within
a partition and system-wide
```
  373a3df1
- Add SelectType field to "scontrol show partition" output · 78d3fbf8
  Morris Jette authored Feb 11, 2013
```
Added new field to partition_info data structure
Break up some long lines and minor format changes
Move some definitions and statements into alphabetic order
```
  78d3fbf8
11 Feb, 2013 2 commits

Various updates for new slurmctld/dynalloc plugin · 463d2388

Morris Jette authored Feb 11, 2013

1. Removed the job_submit and job_modify functions from the plugin, they are not required for the "slurmctld" plugin type
2. Renamed the new parameter from "JobSubmitDynAllocPort" to "DynAllocPort" and renamed the variable (You need to change this in your slurm.conf file)
3. Added logic so you can see the DynAllocPort value using "scontrol show config" or "sview"
4. I made some minor formatting changes, mostly for lines that were too long
5. Added #ifdef to the msg.h header file
6. Changed the #ifdef variables in the header files to start with "DYNALLOC_", perhaps not needed, but it should safer, especiallly with some common names like "INFO_H"
7. I re-wrote much of info.c. There was no need to get a copy of the node information and process the copy. We can just work directly with the data structures.

463d2388

Added slurmctld/dynalloc plugin, JobSubmitDynAllocPort config parameter · 835b902b
Jimmy Cao authored Feb 11, 2013
```
These provide support for MapReduce+
```
835b902b

08 Feb, 2013 2 commits
- When asking for job states with sacct default to 'now' instead of midnight · af29d9a7
  Danny Auble authored Feb 07, 2013
```
of the current day.
```
  af29d9a7
- Better debug when the database is down and using the --cluster option in · fc5282d3
  David Bigagli authored Feb 07, 2013
```
the user commands.
```
  fc5282d3
07 Feb, 2013 2 commits

Changes needed to asymetric core reservation for clean build · c8c5e24f

Morris Jette authored Feb 07, 2013

Make core reservation test a bit more flexible in terms of node names

Fix arithmetic for core reservation count return for old RPCs
Update NEWS and RELEASE_NOTES

Remove trailing spaces from recent scontrol man page changes

c8c5e24f

CRAY - If a partition has MinNodes=0 and a batch job doesn't request nodes put · ebbd10db
Danny Auble authored Feb 07, 2013
```
the allocation to 1 instead of 0 which prevents the allocation to happen.
```
ebbd10db

06 Feb, 2013 3 commits
- Fix bug in PrologSlurmctld use that would block job steps · a6da0510
  Morris Jette authored Feb 06, 2013
```
Old logic uses some vestigial power management logic that no longer
applies. It creates a job allocation, but the job step blocks
indefinitely.
```
  a6da0510
- Do not purge inactive interactive jobs that lack a port to ping · c51ebe73
  Morris Jette authored Feb 06, 2013
```
Added for MR+ operation
```
  c51ebe73
- Start NEWS for v2.5.4 · 00dcaf3c
  Morris Jette authored Feb 05, 2013
  
  00dcaf3c
05 Feb, 2013 6 commits
- Copy information about bug fix from 2.3.6 to 2.5.3 so it is more visible · 7abb62a7
  Morris Jette authored Feb 05, 2013
  
  7abb62a7
- BGQ - Fix for handling a job cleanup on a small block if the job has long · 4f9e675a
  Danny Auble authored Feb 05, 2013
```
since left the system.
```
  4f9e675a
- Fix for not building sview if glib exists on a system but not the gtk libs. · 1e16b4bf
  Danny Auble authored Feb 05, 2013
  
  1e16b4bf
- BGQ - Changed order of library inclusions and fixed incorrect declaration · 530c3ff5
  Danny Auble authored Feb 04, 2013
```
to compile correctly on newer compilers

Signed-off-by: Danny Auble <da@schedmd.com>
```
  530c3ff5
- Set default sacct format using SACCT_FORMAT environment variable · 872ce12d
  Don Lipari authored Feb 05, 2013
  
  872ce12d
- Fix race condition in job dependency logic · ff26cc50
  Morris Jette authored Feb 05, 2013
```
If job involved in dependency completes and is purged the logic
used to test for circular dependencies can use the invalid pointer
and generate an invalid memory reference before the pointer is
cleared from the dependency list data structure.
```
  ff26cc50
04 Feb, 2013 1 commit
- task/affinity plugin - Fix bug in CPU masks for some processors · 5fb4f738
  Morris Jette authored Feb 04, 2013
  
  5fb4f738
01 Feb, 2013 2 commits

Gres/gpu plugin - If no GPUs requested, set CUDA_VISIBLE_DEVICES=NoDevFiles · 0e14909e

Morris Jette authored Feb 01, 2013

This bug was introduced in 2.5.2 for the case where a GPU count was
configured, but without device files. Unfortunately it left the
CUDA_VISIBLE_DEVICES environment variable unset if GPUs were configured,
but the job did not request any of them.

0e14909e

Disable a core reservation test except on select/cons_res · 0341a1a9
Morris Jette authored Jan 31, 2013

0341a1a9

31 Jan, 2013 2 commits
- BLUEGENE - Fix in reservation logic that could cause abort. · c301e599
  Morris Jette authored Jan 31, 2013
  
  c301e599
- Switch/nrt - Dynamically load libnrt.so from within the plugin as needed. · 7c6c6fb4
  Morris Jette authored Jan 31, 2013
```
This eliminates the need for libnrt.so on the head node.
```
  7c6c6fb4
30 Jan, 2013 2 commits
- Note about -- acct_gather_energy/ipmi · 81aab990
  Danny Auble authored Jan 30, 2013
```
Signed-off-by: Danny Auble <da@schedmd.com>
```
  81aab990
- Modify default installation locations for RPMs to match "make install" · 7cf535f2
  David Bigagli authored Jan 30, 2013
```
_prefix /usr/local
_slurm_sysconfdir %{_prefix}/etc/slurm
_mandir %{_prefix}/share/man
_infodir %{_prefix}/share/info
```
  7cf535f2
29 Jan, 2013 4 commits
- BLUEGENE - If we made a block that isn't runnable because of a overlapping · 73d46476
  Danny Auble authored Jan 29, 2013
```
block, destroy it correctly.
```
  73d46476
- Avoid apparent kernel bug in 2.6.32 which apparently is solved in · 43ffa23a
  Danny Auble authored Jan 29, 2013
```
at least 3.5.0.  This avoids a stack overflow when running jobs on
more than 120k nodes.
```
  43ffa23a
- Removed contribs/arrayrun tool. Use native support for job arrays. · f2560281
  jette authored Jan 29, 2013
  
  f2560281
- Added "SlurmctldPlugstack" configuration parameter · de6c8dcc
  Morris Jette authored Jan 29, 2013
```
Used for generic stack of slurmctld daemon plugins
```
  de6c8dcc
25 Jan, 2013 1 commit
- Optimize squeue output for job arrays · 57e4b8e1
  Morris Jette authored Jan 24, 2013
  
  57e4b8e1
23 Jan, 2013 3 commits

Added SLURM_ARRAY_ID to environment of job array · ded211de
Morris Jette authored Jan 23, 2013

ded211de
Added SLURM_SUBMIT_HOST to salloc, sbatch and srun job environment. · 4a55e5b1
Morris Jette authored Jan 23, 2013

4a55e5b1

In select/cons_res, correct logic when job removed from only some nodes. · eb3c1046

jette authored Jan 23, 2013

I run into a problem with slurm-2.5.1 that IDLE nodes can not be
allocated to jobs. This can be reproduced as follows:

First, submit a job with --no-kill option (I have SLURM_EXCLUSIVE set to
allocate nodes exclusively by default). Then set one of the nodes
allocated to the job(cn2) to state DOWN:

srun: error: Node failure on cn2
srun: error: Node failure on cn2
srun: error: cn2: task 0: Killed
^Csrun: interrupt (one more within 1 sec to abort)
srun: task 1: running
srun: task 0: exited abnormally
^Csrun: sending Ctrl-C to job 22605.0
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: Force Terminated job step 22605.0

Then change state of the node to IDLE again. But it can not be allocated
to jobs:

srun: job 22606 queued and waiting for resources

  JOBID PARTITION     NAME     USER  ST       TIME  NODES
NODELIST(REASON)
  22606      work hostname     root  PD       0:00      1 (Resources)
  22604      work   sbatch     root   R       3:06      1 cn1

NodeName=cn2 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=0.05 Features=abc
   Gres=(null)
   NodeAddr=cn2 NodeHostName=cn2
   OS=Linux RealMemory=30000 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2012-12-24T15:22:34 SlurmdStartTime=2013-01-14T11:06:32
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

I traced and located the problem in select/cons_res. The call sequence
is:

slurmctld/node_mgr.c: update_node() =>
slurmctld/job_mgr.c: kill_running_job_by_node_name() =>
excise_node_from_job() =>
plugins/select/cons_res/select_cons_res.c: select_p_job_resized() =>
_rm_job_from_one_node() => _build_row_bitmaps() =>
common/job_resources: remove_job_from_cores()

If there are other jobs running in the partition, the partition row
bitmap will not be set correctly. In the example above, before
_build_row_bitmaps(), output of _dump_part() is:

[2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
[2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-63

after setting the node down, output of _dump_part() is

[2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
[2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-47

Cores of cn2 are not marked as available. Instead, cores of other nodes
are released. When another job requires the node cn2, the following log
message appears:

[2013-01-19T13:25:03+08:00] debug3: cons_res: _vns: node cn2 busy

I do not understand the design of select/cons_res well and I do not know
how to fix this. But it seems that _build_row_bitmaps() should not be
called, since the job is not removed totally, but only one of the nodes
released.

eb3c1046