Commits · 16d7318d046ab9a275e717193871397751b5a185 · Manuel G. Marciani / ces_slurm_simulator

08 Jan, 2013 7 commits

Remove redundant malloc failure tests · 16d7318d
Nathan Yee authored Jan 08, 2013

16d7318d
Merge branch 'slurm-2.5' · e5c8de12
Morris Jette authored Jan 08, 2013

e5c8de12

Report node state as MAINT only if not allocated jobs · 2af5ce33

Rod Schultz authored Jan 08, 2013

One of our testers has observed that when a long running job continues to run after a maintenance reservation comes into effect sinfo reports the node as being in the allocated state while scontrol shows it to be in the maintenance state.

This can happen when a node is not completely allocated. (select cons_res, a partition which is not Shared=EXCLUSIVE, jobs allocated without –exclusive, or jobs that are allocated only some of the cpus on a node.)

Execution paths leading up to calls to node_state_string (slurm_protocol_defs.c) or node_state_string_compact, in scontrol, test for allocated_cpus less that total_cpus on the node and set the node state to MIXED rather than ALLOCATED, while similar paths in sinfo do not.

I think this is probably a bug, since the mixed state is defined and think it is desirable that both command return the same result.

The problem can be fixed with two logic changes (in multiple places)

1) node_state_string and node_state_string_compact have to check for mixed as well as allocated before returning the MAINT state. This means that the reported state for the node with the allocated job will be MIXED.

2) Sinfo must also check allocated_cpus less than total_cpus and set the state to MIXED before calling either node_state_string or node_state_string_compact.

The attached patch (against 2.5.1) makes these changes. The attached script is a test case.

2af5ce33

Fix advanced reservation recovery logic when upgrading from version 2.4. · 604b869e
Morris Jette authored Jan 08, 2013

604b869e

Added support for job arrays. · 2993b423

Morris Jette authored Jan 07, 2013

Phase 1 of effort. See "man sbatch" option -a/--array option for details.
Creates job records using sbatch. Reports job arrays using scontrol or
squeue. More work coming soon...

2993b423

Get rid of errors when using 64 bit bitmaps (nothing sets USE_64BIT_BITSTR · 18c9ecd7
Danny Auble authored Jan 07, 2013
```
today) so bitmaps are always 32bits.  If one would like to use 64bit
bitmaps just #define USE_64BIT_BITSTR in config.h.
```
18c9ecd7

Convert hostlist functions on a multi dimensional system to use a bitmap · eb7500c9

Danny Auble authored Jan 07, 2013

instead of a large array. This appears to speed up the process a big deal
before we were seeing times of over 6000 usecs just to memset the array
for a 5D system. With this patch on average the whole process takes
around 1000 usecs with many being way under that.

eb7500c9

07 Jan, 2013 1 commit
- BG - fix check of topology plugin · 11bc66aa
  Danny Auble authored Jan 07, 2013
  
  11bc66aa
04 Jan, 2013 5 commits

Use local no-mem functions · 3a6bd336

jette authored Jan 04, 2013

Make sure out of memory gets logged properly for slurmctld in foreground

Fix slurmd and slurmdbd to log out of memory to stdout in foreground

3a6bd336

Use local no-mem functions · 5e1d0210
jette authored Jan 04, 2013

5e1d0210

mpi/mvapich: Don't set MPIRUN_PROCESSES by default · fd5b0e56

Mark A. Grondona authored Jan 22, 2012

The MPIRUN_PROCESSES variable set by the mpi/mvapich plugin probably
is not needed for most if not all recent versions of mvapich.
This environment variable also negatively affects job scalability
since its length is proportional to the number of tasks in a job.
In fact, for very large jobs, the increased environment size can
lead to failures in execve(2).

Since MPIRUN_PROCESSES *might* be required in some older versions of
mvapich, this patch disables the setting of that variable completely
only if SLURM_NEED_MVAPICH_MPIRUN_PROCESSES is not set in the job's
environment. (Thus, by default MPIRUN_PROCESSES is disabled, but
the old behavior may be restored by setting the environment variable
above)

fd5b0e56

Merge branch 'master' of https://github.com/SchedMD/slurm · b196f153
jette authored Jan 03, 2013

b196f153
Fix logic in hostset_create for invalid input · 33cb1e40
jette authored Jan 03, 2013

33cb1e40

03 Jan, 2013 16 commits
- Merge branch 'slurm-2.5' · c79de9f1
  Morris Jette authored Jan 03, 2013
```
Conflicts:
	META
	NEWS
```
  c79de9f1
- Start news for V2.5.2 · 64048706
  Morris Jette authored Jan 03, 2013
  
  64048706
- Update META for v2.5.1 tag · 00f20189
  Morris Jette authored Jan 03, 2013
  
  00f20189
- Correct bad year in date, time to get used to 2013... · aa6570c1
  Morris Jette authored Jan 03, 2013
  
  aa6570c1
- Add man pages for slurm_load_job_user and slurm_load_node_single · 5e87910b
  Nathan Yee authored Jan 03, 2013
  
  5e87910b
- Merge branch 'slurm-2.5' · a754551f
  Morris Jette authored Jan 03, 2013
  
  a754551f
- Disable job --exclusive option with select/serial plugin · 54a3a7db
  Morris Jette authored Jan 03, 2013
  
  54a3a7db
- Modify test for changed scontrol output and disable with select/serial · 108edd31
  Morris Jette authored Jan 03, 2013
  
  108edd31
- Correction to commit 9eb37af0 · da312784
  Morris Jette authored Jan 03, 2013
```
Command line argument would not be processed, but scontrol would
exit immediately
```
  da312784
- Merge branch 'slurm-2.5' · 374a8e14
  Morris Jette authored Jan 03, 2013
```
Conflicts:
	src/scontrol/scontrol.c
```
  374a8e14
- Merge branch 'master' of https://github.com/SchedMD/slurm · 6dc2db1f
  jette authored Jan 03, 2013
  
  6dc2db1f
- Exit scontrol command on stdin EOF · 9eb37af0
  Morris Jette authored Jan 03, 2013
  
  9eb37af0
- Fix typo · e3cba0ed
  Morris Jette authored Jan 03, 2013
  
  e3cba0ed
- Merge branch 'slurm-2.5' · 3d0d31a4
  Morris Jette authored Jan 02, 2013
  
  3d0d31a4
- Correct core reservation logic for use with select/serial plugin. · 0a06566a
  Morris Jette authored Jan 02, 2013
  
  0a06566a
- Tweak some tests for select/serial configuration · 3cc1e93d
  jette authored Jan 02, 2013
  
  3cc1e93d
02 Jan, 2013 1 commit

Revert commit · ac27d503

Morris Jette authored Jan 02, 2013

The original patch works fine to avoid cancelling a job when all
of it's nodes go unresponsive, but I don't see any way to easily
address nodes coming back into service. We want to cancel jobs
that have some up nodes and some down nodes, but the nodes will
come back into service indivually rather than all at once.

ac27d503

31 Dec, 2012 1 commit

Do not abort a job if ALL if its nodes are unavailable and not responding · b2c18ec1

jette authored Dec 31, 2012

The job will be aborted if any node is set DOWN while responding
or when "scontrol reconfig" is executed or the slurmctld restarts,
but it should respond better to global failures, like the network
going down.

b2c18ec1

29 Dec, 2012 3 commits
- Add Ralph Castain to contributor list · 17f5d8c7
  jette authored Dec 28, 2012
  
  17f5d8c7
- Merge pull request #35 from rhc54/master · 43847a4c
  Danny Auble authored Dec 28, 2012
```
Fix broken build when HAVE_READLINE is false
```
  43847a4c
- Add missing semicolon so code will compile when HAVE_READLINE is false · daf5b196
  Ralph Castain authored Dec 28, 2012
  
  daf5b196
28 Dec, 2012 6 commits
- Change RPC header check to identify non-supported RPCs · 7c0a32de
  Morris Jette authored Dec 28, 2012
```
There are far fewer RPCs not suppored than are supported, so this
should be faster and easier to maintain.
```
  7c0a32de
- Merge branch 'slurm-2.5' · 474a20ac
  Morris Jette authored Dec 28, 2012
```
Conflicts:
	src/common/slurm_protocol_util.c
```
  474a20ac
- Support multiple versions of account gather RPCs · 5f087561
  Morris Jette authored Dec 28, 2012
  
  5f087561
- Support version 2.5 slurmctld with version 2.4 slurmd daemons. · 844f70a2
  Morris Jette authored Dec 28, 2012
  
  844f70a2
- Document and require topology/none on BlueGene systems · b64a9268
  Morris Jette authored Dec 21, 2012
  
  b64a9268
- Merge branch 'slurm-2.5' · b8927946
  jette authored Dec 27, 2012
  
  b8927946