Commits · 9096d7ddb73db7868ba108360125dc23f4edb09e · Manuel G. Marciani / ces_slurm_simulator

24 Aug, 2011 5 commits
- BGQ - make sure the recently added select_jobinfo of a step launch request · 9096d7dd
  Danny Auble authored Aug 24, 2011
```
isn't sent to the slurmd where environment variables would be overwritten
incorrectly.
```
  9096d7dd
- BGQ - set up the corner of a sub block correctly based on a relative · 50ed0e71
  Danny Auble authored Aug 24, 2011
```
position in the block instead of absolute.
```
  50ed0e71
- Fix salloc foreground process group of the terminal · 314c42de
  Morris Jette authored Aug 24, 2011
```
If salloc was run as interactive, with job control, reset the foreground
process group of the terminal to the process group of the parent pid before
exiting. Patch from Don Albert, Bull.
```
  314c42de
- CRAY: Add cray.conf parameter of SyncTimeout · 54204bbd
  Morris Jette authored Aug 24, 2011
```
Add cray.conf parameter of SyncTimeout, maximum time to defer job
scheduling if SLURM node or job state are out of synchronization with ALPS.
```
  54204bbd
- Fix for squeue -t "CONFIGURING" to actually work. · d03eb45d
  Danny Auble authored Aug 24, 2011
  
  d03eb45d
23 Aug, 2011 1 commit
- GrpCPURunMins in a QOS flushed out. · 78216290
  Danny Auble authored Aug 23, 2011
  
  78216290
22 Aug, 2011 2 commits
- Fix problem with _job_create() when not using qos's. It makes · f7b73f35
  Danny Auble authored Aug 22, 2011
```
_job_create() consistent with similar logic in select_nodes().
```
  f7b73f35
- Minor update to documentation describing the AllowGroups option for a · 97efb51d
  Danny Auble authored Aug 22, 2011
```
partition in the slurm.conf.
```
  97efb51d
19 Aug, 2011 1 commit

Treat duplicate switch name in topology.conf as fatal error · d2a30013

Morris Jette authored Aug 19, 2011

One of our testers created an illegal topology.conf file.

He has a config you probably wouldn't see in production, but can see in
testing when you are sometimes given a collection of miscellaneous
resources.

          |-- nodes
switch1 --|
          |-- switch2 -- nodes

He tried the topology.conf file below. Switch s1 is defined twice. Slurm
accepted this config, but wouldn't allocate nodes from both switches to
one job.

SwitchName=s1 Nodes=xna[14-26]
SwitchName=s2 Nodes=xna[41-43]
SwitchName=s1 Switches=s2

I believe slurm shouldn't allow the second definition of switch s1. The
attached patch checks for duplicate switch names.
Patch from Rod Schultz, Bull.

d2a30013

17 Aug, 2011 1 commit
- Revert "BLUEGENE - updated to smap to compile correctly on real bluegene systems." · 5cdcd7e0
  Danny Auble authored Aug 16, 2011
```
This reverts commit 350ef5dc.
```
  5cdcd7e0
16 Aug, 2011 1 commit
- BLUEGENE - updated to smap to compile correctly on real bluegene systems. · 350ef5dc
  Danny Auble authored Aug 15, 2011
  
  350ef5dc
12 Aug, 2011 2 commits
- BGQ - fix issue where if first job step is the entire block and then the · 3d293786
  Danny Auble authored Aug 12, 2011
```
next parallel step is ran on a sub block, SLURM won't over
subscribe cnodes.
```
  3d293786
- Memory leak fixed for rolling up accounting with down clusters. · eb285254
  Danny Auble authored Aug 12, 2011
  
  eb285254
11 Aug, 2011 2 commits
- Code cleanup on step request to get the correct select_jobinfo. · 1a55b75d
  Danny Auble authored Aug 11, 2011
  
  1a55b75d
- BLUEGENE - Modify "scontrol show step" · ad985bba
  Morris Jette authored Aug 11, 2011
```
BLUEGENE - Modify "scontrol show step" to show  I/O nodes (BGL and BGP) or
c-nodes (BGQ) allocated to each step. Change field name from "Nodes=" to
"BP_List=".
```
  ad985bba
10 Aug, 2011 3 commits
- BGQ - Improved c-node selection when asked for a sub-block job that · 09b44d54
  Danny Auble authored Aug 10, 2011
```
cannot fit into the available shape.
```
  09b44d54
- BLUEGENE - Fix job step scalability issue with large task count. · 86cea6ef
  Morris Jette authored Aug 10, 2011
```
Previous code would fail when trying to launch more than 4096 tasks,
which is a problem on BGQ systems where SLURM actually launches job
steps.
```
  86cea6ef
- BLUEGENE - Added notice in the print config to tell if you are emulated · 163966c7
  Danny Auble authored Aug 09, 2011
```
or not.
```
  163966c7
09 Aug, 2011 3 commits

Cray srun wrapper, map --share and --exclusive options · 08538cb8

Morris Jette authored Aug 09, 2011

This change applies only to Cray systems and only when the srun
wrapper for aprun. Map --exclusive to -F exclusive and --share to
-F share. Note this does not consider the partition's Shared
configuration, so it is an imperfect mapping of options.

08538cb8

Cray DOWN node will be treated as transient condition · 493aa97a

Morris Jette authored Aug 08, 2011

A node DOWN to ALPS will be marked DOWN to SLURM only after reaching
SlurmdTimeout. In the interim, the node state will be NO_RESPOND. This
change makes behavior makes SLURM handling of the node DOWN state more
consistent with ALPS. This change effects only Cray systems.

493aa97a

Fix node state acctg for cray. · acfa9aca
Morris Jette authored Aug 08, 2011
```
Fix the node state accounting to be consistent with the node state
set by ALPS.
```
acfa9aca

05 Aug, 2011 2 commits
- CRAY - Fix to work with 4.0.* instead of just 4.0.0 which is suppose to · 0fc7e998
  Danny Auble authored Aug 05, 2011
```
be the same.
```
  0fc7e998
- Cray - fix to make nodes come back up in accounting if they were · 7e1609c8
  Danny Auble authored Aug 05, 2011
```
previously marked down by alps.
```
  7e1609c8
04 Aug, 2011 2 commits

Require SchedulerTimeSlice be at least 5 secs · c9b0eafe

Morris Jette authored Aug 04, 2011

Require SchedulerTimeSlice configuration parameter to be at least 5 seconds
to avoid thrashing slurmd daemon.
Addresses Cray bug 774692

c9b0eafe

Job step now gets all of job's GRES by default · 1078426e

Morris Jette authored Aug 04, 2011

Change in GRES behavior for job steps: A job step's default generic
resource allocation will be set to that of the job. If a job step's --gres
value is set to "none" then none of the generic resources which have been
allocated to the job will be allocated to the job step.
Add srun environment value of SLURM_STEP_GRES to set default --gres value
for a job step.

1078426e

03 Aug, 2011 2 commits
- Fix to smap command-line mode display · 88d152fa
  Morris Jette authored Aug 02, 2011
```
On Bluegene systems, smap's command-line mode would generate an invalid
memory reference due to an uninitialized variable.
```
  88d152fa
- Fixed issue where if the DBD connection from the ctld goes away because of · 375e2d38
  Danny Auble authored Aug 02, 2011
```
a POLLERR the dbd_fail callback is called.
```
  375e2d38
02 Aug, 2011 2 commits
- Fixed issue where if there was a network issue between the slurmctld and · eb1f2ed3
  Danny Auble authored Aug 02, 2011
```
the DBD where both remained up but were disconnected the slurmctld would
get registered again with the DBD.
```
  eb1f2ed3
- BLUEGENE - fix to run steps correctly in a BGL/P emulated system. · f2df2e7e
  Danny Auble authored Aug 02, 2011
  
  f2df2e7e
01 Aug, 2011 2 commits
- insure moab/maui requeued job prio set to zero · 6d8b2cac
  Morris Jette authored Aug 01, 2011
```
With sched/wiki or sched/wiki2 (Maui or Moab scheduler), insure that a
requeued job's priority is reset to zero.
```
  6d8b2cac
- Start NEWS for slurm v2.3.0-rc2 · 0f8895f4
  Morris Jette authored Jul 28, 2011
  
  0f8895f4
29 Jul, 2011 1 commit
- updated news · cc36eb3a
  Danny Auble authored Jul 28, 2011
  
  cc36eb3a
28 Jul, 2011 1 commit

Add ability to limit job's leaf switch count · 08e9f248

Morris Jette authored Jul 28, 2011

Add the ability for a user to limit the number of leaf switches in a job's
allocation using the --switch option of salloc, sbatch and srun. There is
also a new SchedulerParameters value of max_switch_wait, which a SLURM
administrator can used to set a maximum job delay and prevent a user job
from blocking lower priority jobs for too long. Based on work by Rod
Schultz, Bull.

08e9f248

22 Jul, 2011 2 commits

Permit multiple conn-type parameters · f67f54f8

Morris Jette authored Jul 19, 2011

BlueGene: Permit users to specify a separate connection type for each
dimension (e.g. "--conn-type=torus,mesh,torus").

f67f54f8

For Cray systems, build srun man page with proper options · b6a9470d

Morris Jette authored Jul 21, 2011

On Cray systems with the srun2aprun wrapper, build an srun man page
that describes which options are available with the wrapper.

b6a9470d

21 Jul, 2011 1 commit

Restore node configuration information on slurmctld restart · f729d72b

Morris Jette authored Jul 20, 2011

Restore node configuration information (CPUs, memory, etc.) for powered
down when slurmctld daemon restarts rather than waiting for the node to be
restored to service and getting the information from the node (NOTE: Only
relevent if FastSchedule=0).

f729d72b

20 Jul, 2011 1 commit

Fix select/cons_res task distribution bug · b70cc235

Morris Jette authored Jul 20, 2011

Fix bug in select/cons_res task distribution logic when tasks-per-node=0.
Eliminates misleading slurmctld message
"error: cons_res: _compute_c_b_task_dist oversubscribe."
This problem was introduced in SLURM version 2.2.5 in order to fix
a task distribution problem when cpus_per_task=0. Patch from Rod Schultz, Bull.

b70cc235

14 Jul, 2011 1 commit

Set environment variables with job memory limtis · dbd292c7

Morris Jette authored Jul 14, 2011

Set SLURM_MEM_PER_CPU or SLURM_MEM_PER_NODE environment variables for both
interactive (salloc) and batch jobs if the job has a memory limit. For Cray
systems also set CRAY_AUTO_APRUN_OPTIONS environment variable with the
memory limit.

dbd292c7

13 Jul, 2011 1 commit

limit batch jobs in front-end mode to a single CPU · 344daaa1

Morris Jette authored Jul 13, 2011

For front-end configurations (Cray and IBM BlueGene), bind each batch job to
a unique CPU to limit the damage which a single job can cause. Previously any
single job could use all CPUs causing problems for other jobs or system
daemons. This addresses a problem reported by Steve Trofinoff, CSCS.

344daaa1

12 Jul, 2011 1 commit
- Fixed documention (html) for PriorityUsageResetPeriod to match that in the · 5e100b2e
  Danny Auble authored Jul 12, 2011
```
man pages. Patch by Nancy Kritkausky, Bull.
```
  5e100b2e