Commits · 5cdcd7e09d3bbed82c3079a7fdfdb91417d41c6a · Manuel G. Marciani / ces_slurm_simulator

17 Aug, 2011 1 commit
- Revert "BLUEGENE - updated to smap to compile correctly on real bluegene systems." · 5cdcd7e0
  Danny Auble authored Aug 16, 2011
```
This reverts commit 350ef5dc.
```
  5cdcd7e0
16 Aug, 2011 1 commit
- BLUEGENE - updated to smap to compile correctly on real bluegene systems. · 350ef5dc
  Danny Auble authored Aug 15, 2011
  
  350ef5dc
12 Aug, 2011 2 commits
- BGQ - fix issue where if first job step is the entire block and then the · 3d293786
  Danny Auble authored Aug 12, 2011
```
next parallel step is ran on a sub block, SLURM won't over
subscribe cnodes.
```
  3d293786
- Memory leak fixed for rolling up accounting with down clusters. · eb285254
  Danny Auble authored Aug 12, 2011
  
  eb285254
11 Aug, 2011 2 commits
- Code cleanup on step request to get the correct select_jobinfo. · 1a55b75d
  Danny Auble authored Aug 11, 2011
  
  1a55b75d
- BLUEGENE - Modify "scontrol show step" · ad985bba
  Morris Jette authored Aug 11, 2011
```
BLUEGENE - Modify "scontrol show step" to show  I/O nodes (BGL and BGP) or
c-nodes (BGQ) allocated to each step. Change field name from "Nodes=" to
"BP_List=".
```
  ad985bba
10 Aug, 2011 3 commits
- BGQ - Improved c-node selection when asked for a sub-block job that · 09b44d54
  Danny Auble authored Aug 10, 2011
```
cannot fit into the available shape.
```
  09b44d54
- BLUEGENE - Fix job step scalability issue with large task count. · 86cea6ef
  Morris Jette authored Aug 10, 2011
```
Previous code would fail when trying to launch more than 4096 tasks,
which is a problem on BGQ systems where SLURM actually launches job
steps.
```
  86cea6ef
- BLUEGENE - Added notice in the print config to tell if you are emulated · 163966c7
  Danny Auble authored Aug 09, 2011
```
or not.
```
  163966c7
09 Aug, 2011 3 commits

Cray srun wrapper, map --share and --exclusive options · 08538cb8

Morris Jette authored Aug 09, 2011

This change applies only to Cray systems and only when the srun
wrapper for aprun. Map --exclusive to -F exclusive and --share to
-F share. Note this does not consider the partition's Shared
configuration, so it is an imperfect mapping of options.

08538cb8

Cray DOWN node will be treated as transient condition · 493aa97a

Morris Jette authored Aug 08, 2011

A node DOWN to ALPS will be marked DOWN to SLURM only after reaching
SlurmdTimeout. In the interim, the node state will be NO_RESPOND. This
change makes behavior makes SLURM handling of the node DOWN state more
consistent with ALPS. This change effects only Cray systems.

493aa97a

Fix node state acctg for cray. · acfa9aca
Morris Jette authored Aug 08, 2011
```
Fix the node state accounting to be consistent with the node state
set by ALPS.
```
acfa9aca

05 Aug, 2011 2 commits
- CRAY - Fix to work with 4.0.* instead of just 4.0.0 which is suppose to · 0fc7e998
  Danny Auble authored Aug 05, 2011
```
be the same.
```
  0fc7e998
- Cray - fix to make nodes come back up in accounting if they were · 7e1609c8
  Danny Auble authored Aug 05, 2011
```
previously marked down by alps.
```
  7e1609c8
04 Aug, 2011 2 commits

Require SchedulerTimeSlice be at least 5 secs · c9b0eafe

Morris Jette authored Aug 04, 2011

Require SchedulerTimeSlice configuration parameter to be at least 5 seconds
to avoid thrashing slurmd daemon.
Addresses Cray bug 774692

c9b0eafe

Job step now gets all of job's GRES by default · 1078426e

Morris Jette authored Aug 04, 2011

Change in GRES behavior for job steps: A job step's default generic
resource allocation will be set to that of the job. If a job step's --gres
value is set to "none" then none of the generic resources which have been
allocated to the job will be allocated to the job step.
Add srun environment value of SLURM_STEP_GRES to set default --gres value
for a job step.

1078426e

03 Aug, 2011 2 commits
- Fix to smap command-line mode display · 88d152fa
  Morris Jette authored Aug 02, 2011
```
On Bluegene systems, smap's command-line mode would generate an invalid
memory reference due to an uninitialized variable.
```
  88d152fa
- Fixed issue where if the DBD connection from the ctld goes away because of · 375e2d38
  Danny Auble authored Aug 02, 2011
```
a POLLERR the dbd_fail callback is called.
```
  375e2d38
02 Aug, 2011 2 commits
- Fixed issue where if there was a network issue between the slurmctld and · eb1f2ed3
  Danny Auble authored Aug 02, 2011
```
the DBD where both remained up but were disconnected the slurmctld would
get registered again with the DBD.
```
  eb1f2ed3
- BLUEGENE - fix to run steps correctly in a BGL/P emulated system. · f2df2e7e
  Danny Auble authored Aug 02, 2011
  
  f2df2e7e
01 Aug, 2011 2 commits
- insure moab/maui requeued job prio set to zero · 6d8b2cac
  Morris Jette authored Aug 01, 2011
```
With sched/wiki or sched/wiki2 (Maui or Moab scheduler), insure that a
requeued job's priority is reset to zero.
```
  6d8b2cac
- Start NEWS for slurm v2.3.0-rc2 · 0f8895f4
  Morris Jette authored Jul 28, 2011
  
  0f8895f4
29 Jul, 2011 1 commit
- updated news · cc36eb3a
  Danny Auble authored Jul 28, 2011
  
  cc36eb3a
28 Jul, 2011 1 commit

Add ability to limit job's leaf switch count · 08e9f248

Morris Jette authored Jul 28, 2011

Add the ability for a user to limit the number of leaf switches in a job's
allocation using the --switch option of salloc, sbatch and srun. There is
also a new SchedulerParameters value of max_switch_wait, which a SLURM
administrator can used to set a maximum job delay and prevent a user job
from blocking lower priority jobs for too long. Based on work by Rod
Schultz, Bull.

08e9f248

22 Jul, 2011 2 commits

Permit multiple conn-type parameters · f67f54f8

Morris Jette authored Jul 19, 2011

BlueGene: Permit users to specify a separate connection type for each
dimension (e.g. "--conn-type=torus,mesh,torus").

f67f54f8

For Cray systems, build srun man page with proper options · b6a9470d

Morris Jette authored Jul 21, 2011

On Cray systems with the srun2aprun wrapper, build an srun man page
that describes which options are available with the wrapper.

b6a9470d

21 Jul, 2011 1 commit

Restore node configuration information on slurmctld restart · f729d72b

Morris Jette authored Jul 20, 2011

Restore node configuration information (CPUs, memory, etc.) for powered
down when slurmctld daemon restarts rather than waiting for the node to be
restored to service and getting the information from the node (NOTE: Only
relevent if FastSchedule=0).

f729d72b

20 Jul, 2011 1 commit

Fix select/cons_res task distribution bug · b70cc235

Morris Jette authored Jul 20, 2011

Fix bug in select/cons_res task distribution logic when tasks-per-node=0.
Eliminates misleading slurmctld message
"error: cons_res: _compute_c_b_task_dist oversubscribe."
This problem was introduced in SLURM version 2.2.5 in order to fix
a task distribution problem when cpus_per_task=0. Patch from Rod Schultz, Bull.

b70cc235

14 Jul, 2011 1 commit

Set environment variables with job memory limtis · dbd292c7

Morris Jette authored Jul 14, 2011

Set SLURM_MEM_PER_CPU or SLURM_MEM_PER_NODE environment variables for both
interactive (salloc) and batch jobs if the job has a memory limit. For Cray
systems also set CRAY_AUTO_APRUN_OPTIONS environment variable with the
memory limit.

dbd292c7

13 Jul, 2011 1 commit

limit batch jobs in front-end mode to a single CPU · 344daaa1

Morris Jette authored Jul 13, 2011

For front-end configurations (Cray and IBM BlueGene), bind each batch job to
a unique CPU to limit the damage which a single job can cause. Previously any
single job could use all CPUs causing problems for other jobs or system
daemons. This addresses a problem reported by Steve Trofinoff, CSCS.

344daaa1

12 Jul, 2011 3 commits
- Fixed documention (html) for PriorityUsageResetPeriod to match that in the · 5e100b2e
  Danny Auble authored Jul 12, 2011
```
man pages. Patch by Nancy Kritkausky, Bull.
```
  5e100b2e
- Fixed issue where preempt mode is skipped when creating a QOS. Patch by · 56bd49a2
  Danny Auble authored Jul 12, 2011
```
Bill Brophy, Bull.
```
  56bd49a2
- Note change in state save files · 7b406233
  Morris Jette authored Jul 11, 2011
```
Note the job and partition state file formats have changed and RPCs
with information for jobs and partitions have changed.
```
  7b406233
06 Jul, 2011 2 commits

Fix for GRES with topology · 6a8ff8b0

Morris Jette authored Jul 06, 2011

Fix bug in generic resource tracking of gres associated with specific CPUs.
Resources were being over-allocated.

6a8ff8b0

Fix AllocGroups memory buffering bug · 5f60da0a

Morris Jette authored Jul 05, 2011

Fix memory buffering bug if a AllowGroups parameter of a partition has 100
or more users. Patch by Andriy Grytsenko (Massive Solutions Limited).

5f60da0a

05 Jul, 2011 3 commits

dd cgroup support for device files · ac469ca5

Morris Jette authored Jul 05, 2011

Add cgroup support for device files in both the task/cgroup plugin and generic
resource (GRES) logic. Based upon patch Yiannis Georgiou.

ac469ca5

Wait 2 secs between SIGTSTP and SIGSTOP · 4c0b9de8

Morris Jette authored Jul 05, 2011

When suspending a job, wait 2 seconds instead of 1 second between sending
SIGTSTP and SIGSTOP. Some MPI implementation were not stopping within the
1 second delay.

4c0b9de8

Add support for job arrays · 912cff2a

Morris Jette authored Jul 05, 2011

Add contribs/arrayrun tool providing support for job arrays. Contributed by
Bjørn-Helge Mevik, University of Oslo. NOTE: Not currently packaged as RPM
and manual file editing is required.

912cff2a

02 Jul, 2011 1 commit

Do not preempt more jobs than needed · 8a5d5cdf

Morris Jette authored Jul 01, 2011

If a job needed to preempt other jobs to start and those jobs were
not completed by the time of the next scheduling cycle, other jobs
might be selected for preemption in that next cycle resulting in
more jobs being preempted than necessary.

8a5d5cdf

01 Jul, 2011 1 commit

Correct job run time reported by smap for suspended jobs · b887709d

Morris Jette authored Jul 01, 2011

Previous logic reported the run time as the current time minus the job start time,
ignoring any suspended time.

b887709d