- 19 Aug, 2011 1 commit
-
-
Morris Jette authored
One of our testers created an illegal topology.conf file. He has a config you probably wouldn't see in production, but can see in testing when you are sometimes given a collection of miscellaneous resources. |-- nodes switch1 --| |-- switch2 -- nodes He tried the topology.conf file below. Switch s1 is defined twice. Slurm accepted this config, but wouldn't allocate nodes from both switches to one job. SwitchName=s1 Nodes=xna[14-26] SwitchName=s2 Nodes=xna[41-43] SwitchName=s1 Switches=s2 I believe slurm shouldn't allow the second definition of switch s1. The attached patch checks for duplicate switch names. Patch from Rod Schultz, Bull.
-
- 17 Aug, 2011 1 commit
-
-
Danny Auble authored
This reverts commit 350ef5dc.
-
- 16 Aug, 2011 1 commit
-
-
Danny Auble authored
-
- 12 Aug, 2011 2 commits
-
-
Danny Auble authored
next parallel step is ran on a sub block, SLURM won't over subscribe cnodes.
-
Danny Auble authored
-
- 11 Aug, 2011 2 commits
-
-
Danny Auble authored
-
Morris Jette authored
BLUEGENE - Modify "scontrol show step" to show I/O nodes (BGL and BGP) or c-nodes (BGQ) allocated to each step. Change field name from "Nodes=" to "BP_List=".
-
- 10 Aug, 2011 3 commits
-
-
Danny Auble authored
cannot fit into the available shape.
-
Morris Jette authored
Previous code would fail when trying to launch more than 4096 tasks, which is a problem on BGQ systems where SLURM actually launches job steps.
-
Danny Auble authored
or not.
-
- 09 Aug, 2011 3 commits
-
-
Morris Jette authored
This change applies only to Cray systems and only when the srun wrapper for aprun. Map --exclusive to -F exclusive and --share to -F share. Note this does not consider the partition's Shared configuration, so it is an imperfect mapping of options.
-
Morris Jette authored
A node DOWN to ALPS will be marked DOWN to SLURM only after reaching SlurmdTimeout. In the interim, the node state will be NO_RESPOND. This change makes behavior makes SLURM handling of the node DOWN state more consistent with ALPS. This change effects only Cray systems.
-
Morris Jette authored
Fix the node state accounting to be consistent with the node state set by ALPS.
-
- 05 Aug, 2011 2 commits
-
-
Danny Auble authored
be the same.
-
Danny Auble authored
previously marked down by alps.
-
- 04 Aug, 2011 2 commits
-
-
Morris Jette authored
Require SchedulerTimeSlice configuration parameter to be at least 5 seconds to avoid thrashing slurmd daemon. Addresses Cray bug 774692
-
Morris Jette authored
Change in GRES behavior for job steps: A job step's default generic resource allocation will be set to that of the job. If a job step's --gres value is set to "none" then none of the generic resources which have been allocated to the job will be allocated to the job step. Add srun environment value of SLURM_STEP_GRES to set default --gres value for a job step.
-
- 03 Aug, 2011 2 commits
-
-
Morris Jette authored
On Bluegene systems, smap's command-line mode would generate an invalid memory reference due to an uninitialized variable.
-
Danny Auble authored
a POLLERR the dbd_fail callback is called.
-
- 02 Aug, 2011 2 commits
-
-
Danny Auble authored
the DBD where both remained up but were disconnected the slurmctld would get registered again with the DBD.
-
Danny Auble authored
-
- 01 Aug, 2011 2 commits
-
-
Morris Jette authored
With sched/wiki or sched/wiki2 (Maui or Moab scheduler), insure that a requeued job's priority is reset to zero.
-
Morris Jette authored
-
- 29 Jul, 2011 1 commit
-
-
Danny Auble authored
-
- 28 Jul, 2011 1 commit
-
-
Morris Jette authored
Add the ability for a user to limit the number of leaf switches in a job's allocation using the --switch option of salloc, sbatch and srun. There is also a new SchedulerParameters value of max_switch_wait, which a SLURM administrator can used to set a maximum job delay and prevent a user job from blocking lower priority jobs for too long. Based on work by Rod Schultz, Bull.
-
- 22 Jul, 2011 2 commits
-
-
Morris Jette authored
BlueGene: Permit users to specify a separate connection type for each dimension (e.g. "--conn-type=torus,mesh,torus").
-
Morris Jette authored
On Cray systems with the srun2aprun wrapper, build an srun man page that describes which options are available with the wrapper.
-
- 21 Jul, 2011 1 commit
-
-
Morris Jette authored
Restore node configuration information (CPUs, memory, etc.) for powered down when slurmctld daemon restarts rather than waiting for the node to be restored to service and getting the information from the node (NOTE: Only relevent if FastSchedule=0).
-
- 20 Jul, 2011 1 commit
-
-
Morris Jette authored
Fix bug in select/cons_res task distribution logic when tasks-per-node=0. Eliminates misleading slurmctld message "error: cons_res: _compute_c_b_task_dist oversubscribe." This problem was introduced in SLURM version 2.2.5 in order to fix a task distribution problem when cpus_per_task=0. Patch from Rod Schultz, Bull.
-
- 14 Jul, 2011 1 commit
-
-
Morris Jette authored
Set SLURM_MEM_PER_CPU or SLURM_MEM_PER_NODE environment variables for both interactive (salloc) and batch jobs if the job has a memory limit. For Cray systems also set CRAY_AUTO_APRUN_OPTIONS environment variable with the memory limit.
-
- 13 Jul, 2011 1 commit
-
-
Morris Jette authored
For front-end configurations (Cray and IBM BlueGene), bind each batch job to a unique CPU to limit the damage which a single job can cause. Previously any single job could use all CPUs causing problems for other jobs or system daemons. This addresses a problem reported by Steve Trofinoff, CSCS.
-
- 12 Jul, 2011 3 commits
-
-
Danny Auble authored
man pages. Patch by Nancy Kritkausky, Bull.
-
Danny Auble authored
Bill Brophy, Bull.
-
Morris Jette authored
Note the job and partition state file formats have changed and RPCs with information for jobs and partitions have changed.
-
- 06 Jul, 2011 2 commits
-
-
Morris Jette authored
Fix bug in generic resource tracking of gres associated with specific CPUs. Resources were being over-allocated.
-
Morris Jette authored
Fix memory buffering bug if a AllowGroups parameter of a partition has 100 or more users. Patch by Andriy Grytsenko (Massive Solutions Limited).
-
- 05 Jul, 2011 3 commits
-
-
Morris Jette authored
Add cgroup support for device files in both the task/cgroup plugin and generic resource (GRES) logic. Based upon patch Yiannis Georgiou.
-
Morris Jette authored
When suspending a job, wait 2 seconds instead of 1 second between sending SIGTSTP and SIGSTOP. Some MPI implementation were not stopping within the 1 second delay.
-
Morris Jette authored
Add contribs/arrayrun tool providing support for job arrays. Contributed by Bjørn-Helge Mevik, University of Oslo. NOTE: Not currently packaged as RPM and manual file editing is required.
-
- 02 Jul, 2011 1 commit
-
-
Morris Jette authored
If a job needed to preempt other jobs to start and those jobs were not completed by the time of the next scheduling cycle, other jobs might be selected for preemption in that next cycle resulting in more jobs being preempted than necessary.
-