- 02 Aug, 2011 2 commits
-
-
Danny Auble authored
the DBD where both remained up but were disconnected the slurmctld would get registered again with the DBD.
-
Danny Auble authored
-
- 01 Aug, 2011 2 commits
-
-
Morris Jette authored
With sched/wiki or sched/wiki2 (Maui or Moab scheduler), insure that a requeued job's priority is reset to zero.
-
Morris Jette authored
-
- 29 Jul, 2011 1 commit
-
-
Danny Auble authored
-
- 28 Jul, 2011 1 commit
-
-
Morris Jette authored
Add the ability for a user to limit the number of leaf switches in a job's allocation using the --switch option of salloc, sbatch and srun. There is also a new SchedulerParameters value of max_switch_wait, which a SLURM administrator can used to set a maximum job delay and prevent a user job from blocking lower priority jobs for too long. Based on work by Rod Schultz, Bull.
-
- 22 Jul, 2011 2 commits
-
-
Morris Jette authored
BlueGene: Permit users to specify a separate connection type for each dimension (e.g. "--conn-type=torus,mesh,torus").
-
Morris Jette authored
On Cray systems with the srun2aprun wrapper, build an srun man page that describes which options are available with the wrapper.
-
- 21 Jul, 2011 1 commit
-
-
Morris Jette authored
Restore node configuration information (CPUs, memory, etc.) for powered down when slurmctld daemon restarts rather than waiting for the node to be restored to service and getting the information from the node (NOTE: Only relevent if FastSchedule=0).
-
- 20 Jul, 2011 1 commit
-
-
Morris Jette authored
Fix bug in select/cons_res task distribution logic when tasks-per-node=0. Eliminates misleading slurmctld message "error: cons_res: _compute_c_b_task_dist oversubscribe." This problem was introduced in SLURM version 2.2.5 in order to fix a task distribution problem when cpus_per_task=0. Patch from Rod Schultz, Bull.
-
- 14 Jul, 2011 1 commit
-
-
Morris Jette authored
Set SLURM_MEM_PER_CPU or SLURM_MEM_PER_NODE environment variables for both interactive (salloc) and batch jobs if the job has a memory limit. For Cray systems also set CRAY_AUTO_APRUN_OPTIONS environment variable with the memory limit.
-
- 13 Jul, 2011 1 commit
-
-
Morris Jette authored
For front-end configurations (Cray and IBM BlueGene), bind each batch job to a unique CPU to limit the damage which a single job can cause. Previously any single job could use all CPUs causing problems for other jobs or system daemons. This addresses a problem reported by Steve Trofinoff, CSCS.
-
- 12 Jul, 2011 3 commits
-
-
Danny Auble authored
man pages. Patch by Nancy Kritkausky, Bull.
-
Danny Auble authored
Bill Brophy, Bull.
-
Morris Jette authored
Note the job and partition state file formats have changed and RPCs with information for jobs and partitions have changed.
-
- 06 Jul, 2011 2 commits
-
-
Morris Jette authored
Fix bug in generic resource tracking of gres associated with specific CPUs. Resources were being over-allocated.
-
Morris Jette authored
Fix memory buffering bug if a AllowGroups parameter of a partition has 100 or more users. Patch by Andriy Grytsenko (Massive Solutions Limited).
-
- 05 Jul, 2011 3 commits
-
-
Morris Jette authored
Add cgroup support for device files in both the task/cgroup plugin and generic resource (GRES) logic. Based upon patch Yiannis Georgiou.
-
Morris Jette authored
When suspending a job, wait 2 seconds instead of 1 second between sending SIGTSTP and SIGSTOP. Some MPI implementation were not stopping within the 1 second delay.
-
Morris Jette authored
Add contribs/arrayrun tool providing support for job arrays. Contributed by Bjørn-Helge Mevik, University of Oslo. NOTE: Not currently packaged as RPM and manual file editing is required.
-
- 02 Jul, 2011 1 commit
-
-
Morris Jette authored
If a job needed to preempt other jobs to start and those jobs were not completed by the time of the next scheduling cycle, other jobs might be selected for preemption in that next cycle resulting in more jobs being preempted than necessary.
-
- 01 Jul, 2011 1 commit
-
-
Morris Jette authored
Previous logic reported the run time as the current time minus the job start time, ignoring any suspended time.
-
- 30 Jun, 2011 1 commit
-
-
Morris Jette authored
Enhancements to sched/backfill performance with select/cons_res plugin. Major improvements would be seen with large job counts. Based upon bf_build_row_bitmaps_2.2.6.patch patch from Bjørn-Helge Mevik, University of Oslo.
-
- 28 Jun, 2011 1 commit
-
-
Danny Auble authored
association limit.
-
- 27 Jun, 2011 1 commit
-
-
Morris Jette authored
Add default and maximum memory limits on a per-partitiion basis. If not specified, the system-wide memory limits will apply.
-
- 24 Jun, 2011 3 commits
-
-
Morris Jette authored
Add select_jobinfo to the task launch RPC so that all nodes have access to the information and not job the head node. Based upon patch by Andriy Grytsenko (Massive Solutions Limited).
-
Morris Jette authored
Fix possible invalid memory reference in sched/backfill. Patch by Andriy Grytsenko (Massive Solutions Limited).
-
Morris Jette authored
Add flag to the select APIs for job suspend/resume indicating if the action is for gang scheduling or an explicit job suspend/resume by the user. Only an explicit job suspend/resume will reset the job's priority and make resources exclusively held by the job available to other jobs. This change is also needed for Cray systems with ALPS.
-
- 22 Jun, 2011 3 commits
-
-
Morris Jette authored
Add squeue support to display a job's license information. Patch by Andy Roosen (University of Deleware).
-
Morris Jette authored
For front-end architectures on which job steps are run (emulated Cray and BlueGene systems only), fix bug that would free memory still in use.
-
Morris Jette authored
Processes suspended and resumed are determined by using process group ID and parent process ID, so some processes may be missed. Since salloc runs as a normal user, it's ability to identify processes associated with a job is limited.
-
- 21 Jun, 2011 2 commits
- 20 Jun, 2011 3 commits
-
-
Moe Jette authored
Cray systems: Add support to suspend/resume salloc command to insure that aprun does not get initiated when the job is suspended.
-
moe authored
With regard to forthcoming Accelerator support in Basil 1.2/Alps 4.0, this adds interface support for passing the following Accelerator parameters: * accelerator type (currently only "GPU" is supported), * model/rank information (uninterpreted "family" string), * amount of on-board memory in MB. 02_Cray-Accelerator-params.diff Patch from Gerrit Renker and Stephen Trofinoff, CSCS.
-
moe authored
This adds support to parse Basil 1.2/Alps 4.0 per-node accelerator information. 01_Cray-Accelerator-basic-support.diff Patch from Gerrit Renker and Stephen Trofinoff, CSCS
-
- 17 Jun, 2011 3 commits
-
-
Moe Jette authored
-
Moe Jette authored
NOTE: THERE HAS BEEN A NEW FIELD ADDED TO THE CONFIGURATION RESPONSE RPC AS SHOWN BY "SCONTROL SHOW CONFIG". THIS FUNCTION WILL ONLY WORK WHEN THE SERVER AND CLIENT ARE BOTH RUNNING SLURM VERSION 2.3.0.pre6
-
Moe Jette authored
Fix bug in layout of job step with --nodelist option plus node count. Old code could allocate too few nodes by double counting some nodes.
-
- 16 Jun, 2011 1 commit
-
-
Danny Auble authored
-