- 06 Jul, 2011 1 commit
-
-
Morris Jette authored
Fix memory buffering bug if a AllowGroups parameter of a partition has 100 or more users. Patch by Andriy Grytsenko (Massive Solutions Limited).
-
- 05 Jul, 2011 3 commits
-
-
Morris Jette authored
Add cgroup support for device files in both the task/cgroup plugin and generic resource (GRES) logic. Based upon patch Yiannis Georgiou.
-
Morris Jette authored
When suspending a job, wait 2 seconds instead of 1 second between sending SIGTSTP and SIGSTOP. Some MPI implementation were not stopping within the 1 second delay.
-
Morris Jette authored
Add contribs/arrayrun tool providing support for job arrays. Contributed by Bjørn-Helge Mevik, University of Oslo. NOTE: Not currently packaged as RPM and manual file editing is required.
-
- 02 Jul, 2011 1 commit
-
-
Morris Jette authored
If a job needed to preempt other jobs to start and those jobs were not completed by the time of the next scheduling cycle, other jobs might be selected for preemption in that next cycle resulting in more jobs being preempted than necessary.
-
- 01 Jul, 2011 1 commit
-
-
Morris Jette authored
Previous logic reported the run time as the current time minus the job start time, ignoring any suspended time.
-
- 30 Jun, 2011 1 commit
-
-
Morris Jette authored
Enhancements to sched/backfill performance with select/cons_res plugin. Major improvements would be seen with large job counts. Based upon bf_build_row_bitmaps_2.2.6.patch patch from Bjørn-Helge Mevik, University of Oslo.
-
- 28 Jun, 2011 1 commit
-
-
Danny Auble authored
association limit.
-
- 27 Jun, 2011 1 commit
-
-
Morris Jette authored
Add default and maximum memory limits on a per-partitiion basis. If not specified, the system-wide memory limits will apply.
-
- 24 Jun, 2011 3 commits
-
-
Morris Jette authored
Add select_jobinfo to the task launch RPC so that all nodes have access to the information and not job the head node. Based upon patch by Andriy Grytsenko (Massive Solutions Limited).
-
Morris Jette authored
Fix possible invalid memory reference in sched/backfill. Patch by Andriy Grytsenko (Massive Solutions Limited).
-
Morris Jette authored
Add flag to the select APIs for job suspend/resume indicating if the action is for gang scheduling or an explicit job suspend/resume by the user. Only an explicit job suspend/resume will reset the job's priority and make resources exclusively held by the job available to other jobs. This change is also needed for Cray systems with ALPS.
-
- 22 Jun, 2011 3 commits
-
-
Morris Jette authored
Add squeue support to display a job's license information. Patch by Andy Roosen (University of Deleware).
-
Morris Jette authored
For front-end architectures on which job steps are run (emulated Cray and BlueGene systems only), fix bug that would free memory still in use.
-
Morris Jette authored
Processes suspended and resumed are determined by using process group ID and parent process ID, so some processes may be missed. Since salloc runs as a normal user, it's ability to identify processes associated with a job is limited.
-
- 21 Jun, 2011 2 commits
- 20 Jun, 2011 3 commits
-
-
Moe Jette authored
Cray systems: Add support to suspend/resume salloc command to insure that aprun does not get initiated when the job is suspended.
-
moe authored
With regard to forthcoming Accelerator support in Basil 1.2/Alps 4.0, this adds interface support for passing the following Accelerator parameters: * accelerator type (currently only "GPU" is supported), * model/rank information (uninterpreted "family" string), * amount of on-board memory in MB. 02_Cray-Accelerator-params.diff Patch from Gerrit Renker and Stephen Trofinoff, CSCS.
-
moe authored
This adds support to parse Basil 1.2/Alps 4.0 per-node accelerator information. 01_Cray-Accelerator-basic-support.diff Patch from Gerrit Renker and Stephen Trofinoff, CSCS
-
- 17 Jun, 2011 3 commits
-
-
Moe Jette authored
-
Moe Jette authored
NOTE: THERE HAS BEEN A NEW FIELD ADDED TO THE CONFIGURATION RESPONSE RPC AS SHOWN BY "SCONTROL SHOW CONFIG". THIS FUNCTION WILL ONLY WORK WHEN THE SERVER AND CLIENT ARE BOTH RUNNING SLURM VERSION 2.3.0.pre6
-
Moe Jette authored
Fix bug in layout of job step with --nodelist option plus node count. Old code could allocate too few nodes by double counting some nodes.
-
- 16 Jun, 2011 1 commit
-
-
Danny Auble authored
-
- 15 Jun, 2011 1 commit
-
-
Moe Jette authored
The original logic had a problem if you shrank a job and later grew it. Nodes previously released would reappear when the job grows, but have zero CPUs associated with them. The problem was due to the original nodes list of a job being preserved in the job_resources data structure. The new logic confirms that those nodes are still in the job's allocation before rebuilding the job_resources data structure.
-
- 14 Jun, 2011 2 commits
-
-
Danny Auble authored
UMBC.
-
Moe Jette authored
Prevent background salloc disconnecting terminal at termination. Patch by Don Albert, Bull.
-
- 10 Jun, 2011 1 commit
-
-
Moe Jette authored
-
- 09 Jun, 2011 2 commits
- 08 Jun, 2011 2 commits
-
-
Moe Jette authored
Avoid clearing a node's Arch, OS, BootTime and SlurmdStartTime when "scontrol reconfig" is run. Patch from Martin Perry, Bull.
-
Morris Jette authored
-
- 07 Jun, 2011 2 commits
-
-
Danny Auble authored
-
Moe Jette authored
Added scontrol ability to increment or decrement a job or step time limit.
-
- 06 Jun, 2011 1 commit
-
-
Danny Auble authored
would not be set correctly in the added child association.
-
- 02 Jun, 2011 1 commit
-
-
Moe Jette authored
With default configuration on non-Cray systems, enable salloc to be spawned as a background process. Based upon work by Don Albert (Bull) and Gerrit Renker (CSCS).
-
- 01 Jun, 2011 3 commits
-
-
Moe Jette authored
Add support to salloc for a new environment variable SALLOC_KILL_CMD, which is equivalent to the -K/--kill-command option.
-
Moe Jette authored
This fixes a bug which is thanks to a report by Don Albert. The problem is that whenever salloc exits with a child process in stopped state (suspended or stopped on terminal input/output), a zombie process is generated, since this case is not caught by the code evaluating the child status. This patch adds the missing case. It uses SIGKILL, which is the only signal that changes the state of a stopped process. It was decided not to try and re-awken the process using SIGCONT, since (a) this happens during session clean-up and (b) if the condition is due to SIGTTIN, the process immediately becomes stopped again. Patch from Gerrit Renker, CSCS.
-
Moe Jette authored
Treat the specification of multiple cluster names as a fatal error.
-
- 31 May, 2011 1 commit
-
-
Moe Jette authored
Note that scontrol can only support a single cluster at one time.
-