- 09 Jul, 2012 1 commit
-
-
Martin Perry authored
See Bugzilla #73 for more complete description of the problem. Patch by Martin Perry, Bull.
-
- 06 Jul, 2012 1 commit
-
-
Carles Fenoy authored
If job is submitted to more than one partition, it's partition pointer can be set to an invalid value. This can result in the count of CPUs allocated on a node being bad, resulting in over- or under-allocation of its CPUs. Patch by Carles Fenoy, BSC. Hi all, After a tough day I've finally found the problem and a solution for 2.4.1 I was able to reproduce the explained behavior by submitting jobs to 2 partitions. This makes the job to be allocated in one partition but in the schedule function the partition of the job is changed to the NON allocated one. This makes that the resources can not be free at the end of the job. I've solved this by changing the IS_PENDING test some lines above in the schedule function in (job_scheduler.c) This is the code from the git HEAD (Line 801). As this file has changed a lot from 2.4.x I have not done a patch but I'm commenting the solution here. I've moved the if(!IS_JOB_PENDING) after the 2nd line (part_ptr...). This prevents the partition of the job to be changed if it is already starting in another partition. job_ptr = job_queue_rec->job_ptr; part_ptr = job_queue_rec->part_ptr; job_ptr->part_ptr = part_ptr; xfree(job_queue_rec); if (!IS_JOB_PENDING(job_ptr)) continue; /* started in other partition */ Hope this is enough information to solve it. I've just realized (while writing this mail) that my solution has a memory leak as job_queue_rec is not freed. Regards, Carles Fenoy
-
- 03 Jul, 2012 1 commit
-
-
Danny Auble authored
there are jobs running on that hardware.
-
- 02 Jul, 2012 1 commit
-
-
Carles Fenoy authored
correctly when transitioning. This also applies for 2.4.0 -> 2.4.1, no state will be lost. (Thanks to Carles Fenoy)
-
- 28 Jun, 2012 1 commit
-
-
Danny Auble authored
-
- 26 Jun, 2012 4 commits
-
-
Danny Auble authored
-
Danny Auble authored
bg.properties in order for the runjob_mux to run correctly. Signed-off-by: Danny Auble <da@schedmd.com>
-
Danny Auble authored
but job is going to be canceled because it is interactive or other reason it now receives the grace time.
-
Morris Jette authored
-
- 25 Jun, 2012 3 commits
-
-
Danny Auble authored
check if a block is still makable if the cable wasn't in error.
-
Danny Auble authored
removal of the job on the block failed.
-
Danny Auble authored
-
- 22 Jun, 2012 3 commits
-
-
Danny Auble authored
29d79ef8
-
Danny Auble authored
same time a block is destroyed and that block just happens to be the smallest overlapping block over the bad hardware.
-
Danny Auble authored
-
- 20 Jun, 2012 2 commits
-
-
Danny Auble authored
but not node count the node count is correctly figured out.
-
Morris Jette authored
Without this fix, gang scheduling mode could start without creating a list resulting in an assert when jobs are submitted.
-
- 18 Jun, 2012 2 commits
-
-
Danny Auble authored
packing the step layout structure.
-
Danny Auble authored
we must use a small block instead of a shared midplane block.
-
- 13 Jun, 2012 2 commits
-
-
Danny Auble authored
still messages we find when we poll but haven't given it back to the real time yet.
-
Danny Auble authored
-
- 12 Jun, 2012 1 commit
-
-
Danny Auble authored
-
- 05 Jun, 2012 1 commit
-
-
Danny Auble authored
a job kill timeout aren't always reported to the system. This is now handled by the runjob_mux plugin.
-
- 01 Jun, 2012 2 commits
-
-
Danny Auble authored
sub-blocks.
-
Danny Auble authored
to make a larger small block and are running with sub-blocks.
-
- 31 May, 2012 1 commit
-
-
Danny Auble authored
function didn't always work correctly.
-
- 30 May, 2012 3 commits
-
-
Danny Auble authored
the next step in the allocation only uses part of the allocation it gets the correct cnodes.
-
Morris Jette authored
-
Andy Wettstein authored
In etc/init.d/slurm move check for scontrol after sourcing /etc/sysconfig/slurm. Patch from Andy Wettstein, University of Chicago.
-
- 29 May, 2012 1 commit
-
-
Don Lipari authored
-
- 25 May, 2012 2 commits
-
-
Rod Schultz authored
This change makes the code consistent with the documentation. Note that "bf_res=" will continue to be recognized for now. Patch from Rod Schultz, Bull.
-
Don Albert authored
I have implemented the changes as you suggested: using a "-dd" option to indicate that the display of the script is wanted, and setting both the "SHOW_DETAIL" and a new "SHOW_DETAIL2" flag. Since "scontrol" can be run interactively as well, I added a new "script" option to indicate that display of both the script and the details is wanted if the job is a batch job. Here are the man page updates for "man scontrol". For the "-d, --details" option: -d, --details Causes the show command to provide additional details where available. Repeating the option more than once (e.g., "-dd") will cause the show job command to also list the batch script, if the job was a batch job. For the interactive "details" option: details Causes the show command to provide additional details where available. Job information will include CPUs and NUMA memory allocated on each node. Note that on computers with hyperthreading enabled and SLURM configured to allocate cores, each listed CPU represents one physical core. Each hyperthread on that core can be allocated a separate task, so a job's CPU count and task count may differ. See the --cpu_bind and --mem_bind option descriptions in srun man pages for more information. The details option is currently only supported for the show job command. To also list the batch script for batch jobs, in addition to the details, use the script option described below instead of this option. And for the new interactive "script" option: script Causes the show job command to list the batch script for batch jobs in addition to the detail informa- tion described under the details option above. Attached are the patch file for the changes and a text file with the results of the tests I did to check out the changes. The patches are against SLURM 2.4.0-rc1. -Don Albert-
-
- 24 May, 2012 3 commits
-
-
Danny Auble authored
compiling with --enable-debug
-
Jon Bringhurst authored
The purpose of this is so moab scripts and commands (such as 'checkjob') have consistent access to the SUBMITHOST variable.
-
Danny Auble authored
-
- 23 May, 2012 3 commits
-
-
Danny Auble authored
-
Danny Auble authored
isn't up at the time the slurmctld starts, not running the priority/multifactor plugin, and then the database is started up later.
-
Morris Jette authored
-
- 22 May, 2012 1 commit
-
-
Danny Auble authored
-
- 16 May, 2012 1 commit
-
-
Morris Jette authored
Cray - Improve support for zero compute note resource allocations. Partition used can now be configured with no nodes nodes.
-