- 06 Apr, 2013 2 commits
-
-
jette authored
(at initiating pending job steps), interupt driven rather than retry based.\
-
Morris Jette authored
Fix sched/backfill logic to initiate jobs with maximum time limit over the partition limit, but the minimum time limit permits it to start. Related to bug 251
-
- 05 Apr, 2013 5 commits
-
-
Morris Jette authored
Conflicts: src/slurmctld/job_mgr.c
-
Danny Auble authored
-
Morris Jette authored
-
Morris Jette authored
This is the only remaining use of the function. Also added boards parameter to the function and removed some unused parameters
-
Morris Jette authored
Mostly to split long lines
-
- 04 Apr, 2013 4 commits
-
-
Stephen Trofinoff authored
I am sending the latest update of my NPPCU-support patch for Slurm 2.5.0. As before, this patch is applied over my basic BASIL 1.3 support patch. The reason for this latest version is that it came to my attention, that certain jobs that should have been rejected by Slurm were allowed through. I then further noticed that this would cause the backfill algorithm to slow down dramatically (often not being able to process any other jobs). The cause of the problem was that when I introduced the functionality into Slurm to properly set the "nppcu" (number of processors per compute unit) attribute in the XML reservation request to ALPS, I didn't also adjust the tests earlier in the code that eliminate nodes from consideration that do not have sufficient resources. In other words, jobs that would exceed the absolute total number of processors on the node would be rejected as always (this is good). Jobs that required the reduced number of "visible" processors on the node or less were allocated and worked fine (this is good). Unfortunately, jobs that needed a number of processors somewhere in between these limits (let's call them the soft and hard limits) were allowed through by Slurm. Making matters worse, when Slurm would subsequently try to request the ALPS reservation, ALPS would correctly reject it but Slurm would keep trying--this would then kill the backfilling. In my opinion, these jobs should have been rejected from the onset by Slurm as they are asking for more processors per node than can be supplied. If the user wants this number of processors they should specify the "--ntasks-per-core=..." (in our case "2" as that is the full number of hardware threads per core). Obviously, this problem only appeared when I used CR_ONE_TASK_PER_CORE in the slurm.conf as I had modified the code to set nppcu to 1 when Slurm was configured with that option and the user didn't explicitly specify a different value. The patch appears to be working well for us now and so I am submitting it to you for your review.
-
Morris Jette authored
-
Morris Jette authored
This fixes a bug introduced in commit f1cf6d2d Without this change, a job with a partition related reason for not running (e.g. MaxNodes) plus some other reason (e.g. dependency), could after satisfying the second reason (e.g. dependency), have the select_nodes() function executed and return SUCCESS on an allocation request, but it would not actually be allocation resources since select_nodes() would interpret the request as a test to see if it could ever run (e.g. will_run). Bug 263
-
Morris Jette authored
-
- 03 Apr, 2013 5 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
This logic strips leading zeros off node IDs for use with the aprun -L option. Without this change, the node IDs with leading zeros are interpretted as octal. Partial fix for bug 260.
-
Danny Auble authored
-
- 02 Apr, 2013 9 commits
-
-
Danny Auble authored
never looked at to determine eligibility of backfillable job.
-
Morris Jette authored
-
Morris Jette authored
A fix for this problem will require more study. This one causes xassert when an attempt to start a job results in it not being started by sched/backfill due to the partition time limit.
-
Danny Auble authored
and when reading in state from DB2 we find a block that can't be created. You can now do a clean start to rid the bad block.
-
Danny Auble authored
the slurmctld there were software errors on some nodes.
-
Danny Auble authored
without it still existing there. This is extremely rare.
-
Danny Auble authored
a pending job on it we don't kill the job.
-
Danny Auble authored
while it was free cnodes would go into software error and kill the job.
-
Morris Jette authored
Fix sched/backfill logic to initiate jobs with maximum time limit over the partition limit, but the minimum time limit permits it to start. Related to bug 251
-
- 01 Apr, 2013 6 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Fix for bug 224
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 30 Mar, 2013 1 commit
-
-
Morris Jette authored
-
- 29 Mar, 2013 8 commits
-
-
Danny Auble authored
-
Nathan Yee authored
-
Morris Jette authored
-
Morris Jette authored
-
Danny Auble authored
Conflicts: src/plugins/priority/multifactor/priority_multifactor.c
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
-