- 18 Jan, 2013 1 commit
-
-
Phil Eckert authored
About a year ago I submitted a modification that you incorporated into SLURM 2.4, which was to allow an admin to modify a job to use a QOS even though the user did not have access to the QOS. However, I must have tested it without having the Accounting set to enforce QOS's. So, if an admin modifies a job to a QOS they don't have access to, it will be modified, but the job will result in a state of InvalidQOS, which is reasonable, since this would handle the case where a user has their QOS removed. A problem, however, is that even though the scheduler won't schedule the job, backfill still will. One approach would be to fix backfill to be consistent with the scheduler (which should probably occur regardless), but my thought would be to modify the scheduler to allow the QOS as long as it was set by an admin, since that was the intent of the modification to begin with. I believe it would only take a single line to change, just adding a check on the job_ptr->limit_set_qos, to make sure it was set by an admin: if (job_ptr->qos_id) { slurmdb_association_rec_t *assoc_ptr; assoc_ptr = (slurmdb_association_rec_t *)job_ptr->assoc_ptr; if (assoc_ptr && !bit_test(assoc_ptr->usage->valid_qos, job_ptr->qos_id) && !job_ptr->limit_set_qos) { info("sched: JobId=%u has invalid QOS", job_ptr->job_id); xfree(job_ptr->state_desc); job_ptr->state_reason = FAIL_QOS; continue; } else if (job_ptr->state_reason == FAIL_QOS) { xfree(job_ptr->state_desc); job_ptr->state_reason = WAIT_NO_REASON; } } Phil
-
- 16 Jan, 2013 4 commits
-
-
Morris Jette authored
Without this patch, if the first listed partition lacks nodes with required features the job would be rejected.
-
Morris Jette authored
While this will validate job at submit time, it results in redundant looping when scheduling jobs. Working on alternate patch now.
-
Danny Auble authored
submission.
-
Danny Auble authored
-
- 15 Jan, 2013 1 commit
-
-
Matthieu Hautreux authored
QoS limits enforcement on the controller side is based on a list of used_limits per user. When a user is not yet added to the list, which is common when the controller is restarted and the user has no running jobs, the current logic is to not check some of the "per user limits" and let the submission succeed. However, if one of these limits is a zero-valued limit, the check chould failed as it means that no job should be submitted at all as it would necessarily result in a crossing of the limit. This patch ensures that even when a user is not yet present in the per user used_limits list, the 0-valued limits are correctly treated.
-
- 14 Jan, 2013 1 commit
-
-
jette authored
-
- 11 Jan, 2013 1 commit
-
-
Morris Jette authored
-
- 10 Jan, 2013 2 commits
-
-
jette authored
-
Morris Jette authored
-
- 09 Jan, 2013 1 commit
-
-
Danny Auble authored
-
- 08 Jan, 2013 2 commits
-
-
Danny Auble authored
-
Morris Jette authored
-
- 03 Jan, 2013 5 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 28 Dec, 2012 1 commit
-
-
Morris Jette authored
-
- 22 Dec, 2012 1 commit
-
-
Danny Auble authored
stack.
-
- 21 Dec, 2012 1 commit
-
-
Morris Jette authored
If sched/backfill starts a job with a QOS having NO_RESERVE and not job time limit, start it with the partition time limit (or one year if the partition has no time limit) rather than NO_VAL (140 year time limit); If a standby job, which in this case has the NO_RESERVE flag set, is submitted without a time limit, and is backfilled, it will get an EndTime waaayyyy into the future. JobId=99 Name=cmdll UserId=eckert(1043) GroupId=eckert(1043) Priority=12083 Account=sa QOS=standby JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:14 TimeLimit=12:00:00 TimeMin=N/A SubmitTime=2012-12-20T11:49:36 EligibleTime=2012-12-20T11:49:36 StartTime=2012-12-20T11:49:44 EndTime=2149-01-26T18:16:00 so I looked at the code in /src/plugins/sched/backfill: if (job_ptr->start_time <= now) { int rc = _start_job(job_ptr, resv_bitmap); if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){ job_ptr->time_limit = orig_time_limit; job_ptr->end_time = job_ptr->start_time + (orig_time_limit * 60); Using the debugger I found that if the job does not have a specified time limit, the job_ptr->time_limit is equal to NO_VAL when it hits this code.
-
- 20 Dec, 2012 2 commits
-
-
Danny Auble authored
slurm.conf with NodeAddr's signals going to a step could be handled incorrectly.
-
Danny Auble authored
would of also killed the allocation.
-
- 19 Dec, 2012 4 commits
-
-
Danny Auble authored
to make one job run.
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-N1 -n#.
-
- 17 Dec, 2012 2 commits
-
-
Danny Auble authored
-
Chris Read authored
-
- 14 Dec, 2012 4 commits
-
-
Morris Jette authored
-
Danny Auble authored
-
Chris Reed authored
Without this patch, use of sched/builtin would always result in FIFO scheduling, even if priority/multifactor was configured
-
Danny Auble authored
-
- 13 Dec, 2012 3 commits
-
-
jette authored
-
Danny Auble authored
each block independently.
-
Danny Auble authored
-
- 12 Dec, 2012 1 commit
-
-
Morris Jette authored
-
- 07 Dec, 2012 1 commit
-
-
Morris Jette authored
Correction to hostlist sorting for hostnames that contain two numeric components and the first numeric component has various sizes (e.g. "rack9blade1" should come before "rack10blade1")
-
- 06 Dec, 2012 1 commit
-
-
Morris Jette authored
-
- 05 Dec, 2012 1 commit
-
-
Danny Auble authored
job on future step creation attempts.
-