- 29 Jan, 2013 4 commits
-
-
Danny Auble authored
block, destroy it correctly.
-
Danny Auble authored
at least 3.5.0. This avoids a stack overflow when running jobs on more than 120k nodes.
-
jette authored
-
Morris Jette authored
Used for generic stack of slurmctld daemon plugins
-
- 25 Jan, 2013 1 commit
-
-
Morris Jette authored
-
- 23 Jan, 2013 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
jette authored
I run into a problem with slurm-2.5.1 that IDLE nodes can not be allocated to jobs. This can be reproduced as follows: First, submit a job with --no-kill option (I have SLURM_EXCLUSIVE set to allocate nodes exclusively by default). Then set one of the nodes allocated to the job(cn2) to state DOWN: srun: error: Node failure on cn2 srun: error: Node failure on cn2 srun: error: cn2: task 0: Killed ^Csrun: interrupt (one more within 1 sec to abort) srun: task 1: running srun: task 0: exited abnormally ^Csrun: sending Ctrl-C to job 22605.0 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. srun: Force Terminated job step 22605.0 Then change state of the node to IDLE again. But it can not be allocated to jobs: srun: job 22606 queued and waiting for resources JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 22606 work hostname root PD 0:00 1 (Resources) 22604 work sbatch root R 3:06 1 cn1 NodeName=cn2 Arch=x86_64 CoresPerSocket=8 CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=0.05 Features=abc Gres=(null) NodeAddr=cn2 NodeHostName=cn2 OS=Linux RealMemory=30000 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2012-12-24T15:22:34 SlurmdStartTime=2013-01-14T11:06:32 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 I traced and located the problem in select/cons_res. The call sequence is: slurmctld/node_mgr.c: update_node() => slurmctld/job_mgr.c: kill_running_job_by_node_name() => excise_node_from_job() => plugins/select/cons_res/select_cons_res.c: select_p_job_resized() => _rm_job_from_one_node() => _build_row_bitmaps() => common/job_resources: remove_job_from_cores() If there are other jobs running in the partition, the partition row bitmap will not be set correctly. In the example above, before _build_row_bitmaps(), output of _dump_part() is: [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1 [2013-01-19T13:24:56+08:00] row0: num_jobs 2: bitmap: 16,32-63 after setting the node down, output of _dump_part() is [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1 [2013-01-19T13:24:56+08:00] row0: num_jobs 2: bitmap: 16,32-47 Cores of cn2 are not marked as available. Instead, cores of other nodes are released. When another job requires the node cn2, the following log message appears: [2013-01-19T13:25:03+08:00] debug3: cons_res: _vns: node cn2 busy I do not understand the design of select/cons_res well and I do not know how to fix this. But it seems that _build_row_bitmaps() should not be called, since the job is not removed totally, but only one of the nodes released.
-
- 22 Jan, 2013 1 commit
-
-
Magnus Jonsson authored
-
- 18 Jan, 2013 3 commits
-
-
Morris Jette authored
From Chris Holmes, HP: After several days of brainstorming and debugging, I have identified a bug in SLURM 2.5.0rc2, related to the 'tree' topology. It was so early in the execution of the whole SLURM machinery that it took me some time to figure it out (say, 100 or 200 jobs showing the issue, with more or less debugging levels increased and extra instrumentation, with sometimes an uncertain reliability)... For every “switch” a bitmap of nodes (seen down by the switch) is built as the topology is discovered through 'topology.conf'. There is code in read_config.c, executed when the SLURM control daemon starts, that reorders the nodes (according to their hostname by default), while the switches table (ie the bitmaps) has already being built. To reorder the nodes means that the bitmaps of the switches become wrong.
-
Morris Jette authored
slurm.MEM_PER_CPU, slurm.NO_VAL, etc.
-
Phil Eckert authored
About a year ago I submitted a modification that you incorporated into SLURM 2.4, which was to allow an admin to modify a job to use a QOS even though the user did not have access to the QOS. However, I must have tested it without having the Accounting set to enforce QOS's. So, if an admin modifies a job to a QOS they don't have access to, it will be modified, but the job will result in a state of InvalidQOS, which is reasonable, since this would handle the case where a user has their QOS removed. A problem, however, is that even though the scheduler won't schedule the job, backfill still will. One approach would be to fix backfill to be consistent with the scheduler (which should probably occur regardless), but my thought would be to modify the scheduler to allow the QOS as long as it was set by an admin, since that was the intent of the modification to begin with. I believe it would only take a single line to change, just adding a check on the job_ptr->limit_set_qos, to make sure it was set by an admin: if (job_ptr->qos_id) { slurmdb_association_rec_t *assoc_ptr; assoc_ptr = (slurmdb_association_rec_t *)job_ptr->assoc_ptr; if (assoc_ptr && !bit_test(assoc_ptr->usage->valid_qos, job_ptr->qos_id) && !job_ptr->limit_set_qos) { info("sched: JobId=%u has invalid QOS", job_ptr->job_id); xfree(job_ptr->state_desc); job_ptr->state_reason = FAIL_QOS; continue; } else if (job_ptr->state_reason == FAIL_QOS) { xfree(job_ptr->state_desc); job_ptr->state_reason = WAIT_NO_REASON; } } Phil
-
- 17 Jan, 2013 1 commit
-
-
Morris Jette authored
Added "KeepAliveTime" configuration parameter From Matthieu Hautreux: TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP exchanges, thus letting the system potentially deadlocked if a remote host dissapear and the local host is waiting on a read (the write would result in a EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a relatively large timeout value (5 minutes), could enhance the resilience of SLURM for unexpected packet/connection loss without too much implication on the scalability of the solution. The timeout could be configurable in case it is find too aggresive for particular configurations.
-
- 16 Jan, 2013 4 commits
-
-
Morris Jette authored
Without this patch, if the first listed partition lacks nodes with required features the job would be rejected.
-
Morris Jette authored
While this will validate job at submit time, it results in redundant looping when scheduling jobs. Working on alternate patch now.
-
Danny Auble authored
submission.
-
Danny Auble authored
-
- 15 Jan, 2013 2 commits
-
-
Matthieu Hautreux authored
QoS limits enforcement on the controller side is based on a list of used_limits per user. When a user is not yet added to the list, which is common when the controller is restarted and the user has no running jobs, the current logic is to not check some of the "per user limits" and let the submission succeed. However, if one of these limits is a zero-valued limit, the check chould failed as it means that no job should be submitted at all as it would necessarily result in a crossing of the limit. This patch ensures that even when a user is not yet present in the per user used_limits list, the 0-valued limits are correctly treated.
-
David Bigagli authored
Add PriorityFlags value of "TICKET_BASED".
-
- 14 Jan, 2013 3 commits
-
-
jette authored
-
Morris Jette authored
-
Morris Jette authored
Correction to CPU allocation count logic in for cores without hyperthreading.
-
- 11 Jan, 2013 1 commit
-
-
Morris Jette authored
-
- 10 Jan, 2013 3 commits
-
-
Morris Jette authored
Used to specify the communication protocol to be used for ALPS/BASIL.
-
jette authored
-
Morris Jette authored
-
- 09 Jan, 2013 2 commits
-
-
Danny Auble authored
-
Danny Auble authored
allow the use of accounting features like associations, qos and limits but not keep track of jobs or steps in accounting.
-
- 08 Jan, 2013 3 commits
-
-
Danny Auble authored
-
Morris Jette authored
-
Morris Jette authored
Phase 1 of effort. See "man sbatch" option -a/--array option for details. Creates job records using sbatch. Reports job arrays using scontrol or squeue. More work coming soon...
-
- 03 Jan, 2013 5 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 28 Dec, 2012 1 commit
-
-
Morris Jette authored
-
- 22 Dec, 2012 1 commit
-
-
Danny Auble authored
stack.
-
- 21 Dec, 2012 2 commits
-
-
Morris Jette authored
If sched/backfill starts a job with a QOS having NO_RESERVE and not job time limit, start it with the partition time limit (or one year if the partition has no time limit) rather than NO_VAL (140 year time limit); If a standby job, which in this case has the NO_RESERVE flag set, is submitted without a time limit, and is backfilled, it will get an EndTime waaayyyy into the future. JobId=99 Name=cmdll UserId=eckert(1043) GroupId=eckert(1043) Priority=12083 Account=sa QOS=standby JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:14 TimeLimit=12:00:00 TimeMin=N/A SubmitTime=2012-12-20T11:49:36 EligibleTime=2012-12-20T11:49:36 StartTime=2012-12-20T11:49:44 EndTime=2149-01-26T18:16:00 so I looked at the code in /src/plugins/sched/backfill: if (job_ptr->start_time <= now) { int rc = _start_job(job_ptr, resv_bitmap); if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){ job_ptr->time_limit = orig_time_limit; job_ptr->end_time = job_ptr->start_time + (orig_time_limit * 60); Using the debugger I found that if the job does not have a specified time limit, the job_ptr->time_limit is equal to NO_VAL when it hits this code.
-
Danny Auble authored
2.5.
-