Commit eb3c1046 authored by jette's avatar jette
Browse files

In select/cons_res, correct logic when job removed from only some nodes.

I run into a problem with slurm-2.5.1 that IDLE nodes can not be
allocated to jobs. This can be reproduced as follows:

First, submit a job with --no-kill option (I have SLURM_EXCLUSIVE set to
allocate nodes exclusively by default). Then set one of the nodes
allocated to the job(cn2) to state DOWN:

srun: error: Node failure on cn2
srun: error: Node failure on cn2
srun: error: cn2: task 0: Killed
^Csrun: interrupt (one more within 1 sec to abort)
srun: task 1: running
srun: task 0: exited abnormally
^Csrun: sending Ctrl-C to job 22605.0
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: Force Terminated job step 22605.0

Then change state of the node to IDLE again. But it can not be allocated
to jobs:

srun: job 22606 queued and waiting for resources

  JOBID PARTITION     NAME     USER  ST       TIME  NODES
NODELIST(REASON)
  22606      work hostname     root  PD       0:00      1 (Resources)
  22604      work   sbatch     root   R       3:06      1 cn1

NodeName=cn2 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=0.05 Features=abc
   Gres=(null)
   NodeAddr=cn2 NodeHostName=cn2
   OS=Linux RealMemory=30000 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2012-12-24T15:22:34 SlurmdStartTime=2013-01-14T11:06:32
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

I traced and located the problem in select/cons_res. The call sequence
is:

slurmctld/node_mgr.c: update_node() =>
slurmctld/job_mgr.c: kill_running_job_by_node_name() =>
excise_node_from_job() =>
plugins/select/cons_res/select_cons_res.c: select_p_job_resized() =>
_rm_job_from_one_node() => _build_row_bitmaps() =>
common/job_resources: remove_job_from_cores()

If there are other jobs running in the partition, the partition row
bitmap will not be set correctly. In the example above, before
_build_row_bitmaps(), output of _dump_part() is:

[2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
[2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-63

after setting the node down, output of _dump_part() is

[2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
[2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-47

Cores of cn2 are not marked as available. Instead, cores of other nodes
are released. When another job requires the node cn2, the following log
message appears:

[2013-01-19T13:25:03+08:00] debug3: cons_res: _vns: node cn2 busy

I do not understand the design of select/cons_res well and I do not know
how to fix this. But it seems that _build_row_bitmaps() should not be
called, since the job is not removed totally, but only one of the nodes
released.
parent 8e0ee95a
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment