• jette's avatar
    In select/cons_res, correct logic when job removed from only some nodes. · eb3c1046
    jette authored
    I run into a problem with slurm-2.5.1 that IDLE nodes can not be
    allocated to jobs. This can be reproduced as follows:
    
    First, submit a job with --no-kill option (I have SLURM_EXCLUSIVE set to
    allocate nodes exclusively by default). Then set one of the nodes
    allocated to the job(cn2) to state DOWN:
    
    srun: error: Node failure on cn2
    srun: error: Node failure on cn2
    srun: error: cn2: task 0: Killed
    ^Csrun: interrupt (one more within 1 sec to abort)
    srun: task 1: running
    srun: task 0: exited abnormally
    ^Csrun: sending Ctrl-C to job 22605.0
    srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
    srun: Force Terminated job step 22605.0
    
    Then change state of the node to IDLE again. But it can not be allocated
    to jobs:
    
    srun: job 22606 queued and waiting for resources
    
      JOBID PARTITION     NAME     USER  ST       TIME  NODES
    NODELIST(REASON)
      22606      work hostname     root  PD       0:00      1 (Resources)
      22604      work   sbatch     root   R       3:06      1 cn1
    
    NodeName=cn2 Arch=x86_64 CoresPerSocket=8
       CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=0.05 Features=abc
       Gres=(null)
       NodeAddr=cn2 NodeHostName=cn2
       OS=Linux RealMemory=30000 Sockets=2 Boards=1
       State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
       BootTime=2012-12-24T15:22:34 SlurmdStartTime=2013-01-14T11:06:32
       CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
    
    I traced and located the problem in select/cons_res. The call sequence
    is:
    
    slurmctld/node_mgr.c: update_node() =>
    slurmctld/job_mgr.c: kill_running_job_by_node_name() =>
    excise_node_from_job() =>
    plugins/select/cons_res/select_cons_res.c: select_p_job_resized() =>
    _rm_job_from_one_node() => _build_row_bitmaps() =>
    common/job_resources: remove_job_from_cores()
    
    If there are other jobs running in the partition, the partition row
    bitmap will not be set correctly. In the example above, before
    _build_row_bitmaps(), output of _dump_part() is:
    
    [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
    [2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-63
    
    after setting the node down, output of _dump_part() is
    
    [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
    [2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-47
    
    Cores of cn2 are not marked as available. Instead, cores of other nodes
    are released. When another job requires the node cn2, the following log
    message appears:
    
    [2013-01-19T13:25:03+08:00] debug3: cons_res: _vns: node cn2 busy
    
    I do not understand the design of select/cons_res well and I do not know
    how to fix this. But it seems that _build_row_bitmaps() should not be
    called, since the job is not removed totally, but only one of the nodes
    released.
    eb3c1046
To find the state of this project's repository at the time of any of these versions, check out the tags.