1. 05 Feb, 2013 7 commits
  2. 04 Feb, 2013 5 commits
  3. 01 Feb, 2013 7 commits
  4. 31 Jan, 2013 4 commits
  5. 30 Jan, 2013 3 commits
  6. 29 Jan, 2013 9 commits
  7. 28 Jan, 2013 1 commit
  8. 26 Jan, 2013 1 commit
  9. 23 Jan, 2013 2 commits
    • jette's avatar
      In select/cons_res, correct logic when job removed from only some nodes. · eb3c1046
      jette authored
      I run into a problem with slurm-2.5.1 that IDLE nodes can not be
      allocated to jobs. This can be reproduced as follows:
      
      First, submit a job with --no-kill option (I have SLURM_EXCLUSIVE set to
      allocate nodes exclusively by default). Then set one of the nodes
      allocated to the job(cn2) to state DOWN:
      
      srun: error: Node failure on cn2
      srun: error: Node failure on cn2
      srun: error: cn2: task 0: Killed
      ^Csrun: interrupt (one more within 1 sec to abort)
      srun: task 1: running
      srun: task 0: exited abnormally
      ^Csrun: sending Ctrl-C to job 22605.0
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      srun: Force Terminated job step 22605.0
      
      Then change state of the node to IDLE again. But it can not be allocated
      to jobs:
      
      srun: job 22606 queued and waiting for resources
      
        JOBID PARTITION     NAME     USER  ST       TIME  NODES
      NODELIST(REASON)
        22606      work hostname     root  PD       0:00      1 (Resources)
        22604      work   sbatch     root   R       3:06      1 cn1
      
      NodeName=cn2 Arch=x86_64 CoresPerSocket=8
         CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=0.05 Features=abc
         Gres=(null)
         NodeAddr=cn2 NodeHostName=cn2
         OS=Linux RealMemory=30000 Sockets=2 Boards=1
         State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
         BootTime=2012-12-24T15:22:34 SlurmdStartTime=2013-01-14T11:06:32
         CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
      
      I traced and located the problem in select/cons_res. The call sequence
      is:
      
      slurmctld/node_mgr.c: update_node() =>
      slurmctld/job_mgr.c: kill_running_job_by_node_name() =>
      excise_node_from_job() =>
      plugins/select/cons_res/select_cons_res.c: select_p_job_resized() =>
      _rm_job_from_one_node() => _build_row_bitmaps() =>
      common/job_resources: remove_job_from_cores()
      
      If there are other jobs running in the partition, the partition row
      bitmap will not be set correctly. In the example above, before
      _build_row_bitmaps(), output of _dump_part() is:
      
      [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
      [2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-63
      
      after setting the node down, output of _dump_part() is
      
      [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
      [2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-47
      
      Cores of cn2 are not marked as available. Instead, cores of other nodes
      are released. When another job requires the node cn2, the following log
      message appears:
      
      [2013-01-19T13:25:03+08:00] debug3: cons_res: _vns: node cn2 busy
      
      I do not understand the design of select/cons_res well and I do not know
      how to fix this. But it seems that _build_row_bitmaps() should not be
      called, since the job is not removed totally, but only one of the nodes
      released.
      eb3c1046
    • Morris Jette's avatar
      Correction to comment in spank.h · 8e0ee95a
      Morris Jette authored
      8e0ee95a
  10. 22 Jan, 2013 1 commit