1. 19 Feb, 2013 1 commit
  2. 15 Feb, 2013 1 commit
  3. 13 Feb, 2013 2 commits
  4. 12 Feb, 2013 6 commits
  5. 11 Feb, 2013 2 commits
    • Morris Jette's avatar
      Various updates for new slurmctld/dynalloc plugin · 463d2388
      Morris Jette authored
      1. Removed the job_submit and job_modify functions from the plugin, they are not required for the "slurmctld" plugin type
      2. Renamed the new parameter from "JobSubmitDynAllocPort" to "DynAllocPort" and renamed the variable (You need to change this in your slurm.conf file)
      3. Added logic so you can see the DynAllocPort value using "scontrol show config" or "sview"
      4. I made some minor formatting changes, mostly for lines that were too long
      5. Added #ifdef to the msg.h header file
      6. Changed the #ifdef variables in the header files to start with "DYNALLOC_", perhaps not needed, but it should safer, especiallly with some common names like "INFO_H"
      7. I re-wrote much of info.c. There was no need to get a copy of the node information and process the copy. We can just work directly with the data structures.
      463d2388
    • Jimmy Cao's avatar
      Added slurmctld/dynalloc plugin, JobSubmitDynAllocPort config parameter · 835b902b
      Jimmy Cao authored
      These provide support for MapReduce+
      835b902b
  6. 08 Feb, 2013 2 commits
  7. 07 Feb, 2013 2 commits
  8. 06 Feb, 2013 3 commits
  9. 05 Feb, 2013 6 commits
  10. 04 Feb, 2013 1 commit
  11. 01 Feb, 2013 2 commits
  12. 31 Jan, 2013 2 commits
  13. 30 Jan, 2013 2 commits
  14. 29 Jan, 2013 4 commits
  15. 25 Jan, 2013 1 commit
  16. 23 Jan, 2013 3 commits
    • Morris Jette's avatar
      ded211de
    • Morris Jette's avatar
    • jette's avatar
      In select/cons_res, correct logic when job removed from only some nodes. · eb3c1046
      jette authored
      I run into a problem with slurm-2.5.1 that IDLE nodes can not be
      allocated to jobs. This can be reproduced as follows:
      
      First, submit a job with --no-kill option (I have SLURM_EXCLUSIVE set to
      allocate nodes exclusively by default). Then set one of the nodes
      allocated to the job(cn2) to state DOWN:
      
      srun: error: Node failure on cn2
      srun: error: Node failure on cn2
      srun: error: cn2: task 0: Killed
      ^Csrun: interrupt (one more within 1 sec to abort)
      srun: task 1: running
      srun: task 0: exited abnormally
      ^Csrun: sending Ctrl-C to job 22605.0
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      srun: Force Terminated job step 22605.0
      
      Then change state of the node to IDLE again. But it can not be allocated
      to jobs:
      
      srun: job 22606 queued and waiting for resources
      
        JOBID PARTITION     NAME     USER  ST       TIME  NODES
      NODELIST(REASON)
        22606      work hostname     root  PD       0:00      1 (Resources)
        22604      work   sbatch     root   R       3:06      1 cn1
      
      NodeName=cn2 Arch=x86_64 CoresPerSocket=8
         CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=0.05 Features=abc
         Gres=(null)
         NodeAddr=cn2 NodeHostName=cn2
         OS=Linux RealMemory=30000 Sockets=2 Boards=1
         State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
         BootTime=2012-12-24T15:22:34 SlurmdStartTime=2013-01-14T11:06:32
         CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
      
      I traced and located the problem in select/cons_res. The call sequence
      is:
      
      slurmctld/node_mgr.c: update_node() =>
      slurmctld/job_mgr.c: kill_running_job_by_node_name() =>
      excise_node_from_job() =>
      plugins/select/cons_res/select_cons_res.c: select_p_job_resized() =>
      _rm_job_from_one_node() => _build_row_bitmaps() =>
      common/job_resources: remove_job_from_cores()
      
      If there are other jobs running in the partition, the partition row
      bitmap will not be set correctly. In the example above, before
      _build_row_bitmaps(), output of _dump_part() is:
      
      [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
      [2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-63
      
      after setting the node down, output of _dump_part() is
      
      [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
      [2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-47
      
      Cores of cn2 are not marked as available. Instead, cores of other nodes
      are released. When another job requires the node cn2, the following log
      message appears:
      
      [2013-01-19T13:25:03+08:00] debug3: cons_res: _vns: node cn2 busy
      
      I do not understand the design of select/cons_res well and I do not know
      how to fix this. But it seems that _build_row_bitmaps() should not be
      called, since the job is not removed totally, but only one of the nodes
      released.
      eb3c1046