1. 05 Apr, 2012 3 commits
  2. 04 Apr, 2012 2 commits
  3. 03 Apr, 2012 8 commits
  4. 02 Apr, 2012 9 commits
    • Morris Jette's avatar
      Merge branch 'slurm-2.3' · 0b7a56ca
      Morris Jette authored
      Conflicts:
      	NEWS
      0b7a56ca
    • Morris Jette's avatar
    • Morris Jette's avatar
      Improve MPI document formatting · c5436151
      Morris Jette authored
      c5436151
    • Morris Jette's avatar
      Add UPC documentation · 1dcdfba2
      Morris Jette authored
      1dcdfba2
    • Morris Jette's avatar
      06c92c25
    • Morris Jette's avatar
      Update another web pointer to mail archive · e262bd02
      Morris Jette authored
      e262bd02
    • Morris Jette's avatar
      Fix in select/cons_res+topology+job with node range count · cd84134c
      Morris Jette authored
      The problem was conflicting logic in the select/cons_res plugin. Some of the code was trying to get the job the maximum node count in the range while other logic was trying to minimize spreading out of the job across multiple switches. As you note, this problem only happens when a range of node counts is specified and the select/cons_res plugin and the topology/tree plugin and even then it is not easy to reproduce (you included all of the details below).
      
      Quoting Martin.Perry@Bull.com:
      
      > Certain combinations of topology configuration and srun -N option produce
      > spurious job rejection with "Requested node configuration is not
      > available" with select/cons_res. The following example illustrates the
      > problem.
      >
      > [sulu] (slurm) etc> cat slurm.conf
      > ...
      > TopologyPlugin=topology/tree
      > SelectType=select/cons_res
      > SelectTypeParameters=CR_Core
      > ...
      >
      > [sulu] (slurm) etc> cat topology.conf
      > SwitchName=s1 Nodes=xna[13-26]
      > SwitchName=s2 Nodes=xna[41-45]
      > SwitchName=s3 Switches=s[1-2]
      >
      > [sulu] (slurm) etc> sinfo
      > PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
      > ...
      > jkob         up   infinite      4   idle xna[14,19-20,41]
      > ...
      >
      > [sulu] (slurm) etc> srun -N 2-4 -n 4 -p jkob hostname
      > srun: Force Terminated job 79
      > srun: error: Unable to allocate resources: Requested node configuration is
      > not available
      >
      > The problem does not occur with select/linear, or topology/none, or if -N
      > is omitted, or for certain other values for -N (for example, -N 4-4 and -N
      > 2-3 work ok). The problem seems to be in function _eval_nodes_topo in
      > src/plugins/select/cons_res/job_test.c. The srun man page states that when
      > -N is used, "the job will be allocated as many nodes as possible within
      > the range specified and without delaying the initiation of the job."
      > Consistent with this description, the requested number of nodes in the
      > above example is 4 (req_nodes=4).  However, the code that selects the
      > best-fit topology switches appears to make the selection based on the
      > minimum required number of nodes (min_nodes=2). It therefore selects
      > switch s1.  s1 has only 3 nodes from partition jkob. Since this is fewer
      > than req_nodes the job is rejected with the "node configuration" error.
      >
      > I'm not sure where the code is going wrong.  It could be in the
      > calculation of the number of needed nodes in function _enough_nodes.  Or
      > it could be in the code that initializes/updates req_nodes or rem_nodes. I
      > don't feel confident that I understand the logic well enough to propose a
      > fix without introducing a regression.
      >
      > Regards,
      > Martin
      cd84134c
    • Morris Jette's avatar
      Format change, no change in logic · 92d99010
      Morris Jette authored
      92d99010
    • Morris Jette's avatar
      Use site maximum for option switch wait time. · 2581fe62
      Morris Jette authored
      When the optional max_time is not specified for --switches=count, the site
      max (SchedulerParameters=max_switch_wait=seconds) is used for the job.
      Based on patch from Rod Schultz.
      2581fe62
  5. 30 Mar, 2012 3 commits
  6. 29 Mar, 2012 3 commits
    • Mark Nelson's avatar
      Added CrpCPUMins to the output of sshare -l for those using hard limit · d1ae3d81
      Mark Nelson authored
      accounting.  Work contributed by Mark Nelson.
      d1ae3d81
    • Morris Jette's avatar
      Fix in select/cons_res+topology+job with node range count · f64b29a2
      Morris Jette authored
      The problem was conflicting logic in the select/cons_res plugin. Some of the code was trying to get the job the maximum node count in the range while other logic was trying to minimize spreading out of the job across multiple switches. As you note, this problem only happens when a range of node counts is specified and the select/cons_res plugin and the topology/tree plugin and even then it is not easy to reproduce (you included all of the details below).
      
      Quoting Martin.Perry@Bull.com:
      
      > Certain combinations of topology configuration and srun -N option produce
      > spurious job rejection with "Requested node configuration is not
      > available" with select/cons_res. The following example illustrates the
      > problem.
      >
      > [sulu] (slurm) etc> cat slurm.conf
      > ...
      > TopologyPlugin=topology/tree
      > SelectType=select/cons_res
      > SelectTypeParameters=CR_Core
      > ...
      >
      > [sulu] (slurm) etc> cat topology.conf
      > SwitchName=s1 Nodes=xna[13-26]
      > SwitchName=s2 Nodes=xna[41-45]
      > SwitchName=s3 Switches=s[1-2]
      >
      > [sulu] (slurm) etc> sinfo
      > PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
      > ...
      > jkob         up   infinite      4   idle xna[14,19-20,41]
      > ...
      >
      > [sulu] (slurm) etc> srun -N 2-4 -n 4 -p jkob hostname
      > srun: Force Terminated job 79
      > srun: error: Unable to allocate resources: Requested node configuration is
      > not available
      >
      > The problem does not occur with select/linear, or topology/none, or if -N
      > is omitted, or for certain other values for -N (for example, -N 4-4 and -N
      > 2-3 work ok). The problem seems to be in function _eval_nodes_topo in
      > src/plugins/select/cons_res/job_test.c. The srun man page states that when
      > -N is used, "the job will be allocated as many nodes as possible within
      > the range specified and without delaying the initiation of the job."
      > Consistent with this description, the requested number of nodes in the
      > above example is 4 (req_nodes=4).  However, the code that selects the
      > best-fit topology switches appears to make the selection based on the
      > minimum required number of nodes (min_nodes=2). It therefore selects
      > switch s1.  s1 has only 3 nodes from partition jkob. Since this is fewer
      > than req_nodes the job is rejected with the "node configuration" error.
      >
      > I'm not sure where the code is going wrong.  It could be in the
      > calculation of the number of needed nodes in function _enough_nodes.  Or
      > it could be in the code that initializes/updates req_nodes or rem_nodes. I
      > don't feel confident that I understand the logic well enough to propose a
      > fix without introducing a regression.
      >
      > Regards,
      > Martin
      f64b29a2
    • Morris Jette's avatar
      Format change, no change in logic · ebca432e
      Morris Jette authored
      ebca432e
  7. 28 Mar, 2012 12 commits
    • Danny Auble's avatar
    • Danny Auble's avatar
      8db1b04f
    • Danny Auble's avatar
      Always call the slurm_select_fini on ending the slurmctld to clean up · c5535b20
      Danny Auble authored
      any underlying infrastructure.
      c5535b20
    • Danny Auble's avatar
      BGQ - when calling bridge_status_fini use the rt_mutex in the right spot · a1a5e4b4
      Danny Auble authored
      to avoid deadlock.
      a1a5e4b4
    • Danny Auble's avatar
      06511698
    • Danny Auble's avatar
      BLUEGENE - if a system is doing a clean start and there happen to be · 85173ad0
      Danny Auble authored
      hardware in error and job running on blocks as well this fix will make it
      so new blocks are formed around the bad hardware and free the old ones.
      85173ad0
    • Danny Auble's avatar
      smap - remove debug · eca531ab
      Danny Auble authored
      eca531ab
    • Morris Jette's avatar
      Change select/cons_res logic for socket allocations · 0dce9e1c
      Morris Jette authored
      Patch from Martin Perry.
      
      SelectType=select/cons_res
      SelectTypeParameters=CR_Socket
      
      Slurm built with ALLOCATE_FULL_SOCKET = 1
      
      Node n8 has the following layout:
      Socket 0: CPUs 0-3
      Socket 1: CPUs 4-7
      
      Without fix to _allocate_sockets (incorrect allocation for -c values of 3, 5, 6, and 7):
      
      [sulu] (slurm) etc> srun -c1 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=4-7 Mem=0
      [sulu] (slurm) etc> srun -c2 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=4-7 Mem=0
      [sulu] (slurm) etc> srun -c3 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=0-3 Mem=0
      [sulu] (slurm) etc> srun -c4 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=4-7 Mem=0
      [sulu] (slurm) etc> srun -c5 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=0-4 Mem=0
      [sulu] (slurm) etc> srun -c6 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=0-5 Mem=0
      [sulu] (slurm) etc> srun -c7 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=0-6 Mem=0
      [sulu] (slurm) etc> srun -c8 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=0-7 Mem=0
      
      With fix to _allocate_sockets (allocation appears correct for all values of -c):
      
      [sulu] (slurm) etc> srun -c1 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=4-7 Mem=0
      [sulu] (slurm) etc> srun -c2 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=4-7 Mem=0
      [sulu] (slurm) etc> srun -c3 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=4-7 Mem=0
      [sulu] (slurm) etc> srun -c4 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=4-7 Mem=0
      [sulu] (slurm) etc> srun -c5 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=0-7 Mem=0
      [sulu] (slurm) etc> srun -c6 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=0-7 Mem=0
      [sulu] (slurm) etc> srun -c7 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=0-7 Mem=0
      [sulu] (slurm) etc> srun -c8 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID
           Nodes=n8 CPU_IDs=0-7 Mem=0
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      A
      0dce9e1c
    • Morris Jette's avatar
      Move mailing list pointer to gmane.org · 728a3600
      Morris Jette authored
      728a3600
    • Morris Jette's avatar
      Fix for bad malloc size in gres/gpu logic · fb12314d
      Morris Jette authored
      fb12314d
    • Morris Jette's avatar
      in GRES logic, validate CPU count matches node configuration · 45cc1422
      Morris Jette authored
      Without this change, an assert can occur when operating bitmaps of differrent sizes
      45cc1422
    • Morris Jette's avatar
      Correct typo in log message · 45d80575
      Morris Jette authored
      45d80575