- 03 Apr, 2012 6 commits
-
-
Morris Jette authored
-
Morris Jette authored
Add documentation for the mpi/pmi2 plugin. Minor changes to code formatting and logic, but old code should work fine.
-
Morris Jette authored
No change in logic
-
Hongjia Cao authored
-
Morris Jette authored
-
Morris Jette authored
Add support for new SchedulerParameters of max_depend_depth defining the maximum number of jobs to test for circular dependencies (i.e. job A waits for job B to start and job B waits for job A to start). Default value is 10 jobs.
-
- 02 Apr, 2012 9 commits
-
-
Morris Jette authored
Conflicts: NEWS
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
The problem was conflicting logic in the select/cons_res plugin. Some of the code was trying to get the job the maximum node count in the range while other logic was trying to minimize spreading out of the job across multiple switches. As you note, this problem only happens when a range of node counts is specified and the select/cons_res plugin and the topology/tree plugin and even then it is not easy to reproduce (you included all of the details below). Quoting Martin.Perry@Bull.com: > Certain combinations of topology configuration and srun -N option produce > spurious job rejection with "Requested node configuration is not > available" with select/cons_res. The following example illustrates the > problem. > > [sulu] (slurm) etc> cat slurm.conf > ... > TopologyPlugin=topology/tree > SelectType=select/cons_res > SelectTypeParameters=CR_Core > ... > > [sulu] (slurm) etc> cat topology.conf > SwitchName=s1 Nodes=xna[13-26] > SwitchName=s2 Nodes=xna[41-45] > SwitchName=s3 Switches=s[1-2] > > [sulu] (slurm) etc> sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > ... > jkob up infinite 4 idle xna[14,19-20,41] > ... > > [sulu] (slurm) etc> srun -N 2-4 -n 4 -p jkob hostname > srun: Force Terminated job 79 > srun: error: Unable to allocate resources: Requested node configuration is > not available > > The problem does not occur with select/linear, or topology/none, or if -N > is omitted, or for certain other values for -N (for example, -N 4-4 and -N > 2-3 work ok). The problem seems to be in function _eval_nodes_topo in > src/plugins/select/cons_res/job_test.c. The srun man page states that when > -N is used, "the job will be allocated as many nodes as possible within > the range specified and without delaying the initiation of the job." > Consistent with this description, the requested number of nodes in the > above example is 4 (req_nodes=4). However, the code that selects the > best-fit topology switches appears to make the selection based on the > minimum required number of nodes (min_nodes=2). It therefore selects > switch s1. s1 has only 3 nodes from partition jkob. Since this is fewer > than req_nodes the job is rejected with the "node configuration" error. > > I'm not sure where the code is going wrong. It could be in the > calculation of the number of needed nodes in function _enough_nodes. Or > it could be in the code that initializes/updates req_nodes or rem_nodes. I > don't feel confident that I understand the logic well enough to propose a > fix without introducing a regression. > > Regards, > Martin
-
Morris Jette authored
-
Morris Jette authored
When the optional max_time is not specified for --switches=count, the site max (SchedulerParameters=max_switch_wait=seconds) is used for the job. Based on patch from Rod Schultz.
-
- 30 Mar, 2012 3 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
- 29 Mar, 2012 3 commits
-
-
Mark Nelson authored
accounting. Work contributed by Mark Nelson.
-
Morris Jette authored
The problem was conflicting logic in the select/cons_res plugin. Some of the code was trying to get the job the maximum node count in the range while other logic was trying to minimize spreading out of the job across multiple switches. As you note, this problem only happens when a range of node counts is specified and the select/cons_res plugin and the topology/tree plugin and even then it is not easy to reproduce (you included all of the details below). Quoting Martin.Perry@Bull.com: > Certain combinations of topology configuration and srun -N option produce > spurious job rejection with "Requested node configuration is not > available" with select/cons_res. The following example illustrates the > problem. > > [sulu] (slurm) etc> cat slurm.conf > ... > TopologyPlugin=topology/tree > SelectType=select/cons_res > SelectTypeParameters=CR_Core > ... > > [sulu] (slurm) etc> cat topology.conf > SwitchName=s1 Nodes=xna[13-26] > SwitchName=s2 Nodes=xna[41-45] > SwitchName=s3 Switches=s[1-2] > > [sulu] (slurm) etc> sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > ... > jkob up infinite 4 idle xna[14,19-20,41] > ... > > [sulu] (slurm) etc> srun -N 2-4 -n 4 -p jkob hostname > srun: Force Terminated job 79 > srun: error: Unable to allocate resources: Requested node configuration is > not available > > The problem does not occur with select/linear, or topology/none, or if -N > is omitted, or for certain other values for -N (for example, -N 4-4 and -N > 2-3 work ok). The problem seems to be in function _eval_nodes_topo in > src/plugins/select/cons_res/job_test.c. The srun man page states that when > -N is used, "the job will be allocated as many nodes as possible within > the range specified and without delaying the initiation of the job." > Consistent with this description, the requested number of nodes in the > above example is 4 (req_nodes=4). However, the code that selects the > best-fit topology switches appears to make the selection based on the > minimum required number of nodes (min_nodes=2). It therefore selects > switch s1. s1 has only 3 nodes from partition jkob. Since this is fewer > than req_nodes the job is rejected with the "node configuration" error. > > I'm not sure where the code is going wrong. It could be in the > calculation of the number of needed nodes in function _enough_nodes. Or > it could be in the code that initializes/updates req_nodes or rem_nodes. I > don't feel confident that I understand the logic well enough to propose a > fix without introducing a regression. > > Regards, > Martin
-
Morris Jette authored
-
- 28 Mar, 2012 19 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
any underlying infrastructure.
-
Danny Auble authored
to avoid deadlock.
-
Danny Auble authored
ionode_str
-
Danny Auble authored
hardware in error and job running on blocks as well this fix will make it so new blocks are formed around the bad hardware and free the old ones.
-
Danny Auble authored
-
Morris Jette authored
Patch from Martin Perry. SelectType=select/cons_res SelectTypeParameters=CR_Socket Slurm built with ALLOCATE_FULL_SOCKET = 1 Node n8 has the following layout: Socket 0: CPUs 0-3 Socket 1: CPUs 4-7 Without fix to _allocate_sockets (incorrect allocation for -c values of 3, 5, 6, and 7): [sulu] (slurm) etc> srun -c1 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=4-7 Mem=0 [sulu] (slurm) etc> srun -c2 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=4-7 Mem=0 [sulu] (slurm) etc> srun -c3 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=0-3 Mem=0 [sulu] (slurm) etc> srun -c4 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=4-7 Mem=0 [sulu] (slurm) etc> srun -c5 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=0-4 Mem=0 [sulu] (slurm) etc> srun -c6 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=0-5 Mem=0 [sulu] (slurm) etc> srun -c7 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=0-6 Mem=0 [sulu] (slurm) etc> srun -c8 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=0-7 Mem=0 With fix to _allocate_sockets (allocation appears correct for all values of -c): [sulu] (slurm) etc> srun -c1 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=4-7 Mem=0 [sulu] (slurm) etc> srun -c2 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=4-7 Mem=0 [sulu] (slurm) etc> srun -c3 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=4-7 Mem=0 [sulu] (slurm) etc> srun -c4 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=4-7 Mem=0 [sulu] (slurm) etc> srun -c5 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=0-7 Mem=0 [sulu] (slurm) etc> srun -c6 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=0-7 Mem=0 [sulu] (slurm) etc> srun -c7 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=0-7 Mem=0 [sulu] (slurm) etc> srun -c8 -m block:block --jobid 1 scontrol --details show job 1 | grep CPU_ID Nodes=n8 CPU_IDs=0-7 Mem=0 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Without this change, an assert can occur when operating bitmaps of differrent sizes
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-