SLURM, using the default node allocation plug-in, allocates nodes to jobs in
exclusive mode which means that even when all the resources within a node are
not utilized by a given job, another job will not have access to these resources.
Nodes possess resources such as processors, memory, swap, local
disk, etc. and jobs consume these resources. The exclusive use default policy
in SLURM can result in inefficient utilization of the cluster and of its nodes
resources.
A plug-in supporting cpu as a consumable resource is available in
SLURM 0.5.0 and newer version of SLURM. Information on how to use
this plug-in is described below.
Using CPU Consumable Resource Node Allocation Plugin
- This plug-in is available in SLURM 0.5.0 and newer version of SLURM
- The consumable resource plugin can be enabled by defining a
SelectType in the slurm.conf (e.g. SelectType=select/cons_res).
- The select/cons_res plugin is enabled or disabled cluster-wide.
- Partitions labeled as SHARED=Yes and SHARED=FORCE do not
make sense in connection with the consumable resources support. Consumable
resources support only make sense for SHARED=No. We have chosen to set
SHARED to No within the SLURM code if the select/cons_res plugin is
enabled. In the cases where the select/cons_res plugin is not
enabled the normal SLURM behaviors are not disrupted.
The only changes, users will see when using the select/cons_res plugin,
are that jobs can be co-scheduled
on nodes when cpu resources permits it. The rest of SLURM such as srun and
switches, etc. are not affected by this plugin. SLURM is, from a user point of
view, working the same way as when using the default node selection scheme.
- We introduce a new switch --exclusive which will allow users to reserve/use nodes in
exclusive mode even when consumable resources is enabled. see "man srun" for details.
- SLURM's default select/linear plugin is using a best fit algorithm based on
number of consecutive nodes. We have chosen the same node allocation
approach for consistency.
Example of Node Allocations Using Consumable Resource Plugin
The following example illustrates the different ways four jobs
are allocated across a cluster using (1) SLURM's default allocation
(exclusive mode) and (2) a processor as consumable resource
approach.
It is important to understand that the example listed below is a
contrived example and is only given here to illustrate the use of cpu as
consumable resources. Job 2 and Job 3 call for the node count to equal
the processor count. This would typically be done because
that one task per node requires all of the memory, disk space, etc. The
bottleneck would not be processor count.
Trying to execute more than one job per node will almost certainly severely
impact parallel job's performance.
The biggest beneficiary of cpus as consumable resources will be serial jobs or
jobs with modest parallelism, which can effectively share resources. On a lot
of systems with larger processor count, jobs typically run one fewer task than
there are processors to minimize interference by the kernel and daemons.
The example cluster is composed of 4 nodes (10 cpus in total):
- linux01 (with 2 processors),
- linux02 (with 2 processors),
- linux03 (with 2 processors), and
- linux04 (with 4 processors).
The four jobs are the following:
- [2] srun -n 4 -N 4 sleep 120 &
- [3] srun -n 3 -N 3 sleep 120 &
- [4] srun -n 1 sleep 120 &
- [5] srun -n 3 sleep 120 &
The user launches them in the same order as listed above.
Using SLURM's Default Node Allocation (Non-shared Mode)
The four jobs have been launched and 3 of the jobs are now
pending, waiting to get resources allocated to them. Only Job 2 is running
since it uses one cpu on all 4 nodes. This means that linux01 to linux03 each
have one idle cpu and linux04 has 3 idle cpus.
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 lsf sleep root PD 0:00 3 (Resources)
4 lsf sleep root PD 0:00 1 (Resources)
5 lsf sleep root PD 0:00 1 (Resources)
2 lsf sleep root R 0:14 4 xc14n[13-16]
Once Job 2 is finished, Job 3 is scheduled and runs on
linux01, linux02, and linux03. Job 3 is only using one cpu on each of the 3
nodes. Job 4 can be allocated onto the remaining idle node (linux04) so Job 3
and Job 4 can run concurrently on the cluster.
Job 5 has to wait for idle nodes to be able to run.
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5 lsf sleep root PD 0:00 1 (Resources)
3 lsf sleep root R 0:11 3 xc14n[13-15]
4 lsf sleep root R 0:11 1 xc14n16
Once Job 3 finishes, Job 5 is allocated resources and can run.
The advantage of the exclusive mode scheduling policy is
that the a job gets all the resources of the assigned nodes for optimal
parallel performance. The drawback is
that jobs can tie up large amount of resources that it does not use and which
cannot be shared with other jobs.
Using a Processor Consumable Resource Approach
The output of squeue shows that we
have 3 out of the 4 jobs allocated and running. This is a 2 running job
increase over the default SLURM approach.
Job 2 is running on nodes linux01
to linux04. Job 2's allocation is the same as for SLURM's default allocation
which is that it uses one cpu on each of the 4 nodes. Once Job 2 is scheduled
and running, nodes linux01, linux02 and linux03 still have one idle cpu each
and node linux04 has 3 idle cpus. The main difference between this approach and
the exclusive mode approach described above is that idle cpus within a node
are now allowed to be assigned to other jobs.
It is important to note that
assigned doesn't mean oversubscription. The consumable resource approach
tracks how much of each available resource (in our case cpus) must be dedicated
to a given job. This allows us to prevent per node oversubscription of
resources (cpus).
Once Job 2 is running, Job 3 is
scheduled onto node linux01, linux02, and Linux03 (using one cpu on each of the
nodes) and Job 4 is scheduled onto one of the remaining idle cpus on Linux04.
Job 2, Job 3, and Job 4 are now running concurrently on the cluster.
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5 lsf sleep root PD 0:00 1 (Resources)
2 lsf sleep root R 0:13 4 linux[01-04]
3 lsf sleep root R 0:09 3 linux[01-03]
4 lsf sleep root R 0:05 1 linux04
# sinfo -lNe
NODELIST NODES PARTITION STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
linux[01-03] 3 lsf* allocated 2 2981 1 1 (null) none
linux04 1 lsf* allocated 4 3813 1 1 (null) none
Once Job 2 finishes, Job 5, which was pending, is allocated available resources and is then
running as illustrated below:
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 lsf sleep root R 1:58 3 linux[01-03]
4 lsf sleep root R 1:54 1 linux04
5 lsf sleep root R 0:02 3 linux[01-03]
# sinfo -lNe
NODELIST NODES PARTITION STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
linux[01-03] 3 lsf* allocated 2 2981 1 1 (null) none
linux04 1 lsf* idle 4 3813 1 1 (null) none
Job 3, Job 4, and Job 5 are now running concurrently on the cluster.
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5 lsf sleep root R 1:52 3 linux[01-03]
Job 3 and Job 4 have finshed and Job 5 is still running on nodes linux[01-03].
The advantage of the consumable resource scheduling policy
is that the job throughput can increase dramatically. The overall job
throughput/productivity of the cluster increases thereby reducing the amount of
time users have to wait for their job to complete as well as increasing the
overall efficiency of the use of the cluster. The drawback is that users do not
have the entire node dedicated to their job since they have to share nodes with
other jobs if they do not use all of the resources on the nodes.
We have added a "--exclusive" switch to srun which allow users
to specify that they would like their allocated
nodes in exclusive mode. For more information see "man srun".
The reason for that is if users have mpi//threaded/openMP
programs that will take advantage of all the cpus within a node but only need
one mpi process per node.
|