SLURM: A Highly Scalable Resource Manager for Linux Clusters
Lawrence Livermore National Laboratory (LLNL)
and Linux Networx
are designing and developing SLURM, Simple Linux Utility for
Resource Management.
SLURM provides three key functions:
First, it allocates exclusive and/or non-exclusive access to
resources (compute nodes) to users for some duration of time
so they can perform work.
Second, it provides a framework for starting, executing, and
monitoring work (typically a parallel job) on a set of allocated
nodes.
Finally, it arbitrates conflicting requests for resources by
managing a queue of pending work.
SLURM is not a sophisticated batch system, but it does provide
an Applications Programming Interface (API) for integration
with external schedulers such as
The Maui Scheduler.
While other resources managers do exist, SLURM is unique in
several respects:
- It's source code is freely available under the
GNU General
Public License.
- It is designed to operate in a heterogeneous cluster with
up to thousands of nodes.
- It is portable; written in C with a GNU autoconf configuration
engine. While initially written for Linux, other UNIX-like
operating systems should be easy porting targets. The interconnect
to be initially supported is
Quadrics Elan3, but support for other interconnects is already planned.
- SLURM is highly tolerant of system failures including failure
of the node executing its control functions.
- It is simple enough for the motivated end user to understand
its source and add functionality.
Architecture
SLURM has a centralized manager, slurmctld, to monitor
resources and work.
There may also be a backup manager to assume those responsibilities
in the event of failure.
Each compute server (node) has a slurmd daemon, which can be
compared to a remote shell: it waits for work, executes that work,
returns status, and waits for more work.
User tools include srun to initiate jobs,
scancel to terminate queued or running jobs,
sinfo to report system status, and
squeue to report the status of jobs.
There is also an administrative tool scontrol available to
monitor and/or modify configuration and state information.
APIs are available for all functions.
Security is supported with the use of Pluggable Authentication
Modules (PAM), which is presently interfaced with
authd
to provide flexible user authentication.
Configurability
Node state monitored include: count of processors, size of real memory,
size of temporary disk space, and state (UP, DOWN, etc.).
Additional node information includes weight (preference in being allocated
work) and features (arbitrary information such as processor speed or type).
Nodes are grouped into disjoint partitions.
Partition information includes: name, list of associated nodes,
state (UP or DOWN), maximum job time limit, maximum node count per job,
group access list, and shared node access (YES, NO or FORCE).
Bit maps are used to represent nodes and scheduling decisions can be made
by performing a small number of comparisons and a series of fast bit map
manipulations.
A sample (partial) SLURM configuration file follows.
#
# Sample /etc/slurm.conf
#
ControlMachine=linux0001
BackupController=linux0002
Epilog=/usr/local/slurm/epilog Prolog=/usr/local/slurm/prolog
SlurmctldPort=7002 SlurmdPort=7003
StateSaveLocation=/usr/local/slurm/slurm.state
TmpFS=/tmp
#
# Node Configurations
#
NodeName=DEFAULT TmpDisk=16384 State=IDLE
NodeName=lx[0001-0002] State=DRAINED
NodeName=lx[0003-8000] Procs=16 RealMemory=2048 Weight=16
NodeName=lx[8001-9999] Procs=32 RealMemory=4096 Weight=40 Feature=1200MHz
#
# Partition Configurations
#
PartitionName=DEFAULT MaxTime=30 MaxNodes=2
PartitionName=login Nodes=lx[0001-0002] State=DOWN
PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES
PartitionName=class Nodes=lx[0031-0040] AllowGroups=students
PartitionName=batch Nodes=lx[0041-9999] MaxTime=UNLIMITED MaxNodes=4096
Status
As of December 2002 most SLURM functionality was in place.
Execution of a simple program (/bin/hostname) across 1900
tasks on 950 nodes could be completed in under five seconds.
Additional work remains for production use in the areas of fault-tolerance,
TotalView debugger support,
performance enhancements and security.
We plan to have SLURM running in running a beta-test version on
LLNL development platforms in January 2003 and deployed on
production systems in March 2003.
Our next goal will be the support of the
IBM BlueGene/L
architecture in the summer of 2003.
For additional information please contact
jette1@llnl.gov.
Privacy and Legal Notice
URL = http://www-lc.llnl.gov/dctg-lc/slurm/summary.html
UCRL-WEB-149399 REV 1
Last Modified January 21, 2003
Maintained by Moe Jette
jette1@llnl.gov