SLURM: A Highly Scalable Resource Manager for Linux Clusters

Lawrence Livermore National Laboratory (LLNL) and Linux Networx are designing and devoping SLURM, Simple Linux Utility for Resource Management. SLURM provides three key functions: First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates conflicting requests for resources by managing a queue of pending work. SLURM is not a sophisticated batch system, but it does provide an Applications Programming Interface (API) for integration with external schedulers such as The Maui Scheduler. While other resources managers do exist, SLURM is unique in several respects:

Architecture

SLURM has a centralized manager, slurmctld, to monitor resources and work. There may also be a backup manager to assume those responsibilities in the event of failure. Each compute server (node) has a slurmd daemon, which can be compared to a remote shell: it waits for work, executes that work, returns status, and waits for more work. User tools include srun to initiate jobs, scancel to terminate queued or running jobs, and squeue to report the status of jobs. There is also an administrative tool scontrol available to monitor and/or modify configuration and state information. APIs are available for all functions. Security is supported with the use of Pluggable Authentication Modules (PAM), which is presently interfaced with authd to provide flexible user authentication.

Configurability

Node state monitored include: count of processors, size of real memory, size of temporary disk space, and state (UP, DOWN, etc.). Additional node information includes weight (preference in being allocated work) and features (arbitrary information such as processor speed or type). Nodes are grouped into disjoint partitions. Partition information includes: name, list of associated nodes, state (UP or DOWN), maximum job time limit, maximum node count per job, group access list, and shared node access (YES, NO or FORCE). Bit maps are used to represent nodes and scheduling decisions can be made by performing a small number of comparisons and a series of fast bit map manipulations. A sample (partial) SLURM configuration file follows.
# 
# Sample /etc/slurm.conf
#
ControlMachine=linux0001.llnl.gov BackupController=linux0002.llnl.gov
Epilog=/usr/local/slurm/epilog Prolog=/usr/local/slurm/prolog
SlurmctldPort=7002 SlurmdPort=7003
StateSaveLocation=/usr/local/slurm/slurm.state
TmpFS=/tmp
#
# Node Configurations
#
NodeName=DEFAULT TmpDisk=16384 State=IDLE
NodeName=lx[0001-0002] State=DRAINED
NodeName=lx[0003-8000] Procs=16 RealMemory=2048 Weight=16
NodeName=lx[8001-9999] Procs=32 RealMemory=4096 Weight=40 Feature=1200MHz
#
# Partition Configurations
#
PartitionName=DEFAULT MaxTime=30 MaxNodes=2
PartitionName=login Nodes=lx[0001-0002] State=DOWN
PartitionName=debug Nodes=lx[0003-0030] State=UP    Default=YES
PartitionName=class Nodes=lx[0031-0040] AllowGroups=students
PartitionName=batch Nodes=lx[0041-9999] MaxTime=UNLIMITED MaxNodes=4096

Status

As of August 2002 basic SLURM functionality was available, although much work remains for production use in the areas of fault-tolerance, Quadrics Elan3 integration, and security. We plan to have these issues fully addressed by November of 2002 and have the system deployed in a production environment. Our next goal will be the support of the IBM Blue Gene/L architecture in the summer of 2003. For additional information please contact jette@llnl.gov.
Privacy and Legal Notice

URL = http://www-lc.llnl.gov/dctg-lc/slurm/summary.html

UCRL-WEB-149399

Last Modified July 30, 2002

Maintained by Moe Jette jette1@llnl.gov