SLURM is an open-source resource manager designed for Linux Clusters of all sizes. It was developed by the collaborative efforts of Lawrence Livermore National Laboratory (LLNL) and Linux NetworX. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates conflicting requests for resources by managing a queue of pending work.
SLURM is not a sophisticated batch system, but it does provide an Applications Programming Interface (API) for integration with external schedulers such as The Maui Scheduler. While other resources managers do exist, SLURM is unique in several respects:
SLURM has a centralized manager, slurmctld, to monitor resources and work. There may also be a backup manager to assume those responsibilities in the event of failure. Each compute server (node) has a slurmd daemon, which can be compared to a remote shell: it waits for work, executes that work, returns status, and waits for more work. User tools include srun to initiate jobs, scancel to terminate queued or running jobs, sinfo to report system status, and squeue to report the status of jobs. There is also an administrative tool scontrol available to monitor and/or modify configuration and state information. APIs are available for all functions.
SLURM has a general-purpose plugin mechanism available to easily support various infrastructure. These plugins presently include:
# # Sample /etc/slurm.conf # ControlMachine=linux0001 BackupController=linux0002 # AuthType=auth/authd Epilog=/usr/local/slurm/sbin/epilog HeartbeatInterval=60 PluginDir=/usr/local/slurm/lib Prolog=/usr/local/slurm/sbin/prolog SlurmctldPort=7002 SlurmctldTimeout=120 SlurmdPort=7003 SlurmdSpoolDir=/var/tmp/slurmd.spool SlurmdTimeout=120 StateSaveLocation=/usr/local/slurm/slurm.state SwitchType=switch/elan TmpFS=/tmp # # Node Configurations # NodeName=DEFAULT TmpDisk=16384 State=IDLE NodeName=lx[0001-0002] State=DRAINED NodeName=lx[0003-8000] Procs=16 RealMemory=2048 Weight=16 NodeName=lx[8001-9999] Procs=32 RealMemory=4096 Weight=40 Feature=1200MHz # # Partition Configurations # PartitionName=DEFAULT MaxTime=30 MaxNodes=2 PartitionName=login Nodes=lx[0001-0002] State=DOWN PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES PartitionName=class Nodes=lx[0031-0040] AllowGroups=students PartitionName=batch Nodes=lx[0041-9999] MaxTime=UNLIMITED MaxNodes=4096
URL = http://www.llnl.gov/linux/slurm/overview.html
UCRL-WEB-201790
Last Modified January 7, 2004
Maintained by slurm-dev@lists.llnl.gov