SLURM Quick Start Guide
Overview
Simple Linux Utility for Resource Management (SLURM) is an open source,
fault-tolerant, and highly scalable cluster management and job
scheduling system for Linux clusters large and small.
SLURM requires no kernel modifications for it operation and is
relatively self-contained.
As a cluster resource manager, SLURM has three key functions. First,
it allocates exclusive and/or non-exclusive access to resources
(compute nodes) to users for
some duration of time so they can perform work. Second, it provides
a framework for starting, executing, and monitoring work (normally a
parallel job) on the set of allocated nodes. Finally, it arbitrates
conflicting requests for resources by managing a queue of pending work.
Architecture
As depicted in Figure 1, SLURM consists of a slurmd daemon
running on each compute node, a central slurmctld daemon running on
a management node (with optional fail-over twin), and five command line
utilities: srun, scancel, sinfo, squeue, and
scontrol, which can run anywhere in the cluster.
Figure 1: SLURM components
The entities managed by these SLURM daemons are shown in Figure 2
and include
nodes, the compute resource in SLURM,
partitions, which group nodes into logical disjoint sets,
jobs, or allocations of resources assigned to a user for a
specified amount of time, and
job steps, which are sets of (possibly parallel) tasks within a job.
Priority-ordered jobs are allocated nodes within a partition until the
resources (nodes) within that partition are exhausted.
Once a job is assigned a set of nodes, the user is able to initiate
parallel work in the form of job steps in any configuration within the
allocation. For instance a single job step may be started which utilizes
all nodes allocated to the job, or several job steps may independently
use a portion of the allocation.
Figure 2: SLURM entities
Daemons
slurmctld is sometimes called the controller daemon.
It orchestrates all SLURM activities including: queuing of job,
monitoring node state, and allocating resources (nodes) to jobs.
There is an optional backup controller that automatically assumes
control in the event the primary controller fails.
The primary controller resumes control whenever
it is restored to service. The controller saves its state to disk
whenever there is a change. This state can be recovered by the controller
at startup time. Slurmctld would typically execute as a
special user specifically for this purpose.
A man page exists for slurmctld as well as all other SLURM daemons,
commands, and API functions.
The slurmd daemon executes on every compute node.
It resembles a remote shell daemon to export control to SLURM.
Since slurmd initiates and manages user jobs, it must execute as
the user root.
Commands
srun is used to submit a job for execution.
Jobs can be submitted for immediate or later execution (e.g. batch).
srun has a wide variety of options to specify resource requirements
including: minimum and maximum node count, processor count, specific
nodes to use or not use, and specific node characteristics (so much
memory, disk space, or certain required features).
Besides securing a resource allocation srun is used to initiate
job steps, parallel tasks.
These job steps can execute sequentially or in parallel on independent
or common nodes within the job's node allocation.
scancel is used to cancel a pending or running job or job step.
It can also be used to send an arbitrary signal to all processes
associated with a running job or job step.
scontrol is the administrative tool used to view and/or modify
SLURM state.
Many scontrol commands can only be executed as user root.
sinfo reports the state of partitions and nodes managed by SLURM.
squeue reports the state of jobs or job steps.
It has a wide variety of filtering, sorting, and formatting options.
By default, it reports the running jobs in priority order and then the
pending jobs in priority order.
Authentication
All communications between SLURM components are authenticated.
The authentication infrastructure used is specified in the SLURM
configuration file and options include: none,
authd, and munged.
Configuration
The SLURM configuration file includes a wide variety of parameters.
A full description of the parameters is included in the slurm.conf
man page.
Rather than duplicate that information, a sample configuration file
is shown below and a points will be made about it.
Any text following a "#" is considered a comment.
The keywords in the file are not case sensitive,
although the argument typically is (e.g. "SlurmUser=slurm"
might be specified as "slurmuser=slurm").
The control machine, like all other machine specifications can
include both the host name and the name used for communications.
In this case, the host's name is "mcri" and the name "emcri" is
used for communications. The "e" prefix identifies this as an
ethernet address at this site.
Port numbers to be used for communications are specified as
well as various timer values.
Partition and node specifications use a regular expression to
identify nodes in a consise fashion.
This configuration file was used for a 1154 node cluster, but
might be used for a much larger cluster by just changing a
few regular expressions.
#
# Sample /etc/slurm.conf for mcr.llnl.gov
#
ControlMachine=mcri ControlAddr=emcri
#
AuthType=auth/authd
Epilog=/usr/local/slurm/etc/epilog
HeartbeatInterval=30
PluginDir=/usr/local/slurm/lib/slurm
Prolog=/usr/local/slurm/etc/prolog
SlurmUser=slurm
SlurmctldPort=7002
SlurmctldTimeout=300
SlurmdPort=7003
SlurmdSpoolDir=/var/tmp/slurmd.spool
SlurmdTimeout=300
StateSaveLocation=/tmp/slurm.state
#
# Node Configurations
#
NodeName=DEFAULT Procs=2 RealMemory=2000 TmpDisk=64000 State=UNKNOWN
NodeName=mcr[0-1151] NodeAddr=emcr[0-1151]
#
# Partition Configurations
#
PartitionName=DEFAULT State=UP
PartitionName=pdebug Nodes=mcr[0-191] MaxTime=30 MaxNodes=32 Default=YES
PartitionName=pbatch Nodes=mcr[192-1151]
URL = http://www-lc.llnl.gov/dctg-lc/slurm/quick.start.guide.html
Last Modified March 18, 2003
Maintained by
slurm-dev@lists.llnl.gov