SLURM Quick Start Guide

Overview

Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters large and small. SLURM requires no kernel modifications for it operation and is relatively self-contained. As a cluster resource manager, SLURM has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates conflicting requests for resources by managing a queue of pending work.

Architecture

As depicted in Figure 1, SLURM consists of a slurmd daemon running on each compute node, a central slurmctld daemon running on a management node (with optional fail-over twin), and five command line utilities: srun, scancel, sinfo, squeue, and scontrol, which can run anywhere in the cluster.

Figure 1: SLURM components

The entities managed by these SLURM daemons are shown in Figure 2 and include nodes, the compute resource in SLURM, partitions, which group nodes into logical disjoint sets, jobs, or allocations of resources assigned to a user for a specified amount of time, and job steps, which are sets of (possibly parallel) tasks within a job. Priority-ordered jobs are allocated nodes within a partition until the resources (nodes) within that partition are exhausted. Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation. For instance a single job step may be started which utilizes all nodes allocated to the job, or several job steps may independently use a portion of the allocation.

Figure 2: SLURM entities

Commands

Man pages exist for all SLURM daemons, commands, and API functions. The command option "--help" also provides a brief summary of options. Note that the command options are all case insensitive.

srun is used to submit a job for execution, allocate resources, attach to an existing allocation, or initiate job steps. Jobs can be submitted for immediate or later execution (e.g. batch). srun has a wide variety of options to specify resource requirements including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc). Besides securing a resource allocation, srun is used to initiate job steps. These job steps can execute sequentially or in parallel on independent or shared nodes within the job's node allocation.

scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.

scontrol is the administrative tool used to view and/or modify SLURM state. Note that many scontrol commands can only be executed as user root.

sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.

squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.

Daemons

slurmctld is sometimes called the controller daemon. It orchestrates SLURM activities including: queuing of job, monitoring node state, and allocating resources (nodes) to jobs. There is an optional backup controller that automatically assumes control in the event the primary controller fails. The primary controller resumes control whenever it is restored to service. The controller saves its state to disk whenever there is a change. This state can be recovered by the controller at startup time. slurmctld would typically execute as a special user specifically for this purpose (not user root).

The slurmd daemon executes on every compute node. It resembles a remote shell daemon to export control to SLURM. Since slurmd initiates and manages user jobs, it must execute as the user root.

slurmctld and/or slurmd should be initiated at node startup time per the SLURM configuration.

Examples

Execute /bin/hostname on four nodes (-N4). Include task numbers on the output (-l). The default partition will be used. One task per node will be used by default.
adev0: srun -N4 -l /bin/hostname
0: adev9
1: adev10
2: adev11
3: adev12

Execute /bin/hostname in four tasks (-n4). Include task numbers on the output (-l). The default partition will be used. One processor per task will be used by default (note that we don't specify a node count).

adev0: srun -n4 -l /bin/hostname
0: adev9
1: adev9
2: adev10
3: adev10

Submit the script my.script for later execution (-b). Explicitly use the nodes adev9 and adev10 (-w "adev[9-10]", note the use of a regular expression). One processor per task will be used by default The output will appear in the file my.stdout (-o my.stdout). By default, one task will be initiated per processor on the nodes. Note that my.script contains the command /bin/hostname which executed on the first node in the allocation (where the script runs) plus two job steps initiated using the srun command and executed sequentially.

adev0: cat my.script
#!/bin/sh
/bin/hostname
srun -l /bin/hostname
srun -l /bin/pwd

adev0: srun -w "adev[9-10]" -o my.stdout -b my.script
srun: jobid 469 submitted

adev0: cat my.stdout
adev9
0: adev9
1: adev9
2: adev10
3: adev10
0: /home/jette
1: /home/jette
2: /home/jette
3: /home/jette

Submit a job, get its status and cancel it.

adev0: srun -b my.sleeper
srun: jobid 473 submitted

adev0: squeue
  JobId Partition Name     User     St TimeLim Prio Nodes                        
    473 batch     my.sleep jette    R  UNLIMIT 0.99 adev9 
                       
adev0: scancel 473

adev0: squeue
  JobId Partition Name     User     St TimeLim Prio Nodes            

Get the SLURM partition and node status.

adev0: sinfo
PARTITION  NODES STATE     CPUS    MEMORY    TMP_DISK NODES
--------------------------------------------------------------------------------
debug          8 IDLE         2      3448       82306 adev[0-7]
batch          1 DOWN         2      3448       82306 adev8
               7 IDLE         2 3448-3458       82306 adev[9-15]

SLURM Administration

The remaining information provides basic SLURM administration information. Individuals only interested in making use of SLURM need not read read further.

Authentication

All communications between SLURM components are authenticated. The authentication infrastructure used is specified in the SLURM configuration file and options include: authd, munged and none.

Configuration

The SLURM configuration file includes a wide variety of parameters. A full description of the parameters is included in the slurm.conf man page. Rather than duplicate that information, a sample configuration file is shown below. Any text following a "#" is considered a comment. The keywords in the file are not case sensitive, although the argument typically is (e.g. "SlurmUser=slurm" might be specified as "slurmuser=slurm"). The control machine, like all other machine specifications can include both the host name and the name used for communications. In this case, the host's name is "mcri" and the name "emcri" is used for communications. The "e" prefix identifies this as an ethernet address at this site. Port numbers to be used for communications are specified as well as various timer values.

A description of the nodes and their grouping into non-overlapping partitions is required. Partition and node specifications use regular expressions to identify nodes in a concise fashion. This configuration file defines a 1154 node cluster for SLURM, but might be used for a much larger cluster by just changing a few regular expressions.

# 
# Sample /etc/slurm.conf for mcr.llnl.gov
#
ControlMachine=mcri   ControlAddr=emcri 
#
AuthType=auth/authd
Epilog=/usr/local/slurm/etc/epilog
HeartbeatInterval=30
PluginDir=/usr/local/slurm/lib/slurm
Prolog=/usr/local/slurm/etc/prolog
SlurmUser=slurm
SlurmctldPort=7002
SlurmctldTimeout=300
SlurmdPort=7003
SlurmdSpoolDir=/var/tmp/slurmd.spool
SlurmdTimeout=300
StateSaveLocation=/tmp/slurm.state
#
# Node Configurations
#
NodeName=DEFAULT Procs=2 RealMemory=2000 TmpDisk=64000 State=UNKNOWN
NodeName=mcr[0-1151]  NodeAddr=emcr[0-1151]
#
# Partition Configurations
#
PartitionName=DEFAULT State=UP    
PartitionName=pdebug Nodes=mcr[0-191] MaxTime=30 MaxNodes=32 Default=YES
PartitionName=pbatch Nodes=mcr[192-1151] 

Administration Examples

scontrol can be used to print all system information and modify most of it. Only a few examples are shown below. Please see the scontrol man page for full details. The commands and options are all case insensitive.

Print detailed state of all jobs in the system.

adev0: scontrol
scontrol: show job
JobId=475 UserId=bob(6885) Name=sleep JobState=COMPLETED
   Priority=4294901286 Partition=batch BatchFlag=0
   AllocNode:Sid=adevi:21432 TimeLimit=UNLIMITED
   StartTime=03/19-12:53:41 EndTime=03/19-12:53:59
   NodeList=adev8 NodeListIndecies=-1
   ReqProcs=0 MinNodes=0 Shared=0 Contiguous=0
   MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
   ReqNodeList=(null) ReqNodeListIndecies=-1

JobId=476 UserId=bob(6885) Name=sleep JobState=RUNNING
   Priority=4294901285 Partition=batch BatchFlag=0
   AllocNode:Sid=adevi:21432 TimeLimit=UNLIMITED
   StartTime=03/19-12:54:01 EndTime=NONE
   NodeList=adev8 NodeListIndecies=8,8,-1
   ReqProcs=0 MinNodes=0 Shared=0 Contiguous=0
   MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
   ReqNodeList=(null) ReqNodeListIndecies=-1

Print the detailed state of job 477 and change its priority to zero. A priority of zero prevents a job from being initiated (it is held in pending state).

adev0: scontrol
scontrol: show job 477
JobId=477 UserId=bob(6885) Name=sleep JobState=PENDING
   Priority=4294901286 Partition=batch BatchFlag=0
   more data removed....
scontrol: update JobId=477 Priority=0

Print the state of node adev13 and drain it. Return it to service later.

adev0: scontrol
scontrol: show node adev13
NodeName=adev13 State=ALLOCATED CPUs=2 RealMemory=3448 TmpDisk=32000
   Weight=16 Partition=debug Features=(null) 
scontrol: update NodeName=adev13 State=DRAINING
scontrol: show node adev13
NodeName=adev13 State=DRAINING CPUs=2 RealMemory=3448 TmpDisk=32000
   Weight=16 Partition=debug Features=(null) 
scontrol: quit
Later
adev0: scontrol update NodeName=adev13 State=IDLE

Reconfigure all slurm daemons on all nodes. This should be done after changing the SLURM configuration file.

adev0: scontrol reconfig

Print the current slurm configuration.

adev0: scontrol show config
Configuration data as of 03/19-13:04:12
AuthType          = auth/munge
BackupAddr        = eadevj
BackupController  = adevj
ControlAddr       = eadevi
ControlMachine    = adevi
Epilog            = (null)
FastSchedule      = 0
FirstJobId        = 0
NodeHashBase      = 10
HeartbeatInterval = 60
InactiveLimit     = 0
JobCredPrivateKey = /etc/slurm/slurm.key
JobCredPublicKey  = /etc/slurm/slurm.cert
KillWait          = 30
PluginDir         = /usr/lib/slurm
Prioritize        = (null)
Prolog            = (null)
ReturnToService   = 1
SlurmUser         = slurm(97)
SlurmctldDebug    = 4
SlurmctldLogFile  = /tmp/slurmctld.log
SlurmctldPidFile  = (null)
SlurmctldPort     = 0
SlurmctldTimeout  = 300
SlurmdDebug       = 65534
SlurmdLogFile     = /tmp/slurmd.log
SlurmdPidFile     = (null)
SlurmdPort        = 0
SlurmdSpoolDir    = /tmp/slurmd
SlurmdTimeout     = 300
SLURM_CONFIG_FILE = /etc/slurm/slurm.conf
StateSaveLocation = /usr/local/tmp/slurm/adev
TmpFS             = /tmp

Shutdown all SLURM daemons on all nodes.

adev0: scontrol shutdown

URL = http://www-lc.llnl.gov/dctg-lc/slurm/quick.start.guide.html

Last Modified March 20, 2003

Maintained by slurm-dev@lists.llnl.gov