Simple Linux Utility for Resource Management

Home

About
Overview
What's New
Publications
SLURM Team

Using
Documentation
FAQ
Getting Help

Installing
Platforms
Download
Guide

Quick Start Administrator Guide

Overview

Please see the Quick Start User Guide for a general overview.

Daemons

slurmctld is sometimes called the "controller" daemon. It orchestrates SLURM activities, including queuing of job, monitoring node state, and allocating resources (nodes) to jobs. There is an optional backup controller that automatically assumes control in the event the primary controller fails. The primary controller resumes control whenever it is restored to service. The controller saves its state to disk whenever there is a change. This state can be recovered by the controller at startup time. slurmctld would typically execute as a special user specifically for this purpose (not user root). State changes are saved so that jobs and other state can be preserved when slurmctld moves or is restarted.

The slurmd daemon executes on every compute node. It resembles a remote shell daemon to export control to SLURM. Because slurmd initiates and manages user jobs, it must execute as the user root.

slurmctld and/or slurmd should be initiated at node startup time per the SLURM configuration.

Infrastructure

All communications between SLURM components are authenticated. The authentication infrastructure used is specified in the SLURM configuration file and options include: authd, munged and none. The default authentication infrastructure is "none". This permits any user to execute any job as another user. This may be fine for testing purposes, but certainly not for production use. Configure some AuthType value other than "none" if you want any security.

Quadrics MPI works directly with SLURM on systems having Quadrics interconnects. For non-Quadrics interconnect systems, LAM/MPI is the preferred MPI infrastructure. LAM/MPI uses the command lamboot to initiate job-specific daemons on each node using SLURM's srun command. This places all MPI processes in a process-tree under the control of the slurmd daemon. LAM/MPI version 7.0.4 or higher contains support for SLURM.

SLURM's default scheduler is FIFO (First-In First-Out). A backfill scheduler plugin is also available. Backfill scheduling will initiate a lower-priority job if doing so does not delay the expected initiation time of higher priority jobs; essentially using smaller jobs to fill holes in the resource allocation plan. The Maui Scheduler offers sophisticated scheduling algorithms to control SLURM's workload. Motivated users can even develop their own scheduler plugin if so desired.

SLURM uses the syslog function to record events. It uses a range of importance levels for these messages. Be certain that your system's syslog functionality is operational.

There is no necessity for synchronized clocks on the nodes. Events occur either in real-time or based upon message traffic. However, synchronized clocks will permit easier analysis of SLURM logs from multiple nodes.

Building and Installing

Basic instructions to build and install SLURM are shown below. See the INSTALL file for more details.

  1. cd to the directory containing the SLURM source and type ./configure with appropriate options.
  2. Type make to compile SLURM.
  3. Type make install to install the programs, documentation, libaries, header files, etc.

The most commonly used arguments to the configure command include:

--enable-debug
Enable debugging of individual modules.

--prefix=PREFIX
Install architecture-independent files in PREFIX; default value is /usr/local.

--sysconfdir=DIR
Specify location of SLURM configuration file.

--with-totalview
Compile with support for the TotalView debugger (see http://www.etnus.com).

Configuration

The SLURM configuration file includes a wide variety of parameters. A full description of the parameters is included in the slurm.conf man page. Rather than duplicate that information, a sample configuration file is shown below. Any text following a "#" is considered a comment. The keywords in the file are not case sensitive, although the argument typically is (e.g., "SlurmUser=slurm" might be specified as "slurmuser=slurm"). The control machine, like all other machine specifications, can include both the host name and the name used for communications. In this case, the host's name is "mcri" and the name "emcri" is used for communications. The "e" prefix identifies this as an ethernet address at this site. Port numbers to be used for communications are specified as well as various timer values.

A description of the nodes and their grouping into non-overlapping partitions is required. Partition and node specifications use node range expressions to identify nodes in a concise fashion. This configuration file defines a 1154-node cluster for SLURM, but it might be used for a much larger cluster by just changing a few node range expressions. Specify the minimum processor count (Procs), real memory space (RealMemory, megabytes), and temporary disk space (TmpDisk, megabytes) that a node should have to be considered available for use. Any node lacking these minimum configuration values will be considered DOWN and not scheduled.

# 
# Sample /etc/slurm.conf for mcr.llnl.gov
#
ControlMachine=mcri   ControlAddr=emcri 
#
AuthType=auth/authd
Epilog=/usr/local/slurm/etc/epilog
FastSchedule=1
JobCompLoc=/var/tmp/jette/slurm.job.log
JobCompType=jobcomp/filetxt
JobCredPrivateKey=/usr/local/etc/slurm.key
JobCredPublicKey=/usr/local/etc/slurm.cert
PluginDir=/usr/local/slurm/lib/slurm
Prolog=/usr/local/slurm/etc/prolog
SchedulerType=sched/backfill
SlurmUser=slurm
SlurmctldPort=7002
SlurmctldTimeout=300
SlurmdPort=7003
SlurmdSpoolDir=/var/tmp/slurmd.spool
SlurmdTimeout=300
StateSaveLocation=/tmp/slurm.state
SwitchType=switch/elan
#
# Node Configurations
#
NodeName=DEFAULT Procs=2 RealMemory=2000 TmpDisk=64000 State=UNKNOWN
NodeName=mcr[0-1151]  NodeAddr=emcr[0-1151]
#
# Partition Configurations
#
PartitionName=DEFAULT State=UP    
PartitionName=pdebug Nodes=mcr[0-191] MaxTime=30 MaxNodes=32 Default=YES
PartitionName=pbatch Nodes=mcr[192-1151] 

You will should create unique job credential keys for your site using the program openssl. An example of how to do this is shown below. Specify file names that match the values of JobCredentialPrivateKey and JobCredentialPublicCertificate in your configuration file. The JobCredentialPrivateKey file must be readable only by SlurmUser. The JobCredentialPublicCertificate file must be readable by all users.

openssl genrsa -out /usr/local/etc/slurm.key 1024
openssl rsa -in /usr/local/etc/slurm.key -pubout -out /usr/local/etc/slurm.cert

SLURM does not use reserved ports to authenticate communication between components. You will need to have at least one "auth" plugin. Currently, only three authentication plugins are supported: auth/none, auth/authd, and auth/munge. The auth/none plugin is built and used by default, but either Brent Chun's authd, or Chris Dunlap's Munge should be installed in order to get properly authenticated communications. The configure script in the top-level directory of this distribution will determine which authentication plugins may be built. The configuration file specifies which of the available plugins will be utilized.

A PAM module (Pluggable Authentication Module) is available for SLURM that can prevent a user from accessing a node which he has not been allocated, if that mode of operation is desired.

Starting the Daemons

For testing purposes you may want to start by just running slurmctld and slurmd on one node. By default, they execute in the background. Use the -D option for each daemon to execute them in the foreground and logging will be done to your terminal. The -v option will log events in more detail with more v's increasing the level of detail (e.g. -vvvvvv). You can use one window to execute slurmctld -D -vvvvvv, a second window to execute slurmd -D -vvvvv, and a third window to execute commands such as srun -N1 /bin/hostname to confirm basic functionality.

Another important option for the daemons is -c to clear previous state information. Without the -c option, the daemons will restore any previously saved state information: node state, job state, etc. With the -c option all previously running jobs will be purged and node state will be restored to the values specified in the configuration file. This means that a node configured down manually using the scontrol command will be returned to service unless also noted as being down in the configuration file. In practice, SLURM restarts with preservation consistently.

A thorough battery of tests written in the "expect" language is also available.

Administration Examples

scontrol can be used to print all system information and modify most of it. Only a few examples are shown below. Please see the scontrol man page for full details. The commands and options are all case insensitive.

Print detailed state of all jobs in the system.

adev0: scontrol
scontrol: show job
JobId=475 UserId=bob(6885) Name=sleep JobState=COMPLETED
   Priority=4294901286 Partition=batch BatchFlag=0
   AllocNode:Sid=adevi:21432 TimeLimit=UNLIMITED
   StartTime=03/19-12:53:41 EndTime=03/19-12:53:59
   NodeList=adev8 NodeListIndecies=-1
   ReqProcs=0 MinNodes=0 Shared=0 Contiguous=0
   MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
   ReqNodeList=(null) ReqNodeListIndecies=-1

JobId=476 UserId=bob(6885) Name=sleep JobState=RUNNING
   Priority=4294901285 Partition=batch BatchFlag=0
   AllocNode:Sid=adevi:21432 TimeLimit=UNLIMITED
   StartTime=03/19-12:54:01 EndTime=NONE
   NodeList=adev8 NodeListIndecies=8,8,-1
   ReqProcs=0 MinNodes=0 Shared=0 Contiguous=0
   MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
   ReqNodeList=(null) ReqNodeListIndecies=-1

Print the detailed state of job 477 and change its priority to zero. A priority of zero prevents a job from being initiated (it is held in "pending" state).

adev0: scontrol
scontrol: show job 477
JobId=477 UserId=bob(6885) Name=sleep JobState=PENDING
   Priority=4294901286 Partition=batch BatchFlag=0
   more data removed....
scontrol: update JobId=477 Priority=0

Print the state of node adev13 and drain it. To drain a node specify a new state of DRAIN, DRAINED, or DRAINING. SLURM will automatically set it to the appropriate value of either DRAINING or DRAINED depending on whether the node is allocated or not. Return it to service later.

adev0: scontrol
scontrol: show node adev13
NodeName=adev13 State=ALLOCATED CPUs=2 RealMemory=3448 TmpDisk=32000
   Weight=16 Partition=debug Features=(null) 
scontrol: update NodeName=adev13 State=DRAIN
scontrol: show node adev13
NodeName=adev13 State=DRAINING CPUs=2 RealMemory=3448 TmpDisk=32000
   Weight=16 Partition=debug Features=(null) 
scontrol: quit
Later
adev0: scontrol 
scontrol: show node adev13
NodeName=adev13 State=DRAINED CPUs=2 RealMemory=3448 TmpDisk=32000
   Weight=16 Partition=debug Features=(null) 
scontrol: update NodeName=adev13 State=IDLE

Reconfigure all SLURM daemons on all nodes. This should be done after changing the SLURM configuration file.

adev0: scontrol reconfig

Print the current SLURM configuration. This also reports if the primary and secondary controllers (slurmctld daemons) are responding. To just see the state of the controllers, use the command ping.

adev0: scontrol show config
Configuration data as of 03/19-13:04:12
AuthType          = auth/munge
BackupAddr        = eadevj
BackupController  = adevj
ControlAddr       = eadevi
ControlMachine    = adevi
Epilog            = (null)
FastSchedule      = 1
FirstJobId        = 1
NodeHashBase      = 10
HeartbeatInterval = 60
InactiveLimit     = 0
JobCompLoc        = /var/tmp/jette/slurm.job.log
JobCompType       = jobcomp/filetxt
JobCredPrivateKey = /etc/slurm/slurm.key
JobCredPublicKey  = /etc/slurm/slurm.cert
KillWait          = 30
MaxJobCnt         = 2000
MinJobAge         = 300
PluginDir         = /usr/lib/slurm
Prolog            = (null)
ReturnToService   = 1
SchedulerAuth     = (null)
SchedulerPort     = 65534
SchedulerType     = sched/backfill
SlurmUser         = slurm(97)
SlurmctldDebug    = 4
SlurmctldLogFile  = /tmp/slurmctld.log
SlurmctldPidFile  = /tmp/slurmctld.pid
SlurmctldPort     = 7002 
SlurmctldTimeout  = 300
SlurmdDebug       = 65534
SlurmdLogFile     = /tmp/slurmd.log
SlurmdPidFile     = /tmp/slurmd.pid
SlurmdPort        = 7003
SlurmdSpoolDir    = /tmp/slurmd
SlurmdTimeout     = 300
SLURM_CONFIG_FILE = /etc/slurm/slurm.conf
StateSaveLocation = /usr/local/tmp/slurm/adev
SwitchType        = switch/elan
TmpFS             = /tmp
WaitTime          = 0

Slurmctld(primary/backup) at adevi/adevj are UP/UP

Shutdown all SLURM daemons on all nodes.

adev0: scontrol shutdown

For information about this page, contact slurm-dev@lists.llnl.gov.