SLURM Administrator's Guide

Overview

Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters of thousands of nodes. Components include machine status, partition management, job management, and scheduling modules. The design also includes a scalable, general-purpose communication infrastructure. SLURM requires no kernel modifications and is relatively self-contained. There is a "master" or "control" machine, which orchestrates SLURM activies. There are also compute servers on which applications execute, which can number in the thousands.

Configuration

There are three SLURM configuration files that you need to establish: overall SLURM options, node configurations, and partition configuration. The overall SLURM configuration options indicate where to find the other configuration files, where to find the daemons, how often to perform certain actions, etc. The node configuration tell SLURM what nodes it is to manage as well as their anticipated hardware and system software configurations. The partition configuration permits you to establish different job limits or access lists for various groups (or partitions) of nodes.

Overall SLURM Configuration

The overall SLURM configuration is provided by the file "/etc/SLURM.conf". This file identifies names for administrator names, time limits, configuration file names, key resource names, etc. This file or a copy of it must execute on each computer under SLURM management. The following paramters may be specified:
Administrators
A comma list of users permitted to execute SLURM administrative commands (the default value is "root").
ControlMachine
The name of the machine where SLURM control functions are executed (e.g. "lx01"). This value must be specified.
BackupController
The name of the machine where SLURM control functions are to be executed in the event that ControlMachine fails (e.g. "lx02"). This node may be used as a compute server by default. It will come into service as a controller only upon the failure of ControlMachine and will revert to a "standby" mode when the ControlMachine becomes available once again. While not essential, it is highly recommended that you specify a backup controller.
ControllerTimeout
The number of seconds the BackupController should allow the ControlMachine to respond before assuming responsibilities for control functions.
NodeSpecConf
Fully qualified pathname of the file containing node configuration information as described below (the default value is "/usr/local/SLURM/NodeSpecConf"). This file must be accessible to the ControlMachine and BackupController.
PartitionConf
Fully qualified pathname of the file containing partition configuration information as described below (the default value is "/usr/local/SLURM/PartitionConf"). This file must be accessible to the ControlMachine and BackupController.
MasterDaemon
The fully qualified pathname of the file containing the SLURM daemon to execute on all machines, execute InitProgram then execute ControlDaemon and/or ServerDaemon (the default value is "/usr/local/SLURM/Slurmd.Master"). This file must be accessible to all machines.
InitProgram
This program must execute and return an exit code of zero before the ControlDaemon or ServerDaemon enter into service. This would normally be used to insure that the computer is fully ready for executing user jobs. It may, for example, wait until every required file system has been mounted. By default there is no initialization program.
ControlDaemon
The fully qualified pathname of the file containing the SLURM daemon to execute on the ControlMachine (the default value is "/usr/local/SLURM/Slurmd.Control"). This file must be accessible to the ControlMachine and BackupController.
ServerDaemon
The fully qualified pathname of the file containing the SLURM daemon to execute on every compute server node (the default value is "/usr/local/SLURM/Slurmd.Server"). This file must be accessible to every SLURM compute server.
ControllerTimeout
How long to wait for the ControllerDaemon to respond before assuming it has failed and starting the BackupController. The default value is 300 and the units are seconds.
ServerTimeout
How long to wait for the ServerDaemon to respond before assuming it has failed and declaring the node DOWN and terminating any job running on it. The default value is 300 and the units are seconds.

Lines in the configuration file having "#" in column one will be considered comments. The configuration file should contain one keyword followed by an equal sign, followed by the value (one keyword per line). In the interest of simplicity (for the developers), the field descriptors above are case sensitive. The size of each line in the file is limited to 1024 characters. A SLURM configuration file is included at the end of this document.

Node Configuration

The node configuration permits you to identify the nodes (or machines) to be managed by SLURM. You may identify the hardware and/or software characteristics of the node in the configuration file. SLURM operates in a heterogeneous environment and users are able to specify resource requirements to achieve the desired scheduling characteristics. Note that some of these values are not appropriate for you to set and this will be described in detail below. The node configuration file contains the following information:
Name
Name of a node as returned by hostname (e.g. "lx12").
Partition
List of partition numbers this node belongs to, partition numbers range from 0 to 31 and are specified with comma separators (e.g. "1,3"). This can be altered by resetting MAX_PARTITION in the slurm.h file before building. In no case should this value exceed the number of bits in an integer on the computer as a bit-mask is used to record partition information. The default partition value is zero.
OS
Operating System name and level (output of the command "/bin/uname -s -r | /bin/sed 's/ /./g'", e.g. "Linux.2.4.7-10"). The default value is "UNKNOWN"
CPUs
Number of processors on the node (e.g. "2"). The default value is 1.
Speed
Relative speed of these processors. Units can be an arbitrary floating point number, but MHz value is recommended (e.g. "863.8"). The default value is 1.
RealMemory
Size of real memory on the node in MegaBytes (e.g. "2048"). The default value is 1.
VirtualMemory
Size of virtual memory on the node in MegaBytes (e.g. "4096"). The default value is 1.
TmpDisk
Total size of temporary disk storage on "/tmp" in MegaBytes (e.g. "16384"). Note this does not indicate the amount of free space available to the user on the node, only the total file The default value is 1. system size.
LastResponse
Time of last contact from node, format is time_t as returned by the "time" function. The default value is 0.
State
State of the node with respect to the initiation of user jobs. Acceptable values are "UNKNOWN", "IDLE", "BUSY", "DOWN", "DRAINED", "DRAINING". For example, the SLURM control machine may very well not be used for initiation of user jobs and its state would thus be set to "DOWN". The default value is "UNKNOWN".

Only the Name must be supplied in the configuration file; all other items are optional. If you operate with more than one partition, Partition should also be specified. Other configuration information can be established through communications with the SLURM Daemon, slurmd actually running on each node. ALternately, you can explicitly establish baseline values in the configuration file. Nodes which register to the system with less than the configured resources (e.g. too little memory), will be placed in the "DRAINED" state to avoid scheduling jobs on them. By default all nodes will be in partition zero, but it is possible to configure your system with multiple overlapping partitions (more on that below). If a node is not to be included in any partition, indicate this with the expression "Partition= ". Lines in the configuration file having "#" in column one will be considered comments. The configuration file should contain information about one node on a single line. If more than one line is used to describe a node's configuration, be sure to include "Name=" on each line. In the interest of simplicity (for the developers), the field descriptors above are case sensitive. Each field should contain the field's name, an equal sign, and the value. Fields should be space or tab separated. The default values for each node can be specified with a record in which "Name" is "DEFAULT". The default entry values will apply only to lines following it in the configuration file and the default values can be reset multiple times in the configuration file with multiple entries where "Name=DEFAULT". In order to support the concept of jobs requiring consecutive nodes, nodes should be place in this file in consecutive order. The size of each line in the file is limited to 1024 characters. A sample node configuration file is included at the end of this document.

SLURM uses a hash table in order to locate table entries rapidly. Each table entry can be directly accessed without any searching if the name contains a base-ten sequence number suffix. If you use a different naming convention, it may be desirable to modify the hashing functions Hash_Index and Rehash in the Mach_Stat_Mgr.c module as appropriate. This is especially important in clusters having large numbers of nodes. Non-numeric information in the name is ignored (if desired, they could differ for each node). The sequence numbers can start at zero, one, or any other desired number, but should contain consecutive numbers. The sequence number portion can contain leading zeros for a consistent name length, if so desired. For example, name nodes lx01, lx02, lx03, lx04, lx05, lx06, lx07, lx08, lx09, lx10, lx11, lx12, lx13, lx14, lx15, lx16 for excellent performance. Note that correct operation will be provided with any nodes names, but performance will suffer without this optimization.

Partition Configuration

The partition configuration permits you to establish different job limits or access lists for various groups (or partitions) of nodes. Nodes may be in more than one partition. The partiton configuration file contains the following information:
Name
Name by which the partition may be referenced (e.g. "Interactive"). This name can be used by users when submitting their jobs.
Number
Unique number by which the partition can be referenced. This is used in the node configuration file.
JobType
Job types which may execute in the partition. Possible values are "BATCH", "INTERACTIVE", and "ALL". The default value is "ALL".
MaxTime
Maximum wall-time limit for any job in minutes. The default value is "UNLIMITED", which is represented internally as -1.
MaxCpus
Maximum count of CPUs which may be allocated to any single job, The default value is "UNLIMITED", which is represented internally as -1.
State
State of partition or availability for use. Possible values are "UP" or "DOWN". The default value is "UP".
AllowUsers
Names of user who may use the partition, separated by commas. The default value is "ALL". If AllowUsers is specified, then the value of DenyUsers will be ignored.
DenyUsers
Names of user who may not use the partition, separated by commas. The default value is "NONE".

Only the first two items, Name and Number, must be supplied in the configuration file. If not otherwise specified, all nodes will be in partition zero. Lines in the configuration file having "#" in the first collumn will be considered comments. It is recommended that configuration file contain information about one partition per line. If more than one line is used to describe the configuration of a partition, specify the "Name=" on each line. In the interest of simplicity (for the developers), the field descriptors above are case sensitive. Each field should contain the field's name, an equal sign, and the value. Fields should be space or tab separated. The default values for each partition can be specified with a record in which "Name" is "DEFAULT" if other default values are desired. The default entry values will apply only to lines following it in the configuration file and the default values can be reset multiple times in the configuration file with multiple entries where "Name=DEFAULT". The size of each line in the file is limited to 1024 characters. If user controls are desired then set either AllowUsers or DenyUsers, but not both. If AllowUsers is set, then DenyUsers is ignored. A sample partition configuration file is included at the end of this document.

Job Configuration

The job configuration format specified below is used for the SLURM daemons to save state information. The format specified below is also used by the slurm_admin administration tool to modify job state information:
Number
Unique number by which the job can be referenced. This value may not be changed by slurm_admin.
Name
Name by which the job may be referenced (e.g. "Simulation"). This name can be specified by users when submitting their jobs.
JobType
Permitted job types values are "BATCH" and "INTERACTIVE".
MaxTime
Maximum wall-time limit for the job in minutes. An "UNLIMITED" value is represented internally as -1.
Nodes
Comma separated list of nodes which are allocated to the job. This value may not be changed by slurm_admin.
State
State of the job. Possible values are "PENDING", "STARTING", "RUNNING", and "ENDING".
User
Name of the user executing this job.

Slurm_admin Administration Tool

The tool you will primarily use in the administration of SLURM is slurm_admin. It provides the means of viewing and updating node and partition configurations. It can also be used to update some job state information. You can execute slurm_admin with a single keyword on the execute line or it will query you for input and process those keywords on an interactive basis. The slurm_admin keywords are shown below. A sample slurm_admin session with examples is appended.

Usage: slurm_admin [-q | -v] [<keyword>]
-q is equivalent to the "quiet" keyword
-v is equivalent to the "verbose" keyword

exit
Terminate slurm_admin.
help
Display this list of slurm_admin commands and options.
quiet
Print no messages other than error messages.
quit
Terminate slurm_admin.
reconfigure [<NodeName>]
The SLURM daemons on the specified node are instructed to re-read the configuration files. The default is that all daemons on all nodes are reconfigured.
restart [<NodeName>]
The SLURM daemons on the specified node are stopped and restarted. The default is that all daemons on all nodes are stopped and restarted. All state information for the daemons are preserved.
show <entity> [<ID>]
Show the configuration for a given entity. Entity must be "job", "node", or "partition". By default, state information for all records is reported. If you only wish to see the state of one entity record, specify either its ID number (assumed if entirely numberic) or its name.
start [<NodeName>]
The SLURM daemons on the specified node are started as needed. The default is that all daemons on all nodes are started.
stop [<NodeName>]
The SLURM daemons on the specified node are stopped. The default is that all daemons on all nodes are stopped. All state information for the daemons are preserved for whenever the daemons are restarted. Note this will not terminate any job executing on the node. You may set a node's state to "DRAINING", set any running job's state to "ENDING", wait for the job to terminate, then stop the daemon on a node to leave no active jobs.
update <entity> <options>
Update the configuration for a given entity. Entity must be "job", "node", or "partition". Options are of the same format as the configuration files.
upload [<NodeName>]
Upload into the SLURM node configuration table actual configuration as actually reported by the node (memory, CPU count, temporary disk, etc.). This can be used to establish a baseline configuration rather than entering the configurations manually into a file. By default information from all nodes is uploaded.
verbose
Enable detailed logging of slurm_admin execution state information.
version
Display the slurm_admin tool version number.
write <entity> <filename> Write current entity configuration information to the specified file. Entity must be "job", "node", or "partition".

Miscellaneous

It is advisable to start the ControlMachine before any other of the cluster's nodes. There is no necessity for synchronized clocks on the nodes. The hierarchical communications provides excellent scalability. Fault-tolerance will be built through mechanisms to save and restore the database using local and global file systems.

Sample SLURM Configuration File

# 
# Sample /etc/SLURM.conf
# Author: John Doe
# Date: 11/06/2001
#
Administrators=cdunlap,garlick,grondo,jette
#
ControlMachine=lx01
BackupController=lx02
#
NodeSpecConf=/usr/local/SLURM/NodeSpecConf
PartitionConf=/usr/local/SLURM/PartitionConf
#
MasterDaemon=/usr/local/SLURM/Slurmd.Master"
InitProgram=/usr/local/SLURM/Slurmd.Prolog"
ControlDaemon=/usr/local/SLURM/Slurmd.Control"
ServerDaemon=/usr/local/SLURM/Slurmd.Server"
ControllerTimeout=120
ServerTimeout=90

Sample Node Configuration File

# 
# Sample /usr/local/SLURM/NodeSpecConf
# Author: John Doe
# Date: 11/06/2001
#
Name=DEFAULT OS=Linux.2.4.7-1 CPUs=16 Speed=345.0 RealMemory=2048 VirtualMemory=4096 TmpDisk=16384 State=IDLE
#
# lx01-lx02 for login and control functions only, node state is DOWN for SLURM initiated jobs
Name=lx01 State=DOWN
Name=lx02 State=DOWN
#
# lx03-lx09 for partitions 1 (debug) and 3 (super)
Name=DEFAULT Partition=1,3
Name=lx03
Name=lx04
Name=lx05 
Name=lx06 
Name=lx07 TmpDisk=4096
Name=lx08 
Name=lx09 
#
# lx10-lx30 for partitions 0 (pbatch) and 3 (super)
Name=DEFAULT Partition=0,3
Name=lx10 
Name=lx11 VirtualMemory=2048
Name=lx12 RealMemory=1024 
Name=lx13 
Name=lx14 CPUs=32
Name=lx15 
Name=lx16 
Name=lx17 
Name=lx18 State=DOWN
Name=lx19 
Name=lx20 
Name=lx21 
Name=lx22 CPUs=8
Name=lx23 
Name=lx24 
Name=lx25 
Name=lx26 
Name=lx27 
Name=lx28 
Name=lx29 
Name=lx30 
#
# lx31-lx32 for partitions 4 (class) and 3 (super)
Name=DEFAULT Partition=3,4
Name=lx31 
Name=lx32 

Sample Partition Configuration File

# 
# Example /usr/local/SLURM/PartitionConf
# Author: John Doe
# Date: 12/14/2001
#
Name=pbatch  Number=0 JobType=BATCH       MaxCpus=128 MaxTime=UNLIMITED
Name=debug   Number=1 JobType=INTERACTIVE MaxCpus=16  MaxTime=60
Name=super   Number=3 JobType=ALL   MaxCpus=UNLIMITED MaxTime=UNLIMITED AllowUsers=cdunlap,garlick,grondo,jette
Name=class   Number=4 JobType=ALL         MaxCpus=16  MaxTime=10        AllowUsers=student1,student2,student3

Sample slurm_admin Execution

Upload actual node configurations to review:
  # slurm_admin
  slurm_admin: upload
  slurm_admin: write node baseline_node_config
  slurm_admin: reload
  slurm_admin: quit
  # cat baseline_node_config
  .....

Remove node lx30 from service, removing jobs as needed:
  # slurm_admin
  slurm_admin: update node Name=lx30 State=DRAINING
  slurm_admin: show job
  ID=1234 Name=Simulation JobType=BATCH MaxTime=100 Nodes=lx29,lx30 State=RUNNING User=grondo
  ID=1235 Name=MyBigTest  JobType=BATCH MaxTime=100 Nodes=lx20,lx21 State=RUNNING User=grondo
  slurm_admin: update job ID=1234 State=ENDING
  slurm_admin: show job 1234
  Job 1234 not found
  slurm_admin: stop lx30
  slurm_admin: show node lx30
  Name=lx30 Partition=0,3 State=DOWN OS=Linux.2.4.7-1 CPUs=16 Speed=345.0 RealMemory=2048 VirtualMemory=4096 TmpDisk=16384
  slurm_admin: quit

URL = http://www-lc.llnl.gov/dctg-lc/slurm/user.administrator.html

Last Modified January 23, 2002

Maintained by Moe Jette jette1@llnl.gov