SLURM Administrator's Guide
Overview
Simple Linux Utility for Resource Management (SLURM) is an open source,
fault-tolerant, and highly scalable cluster management and job
scheduling system for Linux clusters having
thousands of nodes. Components include machine status, partition
management, job management, and scheduling modules. The design also
includes a scalable, general-purpose communication infrastructure.
SLURM requires no kernel modifications and is relatively self-contained.
There is a "control" machine, which orchestrates SLURM activities. A "backup
controller" is also desirable to assume control functions in the event
of a failure in the control machine. There are also compute servers on
which applications execute, which can number in the thousands.
Configuration
There a single SLURM configuration file containing:
overall SLURM options, node configurations, and partition configuration.
This file is located at "/etc/SLURM.conf" by default.
The file location can be modified at system build time using the
DEFAULT_SLURM_CONF parameter.
The overall SLURM configuration options specify the control and backup
control machines.
The locations of daemons, state information storage, and other details
are specified at build time.
See the Build Parameters section for details.
The node configuration tell SLURM what nodes it is to manage as well as
their expected configuration.
The partition configuration permits you to define sets (or partitions)
of nodes and establish distinct job limits or access control for them.
Configuration information may be read or updated using SLURM APIs.
This configuration file or a copy of it must be accessible on every computer under
SLURM management.
The following parameters may be specified:
- ControlMachine
- The name of the machine where SLURM control functions are executed
(e.g. "lx01"). This value must be specified.
- BackupController
- The name of the machine where SLURM control functions are to be
executed in the event that ControlMachine fails (e.g. "lx02"). This node
may also be used as a compute server if so desired. It will come into service
as a controller only upon the failure of ControlMachine and will revert
to a "standby" mode when the ControlMachine becomes available once again.
While not essential, it is highly recommended that you specify a backup
controller.
Any text after "#" until the old of the line in the configuration file
will be considered a comment.
If you need to use "#" in a value within the configuration file, proceed
it with backslash "\").
The configuration file should contain a keyword followed by an
equal sign, followed by the value.
Keyword value pairs should be separated from each other by white space.
The field descriptor keywords are case sensitive.
The size of each line in the file is limited to 1024 characters.
A sample SLURM configuration file (without node or partition information)
follows.
# /etc/SLURM.conf
# Built by John Doe, 1/29/2002
ControlMachine=lx01
BackupController=lx02
The node configuration permits you to identify the nodes (or machines)
to be managed by SLURM. You may also identify the
characteristics of the node in the configuration file. SLURM operates
in a heterogeneous environment and users are able to specify resource
requirements for each job
The node configuration specifies the following information:
- NodeName
- Name of a node as returned by hostname (e.g. "lx12").
A simple real expression may optionally be used to specify ranges
of nodes to avoid building a configuration file with thousands
of entries. The real expression can contain one
pair of square brackets optionally followed by "o", "x"
or "X" for octal, lower case hexadecimal or upper case
hexadecimal respectively (the default is decimal) followed by
a number followed by a "-" followed by another number.
SLURM considers every number in the specified range to
identify a valid node. Some possible NodeName values include:
"solo", "lx[00-64]", "linux[0-64]", "slurm[o00-77]" and "cluster[x00-ff]".
If the NodeName is "DEFAULT", the values specified
with that record will apply to subsequent node specifications
unless explicitly set to other values in that node record or
replaced with a different set of default values.
For architectures in which the node order is significant,
nodes will be considered consecutive in the order defined.
For example, if the configuration for NodeName=charlie immediately
follows the configuration for NodeName=baker they will be
considered adjacent in the computer.
- CPUs
- Number of processors on the node (e.g. "2"). The default
value is 1.
- RealMemory
- Size of real memory on the node in MegaBytes (e.g. "2048").
The default value is 1.
- TmpDisk
- Total size of temporary disk storage in TMP_FS in MegaBytes
(e.g. "16384"). TMP_FS (for "Temporary File System")
identifies the location which jobs should use for temporary storage. The
value of TMP_FS is set at SLURM build time.
Note this does not indicate the amount of free
space available to the user on the node, only the total file
system size. The system administration should insure this file
system is purged as needed so that user jobs have access to
most of this space.
The PROLOG and/or EPILOG files (specified at build time) might
be used to insure the file system is kept clean.
The default value is 1.
- State
- State of the node with respect to the initiation of user jobs.
Acceptable values are "UNKNOWN", "IDLE", "BUSY", "DOWN", "DRAINED",
"DRAINING". For example, the SLURM control machine may very well
not be used for initiation of user jobs and its state could thus
be set to "DRAINED". The default value is "UNKNOWN".
Only the NodeName must be supplied in the configuration file; all other
items are optional.
Other configuration information can be gathered through communications
with the SLURM Daemon, slurmd actually running on each node.
Alternately, you can explicitly establish baseline values in the
configuration file.
Nodes which register to the system with less than the configured resources
(e.g. too little memory), will be placed in the "DOWN" state to
avoid scheduling jobs on them.
The resources checked at node registration time are: CPUs,
RealMemory and TmpDisk.
The default values for each node can be specified with a record in which
"NodeName" is "DEFAULT".
The "NodeName=" specification must be placed on every line
describing the configuration of that node(s).
When a NodeName specification exists on two or more separate lines
in the configuration, only values specified in the second
or subsequent lines will be set (SLURM will not re-apply default values).
All required information can typically be placed on a single line.
The field descriptors above are case sensitive.
The default entry values will apply only to lines following it in the
configuration file and the default values can be reset multiple times
in the configuration file with multiple entries where "NodeName=DEFAULT".
In order to support the concept of jobs requiring consecutive nodes
on some architectures,
node specifications should be place in this file in consecutive order.
The size of each line in the file is limited to 1024 characters.
The node states have the following meanings:
- UNKNOWN
- Default initial node state upon startup of SLURM.
An attempt will be made to contact the node and acquire current state information.
- IDLE
- The node is idle and available for use.
- BUSY
- The node has been allocated work (one or more user jobs) and is
processing it.
- DOWN
- The node is unavailable for use. It has been explicitly configured
DOWN or failed to respond to system state inquiries or has
explicitly removed itself from service due to a failure. This state
typically indicates some problem requiring administrator intervention.
- DRAINING
- The node has been made unavailable for new work by explicit administrator
intervention. It is processing some work at present and will enter state
"DRAINED" when that work has been completed. This might be used to
prepare some nodes for maintenance work.
- DRAINED
- The node is idle, but not available for use. The state of a node
will automatically change from DRAINING to DRAINED when user job(s) executing
on that node terminate. Since this state is entered by explicit
administrator request, additional SLURM administrator intervention is typically
not required.
SLURM uses a hash table in order to locate table entries rapidly.
Each table entry can be directly accessed without any searching
if the name contains a sequence number suffix. SLURM can be built
with the HASH_BASE set at build time to indicate the hashing mechanism.
Possible contains values are "16", "10" and "8" for names containing
hexadecimal, decimal, and octal sequence numbers respectively
or "0" which processes mixed alpha-numeric without sequence numbers.
If you use a naming convention lacking a sequence number, it may be
desirable to review the hashing function Hash_Index in the
Mach_Stat_Mgr.c module. This is especially important in clusters having
large numbers of nodes. The sequence numbers can start at any
desired number, but should contain consecutive numbers. The
sequence number portion may contain leading zeros for a consistent
name length, if so desired. Note that correct operation
will be provided with any nodes names, but performance will suffer
without this optimization.
A sample SLURM configuration file (node information only) follows.
# Node specifications
NodeName=DEFAULT CPUs=16 RealMemory=2048 TmpDisk=16384
NodeName=lx[01-02] State=DRAINED
NodeName=lx[03-16]
NodeName=lx[17-32] CPUs=32 RealMemory=4096
The partition configuration permits you to establish different job
limits or access controls for various groups (or partitions) of nodes.
Nodes may be in more than one partition. The partition configuration
file contains the following information:
- PartitionName
- Name by which the partition may be referenced (e.g. "Interactive").
This name can be specified by users when submitting jobs.
- MaxTime
- Maximum wall-time limit for any job in minutes. The default
value is "UNLIMITED", which is represented internally as -1.
- MaxNodes
- Maximum count of nodes which may be allocated to any single job,
The default value is "UNLIMITED", which is represented internally as -1.
- State
- State of partition or availability for use. Possible values
are "UP" or "DOWN". The default value is "UP".
- RootKey
- If this keyword is set, the job must be submitted with a
valid "Key=value" specified.
Valid key values are provided to user root upon request.
This mechanism can be used to restrict access to a partition.
For example, a batch system might execute as root, acquire
a key via the SLURM API, then set its user ID to that of a
non-privileged user and initiate his job.
The user's job has no special privileges other than access
to the partition.
The non-privileged user would not be able to submit jobs
directly to this partition for lack of a key.
Issued keys will remain valid for a single use only.
- AllowGroups
- Comma separated list of group IDs which may use the partition.
If at least one group associated with the user submitting the
job is in AllowGroups, he will be permitted to use this partition.
The default value is "ALL".
- Nodes
- Comma separated list of nodes which are associated with this
partition. Node names may be specified using the
real expression syntax described above. A blank list of nodes
(i.e. "Nodes= ") can be used if one wants a partition to exist,
but have no resources (possibly on a temporary basis).
- Shared
- Specify if more than one job may execute on each node in
a partition simultaneously. Possible values are
"YES" and "NO". The default value is "NO".
If nodes are shared, job performance will vary.
Only the PartitionName must be supplied in the configuration file.
It is recommended that configuration file contain information
about one partition per line. If more than one line is used to
describe the configuration of a partition, specify the "PartitionName="
on each line.
The field descriptors above are case sensitive.
The default values for each partition can be specified with a record in which
"PartitionName" is "DEFAULT" if other default values are desired.
The default entry values will apply only to lines following it in the
configuration file and the default values can be reset multiple times
in the configuration file with multiple entries where "PartitionName=DEFAULT".
When a PartitionName specification exists on two separate lines
in the configuration, only values explicitly set in the second
or subsequent lines will be set (SLURM will not re-apply default values).
The size of each line in the file is limited to 1024 characters.
A sample SLURM configuration file (partition information only) follows.
A single job may be allocated nodes from more than one partition if the
job satisfies the configuration specifications from all partitions used.
We could restrict jobs to use of a single partition, but this could
reduce flexibility in scheduling. - Moe
The job may specify a particular PartitionName, if so desired.
# Partition specifications
PartitionName=batch MaxNodes=10 MaxTime=UNLIMITED Nodes=lx[10-30] RootKey
PartitionName=debug MaxNodes=2 MaxTime=60 Nodes=lx[03-09]
PartitionName=super MaxNodes=UNLIMITED MaxTime=UNLIMITED Nodes=lx[03-32] AllowGroups=dunlap,garlick,grondo,jette
PartitionName=class MaxNodes=1 MaxTime=10 Nodes=lx[31-32] AllowGroups=students
Several sets of system configurations may be placed into a single
SLURM configuration file with associated names.
For example, one might establish a "night" and "weekend"
configuration.
One should establish a default configuration to be used
when SLURM is initialized.
The named configurations may be used when explicitly
requested using the slurm_admin tool described below.
Named configurations are followed by a set of applicable
configuration parameters to be applied enclosed within
braces "{}".
I am not entirely happy with this. It seems we would need
to specify time ranges, day of week ranges, holiday flags, etc.
It would seem simpler to specify a general partition configuration
in the file and use CRON to make changes. For example, we could
create a PartitionName=class with State=DOWN and change the State
as desired. - Moe
A SLURM configuration file is included at the end of this document.
Job Configuration
The job configuration format specified below is used by the
slurm_admin administration tool to modify job state information:
- Number
- Unique number by which the job can be referenced. This value
may not be changed by slurm_admin.
- Name
- Name by which the job may be referenced (e.g. "Simulation").
This name can be specified by users when submitting their jobs.
- MaxTime
- Maximum wall-time limit for the job in minutes. An "UNLIMITED"
value is represented internally as -1.
- Nodes
- Comma separated list of nodes which are allocated to the job.
This value may not be changed by slurm_admin.
- State
- State of the job. Possible values are "PENDING", "STARTING",
"RUNNING", and "ENDING".
- User
- Name of the user executing this job.
Build Parameters
The following configuration parameters are established at SLURM build time.
State and configuration information may be read or updated using SLURM APIs.
- BACKUP_INTERVAL
- How long to wait between saving SLURM state. The default
value is 60 and the units are seconds.
- BACKUP_LOCATION
- The fully qualified pathname of the file where the SLURM
state information is saved. There is no default value. The file should
be accessible to both the ControlMachine and also the BackupController.
The default value is "/usr/local/SLURM/Slurm.state".
- CONTROL_DAEMON
- The fully qualified pathname of the file containing the SLURM daemon
to execute on the ControlMachine. The default value is "/usr/local/SLURM/bin/Slurmd.Control".
This file must be accessible to the ControlMachine and BackupController.
- CONTROLLER_TIMEOUT
- How long the BackupController waits for the CONTROL_DAEMON to respond
before assuming it has failed and starting the BackupController.
The default value is 300 and the units are seconds.
- EPILOG
- This program is executed on each node allocated to a job upon its termination.
This can be used to remove temporary files created by the job or other clean-up.
This file must be accessible to every SLURM compute server.
By default there is no epilog program.
- HASH_BASE
- SLURM uses a hash table in order to locate table entries rapidly.
Each table entry can be directly accessed without any searching
if the name contains a sequence number suffix. SLURM can be built
with the HASH_BASE set to indicate the hashing mechanism. Possible
values are "16", "10", and "8" for names containing hexadecimal,
decimal and octal sequence numbers respectively
or "0" which processes mixed alpha-numeric without sequence numbers.
If you use a naming convention lacking a sequence number, it may be
desirable to review the hashing function Hash_Index in the
Mach_Stat_Mgr.c module. This is especially important in clusters having
large numbers of nodes. The default value is "10".
- HEARTBEAT_INTERVAL
- How frequently each SERVER_DAEMON should report its state to the CONTROL_DAEMON.
Also, how frequently the CONTROL_DAEMON should report its state to the BackupController.
The default value is 60 and the units are seconds.
- INIT_PROGRAM
- The fully qualified pathname of a program that must execute and
return an exit code of zero before the CONTROL_DAEMON or SERVER_DAEMON
enter into service. This would normally be used to insure that the
computer is fully ready for executing user jobs. It may, for example,
wait until every required file system has been mounted.
By default there is no initialization program.
- KILL_WAIT
- How long to wait between sending SIGTERM and SIGKILL signals to jobs at termination time.
The default value is 60 and the units are seconds.
- PROLOG
- This program is executed on each node allocated to a job prior to its initiation.
This file must be accessible to every SLURM compute server. By default no prolog is executed.
- SERVER_DAEMON
- The fully qualified pathname of the file containing the SLURM daemon
to execute on every compute server node. The default value is "/usr/local/SLURM/bin/Slurmd.Server".
This file must be accessible to every SLURM compute server.
- SERVER_TIMEOUT
- How long the CONTROL_DAEMON waits for the SERVER_DAEMON to respond before assuming it
has failed and declaring the node DOWN then terminating any job running on
it. The default value is 300 and the units are seconds.
- SLURM_CONF
- The fully qualified pathname of the file containing the SLURM
configuration file. The default value is "/etc/SLURM.conf".
- TMP_FS
- The fully qualified pathname of the file system which jobs should use for
temporary storage. The default value is "/tmp".
Slurm_admin Administration Tool
The tool you will primarily use in the administration of SLURM is slurm_admin.
It provides the means of viewing and updating node and partition
configurations. It can also be used to update some job state
information. You can execute slurm_admin with a single keyword on
the execute line or it will query you for input and process those
keywords on an interactive basis. The slurm_admin keywords are shown below.
A sample slurm_admin session with examples is appended.
Usage: slurm_admin [-q | -v] [<keyword>]
-q is equivalent to the "quiet" keyword
-v is equivalent to the "verbose" keyword
- exit
- Terminate slurm_admin.
- help
- Display this list of slurm_admin commands and options.
- quiet
- Print no messages other than error messages.
- quit
- Terminate slurm_admin.
- reconfigure [<NodeName>]
- The SLURM daemons on the specified node are instructed to re-read
the configuration files. The default is that all daemons on all nodes
are reconfigured.
- show <entity> [<ID>]
- Show the configuration for a given entity. Entity must
be "job", "node", or "partition". By default, state information
for all records is reported. If you only wish to see the
state of one entity record, specify either its ID number
(assumed if entirely numeric) or its name.
- update <options>
- Update the configuration information.
Options are of the same format as the configuration file.
This command can only be issued by user root.
- upload [<NodeName>]
- Upload into the SLURM node configuration table actual configuration
as actually reported by SERVER_DAEMON on each node (memory, CPU count, temporary disk, etc.).
This could be used to establish a baseline configuration rather than
entering the configurations manually into a file.
By default information from all nodes is uploaded.
This command can only be issued by user root.
- verbose
- Enable detailed logging of slurm_admin execution state information.
- version
- Display the slurm_admin tool version number.
- write <filename>
- Write current configuration information to the specified file.
This file can subsequently be used as a SLURM configuration file.
This file can be quite verbose as regular expressions will not be
used for node identification. (To do: add regular expressions)
Miscellaneous
There is no necessity for synchronized clocks on the nodes.
Events occur either in real-time based upon message traffic
or based upon changes in the time on a node. However, synchronized
clocks will permit easier analysis of SLURM logs from multiple
nodes.
SLURM uses the syslog function to record events. It uses a
range of importance levels for these messages. Be certain
that your system's syslog functionality is operational.
Sample Configuration File
# /etc/SLURM.conf
# Built by John Doe, 1/29/2002
ControlMachine=lx01
BackupController=lx02
#
# Node specifications
NodeName=DEFAULT CPUs=16 Speed=345.6 RealMemory=2048 TmpDisk=16384
NodeName=lx[01-02] State=DRAINED
NodeName=lx[03-16] Feature=CoolDebugger
#
# Default partition specification
PartitionName=batch MaxCpus=128 MaxTime=240 Nodes=lx[10-30] RootKey
PartitionName=debug MaxCpus=16 MaxTime=60 Nodes=lx[03-09] Shared=YES
PartitionName=super MaxCpus=UNLIMITED MaxTime=UNLIMITED Nodes=lx[03-32] AllowGroups=dunlap,garlick,grondo,jette
PartitionName=class MaxCpus=16 MaxTime=10 Nodes=lx[31-32] AllowGroups=students
#
# night configuration used 17:00 (Sunday through Thursday) until 8:00 (the next morning)
# absorb class partition into batch and reset time limit
night day-of-week_range time-of-day_range holiday_flag {
PartitionName=batch Nodes=lx[10-32] MaxTime=480
PartitionName=class Nodes=
}
#
# weekend configuration used Friday 17:00 until Sunday 17:00
# absorb class and merge debug partition into batch and reset time limit
weekend day-of-week_range time-of-day_range holiday_flag {
PartitionName=batch Nodes=lx[03-32] MaxTime=960
PartitionName=debug Nodes=lx[03-09] Shared=NO
PartitionName=class Nodes=
}
Sample slurm_admin Execution
Upload actual node configurations to review:
# slurm_admin
slurm_admin: upload
slurm_admin: write node baseline_node_config
slurm_admin: reload
slurm_admin: quit
# cat baseline_node_config
.....
Remove node lx30 from service, removing jobs as needed:
# slurm_admin
slurm_admin: update NodeName=lx30 State=DRAINING
slurm_admin: show job
ID=1234 Name=Simulation MaxTime=100 Nodes=lx29,lx30 State=RUNNING User=grondo
ID=1235 Name=MyBigTest MaxTime=100 Nodes=lx20,lx21 State=RUNNING User=grondo
slurm_admin: update job ID=1234 State=ENDING
slurm_admin: show job 1234
Job 1234 not found
slurm_admin: show node lx30
Name=lx30 Partition=super,class State=DRAINED CPUs=16 Speed=345.0 RealMemory=2048 TmpDisk=16384
slurm_admin: quit
URL = http://www-lc.llnl.gov/dctg-lc/slurm/admin.guide.html
Last Modified February 6, 2002
Maintained by
slurm-dev@lists.llnl.gov