SLURM Administrator's Guide
Overview
Simple Linux Utility for Resource Management (SLURM) is an open source,
fault-tolerant, and highly scalable cluster management and job
scheduling system for Linux clusters having
thousands of nodes. Components include machine status, partition
management, job management, and scheduling modules. The design also
includes a scalable, general-purpose communication infrastructure.
SLURM requires no kernel modifications and is relatively self-contained.
There is a "control" machine, which orchestrates SLURM activities. A "backup
controller" is also desirable to assume control functions in the event
of a failure in the control machine. There are also compute servers on
which applications execute, which can number in the thousands.
Configuration
There a single SLURM configuration file containing:
overall SLURM options, node configurations, and partition configuration.
This file is located at "/etc/SLURM.conf" by default.
The file location can be modified at system build time using the
DEFAULT_SLURM_CONF parameter.
The overall SLURM configuration options specify the control and backup
control machines.
The locations of daemons, state information storage, and other details
are specified at build time.
See the Build Parameters section for details.
The node configuration tell SLURM what nodes it is to manage as well as
their expected configuration.
The partition configuration permits you to define sets (or partitions)
of nodes and establish distinct job limits or access control for them.
Configuration information may be read or updated using SLURM APIs.
This configuration file or a copy of it must be accessible on every computer under
SLURM management.
The following parameters may be specified:
- ControlMachine
- The name of the machine where SLURM control functions are executed
(e.g. "lx0001"). This value must be specified.
- BackupController
- The name of the machine where SLURM control functions are to be
executed in the event that ControlMachine fails (e.g. "lx0002"). This node
may also be used as a compute server if so desired. It will come into service
as a controller only upon the failure of ControlMachine and will revert
to a "standby" mode when the ControlMachine becomes available once again.
While not essential, it is highly recommended that you specify a backup
controller.
Any text after "#" until the end of the line in the configuration file
will be considered a comment.
If you need to use "#" in a value within the configuration file, proceed
it with backslash "\").
The configuration file should contain a keyword followed by an
equal sign, followed by the value.
Keyword value pairs should be separated from each other by white space.
The field descriptor keywords are case sensitive.
The size of each line in the file is limited to 1024 characters.
A sample SLURM configuration file (without node or partition information)
follows.
# /etc/SLURM.conf
# Built by John Doe, 1/29/2002
ControlMachine=lx0001
BackupController=lx0002
The node configuration permits you to identify the nodes (or machines)
to be managed by SLURM. You may also identify the
characteristics of the nodes in the configuration file. SLURM operates
in a heterogeneous environment and users are able to specify resource
requirements for each job
The node configuration specifies the following information:
- NodeName
- Name of a node as returned by hostname (e.g. "lx0012").
A simple regular expression may optionally
be used to specify ranges
of nodes to avoid building a configuration file with thousands
of entries. The real expression can contain one
pair of square brackets optionally followed by "o"
for octal (the default is decimal) followed by
a number followed by a "-" followed by another number.
SLURM considers every number in the specified range to
identify a valid node. Some possible NodeName values include:
"solo", "lx[00-64]", "linux[0-64]", and "slurm[o00-77]".
If the NodeName is "DEFAULT", the values specified
with that record will apply to subsequent node specifications
unless explicitly set to other values in that node record or
replaced with a different set of default values.
For architectures in which the node order is significant,
nodes will be considered consecutive in the order defined.
For example, if the configuration for NodeName=charlie immediately
follows the configuration for NodeName=baker they will be
considered adjacent in the computer.
- Feature
- A comma delimited list of arbitrary strings indicative of some
characteristic associated with the node.
There is no value associated with a feature at this time, a node
either has a feature or it does not.
If desired a feature may contain a numeric component indicating,
for example, processor speed.
By default a node has no features.
- RealMemory
- Size of real memory on the node in MegaBytes (e.g. "2048").
The default value is 1.
- Procs
- Number of processors on the node (e.g. "2").
The default value is 1.
dt>State
- State of the node with respect to the initiation of user jobs.
Acceptable values are "DOWN", "UNKNOWN", "IDLE", and "DRAINING".
The node states are fully described below.
The default value is "UNKNOWN".
- TmpDisk
- Total size of temporary disk storage in TMP_FS in MegaBytes
(e.g. "16384"). TMP_FS (for "Temporary File System")
identifies the location which jobs should use for temporary storage. The
value of TMP_FS is set at SLURM build time.
Note this does not indicate the amount of free
space available to the user on the node, only the total file
system size. The system administration should insure this file
system is purged as needed so that user jobs have access to
most of this space.
The PROLOG and/or EPILOG programs (specified at build time) might
be used to insure the file system is kept clean.
The default value is 1.
- Weight
- The priority of the node for scheduling purposes.
All things being equal, jobs will be allocated the nodes with
the lowest weight which satisfies their requirements.
For example, a heterogeneous collection of nodes might
be placed into a single partition for greater system
utilization, responsiveness and capability. It would be
preferable to allocate smaller memory nodes rather than larger
memory nodes if either will satisfy a job's requirements.
The units of weight are arbitrary, but larger weights
should be assigned to nodes with more processors, memory,
disk space, higher processor speed, etc.
Weight is an integer value with a default value of 1.
Only the NodeName must be supplied in the configuration file; all other
items are optional.
It is advisable to establish baseline node configurations in
the configuration file, especially if the cluster is heterogeneous.
Nodes which register to the system with less than the configured resources
(e.g. too little memory), will be placed in the "DOWN" state to
avoid scheduling jobs on them.
Establishing baseline configurations will also speed SLURM's
scheduling process by permitting it to compare job requirements
against these (relatively few) configuration parameters and
possibly avoid having to perform checks job requirements
against every individual node's configuration.
The resources checked at node registration time are: Procs,
RealMemory and TmpDisk.
While baseline values for each of these can be established
in the configuration file, the actual values upon node
registration are recorded and these actual values are
used for scheduling purposes.
Default values can be specified with a record in which
"NodeName" is "DEFAULT".
The default entry values will apply only to lines following it in the
configuration file and the default values can be reset multiple times
in the configuration file with multiple entries where "NodeName=DEFAULT".
The "NodeName=" specification must be placed on every line
describing the configuration of nodes.
Each nodes configuration must be specified on a single line
rather than having the various values established on multiple lines.
In fact, it is generally possible and desirable to define the
configurations of all nodes in only a few lines.
This convention permits significant optimization in the scheduling
of larger clusters.
The field descriptors above are case sensitive.
In order to support the concept of jobs requiring consecutive nodes
on some architectures,
node specifications should be place in this file in consecutive order.
The size of each line in the file is limited to 1024 characters.
The node states have the following meanings:
- BUSY
- The node has been allocated work (one or more user jobs) and is
processing it.
- DOWN
- The node is unavailable for use. It has been explicitly configured
DOWN or failed to respond to system state inquiries or has
explicitly removed itself from service due to a failure. This state
typically indicates some problem requiring administrator intervention.
- DRAINED
- The node is idle, but not available for use. The state of a node
will automatically change from DRAINING to DRAINED when user job(s) executing
on that node terminate. Since this state is entered by explicit
administrator request, additional SLURM administrator intervention is typically
not required.
- DRAINING
- The node has been made unavailable for new work by explicit administrator
intervention. It is processing some work at present and will enter state
"DRAINED" when that work has been completed. This might be used to
prepare some nodes for maintenance work.
- IDLE
- The node is idle and available for use.
- STAGE_IN
- The node has been allocated to a job, which is being prepared for
the job's execution.
- STAGE_OUT
- The has been allocated to a job, which has completed execution.
The node is performing job termination work.
- UNKNOWN
- Default initial node state upon startup of SLURM.
An attempt will be made to contact the node and acquire current state information.
SLURM uses a hash table in order to locate table entries rapidly.
Each table entry can be directly accessed without any searching
if the name contains a sequence number suffix. SLURM can be built
with the HASH_BASE set at build time to indicate the hashing algorithm.
Possible contains values are "10" and "8" for names containing
decimal and octal sequence numbers respectively
or "0" which processes mixed alpha-numeric without sequence numbers.
The default value of HASH_BASE is "10".
If you use a naming convention lacking a sequence number, it may be
desirable to review the hashing function Hash_Index in the
node_mgr.c module. This is especially important in clusters having
large numbers of nodes. The sequence numbers can start at any
desired number, but should contain consecutive numbers. The
sequence number portion may contain leading zeros for a consistent
name length, if so desired. Note that correct operation
will be provided with any nodes names, but performance will suffer
without this optimization.
A sample SLURM configuration file (node information only) follows.
#
# Node Configurations
#
NodeName=DEFAULT TmpDisk=16384 State=IDLE
NodeName=lx[0001-0002] State=DRAINED
NodeName=lx[0003-8000] Procs=16 RealMemory=2048 Weight=16
NodeName=lx[8001-9999] Procs=32 RealMemory=4096 Weight=40 Feature=1200MHz,VizTools
The partition configuration permits you to establish different job
limits or access controls for various groups (or partitions) of nodes.
Nodes may be in only one partition. The partition configuration
file contains the following information:
- AllowGroups
- Comma separated list of group IDs which may use the partition.
If at least one group associated with the user submitting the
job is in AllowGroups, he will be permitted to use this partition.
The default value is "ALL".
- Default
- If this keyword is set, jobs submitted without a partition
specification will utilize this partition.
Possible values are "YES" and "NO".
The default value is "NO".
- Key
- Specifies if SLURM provided key is required for a job to
execute in this partition.
The key is provided to user root upon request and is invalidated
after one use or expiration.
The key may be used to deligate control of partitions to external
schedulers.
Possible values are "YES" and "NO".
The default value is "NO".
- MaxNodes
- Maximum count of nodes which may be allocated to any single job,
The default value is "UNLIMITED", which is represented internally as -1.
- MaxTime
- Maximum wall-time limit for any job in minutes. The default
value is "UNLIMITED", which is represented internally as -1.
- Nodes
- Comma separated list of nodes which are associated with this
partition. Node names may be specified using the
real expression syntax described above. A blank list of nodes
(i.e. "Nodes= ") can be used if one wants a partition to exist,
but have no resources (possibly on a temporary basis).
- PartitionName
- Name by which the partition may be referenced (e.g. "Interactive").
This name can be specified by users when submitting jobs.
- Shared
- Ability of the partition to execute more than one job at a
time on each node. Shared nodes will offer unpredictable performance
for application programs, but can provide higher system utilization
and responsiveness than otherwise possible.
Possible values are "FORCE", "YES", and "NO".
The default value is "NO".
- State
- State of partition or availability for use. Possible values
are "UP" or "DOWN". The default value is "UP".
Only the PartitionName must be supplied in the configuration file.
Other parameters will assume default values if not specified.
The default values can be specified with a record in which
"PartitionName" is "DEFAULT" if non-standard default values are desired.
The default entry values will apply only to lines following it in the
configuration file and the default values can be reset multiple times
in the configuration file with multiple entries where "PartitionName=DEFAULT".
The configuration of one partition should be specified per line.
The field descriptors above are case sensitive.
The size of each line in the file is limited to 1024 characters.
A sample SLURM configuration file (partition information only) follows.
A single job may be allocated nodes from only one partition and
satisfy the configuration specifications for that partitions.
The job may specify a particular PartitionName, if so desired,
or use the system's default partition.
#
# Partition Configurations
#
PartitionName=DEFAULT MaxTime=30 MaxNodes=2
PartitionName=login Nodes=lx[0001-0002] State=DOWN
PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES
PartitionName=class Nodes=lx[0031-0040] AllowGroups=students
PartitionName=batch Nodes=lx[0041-9999] MaxTime=UNLIMITED MaxNodes=4096 Key=YES
APIs and an administrative tool can be used to alter the SLRUM
configuration in real time.
When the SLURM controller restarts, it's state will be restored
to that at the time it terminated unless the SLURM configuration
file is newer, it which case the configuration will be rebuilt
from that file.
State information not incorporated in the configuration file,
such as job state, will be preserved.
A SLURM configuration file is included
at the end of this document.
Job Configuration
The job configuration format specified below is used by the
scontrol administration tool to modify job state information:
- Contiguous
- Determine if the nodes allocated to the job must be contiguous.
Acceptable values are "YES" and "NO" with the default being "NO".
- Features
- Required features of nodes to be allocated to this job.
Features may be combined using "|" for OR, "&" for AND,
and square brackets.
For example, "Features=1000MHz|1200MHz&CoolTool)".
The feature list is processes left to right except for
the grouping by brackets.
Square brackets are used to identify alternate features,
but ones that must apply to every node allocated to the job.
For example, some clusters are configured with more than
one parallel file system. These parallel file systems
may be accessible only to a subset of the nodes in a cluster.
The application may not care which parallel file system
is used, but all nodes allocated to it must be in the
subset of nodes assessing a single parallel file system.
This might be specified with a specification of
"Features=[PFS1|PFS2|PFS3|PFS4]".
- Groups
- Comma separated list of group names to which the user belongs.
- JobName
- Name to be associated with the job
- JobId
- Identification for the job. By default this is the partition's
name, followed by a period, followed by a sequence number (e.g.
"batch.1234").
- Key
- Key granted to user root for optional access control to partitions.
- Name
- Name by which the job may be referenced (e.g. "Simulation").
This name can be specified by users when submitting their jobs.
- MaxTime
- Maximum wall-time limit for the job in minutes. An "UNLIMITED"
value is represented internally as -1.
- MinProcs
- Minimum number of processors per node.
- MinRealMemory
- Minimum number of megabytes of real memory per node.
- MinTmpDisk
- Minimum number of megabytes of temporary disk storage per node.
- ReqNodes
- Comma separated list of nodes which must be allocated to the job.
The nodes may be specified using regular expressions (e.g.
"lx[0010-0020],lx[0033-0040]".
This value may not be changed by scontrol.
- Number
- Unique number by which the job can be referenced. This value
may not be changed by scontrol.
- Partition
- Name of the partition in which this job should execute.
- Priority
- Floating point priority of the pending job. The value may
be specified by user root initiated jobs, otherwise SLURM will
select a value. Generally, higher priority jobs will be initiated
before lower priority jobs. Backfill scheduling will permit
lower priority jobs to be initiated before higher priority jobs
only if doing so will not delay the anticipated initiation time
of the higher priority job .
- Script
- Pathname of the script to be executed for the job.
The script will typically contain "srun" commands to initiate
the parallel commands.
- Shared
- Job can share nodes with other jobs. Possible values are 1
and 0 for YES and NO respectively.
- State
- State of the job. Possible values are "PENDING", "STARTING",
"RUNNING", and "ENDING".
- TotalNodes
- Minimum total number of nodes to be allocated to the job.
- TotalProcs
- Minimum total number of processors to be allocated to the job.
- User
- Name of the user executing this job.
Build Parameters
The following configuration parameters are established at SLURM build time.
State and configuration information may be read or updated using SLURM APIs.
- BACKUP_INTERVAL
- How long to wait between saving SLURM state. The default
value is 60 and the units are seconds.
- BACKUP_LOCATION
- The fully qualified pathname of the file where the SLURM
state information is saved. There is no default value. The file should
be accessible to both the ControlMachine and also the BackupController.
The default value is "/usr/local/SLURM/Slurm.state".
- CONTROL_DAEMON
- The fully qualified pathname of the file containing the SLURM daemon
to execute on the ControlMachine. The default value is "/usr/local/SLURM/bin/Slurmd.Control".
This file must be accessible to the ControlMachine and BackupController.
- CONTROLLER_TIMEOUT
- How long the BackupController waits for the CONTROL_DAEMON to respond
before assuming it has failed and starting the BackupController.
The default value is 300 and the units are seconds.
- EPILOG
- This program is executed on each node allocated to a job upon its termination.
This can be used to remove temporary files created by the job or other clean-up.
This file must be accessible to every SLURM compute server.
By default there is no epilog program.
- FAST_SCHEDULE
- SLURM will only check the job's memory, processor, and disk
contraints against the configuration file entries if set. If set,
the specific values of each node will not be tested and scheduling
will be considerably faster for large clusters.
- HASH_BASE
- SLURM uses a hash table in order to locate table entries rapidly.
Each table entry can be directly accessed without any searching
if the name contains a sequence number suffix. SLURM can be built
with the HASH_BASE set to indicate the hashing mechanism. Possible
values are "10" and "8" for names containing
decimal and octal sequence numbers respectively
or "0" which processes mixed alpha-numeric without sequence numbers.
If you use a naming convention lacking a sequence number, it may be
desirable to review the hashing function Hash_Index in the
Mach_Stat_Mgr.c module. This is especially important in clusters having
large numbers of nodes. The default value is "10".
- HEARTBEAT_INTERVAL
- How frequently each SERVER_DAEMON should report its state to the CONTROL_DAEMON.
Also, how frequently the CONTROL_DAEMON should report its state to the BackupController.
The default value is 60 and the units are seconds.
- INIT_PROGRAM
- The fully qualified pathname of a program that must execute and
return an exit code of zero before the CONTROL_DAEMON or SERVER_DAEMON
enter into service. This would normally be used to insure that the
computer is fully ready for executing user jobs. It may, for example,
wait until every required file system has been mounted.
By default there is no initialization program.
- KILL_WAIT
- How long to wait between sending SIGTERM and SIGKILL signals to jobs at termination time.
The default value is 60 and the units are seconds.
- PRIORITIZE
- Job to execute in order to establish the initial priority of a job.
The program is passed the job's specifications and returns the priority.
Details of message format TBD.
By default there is no prioritization program.
- PROLOG
- This program is executed on each node allocated to a job prior to its initiation.
This file must be accessible to every SLURM compute server. By default no prolog is executed.
- SERVER_DAEMON
- The fully qualified pathname of the file containing the SLURM daemon
to execute on every compute server node. The default value is "/usr/local/SLURM/bin/Slurmd.Server".
This file must be accessible to every SLURM compute server.
- SERVER_TIMEOUT
- How long the CONTROL_DAEMON waits for the SERVER_DAEMON to respond before assuming it
has failed and declaring the node DOWN then terminating any job running on
it. The default value is 300 and the units are seconds.
- SLURM_CONF
- The fully qualified pathname of the file containing the SLURM
configuration file. The default value is "/etc/SLURM.conf".
- TMP_FS
- The fully qualified pathname of the file system which jobs should use for
temporary storage. The default value is "/tmp".
scontrol Administration Tool
The tool you will primarily use in the administration of SLURM is scontrol.
It provides the means of viewing and updating node and partition
configurations. It can also be used to update some job state
information. You can execute scontrol with a single keyword on
the execute line or it will query you for input and process those
keywords on an interactive basis. The scontrol keywords are shown below.
A sample scontrol session with examples is appended.
Usage: scontrol [-q | -v] [<keyword>]
-q is equivalent to the "quiet" keyword
-v is equivalent to the "verbose" keyword
- exit
- Terminate scontrol.
- help
- Display this list of scontrol commands and options.
- quiet
- Print no messages other than error messages.
- quit
- Terminate scontrol.
- reconfigure
- The SLURM control daemon re-reads its configuration files.
- show <entity> [<ID>]
- Show the configuration for a given entity. Entity must
be "build", "job", "node", or "partition" for SLURM build
parameters, job, node and partition information respectively.
By default, state information for all records is reported.
If you only wish to see the state of one entity record,
specify either its ID number (assumed if entirely numeric)
or its name. Regular expressions may
be used to identify node names.
- update <options>
- Update the configuration information.
Options are of the same format as the configuration file.
Not all configuration information can be modified using
this mechanism, such as the configuration of a node
after it has registered (only a node's state can be modified).
One can always modify the SLURM configuration file and
use the reconfigure command to rebuild all controller
information if required.
This command can only be issued by user root.
- verbose
- Enable detailed logging of scontrol execution state information.
- version
- Display the scontrol tool version number.
Miscellaneous
There is no necessity for synchronized clocks on the nodes.
Events occur either in real-time based upon message traffic
or based upon changes in the time on a node. However, synchronized
clocks will permit easier analysis of SLURM logs from multiple
nodes.
SLURM uses the syslog function to record events. It uses a
range of importance levels for these messages. Be certain
that your system's syslog functionality is operational.
Sample Configuration File
# /etc/SLURM.conf
# Built by John Doe, 1/29/2002
ControlMachine=lx0001
BackupController=lx0002
#
# Node Configurations
#
NodeName=DEFAULT TmpDisk=16384 State=IDLE
NodeName=lx[0001-0002] State=DRAINED
NodeName=lx[0003-8000] Procs=16 RealMemory=2048 Weight=16
NodeName=lx[8001-9999] Procs=32 RealMemory=4096 Weight=40 Feature=1200MHz
#
# Partition Configurations
#
PartitionName=DEFAULT MaxTime=30 MaxNodes=2
PartitionName=login Nodes=lx[0001-0002] State=DOWN
PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES
PartitionName=class Nodes=lx[0031-0040] AllowGroups=students
PartitionName=batch Nodes=lx[0041-9999] MaxTime=UNLIMITED MaxNodes=4096 Key=YES
Sample scontrol Execution
Remove node lx0030 from service, removing jobs as needed:
# scontrol
scontrol: update NodeName=lx0030 State=DRAINING
scontrol: show job
ID=1234 Name=Simulation MaxTime=100 Nodes=lx[0029-0030] State=RUNNING User=smith
ID=1235 Name=MyBigTest MaxTime=100 Nodes=lx0020,lx0023 State=RUNNING User=smith
scontrol: update job ID=1234 State=ENDING
scontrol: show job 1234
Job 1234 not found
scontrol: show node lx0030
Name=lx0030 Partition=class State=DRAINED Procs=16 RealMemory=2048 TmpDisk=16384
scontrol: quit
URL = http://www-lc.llnl.gov/dctg-lc/slurm/admin.guide.html
Last Modified April 5, 2002
Maintained by
slurm-dev@lists.llnl.gov