Quick Start Administrator Guide
Overview
Please see the Quick Start User Guide for a general
overview.
Building and Installing
Instructions to build and install SLURM manually are shown below.
See the README and INSTALL files in the source distribution for more details.
- gunzip the distributed tar-ball and
untar the files.
- cd to the directory containing the SLURM
source and type ./configure with appropriate
options.
- Type make to compile SLURM.
- Type make install to install the programs,
documentation, libaries, header files, etc.
The most commonly used arguments to the configure
command include:
--enable-debug
Enable additional debugging logic within SLURM.
--prefix=PREFIX
Install architecture-independent files in PREFIX; default value is /usr/local.
--sysconfdir=DIR
Specify location of SLURM configuration file.
Optional SLURM plugins will be built automatically when the
configure script detects that the required
build requirements are present. Build dependencies for various plugins
are denoted below.
- Munge The auth/munge plugin will be built if Chris Dunlap's Munge
library is installed.
- Authd The auth/authd plugin will be built and installed if
the libauth library and its dependency libe are installed.
- Federation The switch/federation plugin will be built and installed
if the IBM Federation switch libary is installed.
- QsNet support in the form of the switch/elan plugin requires
that the qsnetlibs package (from Quadrics) be installed along
with its development counterpart (i.e. the qsnetheaders
package.) The switch/elan plugin also requires the
presence of the libelanosts library and /etc/elanhosts
configuration file. (See elanhosts(5) man page in that
package for more details)
Please see the Download page for references to
required software to build these plugins.
To build RPMs directly, copy the distribued tar-ball into the directory
/usr/src/redhat/SOURCES and execute a command of this sort (substitute
the appropriate SLURM version number):
rpmbuild -ta slurm-0.5.0-1.tgz or
rpmbuild -ta slurm-0.6.0-1.tar.bz2.
You can control some aspects of the RPM built with a .rpmmacros
file in your home directory. Special macro definitions will likely
only be required if files are installed in unconventional locations.
Some macro definitions that may be used in building SLURM include:
- _enable_debug
- Specify if debugging logic within SLURM is to be enabled
- _prefix
- Pathname of directory to contain the SLURM files
- _sysconfdir
- Pathname of directory containing the slurm.conf configuration file
- with_munge
- Specifies munge (authentication library) installation location
- with_proctrack
- Specifies AIX process tracking kernel extention header file location
- with_ssl
- Specifies SSL libary installation location
To build SLURM on our AIX system, the following .rpmmacros file is used:
# .rpmmacros
# For AIX at LLNL
# Override some RPM macros from /usr/lib/rpm/macros
# Set other SLURM-specific macros for unconventional file locations
#
%_enable_debug "--with-debug"
%_prefix /admin/llnl
%_sysconfdir %{_prefix}/etc/slurm
%with_munge "--with-munge=/admin/llnl"
%with_proctrack "--with-proctrack=/admin/llnl/include"
%with_ssl "--with-ssl=/opt/freeware"
Daemons
slurmctld is sometimes called the "controller" daemon. It
orchestrates SLURM activities, including queuing of job, monitoring node state,
and allocating resources (nodes) to jobs. There is an optional backup controller
that automatically assumes control in the event the primary controller fails.
The primary controller resumes control whenever it is restored to service. The
controller saves its state to disk whenever there is a change.
This state can be recovered by the controller at startup time.
State changes are saved so that jobs and other state can be preserved when
controller moves (to or from backup controller) or is restarted.
We recommend that you create a Unix user slurm for use by
slurmctld. This user name will also be specified using the
SlurmUser in the slurm.conf configuration file.
Note that files and directories used by slurmctld will need to be
readable or writable by the user SlurmUser (the slurm configuration
files must be readable; the log file directory and state save directory
must be writable).
The slurmd daemon executes on every compute node. It resembles a remote
shell daemon to export control to SLURM. Because slurmd initiates and manages
user jobs, it must execute as the user root.
slurmctld and/or slurmd should be initiated at node startup time
per the SLURM configuration.
A file etc/init.d/slurm is provided for this purpose.
This script accepts commands start, startclean (ignores
all saved state), restart, and stop.
Infrastructure
Authentication of SLURM communications
All communications between SLURM components are authenticated. The
authentication infrastructure is provided by a dynamically loaded
plugin chosen at runtame via the AuthType keyword in the SLURM
configuration file. Currently available authentication types include
authd,
munge, and none.
The default authentication infrastructure is "none". This permits any user to execute
any job as another user. This may be fine for testing purposes, but certainly not for production
use. Configure some AuthType value other than "none" if you want any security.
We recommend the use of Munge unless you are experienced with authd.
While SLURM itself does not rely upon synchronized clocks on all nodes
of a cluster for proper operation, its underlying authentication mechanism
may have this requirement. For instance, if SLURM is making use of the
auth/munge plugin for communication, the clocks on all nodes will need to
be synchronized.
MPI support
Quadrics MPI works directly with SLURM on systems having Quadrics
interconnects and is the prefered version of MPI for those systems.
Set the MpiDefault=none configuration parameter in slurm.conf.
For Myrinet systems, MPICH-GM
is prefered. In order to use MPICH-GM, set MpiDefault=mpichgm and
ProctrackType=proctrack/linuxproc configuration parameters in
slurm.conf.
HP customers would be well served by using
HP-MPI.
A good open-source MPI for use with SLURM is
LAM/MPI. LAM/MPI uses the command
lamboot to initiate job-specific daemons on each node using SLURM's
srun
command. This places all MPI processes in a process-tree under the control of
the slurmd daemon. LAM/MPI version 7.1 or higher contains support for
SLURM.
Set the MpiDefault=lam configuration parameters in slurm.conf.
Another good open-source MPI for use with SLURM is
Open MPI.
Set the MpiDefault=lam configuration parameters in slurm.conf
for use of Open MPI.
Note that the ordering of tasks within an job's allocation matches that of
nodes in the slurm.conf configuration file. SLURM presently lacks the ability
to arbitrarily order tasks across nodes.
Scheduler support
The scheduler used by SLURM is controled by the SchedType configuration
parameter. This is meant to control the relative importance of pending jobs.
SLURM's default scheduler is FIFO (First-In First-Out). A backfill scheduler
plugin is also available. Backfill scheduling will initiate a lower-priority job
if doing so does not delay the expected initiation time of higher priority jobs;
essentially using smaller jobs to fill holes in the resource allocation plan.
SLURM also supports a plugin for use of
The Maui Scheduler, which offers sophisticated scheduling algorithms.
Motivated users can even develop their own scheduler plugin if so desired.
Node selection
The node selection mechanism used by SLURM is controlled by the
SelectType configuration parameter.
If you want to execute multiple jobs per node, but apportion the processors,
memory and other resources, the cons_res (consumable resources)
plugin is recommended.
If you tend to dedicate entire nodes to jobs, the linear plugin
is recommended.
For more information, please see
Consumable Resources in SLURM.
For BlueGene systems, bluegene plugin is required (it is topology
aware and interactes with the BlueGene bridge API).
Logging
SLURM uses the syslog function to record events. It uses a range of importance
levels for these messages. Be certain that your system's syslog functionality
is operational.
Corefile format
SLURM is designed to support generating a variety of core file formats for
application codes that fail (see the --core option of the srun
command). As of now, SLURM only supports a locally developed lightweight
corefile library which has not yet been released to the public. It is
expected that this library will be available in the near future.
Parallel debugger support
SLURM exports information for parallel debuggers using the specification
detailed here.
This is meant to be exploited by any parallel debugger (notably, TotalView),
and support is unconditionally compiled into SLURM code.
We use a patched version of TotalView that looks for a "totalview_jobid"
symbol in srun that it then uses (configurably) to perform a bulk
launch of the tvdsvr daemons via a subsequent srun. Otherwise
it is difficult to get TotalView to use srun for a bulk launch, since
srun will be unable to determine for which job it is launching tasks.
Another solution would be to run TotalView within an existing srun
--allocate session. Then the Totalview bulk launch command to srun
could be set to ensure only a single task per node. This functions properly
because the SLRUM_JOBID environment variable is set in the allocation shell
environment.
Compute node access
SLURM does not by itself limit access to allocated compute nodes,
but it does provide mechanisms to accomplish this.
There is a Pluggable Authentication Module (PAM) for resticting access
to compute nodes available for download.
When installed, the SLURM PAM module will prevent users from logging
into any node that has not be assigned to that user.
On job termination, any processes initiated by the user outside of
SLURM's control may be killed using an Epilog script configured
in slurm.conf.
An example of such a script is included as etc/slurm.epilog.clean.
Without these mechanisms any user can login to any compute node,
even those allocated to other users.
Configuration
The SLURM configuration file includes a wide variety of parameters.
This configuration file must be available on each node of the cluster. A full
description of the parameters is included in the slurm.conf man page. Rather than
duplicate that information, a minimal sample configuration file is shown below.
Your slurm.conf file should define at least the configuration parameters defined
in this sample and likely additional ones. Any text
following a "#" is considered a comment. The keywords in the file are
not case sensitive, although the argument typically is (e.g., "SlurmUser=slurm"
might be specified as "slurmuser=slurm"). The control machine, like
all other machine specifications, can include both the host name and the name
used for communications. In this case, the host's name is "mcri" and
the name "emcri" is used for communications.
In this case "emcri" is the private management network interface
for the host "mcri". Port numbers to be used for
communications are specified as well as various timer values.
A description of the nodes and their grouping into non-overlapping partitions
is required. Partition and node specifications use node range expressions to identify
nodes in a concise fashion. This configuration file defines a 1154-node cluster
for SLURM, but it might be used for a much larger cluster by just changing a few
node range expressions. Specify the minimum processor count (Procs), real memory
space (RealMemory, megabytes), and temporary disk space (TmpDisk, megabytes) that
a node should have to be considered available for use. Any node lacking these
minimum configuration values will be considered DOWN and not scheduled.
Note that a more extensive sample configuration file is provided in
etc/slurm.conf.example.
#
# Sample /etc/slurm.conf for mcr.llnl.gov
#
ControlMachine=mcri ControlAddr=emcri
BackupMachine=mcrj BackupAddr=emcrj
#
AuthType=auth/munge
Epilog=/usr/local/slurm/etc/epilog
FastSchedule=1
JobCompLoc=/var/tmp/jette/slurm.job.log
JobCompType=jobcomp/filetxt
JobCredentialPrivateKey=/usr/local/etc/slurm.key
JobCredentialPublicCertificate=/usr/local/etc/slurm.cert
PluginDir=/usr/local/slurm/lib/slurm
Prolog=/usr/local/slurm/etc/prolog
SchedulerType=sched/backfill
SelectType=select/linear
SlurmUser=slurm
SlurmctldPort=7002
SlurmctldTimeout=300
SlurmdPort=7003
SlurmdSpoolDir=/var/tmp/slurmd.spool
SlurmdTimeout=300
StateSaveLocation=/tmp/slurm.state
SwitchType=switch/elan
#
# Node Configurations
#
NodeName=DEFAULT Procs=2 RealMemory=2000 TmpDisk=64000 State=UNKNOWN
NodeName=mcr[0-1151] NodeAddr=emcr[0-1151]
#
# Partition Configurations
#
PartitionName=DEFAULT State=UP
PartitionName=pdebug Nodes=mcr[0-191] MaxTime=30 MaxNodes=32 Default=YES
PartitionName=pbatch Nodes=mcr[192-1151]
Security
You will should create unique job credential keys for your site
using the program openssl.
You must use openssl and not ssh-genkey to construct these keys.
An example of how to do this is shown below. Specify file names that
match the values of JobCredentialPrivateKey and JobCredentialPublicCertificate
in your configuration file. The JobCredentialPrivateKey
file must be readable only by SlurmUser. The JobCredentialPublicCertificate file
must be readable by all users.
Both files must be available on all nodes in the cluster.
openssl genrsa -out /usr/local/etc/slurm.key
1024
openssl rsa -in /usr/local/etc/slurm.key -pubout -out /usr/local/etc/slurm.cert
SLURM does not use reserved ports to authenticate communication between components.
You will need to have at least one "auth" plugin. Currently, only three
authentication plugins are supported: auth/none, auth/authd, and
auth/munge. The auth/none plugin is built and used by default, but either
Brent Chun's authd, or Chris Dunlap's
munge should be installed in order to
get properly authenticated communications.
Unless you are experience with authd, we recommend the use of munge.
The configure script in the top-level directory of this distribution will determine
which authentication plugins may be built. The configuration file specifies which
of the available plugins will be utilized.
A PAM module (Pluggable Authentication Module) is available for SLURM that
can prevent a user from accessing a node which he has not been allocated, if that
mode of operation is desired.
Starting the Daemons
For testing purposes you may want to start by just running slurmctld and slurmd
on one node. By default, they execute in the background. Use the -D
option for each daemon to execute them in the foreground and logging will be done
to your terminal. The -v option will log events
in more detail with more v's increasing the level of detail (e.g. -vvvvvv).
You can use one window to execute slurmctld -D -vvvvvv,
a second window to execute slurmd -D -vvvvv.
You may see errors such as "Connection refused" or "Node X not responding"
while one daemon is operative and the other is being started, but the
daemons can be started in any order and proper communications will be
established once both daemons complete initialization.
You can use a third window to execute commands such as
srun -N1 /bin/hostname to confirm
functionality.
Another important option for the daemons is -c
to clear previous state information. Without the -c
option, the daemons will restore any previously saved state information: node
state, job state, etc. With the -c option all
previously running jobs will be purged and node state will be restored to the
values specified in the configuration file. This means that a node configured
down manually using the scontrol command will
be returned to service unless also noted as being down in the configuration file.
In practice, SLURM restarts with preservation consistently.
A thorough battery of tests written in the "expect" language is also
available.
Administration Examples
scontrol can be used to print all system information
and modify most of it. Only a few examples are shown below. Please see the scontrol
man page for full details. The commands and options are all case insensitive.
Print detailed state of all jobs in the system.
adev0: scontrol
scontrol: show job
JobId=475 UserId=bob(6885) Name=sleep JobState=COMPLETED
Priority=4294901286 Partition=batch BatchFlag=0
AllocNode:Sid=adevi:21432 TimeLimit=UNLIMITED
StartTime=03/19-12:53:41 EndTime=03/19-12:53:59
NodeList=adev8 NodeListIndecies=-1
ReqProcs=0 MinNodes=0 Shared=0 Contiguous=0
MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
ReqNodeList=(null) ReqNodeListIndecies=-1
JobId=476 UserId=bob(6885) Name=sleep JobState=RUNNING
Priority=4294901285 Partition=batch BatchFlag=0
AllocNode:Sid=adevi:21432 TimeLimit=UNLIMITED
StartTime=03/19-12:54:01 EndTime=NONE
NodeList=adev8 NodeListIndecies=8,8,-1
ReqProcs=0 MinNodes=0 Shared=0 Contiguous=0
MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
ReqNodeList=(null) ReqNodeListIndecies=-1
Print the detailed state of job 477 and change its priority to
zero. A priority of zero prevents a job from being initiated (it is held in "pending"
state).
adev0: scontrol
scontrol: show job 477
JobId=477 UserId=bob(6885) Name=sleep JobState=PENDING
Priority=4294901286 Partition=batch BatchFlag=0
more data removed....
scontrol: update JobId=477 Priority=0
Print the state of node adev13 and drain it. To drain a node specify a new
state of DRAIN, DRAINED, or DRAINING. SLURM will automatically set it to the appropriate
value of either DRAINING or DRAINED depending on whether the node is allocated
or not. Return it to service later.
adev0: scontrol
scontrol: show node adev13
NodeName=adev13 State=ALLOCATED CPUs=2 RealMemory=3448 TmpDisk=32000
Weight=16 Partition=debug Features=(null)
scontrol: update NodeName=adev13 State=DRAIN
scontrol: show node adev13
NodeName=adev13 State=DRAINING CPUs=2 RealMemory=3448 TmpDisk=32000
Weight=16 Partition=debug Features=(null)
scontrol: quit
Later
adev0: scontrol
scontrol: show node adev13
NodeName=adev13 State=DRAINED CPUs=2 RealMemory=3448 TmpDisk=32000
Weight=16 Partition=debug Features=(null)
scontrol: update NodeName=adev13 State=IDLE
Reconfigure all SLURM daemons on all nodes. This should
be done after changing the SLURM configuration file.
adev0: scontrol reconfig
Print the current SLURM configuration. This also reports if the
primary and secondary controllers (slurmctld daemons) are responding. To just
see the state of the controllers, use the command ping.
adev0: scontrol show config
Configuration data as of 03/19-13:04:12
AuthType = auth/munge
BackupAddr = eadevj
BackupController = adevj
ControlAddr = eadevi
ControlMachine = adevi
Epilog = (null)
FastSchedule = 1
FirstJobId = 1
HeartbeatInterval = 60
InactiveLimit = 0
JobCompLoc = /var/tmp/jette/slurm.job.log
JobCompType = jobcomp/filetxt
JobCredPrivateKey = /etc/slurm/slurm.key
JobCredPublicKey = /etc/slurm/slurm.cert
KillWait = 30
MaxJobCnt = 2000
MinJobAge = 300
PluginDir = /usr/lib/slurm
Prolog = (null)
ReturnToService = 1
SchedulerAuth = (null)
SchedulerPort = 65534
SchedulerType = sched/backfill
SlurmUser = slurm(97)
SlurmctldDebug = 4
SlurmctldLogFile = /tmp/slurmctld.log
SlurmctldPidFile = /tmp/slurmctld.pid
SlurmctldPort = 7002
SlurmctldTimeout = 300
SlurmdDebug = 65534
SlurmdLogFile = /tmp/slurmd.log
SlurmdPidFile = /tmp/slurmd.pid
SlurmdPort = 7003
SlurmdSpoolDir = /tmp/slurmd
SlurmdTimeout = 300
SLURM_CONFIG_FILE = /etc/slurm/slurm.conf
StateSaveLocation = /usr/local/tmp/slurm/adev
SwitchType = switch/elan
TmpFS = /tmp
WaitTime = 0
Slurmctld(primary/backup) at adevi/adevj are UP/UP
Shutdown all SLURM daemons on all nodes.
adev0: scontrol shutdown
Upgrades
When upgrading to a new major or minor release of SLURM (e.g. 0.3.x to 0.4.x)
all running and pending jobs will be purged due to changes in state save
information. It is possible to develop software to translate state information
between versions, but we do not normally expect to do so.
When upgrading to a new micro release of SLURM (e.g. 0.3.1 to 0.3.2) all
running and pending jobs will be preserved. Just install a new version of
SLURM and restart the daemons.
An exception to this is that jobs may be lost when installing new pre-release
versions (e.g. 0.4.0-pre1 to 0.4.0-pre2). We'll try to note these cases
in the NEWS file.
|