SLURM Programmer's Guide

Overview

Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters containing thousands of nodes. Components include machine status, job management, and scheduling modules. The design also includes a scalable, general-purpose communication infrastructure (MONGO, to be described elsewhere). SLURM requires no kernel modifications for it operation and is relatively self-contained. Initial target platforms include Red Hat Linux clusters with Quadrics interconnect and the IBM Blue Gene product line.

There is an overview the components and their interactions available in a separate document, SLURM: Simple Linux Utility for Resource Management.

Code should adhere to the Linux kernel code style as described in http://www.linuxhq.com/kernel/v2.4/doc/CodingStyle.html.

Directory Structure

The contents of the SLURM directory structure will be described below in increasing detail as the structure is descended. The top level directory contains the scripts and tools required to build the entire SLURM system. It also contains a variety of subdirectories for each type of file.

General build tools/files include: autogen.sh, configure.ac, Makefile.am and the contents of the auxdir directory. autoconf and make are used to build and install SLURM in an automated fashion. NOTE: autoconf version 2.52 or higher is required to build SLURM. Execute "autoconf -V" to check your version number. The build process may be as simple as executing a sequence of three commands:

./autogen.sh
./configure
make

Copyright and disclaimer information are in the files COPYING and DISCLAIMER. Documentation including man pages are in the subdirectory doc. Sample configuration files are in the etc subdirectory. All source code and header files are in the directory src. DejaGnu is used as a testing framework and all of its files are in the testsuite directory.

Documentation

All of the documentation is in the subdirectory doc. Man pages for both the commands and APIs are in doc/man. Various documents suitable for public consumption are in doc/html. An overall SLURM design document including various figures is in doc/pubdesign. Various design documents (many of which are dated) can be found in doc/slides and doc/txt. A survey of available resource managers initiated at the start of the SLURM project is in doc/survey.

Source code

Functions are divided into several catagories, each in its own subdirectory. The details of each directory's contents are proved below. The directories are as follows:

api
Application Program Interfaces into the SLURM code. Used to send and get SLURM information from the central manager.
common
General purpose functions for widespread use.
popt
General purpose parsing tools.
scancel
User command to cancel a job or job step.
scontrol
Administrator tool to manage SLURM.
slurmctld
SLURM central manager code.
slurmd
SLURM code to manage the compute server nodes including the execution of user applications.
squeue
User command to get information on SLURM jobs and allocations
srun
User command to submit a job, get an allocation, and/or initiation a parallel job step.
test
Functions for testing individual SLURM modules. These tests are not under the DejaGnu framework.

API Modules

This directory contains modules supporting the SLURM API functions. The APIs to get SLURM information accept a time-stamp. If the data has not changed since the specified time, a return code will indicate this and return no other data. Otherwise a data structure is returned including its time-stamp, element count, and an array of structures describing the state of each node, job, partition, etc. Each of these functions also includes a corresponding function to release all storage associated with the data structure.
allocate.c
Allocates resources for a job's initiation. This creates a job entry and allocates resouces to it. The resources can be claimed at a later time to actually run a parallel job. If the requested resouces are not currently available, the request will fail.
allocate.c
Allocate resources for a job. The allocation request may result in the immediate execution of a job step, the immediate allocation of resources for future job steps, or the queuing the allocation request depending upon parameters used.
cancel.c
Cancels (i.e. terminates) a running or pending job or job step.
complete.c
Note the completion of a running job or job step.
config_info.c
Reports SLURM configuration parameter values.
job_info.c
Reports job state information
Makefile.am
Information used by autoconf to build a Makefile for the api subdirectory.
node_info.c
Reports node state and configuration values.
partition_info.c
Reports partition state and configuration values.
reconfigure.c
Requests that slurmctld reload configuration information. Also includes the API to request slurmctld shutdown.
submit.c
Submits a job to slurm. The job will be queued for initiation when resources are available.
update_config.c
Updates job, node or partition state information.
Future components to include: job step support (a set of parallel tasks associated with a job or allocation, multiple job steps may execute in serial or parallel within an allocation), issuing keys, getting Elan (Quadrics interconnect) capabilities, and resource accounting.

Common Modules

This directory contains modules of general use throughout the SLURM code. The modules are described below.
bitstring.[ch]
A collection of general purpose functions for managing bitmaps. We use these for rapid node management functions including: scheduling and associating partitions and jobs with the nodes.
hostlist.[ch]
Has tools which accept a regular expression for a host list (e.g. "lx[123-456,777]") and provide individual node names in several fashions.
list.[ch]
A general purpose list manager. One can define a list, add and delete entries, search for entries, etc.
log.[ch]
A general purpose log manager. It can filter log messages based upon severity and route them to stderr, syslog, or a log file.
macros.h
General purpose SLURM macro definitions.
Makefile.am
autoconf input to build a Makefile for this subdirectory.
pack.[ch]
Functions for packing and unpacking unsigned integers and strings for transmission over the network. The unsigned integers are translated to/from machine independent form. Strings are transmitted with a length value.
parse_spec.[ch]
Parser functions for translating the configuration file or input to scontrol.
qsw.[ch]
Functions for interacting with the Quadrics interconnect.
qsw.h
Definitions for qsw.c and documentation for its functions.
safeopen.[ch]
Functions for opening files with simple sanity checks on the file.
slurm_errno.h
Slurmd specific error codes.
slurm_protocol_api.[ch]
TBD
slurm_protocol_common.h
TBD
slurm_protocol_defs.[ch]
TBD
slurm_protocol_errno.[ch]
General SLURM error functions and codes.
slurm_protocol_implementation.c
TBD
slurm_protocol_mongo_common.h
TBD
slurm_protocol_pack.[ch]
Functions to pack a variety of RPC specific data structures.
slurm_protocol_socket_common.h
TBD
slurm_protocol_socket_implementation.c
Socket-based communctions protocol functions.
slurm_protocol_util.[ch]
TBD
slurm_return_codes.h
TBD
strlcpy.[ch]
String copy function with input/output length information.
util_signals.[ch]
TBD
xassert.[ch]
Assert function with configurable handling.
xerrno.[ch]
Quadrics Elan error management functions.
xmalloc.[ch]
"Safe" memory management functions. Includes magic cooking to insure that freed memory was in fact allocated by its functions.
xstring.[ch]
A collection of functions for string manipulations with automatic expansion of allocated memory on an as needed basis.

scancel Modules

scancel is a command to cancel running or pending jobs or job steps.
Makefile.am
autoconf input to build a Makefile for this subdirectory.
scancel.c
A command line interface to cancel jobs or job steps.

scontrol Modules

scontrol is the administrator tool for monitoring and modifying SLURM configuration and state. It has a command line interface only.
Makefile.am
autoconf input to build a Makefile for this subdirectory.
scontrol.c
A command line interface to slurmctld.

slurmctld Modules

slurmctld executes on the control machine and orchestrates SLURM activities across the entire cluster including monitoring node and partition state, scheduling, job queue management, job dispatching, and switch management. The slurmctld modules and their functionality are described below.
controller.c
Primary SLURM daemon to execute on control machine. It has several threads to handle signals, incomming RPCs, generate heartbeat requests for slurmd, etc. It manages the Partition Manager, Switch Manager, and Job Manager sub-systems.
job_mgr.c
Reads, writes, records, updates, and otherwise manages the state information for all jobs and allocations for jobs.
job_scheduler.c
Determines which pending job(s) should execute next and initiates them.
locks.[ch]
Provides read and write locks for the various slurmctld data structures.
Makefile.am
autoconf input to build a Makefile for this subdirectory.
node_mgr.c
Reads, writes, records, updates, and otherwise manages the state information for all nodes (machines) in the cluster managed by SLURM.
node_scheduler.c
Selects the nodes to be allocated to pending jobs. This makes extensive use of bit maps in representing the nodes. It also considers the locality of nodes to improve communications performance.
pack.c
Pack the slurmctld structures into buffers understood by slurm_protocol.
partition_mgr.c
Reads, writes, records, updates, and otherwise manages the state information associated with partitions in the cluster managed by SLURM.
read_config.c
Read the SLURM configuration file and use it to build node and partition data structures.
slurmctld.h
Defines data structures and functions for all of slurmctld
step_mgr.c
Reads, writes, records, updates, and otherwise manages the state information for job steps.

slurmd Modules

slurmd executes on each compute node. It initiates and terminates user jobs and monitors both system and job state. The slurmd modules and their functionality are described below.
get_mach_stat.c
Gets the machine's status and configuration in a operating system independent fashion. This configuration information includes: size of real memory, size of temporary disk storage, and the number of processors.
read_proc.c
Collects job state information including real memory use, virtual memory use, and CPU time use. While desirable to maintain operating system independent code, this module is not completely portable.

Design Issues

Many of these modules have been built and tested on a variety of Unix computers including Redhat's Linux, IBM's AIX, Sun's Solaris, and Compaq's Tru-64. The only module at this time which is operating system dependent is slurmd/read_proc.c.

The node selection logic allocates nodes to jobs in a fashion which makes most sense for a Quadrics switch interconnect. It allocates the smallest collection of consecutive nodes that satisfies the request (e.g. if there are 32 consecutive nodes and 16 consecutive nodes available, a job needing 16 or fewer nodes will be allocated those nodes from the 16 node set rather than fragment the 32 node set). If the job can not be allocated consecutive nodes, it will be allocated the smallest number of consecutive sets (e.g. if there are sets of available consecutive nodes of sizes 6, 4, 3, 3, 2, 1, and 1 then a request for 10 nodes will always be allocated the 6 and 4 node sets rather than use the smaller sets). These techniques minimize the job communications overhead. A job can use hardware broadcast mechanisms given consecutive nodes. Without consecutive nodes, much slower software broadcase mechanisms must be used.


URL = http://www-lc.llnl.gov/dctg-lc/slurm/programmer.guide.html

Last Modified July 30, 2002

Maintained by slurm-dev@lists.llnl.gov