SLURM Programmer's Guide
Overview
Simple Linux Utility for Resource Management (SLURM) is an open source,
fault-tolerant, and highly scalable cluster management and job
scheduling system for Linux clusters containing
thousands of nodes. Components include machine status,
job management, and scheduling modules. The design also
includes a scalable, general-purpose communication infrastructure
(MONGO, to be described elsewhere).
SLURM requires no kernel modifications for it operation and is
relatively self-contained.
Initial target platforms include Red Hat Linux clusters with
Quadrics interconnect and the IBM Blue Gene product line.
There is an overview the components and their interactions available
in a separate document, SLURM: Simple Linux Utility
for Resource Management.
Code should adhere to the Linux kernel code style as described in
http://www.linuxhq.com/kernel/v2.4/doc/CodingStyle.html.
Directory Structure
The contents of the SLURM directory structure will be described below in
increasing detail as the structure is descended. The top level directory
contains the scripts and tools required to build the entire SLURM system.
It also contains a variety of subdirectories for each type of file.
General build tools/files include: autogen.sh, configure.ac, Makefile.am
and the contents of the auxdir directory.
autoconf and make are used to build and install
SLURM in an automated fashion. NOTE: autoconf version 2.52
or higher is required to build SLURM. Execute "autoconf -V" to check
your version number. The build process may be as simple as executing
a sequence of three commands:
./autogen.sh
./configure
make
Copyright and disclaimer information are in the files COPYING and DISCLAIMER.
Documentation including man pages are in the subdirectory doc.
Sample configuration files are in the etc subdirectory.
All source code and header files are in the directory src.
DejaGnu is used as a testing framework and all of its files are in the
testsuite directory.
Documentation
All of the documentation is in the subdirectory doc.
Man pages for both the commands and APIs are in doc/man.
Various documents suitable for public consumption are in doc/html.
An overall SLURM design document including various figures is in doc/pubdesign.
Various design documents (many of which are dated) can be found in
doc/slides and doc/txt.
A survey of available resource managers initiated at the start of
the SLURM project is in doc/survey.
Source code
Functions are divided into several catagories, each in its own
subdirectory. The details of each directory's contents are proved
below. The directories are as follows:
- api
- Application Program Interfaces into the SLURM code.
Used to send and get SLURM information from the central manager.
- common
- General purpose functions for widespread use.
- popt
- General purpose parsing tools.
- scancel
- User command to cancel a job or job step.
- scontrol
- Administrator tool to manage SLURM.
- slurmctld
- SLURM central manager code.
- slurmd
- SLURM code to manage the compute server nodes including the
execution of user applications.
- squeue
- User command to get information on SLURM jobs and allocations
- srun
- User command to submit a job, get an allocation, and/or initiation
a parallel job step.
- test
- Functions for testing individual SLURM modules. These tests are
not under the DejaGnu framework.
API Modules
This directory contains modules supporting the SLURM API functions.
The APIs to get SLURM information accept a time-stamp. If the data
has not changed since the specified time, a return code will indicate
this and return no other data. Otherwise a data structure is returned
including its time-stamp, element count, and an array of structures
describing the state of each node, job, partition, etc.
Each of these functions also includes a corresponding function to
release all storage associated with the data structure.
- allocate.c
- Allocates resources for a job's initiation.
This creates a job entry and allocates resouces to it.
The resources can be claimed at a later time to actually
run a parallel job. If the requested resouces are not
currently available, the request will fail.
- allocate.c
- Allocate resources for a job. The allocation request may
result in the immediate execution of a job step, the immediate
allocation of resources for future job steps, or the queuing
the allocation request depending upon parameters used.
- cancel.c
- Cancels (i.e. terminates) a running or pending job or job step.
- complete.c
- Note the completion of a running job or job step.
- config_info.c
- Reports SLURM configuration parameter values.
- job_info.c
- Reports job state information
- Makefile.am
- Information used by autoconf to build a Makefile for the api
subdirectory.
- node_info.c
- Reports node state and configuration values.
- partition_info.c
- Reports partition state and configuration values.
- reconfigure.c
- Requests that slurmctld reload configuration information.
Also includes the API to request slurmctld shutdown.
- submit.c
- Submits a job to slurm. The job will be queued
for initiation when resources are available.
- update_config.c
- Updates job, node or partition state information.
Future components to include: job step support (a set of parallel
tasks associated with a job or allocation, multiple job steps may
execute in serial or parallel within an allocation),
issuing keys, getting Elan (Quadrics
interconnect) capabilities, and resource accounting.
Common Modules
This directory contains modules of general use throughout the SLURM code.
The modules are described below.
- bitstring.[ch]
- A collection of general purpose functions for managing bitmaps.
We use these for rapid node management functions including: scheduling
and associating partitions and jobs with the nodes.
- hostlist.[ch]
- Has tools which accept a regular expression for a host list (e.g.
"lx[123-456,777]") and provide individual node names in several fashions.
- list.[ch]
- A general purpose list manager.
One can define a list, add and delete entries, search for entries, etc.
- log.[ch]
- A general purpose log manager. It can filter log messages
based upon severity and route them to stderr, syslog, or a log file.
- macros.h
- General purpose SLURM macro definitions.
- Makefile.am
- autoconf input to build a Makefile for this subdirectory.
- pack.[ch]
- Functions for packing and unpacking unsigned integers and strings
for transmission over the network. The unsigned integers are translated
to/from machine independent form. Strings are transmitted with a length
value.
- parse_spec.[ch]
- Parser functions for translating the configuration file or input to scontrol.
- qsw.[ch]
- Functions for interacting with the Quadrics interconnect.
- qsw.h
- Definitions for qsw.c and documentation for its functions.
- safeopen.[ch]
- Functions for opening files with simple sanity checks on the file.
- slurm_errno.h
- Slurmd specific error codes.
- slurm_protocol_api.[ch]
- TBD
- slurm_protocol_common.h
- TBD
- slurm_protocol_defs.[ch]
- TBD
- slurm_protocol_errno.[ch]
- General SLURM error functions and codes.
- slurm_protocol_implementation.c
- TBD
- slurm_protocol_mongo_common.h
- TBD
- slurm_protocol_pack.[ch]
- Functions to pack a variety of RPC specific data structures.
- slurm_protocol_socket_common.h
- TBD
- slurm_protocol_socket_implementation.c
- Socket-based communctions protocol functions.
- slurm_protocol_util.[ch]
- TBD
- slurm_return_codes.h
- TBD
- strlcpy.[ch]
- String copy function with input/output length information.
- util_signals.[ch]
- TBD
- xassert.[ch]
- Assert function with configurable handling.
- xerrno.[ch]
- Quadrics Elan error management functions.
- xmalloc.[ch]
- "Safe" memory management functions. Includes magic cooking to insure
that freed memory was in fact allocated by its functions.
- xstring.[ch]
- A collection of functions for string manipulations with automatic expansion
of allocated memory on an as needed basis.
scancel Modules
scancel is a command to cancel running or pending jobs or job steps.
- Makefile.am
- autoconf input to build a Makefile for this subdirectory.
- scancel.c
- A command line interface to cancel jobs or job steps.
scontrol Modules
scontrol is the administrator tool for monitoring and modifying SLURM configuration
and state. It has a command line interface only.
- Makefile.am
- autoconf input to build a Makefile for this subdirectory.
- scontrol.c
- A command line interface to slurmctld.
slurmctld Modules
slurmctld executes on the control machine and orchestrates SLURM activities
across the entire cluster including monitoring node and partition state,
scheduling, job queue management, job dispatching, and switch management.
The slurmctld modules and their functionality are described below.
- controller.c
- Primary SLURM daemon to execute on control machine.
It has several threads to handle signals, incomming RPCs, generate heartbeat
requests for slurmd, etc. It manages the Partition Manager, Switch Manager,
and Job Manager sub-systems.
- job_mgr.c
- Reads, writes, records, updates, and otherwise
manages the state information for all jobs and allocations
for jobs.
- job_scheduler.c
- Determines which pending job(s) should execute next and initiates them.
- locks.[ch]
- Provides read and write locks for the various slurmctld data structures.
- Makefile.am
- autoconf input to build a Makefile for this subdirectory.
- node_mgr.c
- Reads, writes, records, updates, and otherwise
manages the state information for all nodes (machines) in the
cluster managed by SLURM.
- node_scheduler.c
- Selects the nodes to be allocated to pending jobs. This makes extensive use
of bit maps in representing the nodes. It also considers the locality of nodes
to improve communications performance.
- pack.c
- Pack the slurmctld structures into buffers understood by slurm_protocol.
- partition_mgr.c
- Reads, writes, records, updates, and otherwise
manages the state information associated with partitions in the
cluster managed by SLURM.
- read_config.c
- Read the SLURM configuration file and use it to build node and
partition data structures.
- slurmctld.h
- Defines data structures and functions for all of slurmctld
- step_mgr.c
- Reads, writes, records, updates, and otherwise
manages the state information for job steps.
slurmd Modules
slurmd executes on each compute node. It initiates and terminates user
jobs and monitors both system and job state. The slurmd modules and their
functionality are described below.
- get_mach_stat.c
- Gets the machine's status and configuration in a operating system
independent fashion.
This configuration information includes: size of real memory,
size of temporary disk storage, and the number of processors.
- read_proc.c
- Collects job state information including real memory use, virtual
memory use, and CPU time use.
While desirable to maintain operating system independent code, this
module is not completely portable.
Design Issues
Many of these modules have been built and tested on a variety of
Unix computers including Redhat's Linux, IBM's AIX, Sun's Solaris,
and Compaq's Tru-64. The only module at this time which is operating
system dependent is slurmd/read_proc.c.
The node selection logic allocates nodes to jobs in a fashion which
makes most sense for a Quadrics switch interconnect. It allocates
the smallest collection of consecutive nodes that satisfies the
request (e.g. if there are 32 consecutive nodes and 16 consecutive
nodes available, a job needing 16 or fewer nodes will be allocated
those nodes from the 16 node set rather than fragment the 32 node
set). If the job can not be allocated consecutive nodes, it will
be allocated the smallest number of consecutive sets (e.g. if there
are sets of available consecutive nodes of sizes 6, 4, 3, 3, 2, 1,
and 1 then a request for 10 nodes will always be allocated the 6
and 4 node sets rather than use the smaller sets).
These techniques minimize the job communications overhead.
A job can use hardware broadcast mechanisms given consecutive nodes.
Without consecutive nodes, much slower software broadcase mechanisms
must be used.
URL = http://www-lc.llnl.gov/dctg-lc/slurm/programmer.guide.html
Last Modified July 30, 2002
Maintained by
slurm-dev@lists.llnl.gov