SLURM Programmer's Guide
Overview
Simple Linux Utility for Resource Management (SLURM) is an open source,
fault-tolerant, and highly scalable cluster management and job
scheduling system for Linux clusters large and small.
Components include machine status, partition
management, job management, scheduling and stream copy modules.
SLURM requires no kernel modifications for it operation and is
relatively self-contained.
There is an overview the components and their interactions available
in a separate document, SLURM: Simple Linux Utility for Resource Management
[PDF]
[PS].
SLURM is written in the C language and uses a GNU autoconf
configuration engine.
While initially written for Linux, other UNIX-like operating systems
should be easy porting targets.
Code should adhere to the
Linux kernel coding style.
Many of these modules have been built and tested on a variety of
Unix computers including Red Hat Linux, IBM's AIX, Sun's Solaris,
and Compaq's Tru-64. The only module at this time which is operating
system dependent is src/slurmd/read_proc.c.
We will be porting and testing on additional platforms in future releases.
Plugins
In order to make the use of different infrastructures possible,
SLURM uses a general purpose plugin mechanism.
A SLURM plugin is a dynamically linked code object which is
loaded explicitly at run time by the SLURM libraries.
It provides a customized implemenation of a well-defined
API connected to tasks such as authentication, interconnect fabric,
task scheduling, etc.
A set of functions is defined for use by all of the different
infrastructures of a particular variety.
When a SLURM daemon is initiated, it reads the configuration
file to determine which of the available plugins should be used.
For details, see plugins.html and
authplugins.html.
Our intent is to make more full use of the plugin mechanism in the future.
Work is underway to support scheduling through a plugin, with the
Maui Scheduler
and FIFO plugin modules initially available.
Work is also underway to support additional interconnects via a plugin
with support for Myrinet being added to the currently supported Quadrics
Elan3 and TCP/IP communications.
Directory Structure
The contents of the SLURM directory structure will be described below in
increasing detail as the structure is descended. The top level directory
contains the scripts and tools required to build the entire SLURM system.
It also contains a variety of subdirectories for each type of file.
General build tools/files include: acinclude.m4, autogen.sh,
configure.ac, Makefile.am, Make-rpm.mk, META,
README, slurm.spec.in, and the contents of the auxdir
directory.
autoconf and make commands are used to build and install
SLURM in an automated fashion. NOTE: autoconf version 2.52
or higher is required to build SLURM. Execute "autoconf -V" to check
your version number. The build process is described in the README
file and may be as simple as executing a sequence of three commands:
./autogen.sh
./configure [OPTIONS]
make
Copyright and disclaimer information are in the files COPYING and DISCLAIMER.
All of the top-level subdirectories are described below.
- auxdir
- Used for building SLURM.
- doc
- Documentation including man pages.
- etc
- Sample configuration files.
- slurm
- Header files for API use. These files must be installed.
Placing these header files in this location makes for better code portability.
- src
- Contains all source code and header files not in the "slurm" subdirectory
described above.
- testsuite
- DejaGnu is used as a testing framework and all of its files are here.
Documentation
All of the documentation is in the subdirectory doc.
Man pages for the APIs, configuration file, commands, and daemons
are in doc/man.
Various documents suitable for public consumption are in doc/html.
Overall SLURM design documents including various figures are in doc/pubdesign.
Various design documents (many of which are dated) can be found in
doc/slides and doc/txt.
A survey of available resource managers as of 2001 is in
doc/survey.
Source Code
Functions are divided into several catagories, each in its own
subdirectory. The details of each directory's contents are proved
below. The directories are as follows:
- api
- Application Program Interfaces into the SLURM code.
Used to send and get SLURM information from the central manager.
These are the functions user applications might utilize.
- common
- General purpose functions for widespread use throughout SLURM.
- plugins
- Plugin functions for various infrastructure.
A separate subdirectory is used for each plugin class: auth
for user authentication, prio for job prioritization, etc.
- popt
- Command line option parsing tools from Red Hat Software, Inc.
- scancel
- User command to cancel (or signal) a job or job step.
- scontrol
- Administrator tool to manage SLURM.
- sinfo
- User command to get information on SLURM nodes and partitions.
- slurmctld
- SLURM central manager daemon code.
- slurmd
- SLURM daemon code to manage the compute server nodes including the
execution of user applications.
- squeue
- User command to get information on SLURM jobs and job steps.
- srun
- User command to submit a job, get an allocation, and/or initiation
a parallel job step.
Configuration
Several configuration files are included in the etc subdirectory.
slurm.conf.example includes a description of all configuration
options and default settings. See doc/man/man5/slurm.conf.5 for
more details.
init.d.slurm is a script that determines which SLURM daemon(s)
should execute on any node based upon the configuration file contents.
This can be used as part of a daemon startup/shutdown mechanism.
Test Suite
The test suite uses a Dega GNU framework for testing.
Some of these tests directly test modules in the daemons.
Other tests are more general and exercise API functionality.
Be aware that some of these tests are dated and some no longer function.
We also have a set of Expect SLURM tests available as a separate
distribution. These tests are executed after SLURM has been installed
and the daemons initiated. About 100 test scripts exercise all SLURM
commands and options including stress tests.
URL = http://www-lc.llnl.gov/dctg-lc/slurm/programmer.guide.html
Last Modified July 4, 2003
Maintained by
slurm-dev@lists.llnl.gov