SLURM Programmer's Guide
Overview
Simple Linux Utility for Resource Management (SLURM) is an open source,
fault-tolerant, and highly scalable cluster management and job
scheduling system for Linux clusters of
thousands of nodes. Components include machine status, partition
management, job management, and scheduling modules. The design also
includes a scalable, general-purpose communication infrastructure.
SLURM requires no kernel modifications and is relatively self-contained.
Components
The Job Initiator (JI) is the tool used by the customer to initiate
a job. The job initiator can execute on any computer in the cluser. Its
request is sent to the controller executing on the control machine.
The controller orchestrates all SLURM activities including: accepting the
job initiation request, allocating nodes to the job, enforcing partition
constraints, enforcing job limits, and general record keeping. The three
primary components (threads) of the controller are the Partition Manager (PM),
Node Manager (NM), and Job Manager (JM). The partition manager
keeps track of partition state and contraints. The node manager keeps track
of node state and configuration. The job manager keeps track of job state
and enforces its limits. Since all of these functions are critical to the
overall SLURM operation, a backup controler assumes thsse responsibilities
in the event of control machine failure.
The final component of interest is the Job Shepherd (JS). The
job shepherd executes on each of the compute server nodes and initiates
the job's tasks. It allocates switch resources. It also monitors job
state and resources utilization. Finally, it delivers signals to the
processes as needed.
Figure 1: SLURM components
Code Modules
- Controller.c
- Primary SLURM daemon to execute on control machine.
It manages the Partition Manager, Node Manager, and Job Manager threads.
- Get_Mach_Stat.c
- Module gets the machine's status and configuration.
This includes: operating system version, size of real memory, size
of virtual memory, size of /tmp disk storage, number of processors,
and speed of processors. This is a module of the Job Shepherd component.
- list.c
- Module is a general purpose list manager. One can define a
list, add and delete entries, search for entries, etc. This module
is used by multiple SLURM components.
- list.h
- Module contains definitions for list.c and documentation for its functions.
- Mach_Stat_Mgr.c
- Module reads, writes, records, updates, and otherwise
manages the state information for all nodes (machines) in the
cluster managed by SLURM. This module performs much of the Node Manager
component functionality.
- Partition_Mgr.c
- Module reads, writes, records, updates, and otherwise
manages the state information associated with partitions in the
cluster managed by SLURM. This module is the Partition Manager component.
- Read_Config.c
- Module reads overall SLURM configuration file.
- Slurm_Admin.c
- Administration tool for reading, writing, and updating SLURM configuration.
Design Issues
Most modules are constructed with a some simple, built-in tests.
Set declarations for DEBUG_MODULE and DEBUG_SYSTEM both to 1 near
the top of the module's code. Then compile and run the test.
Required input scripts and configuration files for these tests
will be kept in the "etc" subdirectory and the commands to execute
the tests are in the "Makefile". In some cases, the module must
be loaded with some other components. In those cases, the support
modules should be built with the declaration for DEBUG_MODULE set
to 0 and for DEBUG_SYSTEM set to 1.
Many of these modules have been built and tested on a variety of
Unix computers including Redhat's Linux, IBM's AIX, Sun's Solaris,
and Compaq's Tru-64. The only module at this time which is operating
system dependent is Get_Mach_Stat.c.
We have tried to develop the SLURM code to be quite general and
flexible, but compromises were made in several areas for the sake of
simplicity and ease of support. Entire nodes are dedicated to user
applications. Our customers at LLNL have expressed the opinion that sharing of
nodes can severely reduce their job's performance and even reliability.
This is due to contention for shared resources such as local disk space,
real memory, virtual memory and processor cycles. The proper support of
shared resources, including the enforcement of limits on these resources,
entails a substantial amount of additional effort. Given such a cost to
benefit situation at LLNL, we have decided to not support shared nodes.
However, we have designed SLURM so as to not preclude the addition of
such a capability at a later time if so desired.
To Do
- We need to build up a reasonable Makefile.
- The node selection process for contiguous nodes in Controller.c selects
the nodes on a best-fit basis. If there is no contiguous set of nodes, it
just selects nodes sequentially from a list. Other options are to allocate
in the minimum number of contiguous sets or gather up all the lose nodes.
URL = http://www-lc.llnl.gov/dctg-lc/slurm/programmer.guide.html
Last Modified January 9, 2002
Maintained by Moe Jette
jette1@llnl.gov