SLURM Programmer's Guide
Overview
Simple Linux Utility for Resource Management (SLURM) is an open source,
fault-tolerant, and highly scalable cluster management and job
scheduling system for Linux clusters containing
thousands of nodes. Components include machine status,
job management, and scheduling modules. The design also
includes a scalable, general-purpose communication infrastructure
(MONGO, to be described elsewhere).
SLURM requires no kernel modifications for it operation and is
relatively self-contained.
Initial target platforms include Red Hat Linux clusters with
Quadrics interconnect and the IBM Blue Gene product line.
Overview
There is an overview the components and their interactions available
in a separate document, SLURM: Simple Linux Utility
for Resource Management.
Code should adhere to the Linux kernel code style as described in
http://www.linuxhq.com/kernel/v2.4/doc/CodingStyle.html.
Functions are divided into several catagories, each in its own
directory. The details of each directory's contents are proved
below. The directories are as follows:
- api
- Application Program Interfaces into the SLURM code.
Used to send and get SLURM information from the central manager.
- common
- General purpose functions for widespread use.
- popt
- TBD
- scancel
- User command to cancel a job or allocation.
- scontrol
- Administrator tool to manage SLURM.
- slurmctld
- SLURM central manager code.
- slurmd
- SLURM code to manage the nodes used for executing user applications
under the control of slurmctld.
- squeue
- User command to get information on SLURM jobs and allocations
- srun
- User command to submit a job, get an allocation, and/or initiation
a parallel job step.
- test
- Functions for testing individual SLURM modules.
API Modules
This directory contains modules supporting the SLURM API functions.
The APIs to get SLURM information accept a time-stamp. If the data
has not changed since the specified time, a return code will indicate
this and return no other data. Otherwise a data structure is returned
including its time-stamp, element count, and an array of structures
describing the state of each node, job, partition, etc.
Each of these functions also includes a corresponding function to
release all storage associated with the data structure.
- allocate.c
- Allocates resources for a job's initiation.
This creates a job entry and allocates resouces to it.
The resources can be claimed at a later time to actually
run a parallel job. If the requested resouces are not
currently available, the request will fail.
- build_info.c
- Reports SLURM build parameter values.
- cancel.c
- Cancels (i.e. terminates) a running or pending job/allocation.
- job_info.c
- Reports job state information
- node_info.c
- Reports node state and configuration values.
- partition_info.c
- Reports partition state and configuration values.
- reconfigure.c
- Requests that slurmctld reload configuration information.
- submit.c
- Submits a job to slurm. The job will be queued
for initiation when resources are available.
- update_config.c
- Updates job, node or partition state information.
Future components to include: job step support (a set of parallel
tasks associated with a job or allocation, multiple job steps may
execute in serial or parallel within an allocation), association of
an allocation with a job step, issuing keys, getting Elan (Quadrics
interconnect) capabilities, and resource accounting.
Common Modules
This directory contains modules of general use throughout the SLURM code.
The modules are described below.
- bits_bytes.c
- A collection of SLURM specific functions for processing bitmaps and
strings for parsing.
- bits_bytes.h
- Function definitions for bits_bytes.c.
- bitstring.c
- A collection of general purpose functions for managing bitmaps.
We use these for rapid node management functions including: scheduling
and associating partitions and jobs with the nodes.
- bitstring.h
- Function definitions for bitstring.c.
- list.c
- A general purpose list manager.
One can define a list, add and delete entries, search for entries, etc.
- list.h
- Definitions for list.c and documentation for its functions.
- log.c
- A general purpose log manager. It can filter log messages
based upon severity and route them to stderr, syslog, or a log file.
- log.h
- Definitions for log.c and documentation for its functions.
- macros.h
- General purpose SLURM macro definitions.
- pack.c
- Functions for packing and unpacking unsigned integers and strings
for transmission over the network. The unsigned integers are translated
to/from machine independent form. Strings are transmitted with a length
value.
- pack.h
- Definitions for pack.c and documentation for its functions.
- qsw.c
- Functions for interacting with the Quadrics interconnect.
- qsw.h
- Definitions for qsw.c and documentation for its functions.
- strlcpy.c
- TBD
- slurm.h
- Definitions for common SLURM data structures and functions.
- slurmlib.h
- Definitions for SLURM API data structures and functions.
This would be included in user applications loading with SLURM APIs.
- xassert.c
- TBD
- xassert.h
- Definitions for xassert.c and documentation for its functions.
- xmalloc.c
- "Safe" memory management functions. Includes magic cooking to insure
that freed memory was in fact allocated by its functions.
- xmalloc.h
- Definitions for xmalloc.c and documentation for its functions.
- xstring.c
- A collection of functions for string manipulations with automatic expansion
of allocated memory on an as needed basis.
- xstring.h
- Definitions for xstring.c and documentation for its functions.
scancel Modules
scancel is a command to cancel running or pending jobs.
- scancel.c
- A command line interface to cancel jobs or their allocations.
scontrol Modules
scontrol is the administrator tool for monitoring and modifying SLURM configuration
and state. It has a command line interface only
- scontrol.c
- A command line interface to slurmctld.
slurmctld Modules
slurmctld executes on the control machine and orchestrates SLURM activities
across the entire cluster including monitoring node and partition state,
scheduling, job queue management, job dispatching, and switch management.
The slurmctld modules and their functionality are described below.
- controller.c
- Primary SLURM daemon to execute on control machine.
It manages communications the Partition Manager, Switch Manager, and Job Manager threads.
- job_mgr.c
- Module reads, writes, records, updates, and otherwise
manages the state information for all jobs and allocations
for jobs.
- job_scheduler.c
- Module determines which pending job(s) should execute next
and initiates them.
- node_mgr.c
- Module reads, writes, records, updates, and otherwise
manages the state information for all nodes (machines) in the
cluster managed by SLURM.
- node_scheduler.c
- Selects the nodes to be allocated to pending jobs. This makes extensive use
of bit maps in representing the nodes. It also considers the locality of nodes
to improve communications performance.
- partition_mgr.c
- Module reads, writes, records, updates, and otherwise
manages the state information associated with partitions in the
cluster managed by SLURM.
- read_config.c
- Read the SLURM configuration file and use it to build node and
partition data structures.
slurmd Modules
slurmd executes on each compute node. It initiates and terminates user
jobs and monitors both system and job state. The slurmd modules and their
functionality are described below.
- get_mach_stat.c
- Gets the machine's status and configuration in a operating system
independent fashion.
This configuration information includes: size of real memory,
size of temporary disk storage, and the number of processors.
- read_proc.c
- Collects job state information including real memory use, virtual
memory use, and CPU time use.
While desirable to maintain operating system independent code, this
module is not completely portable.
Design Issues
Most modules are constructed with a some simple, built-in tests.
NOTE: We need to convert this to a DegaGnu framework.
Set declarations for DEBUG_MODULE and DEBUG_SYSTEM both to 1 near
the top of the module's code. Then compile and run the test.
Required input scripts and configuration files for these tests
will be kept in the "etc" subdirectory and the commands to execute
the tests are in the "Makefile". In some cases, the module must
be loaded with some other components. In those cases, the support
modules should be built with the declaration for DEBUG_MODULE set
to 0 and for DEBUG_SYSTEM set to 1.
Many of these modules have been built and tested on a variety of
Unix computers including Redhat's Linux, IBM's AIX, Sun's Solaris,
and Compaq's Tru-64. The only module at this time which is operating
system dependent is slurmd/read_proc.c.
The node selection logic allocates nodes to jobs in a fashion which
makes most sense for a Quadrics switch interconnect. It allocates
the smallest collection of consecutive nodes that satisfies the
request (e.g. if there are 32 consecutive nodes and 16 consecutive
nodes available, a job needing 16 or fewer nodes will be allocated
those nodes from the 16 node set rather than fragment the 32 node
set). If the job can not be allocated consecutive nodes, it will
be allocated the smallest number of consecutive sets (e.g. if there
are sets of available consecutive nodes of sizes 6, 4, 3, 3, 2, 1,
and 1 then a request for 10 nodes will always be allocated the 6
and 4 node sets rather than use the smaller sets).
These techniques minimize the job communications overhead.
A job can use hardware broadcast mechanisms given consecutive nodes.
Without consecutive nodes, much slower software broadcase mechanisms
must be used.
Application Program Interface (API)
All functions described below can be issued from any node in the SLURM cluster.
- int slurm_load_build (time_t update_time, struct build_buffer **build_buffer_ptr);
- If the SLURM build information has changed since last_time, then
download from slurmctld the current information. The information includes
the data's time of update, the machine on which is the primary slurmctld server,
pathname of the prolog program, pathname of the temporary file system, etc.
See slurmlib.h for a full description of the information available.
Execute slurm_free_build_info to release the memory allocated by slurm_load_build.
- void slurm_free_build_info (struct build_buffer *build_buffer_ptr);
- Release memory allocated by the slurm_load_build function.
- int slurm_load_job (time_t update_time, struct job_buffer **job_buffer_ptr);
- If any SLURM job information has changed since last_time, then
download from slurmctld the current information. The information includes
a count of job entries, and each job's name, job id, user id, allocated
nodes, etc. Included with the job information is an array of indecies
into the node table information as downloaded with slurm_load_node.
See slurmlib.h for a full description of the information available.
Execute slurm_free_job_info to release the memory allocated by slurm_load_job.
- void slurm_free_job_info (struct job_buffer *job_buffer_ptr);
- Release memory allocated by the slurm_load_job function.
- int slurm_load_node (time_t update_time, struct node_buffer **node_buffer_ptr);
- If any SLURM node information has changed since last_time, then
download from slurmctld the current information. The information includes
a count of node entries, and each node's name, real memory size, temporary
disk space, processor count, features, etc.
See slurmlib.h for a full description of the information available.
Execute slurm_free_node_info to release the memory allocated by slurm_load_node.
- void slurm_free_node_info (struct node_buffer *node_buffer_ptr);
- Release memory allocated by the slurm_load_node function.
- int slurm_load_part (time_t update_time, struct part_buffer **part_buffer_ptr);
- If any SLURM partition information has changed since last_time, then
download from slurmctld the current information. The information includes
a count of partition entries, and each partition's name, node count limit
(per job), time limit (per job), group access restrictions, associated
nodes etc. Included with the partition information is an array of indecies
into the node table information as downloaded with slurm_load_node.
See slurmlib.h for a full description of the information available.
Execute slurm_free_part_info to release the memory allocated by slurm_load_part.
- void slurm_free_part_info (struct part_buffer *part_buffer_ptr);
- Release memory allocated by the slurm_load_part function.
Examples of API Use
#include <stdio.h>
#include <slurm.h>
int
main (int argc, char *argv[])
{
int error_code, i, j, k;
slurm_ctl_conf_info_msg_t * conf_info_msg_ptr = NULL;
job_info_msg_t * job_info_msg_ptr = NULL;
node_info_msg_t *node_info_ptr = NULL;
partition_info_msg_t *part_info_ptr = NULL;
/* get and dump some configuration information */
if ( slurm_load_ctl_conf ((time_t) NULL, &conf_info_msg_ptr ) ) {
printf ("slurm_load_ctl_conf errno %d\n", errno);
exit (1);
}
printf ("control_machine = %s\n", slurm_ctl_conf_ptr->control_machine);
printf ("server_timeout = %u\n", slurm_ctl_conf_ptr->server_timeout);
slurm_free_ctl_conf (conf_info_msg_ptr);
/* get and dump some job information */
if ( slurm_load_jobs ((time_t) NULL, &job_buffer_ptr) ) {
printf ("slurm_load_jobs errno %d\n", errno);
exit (1);
}
printf ("Jobs updated at %lx, record count %d\n",
job_buffer_ptr->last_update, job_buffer_ptr->record_count);
for (i = 0; i < job_buffer_ptr->record_count; i++) {
printf ("JobId=%u UserId=%u\n",
job_buffer_ptr->job_array[i].job_id, job_buffer_ptr->job_array[i].user_id);
}
slurm_free_job_info (job_buffer_ptr);
/* get and dump some node information */
if ( slurm_load_node ((time_t) NULL, &node_buffer_ptr) ) {
printf ("slurm_load_node errno %d\n", errno);
exit (1);
}
for (i = 0; i < node_buffer_ptr->node_count; i++) {
printf ("NodeName=%s CPUs=%u\n",
node_buffer_ptr->node_array[i].name,
node_buffer_ptr->node_array[i].cpus);
}
/* get and dump some partition information */
/* note that we use the node information loaded above and */
/* we assume the node table entries have not changed since */
if ( slurm_load_partitions ((time_t) NULL, &part_buffer_ptr) ) {
printf ("slurm_load_part errno %d\n", errno);
exit (1);
}
printf("Partitions updated at %lx, record count %d\n",
part_buffer_ptr->last_update, part_buffer_ptr->record_count);
for (i = 0; i < part_buffer_ptr->record_count; i++) {
printf ("PartitionName=%s MaxTime=%u Nodes=%s:",
part_info_ptr->partition_array[i].name,
part_info_ptr->partition_array[i].max_time,
part_info_ptr->partition_array[i].nodes );
for (j = 0; part_info_ptr->partition_array[i].node_inx; j+=2) {
if (part_info_ptr->partition_array[i].node_inx[j] == -1)
break;
for (k = part_info_ptr->partition_array[i].node_inx[j];
k <= part_info_ptr->partition_array[i].node_inx[j+1]; k++) {
printf ("%s ", node_buffer_ptr->node_array[k].name);
}
}
printf("\n\n");
}
slurm_free_node_info (node_buffer_ptr);
slurm_free_partition_info (part_buffer_ptr);
exit (0);
}
To Do
- How do we interface with TotalView?
- Deadlines: MCR to be built in July 2002, accepted August 2002.
URL = http://www-lc.llnl.gov/dctg-lc/slurm/programmer.guide.html
Last Modified May 13, 2002
Maintained by
slurm-dev@lists.llnl.gov