SLURM Programmer's Guide

Overview

Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters containing thousands of nodes. Components include machine status, job management, and scheduling modules. The design also includes a scalable, general-purpose communication infrastructure (MONGO, to be described elsewhere). SLURM requires no kernel modifications for it operation and is relatively self-contained. Initial target platforms include Red Hat Linux clusters with Quadrics interconnect and the IBM Blue Gene product line.

Overview

There is an overview the components and their interactions available in a separate document, SLURM: Simple Linux Utility for Resource Management.

Code should adhere to the Linux kernel code style as described in http://www.linuxhq.com/kernel/v2.4/doc/CodingStyle.html.

Functions are divided into several catagories, each in its own directory. The details of each directory's contents are proved below. The directories are as follows:

api
Application Program Interfaces into the SLURM code. Used to send and get SLURM information from the central manager.
common
General purpose functions for widespread use.
popt
TBD
scancel
User command to cancel a job or allocation.
scontrol
Administrator tool to manage SLURM.
slurmctld
SLURM central manager code.
slurmd
SLURM code to manage the nodes used for executing user applications under the control of slurmctld.
squeue
User command to get information on SLURM jobs and allocations
srun
User command to submit a job, get an allocation, and/or initiation a parallel job step.
test
Functions for testing individual SLURM modules.

API Modules

This directory contains modules supporting the SLURM API functions. The APIs to get SLURM information accept a time-stamp. If the data has not changed since the specified time, a return code will indicate this and return no other data. Otherwise a data structure is returned including its time-stamp, element count, and an array of structures describing the state of each node, job, partition, etc. Each of these functions also includes a corresponding function to release all storage associated with the data structure.
allocate.c
Allocates resources for a job's initiation. This creates a job entry and allocates resouces to it. The resources can be claimed at a later time to actually run a parallel job. If the requested resouces are not currently available, the request will fail.
build_info.c
Reports SLURM build parameter values.
cancel.c
Cancels (i.e. terminates) a running or pending job/allocation.
job_info.c
Reports job state information
node_info.c
Reports node state and configuration values.
partition_info.c
Reports partition state and configuration values.
reconfigure.c
Requests that slurmctld reload configuration information.
submit.c
Submits a job to slurm. The job will be queued for initiation when resources are available.
update_config.c
Updates job, node or partition state information.
Future components to include: job step support (a set of parallel tasks associated with a job or allocation, multiple job steps may execute in serial or parallel within an allocation), association of an allocation with a job step, issuing keys, getting Elan (Quadrics interconnect) capabilities, and resource accounting.

Common Modules

This directory contains modules of general use throughout the SLURM code. The modules are described below.
bits_bytes.c
A collection of SLURM specific functions for processing bitmaps and strings for parsing.
bits_bytes.h
Function definitions for bits_bytes.c.
bitstring.c
A collection of general purpose functions for managing bitmaps. We use these for rapid node management functions including: scheduling and associating partitions and jobs with the nodes.
bitstring.h
Function definitions for bitstring.c.
list.c
A general purpose list manager. One can define a list, add and delete entries, search for entries, etc.
list.h
Definitions for list.c and documentation for its functions.
log.c
A general purpose log manager. It can filter log messages based upon severity and route them to stderr, syslog, or a log file.
log.h
Definitions for log.c and documentation for its functions.
macros.h
General purpose SLURM macro definitions.
pack.c
Functions for packing and unpacking unsigned integers and strings for transmission over the network. The unsigned integers are translated to/from machine independent form. Strings are transmitted with a length value.
pack.h
Definitions for pack.c and documentation for its functions.
qsw.c
Functions for interacting with the Quadrics interconnect.
qsw.h
Definitions for qsw.c and documentation for its functions.
strlcpy.c
TBD
slurm.h
Definitions for common SLURM data structures and functions.
slurmlib.h
Definitions for SLURM API data structures and functions. This would be included in user applications loading with SLURM APIs.
xassert.c
TBD
xassert.h
Definitions for xassert.c and documentation for its functions.
xmalloc.c
"Safe" memory management functions. Includes magic cooking to insure that freed memory was in fact allocated by its functions.
xmalloc.h
Definitions for xmalloc.c and documentation for its functions.
xstring.c
A collection of functions for string manipulations with automatic expansion of allocated memory on an as needed basis.
xstring.h
Definitions for xstring.c and documentation for its functions.

scancel Modules

scancel is a command to cancel running or pending jobs.
scancel.c
A command line interface to cancel jobs or their allocations.

scontrol Modules

scontrol is the administrator tool for monitoring and modifying SLURM configuration and state. It has a command line interface only
scontrol.c
A command line interface to slurmctld.

slurmctld Modules

slurmctld executes on the control machine and orchestrates SLURM activities across the entire cluster including monitoring node and partition state, scheduling, job queue management, job dispatching, and switch management. The slurmctld modules and their functionality are described below.
controller.c
Primary SLURM daemon to execute on control machine. It manages communications the Partition Manager, Switch Manager, and Job Manager threads.
job_mgr.c
Module reads, writes, records, updates, and otherwise manages the state information for all jobs and allocations for jobs.
job_scheduler.c
Module determines which pending job(s) should execute next and initiates them.
node_mgr.c
Module reads, writes, records, updates, and otherwise manages the state information for all nodes (machines) in the cluster managed by SLURM.
node_scheduler.c
Selects the nodes to be allocated to pending jobs. This makes extensive use of bit maps in representing the nodes. It also considers the locality of nodes to improve communications performance.
partition_mgr.c
Module reads, writes, records, updates, and otherwise manages the state information associated with partitions in the cluster managed by SLURM.
read_config.c
Read the SLURM configuration file and use it to build node and partition data structures.

slurmd Modules

slurmd executes on each compute node. It initiates and terminates user jobs and monitors both system and job state. The slurmd modules and their functionality are described below.
get_mach_stat.c
Gets the machine's status and configuration in a operating system independent fashion. This configuration information includes: size of real memory, size of temporary disk storage, and the number of processors.
read_proc.c
Collects job state information including real memory use, virtual memory use, and CPU time use. While desirable to maintain operating system independent code, this module is not completely portable.

Design Issues

Most modules are constructed with a some simple, built-in tests. NOTE: We need to convert this to a DegaGnu framework. Set declarations for DEBUG_MODULE and DEBUG_SYSTEM both to 1 near the top of the module's code. Then compile and run the test. Required input scripts and configuration files for these tests will be kept in the "etc" subdirectory and the commands to execute the tests are in the "Makefile". In some cases, the module must be loaded with some other components. In those cases, the support modules should be built with the declaration for DEBUG_MODULE set to 0 and for DEBUG_SYSTEM set to 1.

Many of these modules have been built and tested on a variety of Unix computers including Redhat's Linux, IBM's AIX, Sun's Solaris, and Compaq's Tru-64. The only module at this time which is operating system dependent is slurmd/read_proc.c.

The node selection logic allocates nodes to jobs in a fashion which makes most sense for a Quadrics switch interconnect. It allocates the smallest collection of consecutive nodes that satisfies the request (e.g. if there are 32 consecutive nodes and 16 consecutive nodes available, a job needing 16 or fewer nodes will be allocated those nodes from the 16 node set rather than fragment the 32 node set). If the job can not be allocated consecutive nodes, it will be allocated the smallest number of consecutive sets (e.g. if there are sets of available consecutive nodes of sizes 6, 4, 3, 3, 2, 1, and 1 then a request for 10 nodes will always be allocated the 6 and 4 node sets rather than use the smaller sets). These techniques minimize the job communications overhead. A job can use hardware broadcast mechanisms given consecutive nodes. Without consecutive nodes, much slower software broadcase mechanisms must be used.

Application Program Interface (API)

All functions described below can be issued from any node in the SLURM cluster.
int slurm_load_build (time_t update_time, struct build_buffer **build_buffer_ptr);
If the SLURM build information has changed since last_time, then download from slurmctld the current information. The information includes the data's time of update, the machine on which is the primary slurmctld server, pathname of the prolog program, pathname of the temporary file system, etc. See slurmlib.h for a full description of the information available. Execute slurm_free_build_info to release the memory allocated by slurm_load_build.
void slurm_free_build_info (struct build_buffer *build_buffer_ptr);
Release memory allocated by the slurm_load_build function.
int slurm_load_job (time_t update_time, struct job_buffer **job_buffer_ptr);
If any SLURM job information has changed since last_time, then download from slurmctld the current information. The information includes a count of job entries, and each job's name, job id, user id, allocated nodes, etc. Included with the job information is an array of indecies into the node table information as downloaded with slurm_load_node. See slurmlib.h for a full description of the information available. Execute slurm_free_job_info to release the memory allocated by slurm_load_job.
void slurm_free_job_info (struct job_buffer *job_buffer_ptr);
Release memory allocated by the slurm_load_job function.
int slurm_load_node (time_t update_time, struct node_buffer **node_buffer_ptr);
If any SLURM node information has changed since last_time, then download from slurmctld the current information. The information includes a count of node entries, and each node's name, real memory size, temporary disk space, processor count, features, etc. See slurmlib.h for a full description of the information available. Execute slurm_free_node_info to release the memory allocated by slurm_load_node.
void slurm_free_node_info (struct node_buffer *node_buffer_ptr);
Release memory allocated by the slurm_load_node function.
int slurm_load_part (time_t update_time, struct part_buffer **part_buffer_ptr);
If any SLURM partition information has changed since last_time, then download from slurmctld the current information. The information includes a count of partition entries, and each partition's name, node count limit (per job), time limit (per job), group access restrictions, associated nodes etc. Included with the partition information is an array of indecies into the node table information as downloaded with slurm_load_node. See slurmlib.h for a full description of the information available. Execute slurm_free_part_info to release the memory allocated by slurm_load_part.
void slurm_free_part_info (struct part_buffer *part_buffer_ptr);
Release memory allocated by the slurm_load_part function.

Examples of API Use

#include <stdio.h>
#include <slurm.h>

int
main (int argc, char *argv[]) 
{
	int error_code, i, j, k;
	slurm_ctl_conf_info_msg_t * conf_info_msg_ptr = NULL;
	job_info_msg_t * job_info_msg_ptr = NULL;
	node_info_msg_t *node_info_ptr = NULL;
	partition_info_msg_t *part_info_ptr = NULL;


	/* get and dump some configuration information */
	if ( slurm_load_ctl_conf ((time_t) NULL, &conf_info_msg_ptr ) ) {
		printf ("slurm_load_ctl_conf errno %d\n", errno);
		exit (1);
	}

	printf ("control_machine	= %s\n", slurm_ctl_conf_ptr->control_machine);
	printf ("server_timeout	= %u\n", slurm_ctl_conf_ptr->server_timeout);
	slurm_free_ctl_conf (conf_info_msg_ptr);


	/* get and dump some job information */
	if ( slurm_load_jobs ((time_t) NULL, &job_buffer_ptr) ) {
		printf ("slurm_load_jobs errno %d\n", errno);
		exit (1);
	}

	printf ("Jobs updated at %lx, record count %d\n",
		job_buffer_ptr->last_update, job_buffer_ptr->record_count);

	for (i = 0; i < job_buffer_ptr->record_count; i++) {
		printf ("JobId=%u UserId=%u\n", 
			job_buffer_ptr->job_array[i].job_id, job_buffer_ptr->job_array[i].user_id);
	}			
	slurm_free_job_info (job_buffer_ptr);


	/* get and dump some node information */
	if ( slurm_load_node ((time_t) NULL, &node_buffer_ptr) ) {
		printf ("slurm_load_node errno %d\n", errno);
		exit (1);
	}
	
	for (i = 0; i < node_buffer_ptr->node_count; i++) {
		printf ("NodeName=%s CPUs=%u\n", 
			node_buffer_ptr->node_array[i].name, 
			node_buffer_ptr->node_array[i].cpus);
	}			


	/* get and dump some partition information */
	/* note that we use the node information loaded above and */
	/* we assume the node table entries have not changed since */
	if ( slurm_load_partitions ((time_t) NULL, &part_buffer_ptr) ) {
		printf ("slurm_load_part errno %d\n", errno);
		exit (1);
	}
	printf("Partitions updated at %lx, record count %d\n",
		part_buffer_ptr->last_update, part_buffer_ptr->record_count);

	for (i = 0; i < part_buffer_ptr->record_count; i++) {
		printf ("PartitionName=%s MaxTime=%u Nodes=%s:", 
			part_info_ptr->partition_array[i].name, 
			part_info_ptr->partition_array[i].max_time,
			part_info_ptr->partition_array[i].nodes );
		for (j = 0; part_info_ptr->partition_array[i].node_inx; j+=2) {
			if (part_info_ptr->partition_array[i].node_inx[j] == -1)
				break;
			for (k = part_info_ptr->partition_array[i].node_inx[j];
			     k <= part_info_ptr->partition_array[i].node_inx[j+1]; k++) {
				printf ("%s ", node_buffer_ptr->node_array[k].name);
			}
		}
		printf("\n\n");
	}
	slurm_free_node_info (node_buffer_ptr);
	slurm_free_partition_info (part_buffer_ptr);
	exit (0);
}

To Do


URL = http://www-lc.llnl.gov/dctg-lc/slurm/programmer.guide.html

Last Modified May 13, 2002

Maintained by slurm-dev@lists.llnl.gov