SLURM Programmer's Guide

Overview

Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters of thousands of nodes. Components include machine status, partition management, job management, and scheduling modules. The design also includes a scalable, general-purpose communication infrastructure. SLURM requires no kernel modifications and is relatively self-contained.

Overview

There is a description of the components and their interactions available in a separate document, SLURM: Simple Linux Utility for Resource Management.

Code should adhere to the Linux kernel code style as described in http://www.linuxhq.com/kernel/v2.4/doc/CodingStyle.html.

API Modules

This directory contains modules supporting the SLURM API functions.
allocate.c
API to allocate resources for a job's initiation.
build_info.c
API to report SLURM build parameter values.
node_info.c
API to report node state and configuration values.
partition_info.c
API to report partition state and configuration values.
reconfigure.c
API to request that slurmctld reload configuration information.
update_config.c
API to update job, node or partition state information.

Common Modules

This directory contains modules of general use throughout the SLURM code. The modules are described below.
bits_bytes.c
A collection of functions for processing bit maps and strings for parsing.
list.c
Module is a general purpose list manager. One can define a list, add and delete entries, search for entries, etc.
list.h
Module contains definitions for list.c and documentation for its functions.
slurm.h
Definitions for common SLURM data structures and functions.
slurmlib.h
Definitions for SLURM API data structures and functions.

scancel Modules

scancel is a command to cancel running or pending jobs.
scancel.c
A command line interface to cancel jobs.

scontrol Modules

scontrol is the administrator tool for monitoring and modifying SLURM configuration and state. It has a command line interface only
scontrol.c
A command line interface to slurmctld.

slurmctld Modules

slurmctld executes on the control machine and orchestrates SLURM activities across the entire cluster including monitoring node and partition state, scheduling, job queue management, job dispatching, and switch management. The slurmctld modules and their functionality are described below.
controller.c
Primary SLURM daemon to execute on control machine. It manages communications the Partition Manager, Switch Manager, and Job Manager threads.
node_mgr.c
Module reads, writes, records, updates, and otherwise manages the state information for all nodes (machines) in the cluster managed by SLURM.
node_scheduler.c
Selects the nodes to be allocated to pending jobs. This makes extensive use of bit maps in representing the nodes. It also considers the locality of nodes to improve communications performance.
partition_mgr.c
Module reads, writes, records, updates, and otherwise manages the state information associated with partitions in the cluster managed by SLURM.
read_config.c
Read the SLURM configuration file and use it to build node and partition data structures.

slurmd Modules

slurmd executes on each compute node. It initiates and terminates user jobs and monitors both system and job state. The slurmd modules and their functionality are described below.
get_mach_stat.c
This module gets the machine's status and configuration. This includes: size of real memory, size of temporary disk storage, and the number of processors.
read_proc.c
This module collects job state information including real memory use, virtual memory use, and CPU time use.

Design Issues

Most modules are constructed with a some simple, built-in tests. Set declarations for DEBUG_MODULE and DEBUG_SYSTEM both to 1 near the top of the module's code. Then compile and run the test. Required input scripts and configuration files for these tests will be kept in the "etc" subdirectory and the commands to execute the tests are in the "Makefile". In some cases, the module must be loaded with some other components. In those cases, the support modules should be built with the declaration for DEBUG_MODULE set to 0 and for DEBUG_SYSTEM set to 1.

Many of these modules have been built and tested on a variety of Unix computers including Redhat's Linux, IBM's AIX, Sun's Solaris, and Compaq's Tru-64. The only module at this time which is operating system dependent is get_mach_stat.c.

The node selection logic allocates nodes to jobs in a fashion which makes most sense for a Quadrics switch interconnect. It allocates the smallest collection of consecutive nodes that satisfies the request (e.g. if there are 32 consecutive nodes and 16 consecutive nodes available, a job needing 16 or fewer nodes will be allocated those nodes from the 16 node set rather than fragment the 32 node set). If the job can not be allocated consecutive nodes, it will be allocated the smallest number of consecutive sets (e.g. if there are sets of available consecutive nodes of sizes 6, 4, 3, 3, 2, 1, and 1 then a request for 10 nodes will always be allocated the 6 and 4 node sets rather than use the smaller sets).

Application Program Interface (API)

All functions described below can be issued from any node in the SLURM cluster.
void free_node_info(void);
Free the node information buffer (if allocated)
NOTE: Buffer is loaded by load_node and used by load_node_name.
void free_part_info(void);
Free the partition information buffer (if allocated)
NOTE: Buffer is loaded by load_part and used by load_part_name.
int get_job_info(TBD);
Function to be defined.
int load_node(time_t *last_update_time);
Load the supplied node information buffer for use by info gathering APIs if node records have changed since the time specified.
Input: Buffer - Pointer to node information buffer
Buffer_Size - size of Buffer
Output: Returns 0 if no error, EINVAL if the buffer is invalid, ENOMEM if malloc failure
NOTE: Buffer is loaded by load_node and freed by Free_Node_Info.
int load_node_config(char *req_name, char *next_name, int *cpus, int *real_memory, int *tmp_disk, int *weight, char *features, char *partition, char *node_state);
Load the state information about the named node
Input: req_name - Name of the node for which information is requested if "", then get info for the first node in list
next_name - Location into which the name of the next node is stored, "" if no more
cpus, etc. - Pointers into which the information is to be stored
Output: next_name - Name of the next node in the list
cpus, etc. - The node's state information
Returns 0 on success, ENOENT if not found, or EINVAL if buffer is bad
NOTE: req_name, next_name, Partition, and NodeState must be declared by the caller and have length MAX_NAME_LEN or larger. Features must be declared by the caller and have length FEATURE_SIZE or larger
NOTE: Buffer is loaded by load_node and freed by Free_Node_Info.
int load_part(time_t *last_update_time);
Update the partition information buffer for use by info gathering APIs if partition records have changed since the time specified.
Input: last_update_time - Pointer to time of last buffer
Output: last_update_time - Time reset if buffer is updated
Returns 0 if no error, EINVAL if the buffer is invalid, ENOMEM if malloc failure
NOTE: Buffer is used by load_part_name and free by Free_Part_Info.
int load_part_name(char *req_name, char *next_name, int *max_time, int *max_nodes, int *total_nodes, int *total_cpus, int *key, int *state_up, int *shared, int *default, char *nodes, char *allow_groups);
Load the state information about the named partition
Input: req_name - Name of the partition for which information is requested if "", then get info for the first partition in list
next_name - Location into which the name of the next partition is stored, "" if no more
max_time, etc. - Pointers into which the information is to be stored
Output: req_name - The partition's name is stored here
next_name - The name of the next partition in the list is stored here
max_time, etc. - The partition's state information
Returns 0 on success, ENOENT if not found, or EINVAL if buffer is bad
NOTE: req_name and next_name must be declared by caller with have length MAX_NAME_LEN or larger.
Nodes and AllowGroups must be declared by caller with length of FEATURE_SIZE or larger.
NOTE: Buffer is loaded by load_part and free by Free_Part_Info.
int reconfigure(void);
Request that slurmctld re-read the configuration files Output: Returns 0 on success, errno otherwise
int slurm_allocate(char *spec, char **node_list);
Allocate nodes for a job with supplied contraints.
Input: spec - Specification of the job's constraints;
node_list - Place into which a node list pointer can be placed;
Output: node_list - List of allocated nodes;
Returns 0 if no error, EINVAL if the request is invalid, EAGAIN if the request can not be satisfied at present;
NOTE: Acceptable specifications include: JobName= NodeList=, Features=, Groups=, Partition=, Contiguous, TotalCPUs=, TotalNodes=, MinCPUs=, MinMemory=, MinTmpDisk=, Key=, Shared=<0|1>
NOTE: The calling function must free the allocated storage at node_list[0]
void slurm_free_build_info(void);
Free the build information buffer (if allocated).
NOTE: Buffer is loaded by slurm_load_build and used by slurm_load_build_name.
int slurm_get_key(? *key);
Load into the location key the value of an authorization key.
To be defined.
int slurm_kill_job(int job_id);
Terminate the specified SLURM job.
TBD.
int slurm_load_build(void);
Update the build information buffer for use by info gathering APIs
Output: Returns 0 if no error, EINVAL if the buffer is invalid, ENOMEM if malloc failure.
int slurm_load_build_name(char *req_name, char *next_name, char *value);
Load the state information about the named build parameter
Input: req_name - Name of the parameter for which information is requested if "", then get info for the first parameter in list
next_name - Location into which the name of the next parameter is stored, "" if no more
value - Pointer to location into which the information is to be stored
Output: req_name - The parameter's name is stored here
next_name - The name of the next parameter in the list is stored here
value - The parameter's value is stored here
Returns 0 on success, ENOENT if not found, or EINVAL if buffer is bad
NOTE: req_name, next_name, and value must be declared by caller with have length BUILD_SIZE or larger
NOTE: Buffer is loaded by slurm_load_build and freed by slurm_free_build_info.
See the SLURM administrator guide for valid build parameter names.
int slurm_run_job(char *job_spec);
Initiate the job with the specification job_spec.
TBD.
int slurm_signal_job(int job_id, int signal);
Send the specified signal to the specified SLURM job.
TBD.
int slurm_transfer_resources(pid_t pid, int job_id);
Transfer the ownership of resources associated with the specified
TBD.
int update(char *spec);
Request that slurmctld update its configuration per request
Input: A line containing configuration information per the configuration file format
Output: Returns 0 on success, errno otherwise
int slurm_will_job_run(char *job_spec);
TBD.

Examples of API Use

Please see the source code of scancel, scontrol, squeue, and srun for examples of all APIs.

To Do


URL = http://www-lc.llnl.gov/dctg-lc/slurm/programmer.guide.html

Last Modified April 15, 2002

Maintained by slurm-dev@lists.llnl.gov