SLURM Programmer's Guide
Overview
Simple Linux Utility for Resource Management (SLURM) is an open source,
fault-tolerant, and highly scalable cluster management and job
scheduling system for Linux clusters of
thousands of nodes. Components include machine status, partition
management, job management, and scheduling modules. The design also
includes a scalable, general-purpose communication infrastructure.
SLURM requires no kernel modifications and is relatively self-contained.
Overview
There is a description of the components and their interactions available
in a separate document, SLURM: Simple Linux Utility
for Resource Management.
Code should adhere to the Linux kernel code style as described in
http://www.linuxhq.com/kernel/v2.4/doc/CodingStyle.html.
API Modules
This directory contains modules supporting the SLURM API functions.
- allocate.c
- API to allocate resources for a job's initiation.
- build_info.c
- API to report SLURM build parameter values.
- node_info.c
- API to report node state and configuration values.
- partition_info.c
- API to report partition state and configuration values.
- reconfigure.c
- API to request that slurmctld reload configuration information.
- update_config.c
- API to update job, node or partition state information.
Common Modules
This directory contains modules of general use throughout the SLURM code.
The modules are described below.
- bits_bytes.c
- A collection of functions for processing bit maps and strings for parsing.
- list.c
- Module is a general purpose list manager. One can define a
list, add and delete entries, search for entries, etc.
- list.h
- Module contains definitions for list.c and documentation for its functions.
- slurm.h
- Definitions for common SLURM data structures and functions.
- slurmlib.h
- Definitions for SLURM API data structures and functions.
scancel Modules
scancel is a command to cancel running or pending jobs.
- scancel.c
- A command line interface to cancel jobs.
scontrol Modules
scontrol is the administrator tool for monitoring and modifying SLURM configuration
and state. It has a command line interface only
- scontrol.c
- A command line interface to slurmctld.
slurmctld Modules
slurmctld executes on the control machine and orchestrates SLURM activities
across the entire cluster including monitoring node and partition state,
scheduling, job queue management, job dispatching, and switch management.
The slurmctld modules and their functionality are described below.
- controller.c
- Primary SLURM daemon to execute on control machine.
It manages communications the Partition Manager, Switch Manager, and Job Manager threads.
- node_mgr.c
- Module reads, writes, records, updates, and otherwise
manages the state information for all nodes (machines) in the
cluster managed by SLURM.
- node_scheduler.c
- Selects the nodes to be allocated to pending jobs. This makes extensive use
of bit maps in representing the nodes. It also considers the locality of nodes
to improve communications performance.
- partition_mgr.c
- Module reads, writes, records, updates, and otherwise
manages the state information associated with partitions in the
cluster managed by SLURM.
- read_config.c
- Read the SLURM configuration file and use it to build node and
partition data structures.
slurmd Modules
slurmd executes on each compute node. It initiates and terminates user
jobs and monitors both system and job state. The slurmd modules and their
functionality are described below.
- get_mach_stat.c
- This module gets the machine's status and configuration.
This includes: size of real memory, size of temporary disk storage, and
the number of processors.
- read_proc.c
- This module collects job state information including real memory use,
virtual memory use, and CPU time use.
Design Issues
Most modules are constructed with a some simple, built-in tests.
Set declarations for DEBUG_MODULE and DEBUG_SYSTEM both to 1 near
the top of the module's code. Then compile and run the test.
Required input scripts and configuration files for these tests
will be kept in the "etc" subdirectory and the commands to execute
the tests are in the "Makefile". In some cases, the module must
be loaded with some other components. In those cases, the support
modules should be built with the declaration for DEBUG_MODULE set
to 0 and for DEBUG_SYSTEM set to 1.
Many of these modules have been built and tested on a variety of
Unix computers including Redhat's Linux, IBM's AIX, Sun's Solaris,
and Compaq's Tru-64. The only module at this time which is operating
system dependent is get_mach_stat.c.
The node selection logic allocates nodes to jobs in a fashion which
makes most sense for a Quadrics switch interconnect. It allocates
the smallest collection of consecutive nodes that satisfies the
request (e.g. if there are 32 consecutive nodes and 16 consecutive
nodes available, a job needing 16 or fewer nodes will be allocated
those nodes from the 16 node set rather than fragment the 32 node
set). If the job can not be allocated consecutive nodes, it will
be allocated the smallest number of consecutive sets (e.g. if there
are sets of available consecutive nodes of sizes 6, 4, 3, 3, 2, 1,
and 1 then a request for 10 nodes will always be allocated the 6
and 4 node sets rather than use the smaller sets).
Application Program Interface (API)
All functions described below can be issued from any node in the SLURM cluster.
- void free_node_info(void);
- Free the node information buffer (if allocated)
- NOTE: Buffer is loaded by load_node and used by load_node_name.
- void free_part_info(void);
- Free the partition information buffer (if allocated)
- NOTE: Buffer is loaded by load_part and used by load_part_name.
- int get_job_info(TBD);
- Function to be defined.
- int load_node(time_t *last_update_time);
- Load the supplied node information buffer for use by info gathering APIs if
node records have changed since the time specified.
- Input: Buffer - Pointer to node information buffer
- Buffer_Size - size of Buffer
- Output: Returns 0 if no error, EINVAL if the buffer is invalid, ENOMEM if malloc failure
- NOTE: Buffer is loaded by load_node and freed by Free_Node_Info.
- int load_node_config(char *req_name, char *next_name, int *cpus,
int *real_memory, int *tmp_disk, int *weight, char *features,
char *partition, char *node_state);
- Load the state information about the named node
- Input: req_name - Name of the node for which information is requested
if "", then get info for the first node in list
- next_name - Location into which the name of the next node is
stored, "" if no more
- cpus, etc. - Pointers into which the information is to be stored
- Output: next_name - Name of the next node in the list
- cpus, etc. - The node's state information
- Returns 0 on success, ENOENT if not found, or EINVAL if buffer is bad
- NOTE: req_name, next_name, Partition, and NodeState must be declared by the
caller and have length MAX_NAME_LEN or larger.
Features must be declared by the caller and have length FEATURE_SIZE or larger
- NOTE: Buffer is loaded by load_node and freed by Free_Node_Info.
- int load_part(time_t *last_update_time);
- Update the partition information buffer for use by info gathering APIs if
partition records have changed since the time specified.
- Input: last_update_time - Pointer to time of last buffer
- Output: last_update_time - Time reset if buffer is updated
- Returns 0 if no error, EINVAL if the buffer is invalid, ENOMEM if malloc failure
- NOTE: Buffer is used by load_part_name and free by Free_Part_Info.
- int load_part_name(char *req_name, char *next_name, int *max_time, int *max_nodes,
int *total_nodes, int *total_cpus, int *key, int *state_up, int *shared, int *default,
char *nodes, char *allow_groups);
- Load the state information about the named partition
- Input: req_name - Name of the partition for which information is requested
if "", then get info for the first partition in list
- next_name - Location into which the name of the next partition is
stored, "" if no more
- max_time, etc. - Pointers into which the information is to be stored
- Output: req_name - The partition's name is stored here
- next_name - The name of the next partition in the list is stored here
- max_time, etc. - The partition's state information
- Returns 0 on success, ENOENT if not found, or EINVAL if buffer is bad
- NOTE: req_name and next_name must be declared by caller with have length MAX_NAME_LEN or larger.
- Nodes and AllowGroups must be declared by caller with length of FEATURE_SIZE or larger.
- NOTE: Buffer is loaded by load_part and free by Free_Part_Info.
- int reconfigure(void);
- Request that slurmctld re-read the configuration files
Output: Returns 0 on success, errno otherwise
- int slurm_allocate(char *spec, char **node_list);
- Allocate nodes for a job with supplied contraints.
- Input: spec - Specification of the job's constraints;
- node_list - Place into which a node list pointer can be placed;
- Output: node_list - List of allocated nodes;
- Returns 0 if no error, EINVAL if the request is invalid,
EAGAIN if the request can not be satisfied at present;
- NOTE: Acceptable specifications include: JobName= NodeList=
,
Features=, Groups=, Partition=, Contiguous,
TotalCPUs=, TotalNodes=, MinCPUs=,
MinMemory=, MinTmpDisk=, Key=, Shared=<0|1>
- NOTE: The calling function must free the allocated storage at node_list[0]
- void slurm_free_build_info(void);
- Free the build information buffer (if allocated).
- NOTE: Buffer is loaded by slurm_load_build and used by slurm_load_build_name.
- int slurm_get_key(? *key);
- Load into the location key the value of an authorization key.
- To be defined.
- int slurm_kill_job(int job_id);
- Terminate the specified SLURM job.
- TBD.
- int slurm_load_build(void);
- Update the build information buffer for use by info gathering APIs
- Output: Returns 0 if no error, EINVAL if the buffer is invalid, ENOMEM if malloc failure.
- int slurm_load_build_name(char *req_name, char *next_name, char *value);
- Load the state information about the named build parameter
- Input: req_name - Name of the parameter for which information is requested
if "", then get info for the first parameter in list
- next_name - Location into which the name of the next parameter is
stored, "" if no more
- value - Pointer to location into which the information is to be stored
- Output: req_name - The parameter's name is stored here
- next_name - The name of the next parameter in the list is stored here
- value - The parameter's value is stored here
- Returns 0 on success, ENOENT if not found, or EINVAL if buffer is bad
- NOTE: req_name, next_name, and value must be declared by caller with have
length BUILD_SIZE or larger
- NOTE: Buffer is loaded by slurm_load_build and freed by slurm_free_build_info.
- See the SLURM administrator guide
for valid build parameter names.
- int slurm_run_job(char *job_spec);
- Initiate the job with the specification job_spec.
- TBD.
- int slurm_signal_job(int job_id, int signal);
- Send the specified signal to the specified SLURM job.
- TBD.
- int slurm_transfer_resources(pid_t pid, int job_id);
- Transfer the ownership of resources associated with the specified
- TBD.
- int update(char *spec);
- Request that slurmctld update its configuration per request
- Input: A line containing configuration information per the configuration file format
- Output: Returns 0 on success, errno otherwise
- int slurm_will_job_run(char *job_spec);
- TBD.
Examples of API Use
Please see the source code of scancel, scontrol, squeue, and srun for examples
of all APIs.
To Do
- How do we interface with TotalView?
- Deadlines: MCR to be built in July 2002, accepted August 2002.
URL = http://www-lc.llnl.gov/dctg-lc/slurm/programmer.guide.html
Last Modified April 15, 2002
Maintained by
slurm-dev@lists.llnl.gov