SLURM Programmer's Guide

Overview

Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters of thousands of nodes. Components include machine status, partition management, job management, and scheduling modules. The design also includes a scalable, general-purpose communication infrastructure. SLURM requires no kernel modifications and is relatively self-contained.

Components

config.h is just used to define "out_of_memory" for list.c.

errors.h is needed for config.h

Get_Mach_Stat.c gets the machine's status and configuration. This includes: operating system version, size of real memory, size of virtual memory, size of /tmp disk storage, number of processors, and speed of processors.

list.c is a general purpose list manager. One can define a list, add and delete entries, search for entries, etc.

list.h contains definitions for list.c and documentation for its functions.

Mach_Stat_Mgr.c reads, writes, records, updates, and otherwise manages the state information for all nodes (machines) in the cluster managed by SLURM.

Partition_Mgr.c reads, writes, records, updates, and otherwise manages the state information associated with partitions in the cluster managed by SLURM.

Design Issues

Most modules are constructed with a some simple, built-in tests. Set declarations for DEBUG_MODULE and DEBUG_SYSTEM both to 1 near the top of the module's code. Then compile and run the test. Required input scripts and configuration files for these tests will be kept in the "etc" subdirectory and the commands to execute the tests are in the "Makefile". In some cases, the module must be loaded with some other components. In those cases, the support modules should be built with the declaration for DEBUG_MODULE set to 0 and for DEBUG_SYSTEM set to 1.

Many of these modules have been built and tested on a variety of Unix computers including Redhat's Linux, IBM's AIX, Sun's Solaris, and Compaq's Tru-64. The only module at this time which is operating system dependent is Get_Mach_Stat.c.

We have tried to develop the SLURM code to be quite general and flexible, but compromises were made in several areas for the sake of simplicity and ease of support. Entire nodes are dedicated to user applications. Our customers at LLNL have expressed the opinion that sharing of nodes can severely reduce their job's performance and even reliability. This is due to contention for shared resources such as local disk space, real memory, virtual memory and processor cycles. The proper support of shared resources, including the enforcement of limits on these resources, entails a substantial amount of additional effort. Given such a cost to benefit situation at LLNL, we have decided to not support shared nodes. However, we have designed SLURM so as to not preclude the addition of such a capability at a later time if so desired.

To Do


URL = http://www-lc.llnl.gov/dctg-lc/slurm/programmer.guide.html

Last Modified December 21, 2001

Maintained by Moe Jette jette1@llnl.gov