Simple Linux Utility for Resource Management

Home

About
Overview
What's New
Publications
SLURM Team

Using
Documentation
FAQ
Getting Help

Installing
Platforms
Download
Guide

Blue Gene User and Administrator Guide

Overview

This document describes the unique features of SLURM on the IBM Blue Gene systems. You should be familiar with the SLURM's mode of operation on Linux clusters before studying the relatively few differences in Blue Gene operation described in this document.

Blue Gene systems have several unique features making for a few differences in how SLURM operates there. The basic unit of resource allocation is a base partition. The base partitions are connected in a three-dimensional torus. Each base partition includes 512 c-nodes each containing two processors; one designed primarily for computations and the other primarily for managing communications. SLURM considers each base partition as one node with 1024 processors. The c-nodes can execute only one process and thus are unable to execute both the user's jobs and SLURM's slurmd daemon. Thus the slurmd daemon executes on one of the Blue Gene Front End Nodes. This slurmd daemon provides (almost) all of the normal SLURM services for every base partition on the system.

User Tools

The normal set of SLURM user tools: srun, scancel, sinfo, squeue and scontrol provide all of the expected services except support for job steps. SLURM performs resource allocation for the job, but initiation of tasks is performed using the mpirun command. SLURM has no concept of a job step on Blue Gene. Four new srun options are available: --geometry (specify job size in each dimension), --no-rotate (disable rotation of geometry), --conn-type (specify interconnect type between base partitions, mesh or torus), and --node-use (specify how the second processor on each c-node is to be used, coprocessor or virual). You can also continue to use the --nodes option with a minimum and (optionally) maximum node count. The --ntasks option continues to be supported. See the srun man pages for details.

The naming of nodes includes a three-digit suffix representing the base partition's location in the X, Y and Z dimensions with a zero origin. For example, "bgl012" represents the base partition whose location is at X=0, Y=1 and Z=2. Since jobs must be allocated consecutive nodes in all three dimensions, we have developed an abbreviated format for describing the nodes in one of these three-dimensional blocks. The node's prefix is followed by the end-points of the block enclosed in square-brackets. For example, " bgl[620x731]" is used to represent the eight nodes enclosed in a block with endpoints bgl620 and bgl731 (bgl620, bgl621, bgl630, bgl631, bgl720, bgl721, bgl730 and bgl731).

One new tools provided is smap. Smap is aware of system topography and provides a map of what nodes are allocated to jobs, partitions, etc. See the smap man page for details. A sample of smap output is provided below showing the location of five jobs. Note the format of the list of nodes allocated to each job. Also note that idle (unassigned) base partitions are indicated by a period. Down and drained base partitions (those not available for use) are indicated by a number sign (bgl703 in the display below). The legend is for illustrative purposes only. The origin (zero in every dimension) is shown at the rear left corner of the bottom plane. Each set of four consecutive lines represents a plane in the Y dimension. Values in the X dimension increase to the right. Values in the Z dimension increase down and toward the left.

   a a a a b b d d       ID JOBID PARTITION USER   NAME ST TIME NODES NODELIST
  a a a a b b d d        a  12345 batch     joseph tst1 R  43:12   64 bgl[000x333]
 a a a a b b c c         b  12346 debug     chris  sim3 R  12:34   16 bgl[420x533]
a a a a b b c c          c  12350 debug     danny  job3 R   0:12    8 bgl[622x733]
                         d  12356 debug     dan    colu R  18:05   16 bgl[600x731]
   a a a a b b d d       e  12378 debug     joseph asx4 R   0:34    4 bgl[612x713]
  a a a a b b d d
 a a a a b b c c
a a a a b b c c

   a a a a . . d d
  a a a a . . d d
 a a a a . . e e              Y
a a a a . . e e               |
                              |
   a a a a . . d d            0----X
  a a a a . . d d            /
 a a a a . . . .            /
a a a a . . . #            Z

System Administration

Building a Blue Gene compatible system is dependent upon the configure program locating some expected files. You should see "#define HAVE_BGL 1" and "#define HAVE_FRONT_END 1" in the "config.h" file before making SLURM.

The slurmctld daemon should execute on the system's service node with an optional backup daemon on one of the front end nodes. One slurmd daemon should be configured to execute on one of the front end nodes. That one slurmd daemon represents communications channel for every base partition. You can use the scontrol command to drain individual nodes as desired and return them to service.

The slurm.conf (configuration) file needs to have the value of InactiveLimit set to zero or not specified (it defaults to a value of zero). This is because there are no job steps and we don't want to purge jobs prematurely. The value of SelectType must be set to "select/bluegene" in order to have node selection performed using a system aware of the system's topography and interfaces. The value of SchedulerType should be set to "sched/builtin". Since jobs with different geometries or other characteristics do not interfere with each other's scheduling, backfill scheduling is not presently meaningful. SLURM's builtin scheduler on Blue Gene will sort pending jobs and then attempt to schedule all of them in priority order.

SLURM node and partition descriptions should make use of the naming conventions described above. For example, "NodeName=bgl[000x733] NodeAddr=frontend0 NodeHostname=frontend0 Procs=1024". Note that the values of both NodeAddr and NodeHostname for all 128 base partitions is the name of the front end node executing the slurmd daemon. The NodeName values represent base partitions. No computers are actually expected to return a value of "bgl000" in response to the hostname command nor will any attempt be made to route message traffic to this address.

While users are unable to initiate SLURM job steps on Blue Gene systems, this restriction does not apply to user root or SlurmUser. Be advised that the one slurmd supporting all nodes is unable to manage a large number of job steps, so this ability should be used only to verify normal SLURM operation. If large numbers of job steps are initiated by slurmd, expect the daemon to fail due to lack of memory.

Presently the system administrator must explicitly define each of the Blue Gene job partitions available to execute jobs. (NOTE: Blue Gene job partitions are unrelated to SLURM partitions.) Jobs must then execute in one of these pre-defined Blue Gene job partitions. This is known as static partitioning. Each of these Blue Gene job partitions is explicitly configured with either a mesh or torus interconnect and either coprocessor or virtual c-node usage. In addition to the normal slurm.conf file, a new bluegene.conf configuration file is required with this information. Put bluegene.conf into the SLURM configuration directory with slurm.conf. System administrators should use the smap tool to build appropriate configuration files for static partitioning. See the smap man page for more information.

Two other changes are required to support SLURM interactions with the DB2 database. The db2profile script must be executed prior to the execution of the slurmctld daemon. This may be accomplished by copying the approriate file into /etc/sysconfig/slurm, which will be executed by /etc/init.d/slurm. The second required file is db.properties, which should be copied into the SLURM configuration directory with slurm.conf.

At some time in the future, we expect SLURM to support dynamic partitioning in which Blue Gene job partitions are created and destroyed as needed to accomodate the workload. At that time the bluegene.conf configuration file will become obsolete. Dynamic partition does involve substantial overhead including the rebooting of c-nodes and I/O nodes.


For information about this page, contact slurm-dev@lists.llnl.gov.