Simple Linux Utility for Resource Management

Home

About
Overview
What's New
Publications
SLURM Team

Using
Documentation
FAQ
Getting Help
Mailing Lists

Installing
Platforms
Download
Guide

Blue Gene User and Administrator Guide

Overview

This document describes the unique features of SLURM on the IBM Blue Gene systems. You should be familiar with the SLURM's mode of operation on Linux clusters before studying the relatively few differences in Blue Gene operation described in this document.

Blue Gene systems have several unique features making for a few differences in how SLURM operates there. The basic unit of resource allocation is a base partition or midplane. The base partitions are connected in a three-dimensional torus. Each base partition includes 512 c-nodes each containing two processors; one designed primarily for computations and the other primarily for managing communications. SLURM considers each base partition as one node with 1024 processors. The c-nodes can execute only one process and thus are unable to execute both the user's jobs and SLURM's slurmd daemon. Thus the slurmd daemon executes on one of the Blue Gene Front End Nodes. This slurmd daemon provides (almost) all of the normal SLURM services for every base partition on the system.

User Tools

The normal set of SLURM user tools: srun, scancel, sinfo, squeue and scontrol provide all of the expected services except support for job steps. SLURM performs resource allocation for the job, but initiation of tasks is performed using the mpirun command. SLURM has no concept of a job step on Blue Gene. Four new srun options are available: --geometry (specify job size in each dimension), --no-rotate (disable rotation of geometry), --conn-type (specify interconnect type between base partitions, mesh or torus), and --node-use (specify how the second processor on each c-node is to be used, coprocessor or virual). You can also continue to use the --nodes option with a minimum and (optionally) maximum node count. The --ntasks option continues to be supported. See the srun man pages for details.

To reiterate: srun is used to submit a job script, but mpirun is used to launch the parallel tasks. It is highly recommended that the srun --batch option be used to submit a script. Note that a SLURM batch job's default stdout and stderr file names are generated using the SLURM job ID. When the SLURM control daemon is restarted, SLURM job ID values can be repeated, therefore it is recommended that batch jobs explicitly specify unique names for stdout and stderr files using the srun options --output and --error respectively. While the srun --allocate option may be used to create an interactive SLURM job, it will be the responsibility of the user to insure that the bglblock is ready for use before initiating any mpirun commands. SLURM will assume this responsibility for batch jobs. The script that you submit to SLURM can contain multiple invocations of mpirun as well as any desired commands for pre- and post-processing. The mpirun command will get its bglblock or BGL partition information from the MPIRUN_PARTITION as set by SLURM. A sample script is shown below.

#!/bin/bash
# pre-processing
date
# processing
mpirun -exec /home/user/prog -cwd /home/user -args 123
mpirun -exec /home/user/prog -cwd /home/user -args 124
# post-processing
date 

The naming of nodes includes a three-digit suffix representing the base partition's location in the X, Y and Z dimensions with a zero origin. For example, "bgl012" represents the base partition whose location is at X=0, Y=1 and Z=2. Since jobs must be allocated consecutive nodes in all three dimensions, we have developed an abbreviated format for describing the nodes in one of these three-dimensional blocks. The node's prefix of "bgl" is followed by the end-points of the block enclosed in square-brackets. For example, " bgl[620x731]" is used to represent the eight nodes enclosed in a block with endpoints bgl620 and bgl731 (bgl620, bgl621, bgl630, bgl631, bgl720, bgl721, bgl730 and bgl731).

One new tool provided is smap. Smap is aware of system topography and provides a map of what nodes are allocated to jobs, partitions, etc. See the smap man page for details. A sample of smap output is provided below showing the location of five jobs. Note the format of the list of nodes allocated to each job. Also note that idle (unassigned) base partitions are indicated by a period. Down and drained base partitions (those not available for use) are indicated by a number sign (bgl703 in the display below). The legend is for illustrative purposes only. The origin (zero in every dimension) is shown at the rear left corner of the bottom plane. Each set of four consecutive lines represents a plane in the Y dimension. Values in the X dimension increase to the right. Values in the Z dimension increase down and toward the left.

   a a a a b b d d       ID JOBID PARTITION USER   NAME ST TIME NODES NODELIST
  a a a a b b d d        a  12345 batch     joseph tst1 R  43:12   64 bgl[000x333]
 a a a a b b c c         b  12346 debug     chris  sim3 R  12:34   16 bgl[420x533]
a a a a b b c c          c  12350 debug     danny  job3 R   0:12    8 bgl[622x733]
                         d  12356 debug     dan    colu R  18:05   16 bgl[600x731]
   a a a a b b d d       e  12378 debug     joseph asx4 R   0:34    4 bgl[612x713]
  a a a a b b d d
 a a a a b b c c
a a a a b b c c

   a a a a . . d d
  a a a a . . d d
 a a a a . . e e              Y
a a a a . . e e               |
                              |
   a a a a . . d d            0----X
  a a a a . . d d            /
 a a a a . . . .            /
a a a a . . . #            Z

Note that jobs enter the SLURM state RUNNING as soon as the have been allocated a bglblock. If the bglblock is in a READY state, the job will begin execution almost immediately. Otherwise the execution of the job will not actually begin until the bglblock is in a READY state, which can require booting the block and a delay of minutes to do so. You can identify the bglblock associated with your job using the command smap -Dj -c and the state of the bglblock with the command smap -Db -c. The time to boot a bglblock is related to its size, but should range from from a few minutes to about 15 minutes for a bglblock containing 64 base partitions. Only after the bglblock is READY will your job's output file be created and the script execution begin. If the bglblock boot fails, SLURM will attempt to reboot several times before draining the associated nodes and aborting the job.

The job will continue to be in a RUNNING state until the bgljob has completed and the bglblock ownership is changed. The time for completing a bgljob has freqently been on the order of five minutes. In summary, your job may appear in SLURM as RUNNING for 15 minutes before the script actually begins to 5 minutes after it completes. These delays are the result of BGL infrastructure issues and are not due to anything in SLURM.

When using smap in curses mode you can scroll through the different windows using the arrow keys. The up and down arrow keys scroll the window containing the grid, and the left and right arrow keys scroll the window containing the text information.

System Administration

If running on a multi X-dim system external wiring may be different that programmed into slurm. If you have more than 2 nodes in the X-dim please have a wiring diagram of your system and query slurm-dev@lists.llnl.gov for advice.

Building a Blue Gene compatible system is dependent upon the configure program locating some expected files. In particular, the configure script searches for libdb2.so in the directories /home/bgdb2cli/sqllib and /u/bgdb2cli/sqllib. If your DB2 library file is in a different location, use the configure option --with-db2-dir=PATH to specify the parent directory. If you have the same version of the operating system on both the Service Node (SN) and the Front End Nodes (FEN) then you can configure and build one set of files on the SN and install them on both the SN and FEN. Note that if your FENs lack an installed libdb2.so, an smap built on the SN will be unable to execute at all on those nodes (it calls BGL Bridge APIs, that dynamically load libdb2.so completely out of our control). You can handle this in two different ways. One option is to build two versions of smap (in the main SLURM RPM), one for the SN and the other for the FENs. The second option is to create a dummy libdb2.so on the FENs (it can just point to libslurm.so) so that smap can be initiated. Smap will discover if libdb2.so is invalid and avoid using any BGL Bridge function calls, which would fail. In either case, all smap functionality will be provided on the FEN except for the ability to map SLURM node names to and from row/rack/midplane data.

If you have different versions of the operating system on the SN and FEN (as was the case for some early system installations), then you will need to configure and build two sets of files for installation. One set will be for the Service Node (SN), which has direct access to the BGL Bridge APIs. The second set will be for the Front End Nodes (FEN), whick lack access to the Bridge APIs and interact with using Remote Proceedure Calls to the slurmctld daemon. You should see "#define HAVE_BGL 1" and "#define HAVE_FRONT_END 1" in the "config.h" file for both the SN and FEN builds. You should also see "#define HAVE_BGL_FILES 1" in config.h on the SN before building SLURM.

The slurmctld daemon should execute on the system's service node. If an optional backup daemon is used, it must be in some location where it is capable of executing BGL Bridge APIs. One slurmd daemon should be configured to execute on one of the front end nodes. That one slurmd daemon represents communications channel for every base partition. A future release of SLURM will support multiple slurmd daemons on multiple front end nodes. You can use the scontrol command to drain individual nodes as desired and return them to service.

The slurm.conf (configuration) file needs to have the value of InactiveLimit set to zero or not specified (it defaults to a value of zero). This is because there are no job steps and we don't want to purge jobs prematurely. The value of SelectType must be set to "select/bluegene" in order to have node selection performed using a system aware of the system's topography and interfaces. The value of SchedulerType should be set to "sched/builtin". The value of Prolog should be set to the full pathname of a program that will delay execution until the bglblock identified by the MPIRUN_PARTITION environment variable is ready for use. It is recommended that you construct a script that serves this function and calls the supplied program sbin/slurm_prolog. The value of Epilog should be set to the full pathname of a program that will wait until the bglblock identified by the MPIRUN_PARTITION environment variable is no longer usable by this job. It is recommended that you construct a script that serves this function and calls the supplied program sbin/slurm_epilog. The prolog and epilog programs are used to insure proper synchronization between the slurmctld daemon, the user job, and MMCS. A multitude of other functions may also be placed into the prolog and epilog as desired (e.g. enabling/disabling user logins, puring file systmes, etc.). Sample prolog and epilog scripts follow.

#!/bin/bash
# Sample Blue Gene Prolog script
#
# Wait for bglblock to be ready for this job's use
/usr/sbin/slurm_prolog


#!/bin/bash
# Sample Blue Gene Epilog script
#
# Cancel job to start the termination process for this job
# and release the bglblock
/usr/bin/scancel $SLURM_JOBID
#
# Wait for bglblock to be released from this job's use
/usr/sbin/slurm_epilog

Since jobs with different geometries or other characteristics do not interfere with each other's scheduling, backfill scheduling is not presently meaningful. SLURM's builtin scheduler on Blue Gene will sort pending jobs and then attempt to schedule all of them in priority order. This essentailly functions as if there is a separate queue for each job size. Note that SLURM does support different partitions with an assortment of different scheduling parameters. For example, SLURM can have defined a partition for full system jobs that is enabled to execute jobs only at certain times; while a default partition could be configured to execute jobs at other times. Jobs could still be queued in a partition that is configured in a DOWN state and scheduled to execute when changed to an UP state. Nodes can also be moved between slurm partitions either by changing the slurm.conf file and restarting the slurmctld daemon or by using the scontrol reconfig command.

SLURM node and partition descriptions should make use of the naming conventions described above. For example, "NodeName=bgl[000x733] NodeAddr=frontend0 NodeHostname=frontend0 Procs=1024". Note that the values of both NodeAddr and NodeHostname for all 128 base partitions is the name of the front end node executing the slurmd daemon. The NodeName values represent base partitions. No computers are actually expected to return a value of "bgl000" in response to the hostname command nor will any attempt be made to route message traffic to this address.

While users are unable to initiate SLURM job steps on Blue Gene systems, this restriction does not apply to user root or SlurmUser. Be advised that the one slurmd supporting all nodes is unable to manage a large number of job steps, so this ability should be used only to verify normal SLURM operation. If large numbers of job steps are initiated by slurmd, expect the daemon to fail due to lack of memory or other resources. It is best to minimize other work on the front end node executing slurmd so as to maximize its performance and minimize other risk factors.

Presently the system administrator must explicitly define each of the Blue Gene partitions (or bglblocks) available to execute jobs. (NOTE: Blue Gene partitions are unrelated to SLURM partitions.) Jobs must then execute in one of these pre-defined bglblocks. This is known as static partitioning. Each of these bglblocks are explicitly configured with either a mesh or torus interconnect. They must also not overlap, except for the implicitly defined full-system bglblock. In addition to the normal slurm.conf file, a new bluegene.conf configuration file is required with this information. Put bluegene.conf into the SLURM configuration directory with slurm.conf. A sample file is installed in bluegene.conf.example. System administrators should use the smap tool to build appropriate configuration file for static partitioning. Note that smap -Dc can be run without the SLURM daemons active to establish the initial configuration. Note that the defined bglblocks may not overlap (except for the full-system bglblock, which is implicitly created). See the smap man page for more information. You must insure that the nodes defined in bluegene.conf are consistent with those defined in slurm.conf. Note that the Image and Numpsets values defined in bluegene.conf are used only when SLURM creates bglblocks. If previously defined bglblocks are used by SLURM, their configurations are not altered. If you change the bglblock layout, then slurmctld and slurmd should both be cold-started (e.g. /etc/init.d/slurm startclean). If you which to modify the Image and Numpsets values for existing bglblocks, either modify them manually or destroy the bglblocks and let SLURM recreate them. Note that in addition to the bglblocks defined in blugene.conf, an additional bglblock is created containing all resources defined all of the other defined bglblocks. If you modify the bglblocks, it is recommended that you restart both slurmctld and slurmd without preserving state (/etc/init.d/slurm startclean). Note that SLURM wiring decisions are based upon the link-cards being interconnected in a specific fashion. If your BlueGene system is wired in an unconventional fashion, modifications to the file src/partition_allocator/partition_allocator.c may be required. Make use of the SLURM partition mechanism to control access to these bglblocks. A sample bluegene.conf file is shown below.

#
# Global specifications for Blue Gene system
#
# BlrtsImage:     BlrtsImage used for creation of all bglblocks.
# LinuxImage:     LinuxImage used for creation of all bglblocks.
# MloaderImage:   MloaderImage used for creation of all bglblocks.
# RamDiskImage:   RamDiskImage used for creation of all bglblocks.
# Numpsets:       The Numpsets used for creation of all bglblocks 
#                 equals this value multiplied by the number of 
#                 base partitions in the bglblock.
#
# BridgeAPILogFile : Pathname of file in which to write the BGL 
#                    Bridge API logs.
# BridgeAPIVerbose:  How verbose the BGL Bridge API logs should be
#                    0: Log only error and warning messages
#                    1: Log level 0 and information messasges
#                    2: Log level 1 and basic debug messages
#                    3: Log level 2 and more debug message
#                    4: Log all messages
# 
# NOTE: The bgl_serial value is set at configuration time using the 
#       "--with-bgl-serial=" option. Its default value is "BGL".
#
BlrtsImage=/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw.rts
LinuxImage=/bgl/BlueLight/ppcfloor/bglsys/bin/zImage.elf
MloaderImage=/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts
RamDiskImage=/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk.elf
Numpsets=8
#
BridgeAPILogFile=/var/log/slurm/bridgeapi.log
BridgeAPIVerbose=0

#
# Define the static partitions (bglblocks)
#
# Nodes: The base partitions (midplanes) in the bglblock using XYZ coordinates
# Type:  Connection type "mesh" or "torus", default is "torus"
# 
# NOTE: A bglblock is implicitly created containing all resources on the system
# NOTE: All Nodes defined here must also be defined in the slurm.conf file
#

# volume = 1x1x1 = 1
Nodes=bgl[000x000]
Nodes=bgl[001x001]

# volume = 1x1x2 = 2
# Nodes=bgl[000x001]   Full-system bglblock, implicitly created 

One more thing is required to support SLURM interactions with the DB2 database (at least as of the time this was written). DB2 database access is required by the slurmctld daemon only. All other SLURM daemons and commands interact with DB2 using remote procedure calls, which are processed by slurmctld. DB2 access is dependent upon the environment variable BRIDGE_CONFIG_FILE. Make sure this is set appropriate before initiating the slurmctld daemon. If desired, this environment variable and any other logic can be executed through the script /etc/sysconfig/slurm, which is automatically executed by /etc/init.d/slurm prior to initiating the SLURM daemons.

At some time in the future, we expect SLURM to support dynamic partitioning in which Blue Gene job partitions are created and destroyed as needed to accomodate the workload. At that time the bluegene.conf configuration file will become obsolete. Dynamic partition does involve substantial overhead including the rebooting of c-nodes and I/O nodes.

SLURM versions 0.4.23 and higher are designed to utilize Bluegene driver 141(2005) or higher. This combination avoids rebooting bglblocks whenever possible so as to minimize the system overhead for boots (which can be tens of minutes on large systems). When slurmctld is initially started on an idle system, the bglblocks already defined in MMCS are read using the BGL Bridge APIs. If these bglblocks do not correspond to those defined in the bluegene.conf file, the old bglblocks with a prefix of "RMP" are destroyed and new ones created. When a job is scheduled, the appropriate bglblock is identified, its node use (virtual or coprocessor) set, its user set, and it is booted. Subsequent jobs use this same bglblock without rebooting by changing the associated user field. The bglblock will be freed and then rebooted in order to change its node use (from virtual to coprocessor or vise-versa). Bglblocks will also be freed and rebooted when going to or from full-system jobs (two or more bglblocks sharing base partitions can not be in a ready state at the same time). When this logic became available at LLNL, approximately 85 percent of bglblock boots were eliminated and the overhead of job startup went from about 24% to about 6% of total job time. Note that bglblocks will remain in a ready (booted) state when the SLURM daemons are stopped. This permits SLURM daemon restarts without loss of running jobs or rebooting of bglblocks.

Be aware that SLURM will issue multiple bglblock boot requests as needed (e.g. when the boot fails). If the bglblock boot requests repeatedly fail, SLURM will configure the failing nodes to a DRAINED state so as to avoid continuing repeated reboots and the likely failure of user jobs. A system administrator should address the problem before returning the nodes to service.

If you cold-start slurmctld (/etc/init.d/slurm startclean or slurmctld -c) it is recommended that you also cold-start the slurmd at the same time. Failure to do so may result in errors being reported by both slurmd and slurmctld due to bglblocks that previously existed being deleted.

A new tool sfree has also been added to help admins free a BGL partition on request.
For usage use sfree -u and for help -h.

Debugging

All of the testing and debugging guidance provided in Quick Start Administrator Guide apply to Blue Gene systems. One can start the slurmctld and slurmd in the foreground with extensive debugging to establish basic functionality. Once runnning in production, the configured SlurmctldLog and SlurmdLog files will provide historical system information. On Blue Gene systems, there is also a BridgeAPILogFile defined in bluegene.conf which can be configured to contain detailed information about every Bridge API call issued.

Note that slurmcltld log messages of the sort Nodes bgl[000x133] not responding are indicative of the slurmd daemon serving as a front-end to those nodes is not responding (on non-Blue Gene systems, the slurmd actaully does run on the compute nodes, so the message is more meaningful there).


For information about this page, contact slurm-dev@lists.llnl.gov.