Blue Gene User Guide
Overview
This document describes the unique features of SLURM on the
IBM Blue Gene systems.
You should be familiar with the SLURM's mode of operation on Linux clusters
before studying the differences in operation described in this document.
Users familiar with SLURM will find that there are relatively few Blue Gene
specific differences.
Blue Gene systems have several unique features making for a few
differences in how SLURM operates there.
The basic unit of resource allocation is a base partition.
The base partitions are connected in a three-dimensional torus.
Each base partition includes 512 c-nodes each containing two processors;
one designed primarily for computations and the other primarily for managing communications.
SLURM considers each base partition as one node with 1024 processors.
The c-nodes can execute only one process and thus are unable to execute both
the user's jobs and SLURM's slurmd daemon.
Thus the slurmd daemon executes on one of the Blue Gene Front End Nodes.
This slurmd daemon provides (almost) all of the normal SLURM services
for every base partition on the system.
User Tools
The normal set of SLURM user tools: srun, scancel, sinfo, squeue and scontrol
provide all of the expected services except support for job steps.
SLURM performs resource allocation for the job, but initiation of job steps is performed
using the mpirun command and daemons provided with the Blue Gene system.
Three new srun options are available: --geometry (specify job size in each dimension),
--rotate (permit rotation of geometry), and --connect (specify connection type of
mesh or torus). See the srun man pages for details.
The naming of nodes includes a three-digit suffix representing the base partition's
location in the X, Y and Z dimensions with a zero origin.
For example, "bgl012" represents the base partition whose location is at X=0, Y=1 and Z=2.
Since jobs must be allocated consecutive nodes in all three dimensions, we have developed
an abbreviated format for describing the nodes in one of these three-dimensional blocks.
The node's prefix is followed by the end-points of the block enclosed in square-brackets.
For example, " bgl[620x731]" is used to represent the eight nodes enclosed in a block
with endpoints bgl620 and bgl731 (bgl620, bgl621, bgl630, bgl631, bgl720, bgl721,
bgl730 and bgl731).
One new tools provided is smap.
Smap is aware of system topography and provides a map of what nodes are allocated
to jobs, partitions, etc.
A sample of smap output is provided below showing the location of five jobs.
Note the format of the list of nodes allocated to each job.
Also note that idle (unassigned) base partitions are indicated by a period.
a a a a b b d d Key JobId User Nodes NodeList
a a a a b b d d a 12345 joseph 64 bgl[000x333]
a a a a b b c c b 12346 chris 16 bgl[420x533]
a a a a b b c c c 12350 danny 8 bgl[620x731]
d 12356 dan 16 bgl[603x733]
a a a a b b d d e 12378 joseph 4 bgl[610x711]
a a a a b b d d
a a a a b b c c
a a a a b b c c
a a a a . . d d
a a a a . . d d
a a a a . . e e
a a a a . . e e
a a a a . . d d
a a a a . . d d
a a a a . . . .
a a a a . . . .
System Administration
Building a Blue Gene compatible system is dependent upon the configure
program locating some expected files. You should see "#define HAVE_BGL 1" in
the "config.h" file before making SLURM.
The slurmctld daemon should execute on the system's service node with
an optional backup daemon on one of the front end nodes.
One slurmd daemon should be configured to execute on one of the front end nodes.
That one slurmd daemon represents communications channel for every base partition.
You can use the scontrol command to drain individual nodes as desired and
return them to service.
The slurm.conf (configuration) file needs to have the value of InactiveLimit
set to zero or not specified (it defaults to a value of zero).
This is because there are no job steps and we don't want to purge jobs prematurely.
The value of SelectType must be set to "select/bluegene" in order to have
node selection performed using a system aware of the system's topography
and interfaces.
If SLURM node and partition descriptions should make use of the
naming conventions described above. For example,
"NodeName=bgl[000x733] NodeAddr=frontend0 Procs=1024".
Note that the NodeAddr value for all 128 base partitions is the name
of the front end node executing the slurmd daemon.
While users are unable to initiate SLURM job steps on Blue Gene systems,
this restriction does not apply to user root or SlurmUser.
Be advised that the one slurmd supporting all nodes is unable to manage a
large number of job steps, so this ability should be used only to verify normal
SLURM operation.
If large numbers of job steps are initiated by slurmd, expect the daemon to
fail due to lack of memory.
|