Figure 1: SLURM components
The entities managed by these SLURM daemons are shown in Figure 2 and include nodes, the compute resource in SLURM, partitions, which group nodes into logical disjoint sets, jobs, or allocations of resources assigned to a user for a specified amount of time, and job steps, which are sets of (possibly parallel) tasks within a job. Priority-ordered jobs are allocated nodes within a partition until the resources (nodes) within that partition are exhausted. Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation. For instance a single job step may be started which utilizes all nodes allocated to the job, or several job steps may independently use a portion of the allocation.
Figure 2: SLURM entities
srun is used to submit a job for execution, allocate resources, attach to an existing allocation, or initiate job steps. Jobs can be submitted for immediate or later execution (e.g. batch). srun has a wide variety of options to specify resource requirements including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc). Besides securing a resource allocation, srun is used to initiate job steps. These job steps can execute sequentially or in parallel on independent or shared nodes within the job's node allocation.
scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
scontrol is the administrative tool used to view and/or modify SLURM state. Note that many scontrol commands can only be executed as user root.
sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.
squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
The slurmd daemon executes on every compute node. It resembles a remote shell daemon to export control to SLURM. Since slurmd initiates and manages user jobs, it must execute as the user root.
slurmctld and/or slurmd should be initiated at node startup time per the SLURM configuration.
adev0: srun -N4 -l /bin/hostname 0: adev9 1: adev10 2: adev11 3: adev12
Execute /bin/hostname in four tasks (-n4). Include task numbers on the output (-l). The default partition will be used. One processor per task will be used by default (note that we don't specify a node count).
adev0: srun -n4 -l /bin/hostname 0: adev9 1: adev9 2: adev10 3: adev10
Submit the script my.script for later execution (-b). Explicitly use the nodes adev9 and adev10 (-w "adev[9-10]", note the use of a node range expression). One processor per task will be used by default The output will appear in the file my.stdout (-o my.stdout). By default, one task will be initiated per processor on the nodes. Note that my.script contains the command /bin/hostname which executed on the first node in the allocation (where the script runs) plus two job steps initiated using the srun command and executed sequentially.
adev0: cat my.script #!/bin/sh /bin/hostname srun -l /bin/hostname srun -l /bin/pwd adev0: srun -w "adev[9-10]" -o my.stdout -b my.script srun: jobid 469 submitted adev0: cat my.stdout adev9 0: adev9 1: adev9 2: adev10 3: adev10 0: /home/jette 1: /home/jette 2: /home/jette 3: /home/jette
Submit a job, get its status and cancel it.
adev0: srun -b my.sleeper srun: jobid 473 submitted adev0: squeue JobId Partition Name User St TimeLim Prio Nodes 473 batch my.sleep jette R UNLIMIT 0.99 adev9 adev0: scancel 473 adev0: squeue JobId Partition Name User St TimeLim Prio Nodes
Get the SLURM partition and node status.
adev0: sinfo PARTITION NODES STATE CPUS MEMORY TMP_DISK NODES -------------------------------------------------------------------------------- debug 8 IDLE 2 3448 82306 adev[0-7] batch 1 DOWN 2 3448 82306 adev8 7 IDLE 2 3448-3458 82306 adev[9-15]
SLURM uses the syslog function to record events. It uses a range of importance levels for these messages. Be certain that your system's syslog functionality is operational.
There is no necessity for synchronized clocks on the nodes. Events occur either in real-time based upon message traffic. However, synchronized clocks will permit easier analysis of SLURM logs from multiple nodes.
A description of the nodes and their grouping into non-overlapping partitions is required. Partition and node specifications use node range expressions to identify nodes in a concise fashion. This configuration file defines a 1154 node cluster for SLURM, but might be used for a much larger cluster by just changing a few node range expressions.
# # Sample /etc/slurm.conf for mcr.llnl.gov # ControlMachine=mcri ControlAddr=emcri # AuthType=auth/authd Epilog=/usr/local/slurm/etc/epilog FirstJobId=65536 HeartbeatInterval=30 PluginDir=/usr/local/slurm/lib/slurm Prolog=/usr/local/slurm/etc/prolog SlurmUser=slurm SlurmctldPort=7002 SlurmctldTimeout=300 SlurmdPort=7003 SlurmdSpoolDir=/var/tmp/slurmd.spool SlurmdTimeout=300 StateSaveLocation=/tmp/slurm.state # # Node Configurations # NodeName=DEFAULT Procs=2 RealMemory=2000 TmpDisk=64000 State=UNKNOWN NodeName=mcr[0-1151] NodeAddr=emcr[0-1151] # # Partition Configurations # PartitionName=DEFAULT State=UP PartitionName=pdebug Nodes=mcr[0-191] MaxTime=30 MaxNodes=32 Default=YES PartitionName=pbatch Nodes=mcr[192-1151]
Print detailed state of all jobs in the system.
adev0: scontrol scontrol: show job JobId=475 UserId=bob(6885) Name=sleep JobState=COMPLETED Priority=4294901286 Partition=batch BatchFlag=0 AllocNode:Sid=adevi:21432 TimeLimit=UNLIMITED StartTime=03/19-12:53:41 EndTime=03/19-12:53:59 NodeList=adev8 NodeListIndecies=-1 ReqProcs=0 MinNodes=0 Shared=0 Contiguous=0 MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0 ReqNodeList=(null) ReqNodeListIndecies=-1 JobId=476 UserId=bob(6885) Name=sleep JobState=RUNNING Priority=4294901285 Partition=batch BatchFlag=0 AllocNode:Sid=adevi:21432 TimeLimit=UNLIMITED StartTime=03/19-12:54:01 EndTime=NONE NodeList=adev8 NodeListIndecies=8,8,-1 ReqProcs=0 MinNodes=0 Shared=0 Contiguous=0 MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0 ReqNodeList=(null) ReqNodeListIndecies=-1
Print the detailed state of job 477 and change its priority to zero. A priority of zero prevents a job from being initiated (it is held in pending state).
adev0: scontrol scontrol: show job 477 JobId=477 UserId=bob(6885) Name=sleep JobState=PENDING Priority=4294901286 Partition=batch BatchFlag=0 more data removed.... scontrol: update JobId=477 Priority=0
Print the state of node adev13 and drain it. To drain a node specify a new state of "DRAIN", "DRAINED", or "DRAINING". SLURM will automatically set it to the appropriate value of either "DRAINING" or "DRAINED" depending if the node is allocated or not. Return it to service later.
adev0: scontrol scontrol: show node adev13 NodeName=adev13 State=ALLOCATED CPUs=2 RealMemory=3448 TmpDisk=32000 Weight=16 Partition=debug Features=(null) scontrol: update NodeName=adev13 State=DRAIN scontrol: show node adev13 NodeName=adev13 State=DRAINING CPUs=2 RealMemory=3448 TmpDisk=32000 Weight=16 Partition=debug Features=(null) scontrol: quit Later adev0: scontrol scontrol: show node adev13 NodeName=adev13 State=DRAINED CPUs=2 RealMemory=3448 TmpDisk=32000 Weight=16 Partition=debug Features=(null) scontrol: update NodeName=adev13 State=IDLE
Reconfigure all slurm daemons on all nodes. This should be done after changing the SLURM configuration file.
adev0: scontrol reconfig
Print the current slurm configuration. This also reports if the priarmy and secondary controllers (slurmctld daemons) are responding. To just see the state of the controllers, use the command "ping".
adev0: scontrol show config Configuration data as of 03/19-13:04:12 AuthType = auth/munge BackupAddr = eadevj BackupController = adevj ControlAddr = eadevi ControlMachine = adevi Epilog = (null) FastSchedule = 0 FirstJobId = 0 NodeHashBase = 10 HeartbeatInterval = 60 InactiveLimit = 0 JobCredPrivateKey = /etc/slurm/slurm.key JobCredPublicKey = /etc/slurm/slurm.cert KillWait = 30 PluginDir = /usr/lib/slurm Prioritize = (null) Prolog = (null) ReturnToService = 1 SlurmUser = slurm(97) SlurmctldDebug = 4 SlurmctldLogFile = /tmp/slurmctld.log SlurmctldPidFile = (null) SlurmctldPort = 0 SlurmctldTimeout = 300 SlurmdDebug = 65534 SlurmdLogFile = /tmp/slurmd.log SlurmdPidFile = (null) SlurmdPort = 0 SlurmdSpoolDir = /tmp/slurmd SlurmdTimeout = 300 SLURM_CONFIG_FILE = /etc/slurm/slurm.conf StateSaveLocation = /usr/local/tmp/slurm/adev TmpFS = /tmp Slurmctld(primary/backup) at adevi/adevj are UP/UP
Shutdown all SLURM daemons on all nodes.
adev0: scontrol shutdown
Last Modified March 27, 2003
Maintained by slurm-dev@lists.llnl.gov