The controller (ControlDaemon) orchestrates all SLURM activities including: accepting the job initiation request, allocating nodes to the job, enforcing partition constraints, enforcing job limits, and general record keeping. The three primary components (threads) of the controller are the Partition Manager (PM), Node Manager (NM), and Job Manager (JM). The partition manager keeps track of partition state and contraints. The node manager keeps track of node state and configuration. The job manager keeps track of job state and enforces its limits. Since all of these functions are critical to the overall SLURM operation, a backup controler assumes thsse responsibilities in the event of control machine failure.
The final component of interest is the Job Shepherd (JS), which is part of the ServerDaemon. The ServerDaemon executes on every SLURM compute server. The job shepherd initiates the job's tasks. It allocates switch resources. It also monitors job state and resources utilization. Finally, it delivers signals to the processes as needed.
Figure 1: SLURM components
Interconnecting all of these components is a highly scalable and reliable communications library. The general mode of operation is for each every node to initiate a MasterDaemon. This daemon will in turn execute any defined InitProgram to insure the node is fully ready for service. The InitProgram can, for example, insure that all required file systems are mounted. MasterDaemon will subsequently initiate a ControlDaemon and/or ServerDaemon as defined in the SLURM configuration file and terminate itself. Is this model good, it does eliminate unique configuration files on the controller and backup controller nodes (RC files).?
The ControlDaemon will read the node and partition information from the appropriate SLURM configuration files. It will then contact each ServerDaemon to gather current job and system state information. The BackupController will ping the ControlDaemon periodically to insure that it is operative. If the ControlDaemon fails to respond for a period specified as ControllerTimeout, the BackupController will assume those responsibilities. The original ControlDaemon will reclaim those responsibilities when returned to service. Whenever the machine responsible for control responsibilities changes, it must notify every other SLURM daemon to insure that messages are routed in an appropriate fashion.
The Job Initiator will contact the ControlDaemon in order to be allocated appropriate resources as possible, including authorization for interconnect use. The Job Initiator itself will be responsible for distributing the program, environment variables, identification of the current directory, standard input, etc. Standard output and standard error from the program will be transmitted to the Job Initiator. Should the Job Initiator terminate prior to the parallel job's termination (for example, if the node fails), the ControlDaemon will initiate a new Job Initiator. While the new Job Initiator will not be capable of transmitting additional standard input data, it will log the standard output and error data.
ServerDaemon's Job Shepherd will initiate the user program's tasks and monitor their state. The ServerDaemon will also monitor and report overall node state information periodically to the ControlDaemon. Should any node associated with a user task fail (ServerDaemon fails to respond within ServerTimeout), the entire application will be terminated by the Job Initiator.
The controller will then load the last known node, partition, and job state information from primary or secondary backup locations. This state recovery mechanism facilitates the recovery process, especially if the control machine changes. Each SLURM machine is then requested to send current state information. State is saved on a periodical bases from that point forward based upon interval and filename specifications identified in the SLURM configuration file. Both primary and secondary intervals and files can be configured. Ideally the primary and secondary backup files will be made to distinct file systems and/or devices for greater fault tolerance. Upon receipt of a shutdown request, the controller will save state to both the primary and backup files and terminate.
At this point, the controller enters a reactive mode. Node and job state information is logged when received, requests for getting and/or setting state information are processed, resources are allocated to jobs, etc.
The allocation of resources to jobs is fairly complext. When a job initiation request is received, a record of each partition that might be used to satisfy the request is made. Each available node is then checked for possible use. This involves many tests:
The controller expects each SLURM Job Shepherd (on the computer servers) to report its state every ServerTimeout seconds. If it fails to do so, the node will have its state set to DOWN and no futher jobs will be scheduled on that node until it reports a valid state. The controller will also send a state request message to the wayward node. The controller collects node and job resource use information. When a job has reached its perscribed time-limit, its termination is initiated through signals to the appropriate Job Shepherds.
The controller also reports its state to the backup controller (if any) at the HeartbeatInterval. If the backup controller has not received any state information from the primary controller in ControllerTimeout seconds, it begins to provide controller functions using an identical startup process. When the primary controller resumes operation, it notifies the backup controller, sleeps for HeartbeatInterval to permit the backup controller to save state and terminate, reads the saved state files, and resumes operation.
The controller, like all other SLURM daemons, logs all significant activities using the syslog function. This not only identifies the event, but its significance.
While the job is running, standard-output and standard-error is collected and reported back to the Job Initiator. Signals sent to the job from the controller (e.g. time-limit enforcement) or from the Job Initiator (e.g. user initiated termination) are forwarded.
The job shepherd collects resource use by all processes on the node. Resource use monitored includes:
The job shepherd accepts connections from the the SLURM administrative tool and Job Initiators. It can then confirm the identity of the user executing the command and forward the authetnicated request to the control machine. Responses to the request from the control machine are forwarded as needed.
ControlDaemon collects state information from ServerDaemon. If there have been no communcations for a while, it pings the ServerDaemon. If there is no response within ServerTimeout, the node is considered DOWN and unavailable for use. The appropriate Job Initiator is also notified in order to terminate the job. The ControlDaemon also processes administrator and user requests.
The ServerDaemon wait for work requests from the Job Initiators. It spawns user tasks as required. It transfers standard input, output and error as required. It reports job and system state information as requested by the Job Initiator and ControlDaemon.
Many of these modules have been built and tested on a variety of Unix computers including Redhat's Linux, IBM's AIX, Sun's Solaris, and Compaq's Tru-64. The only module at this time which is operating system dependent is Get_Mach_Stat.c.
The node selection logic allocates nodes to jobs in a fashion which makes most sense for a Quadrics switch interconnect. It allocates the smallest collection of consecutive nodes that satisfies the request (e.g. if there are 32 consecutive nodes and 16 consecutive nodes available, a job needing 16 or fewer nodes will be allocated those nodes from the 16 node set rather than fragment the 32 node set). If the job can not be allocated consecutive nodes, it will be allocated the smallest number of consecutive sets (e.g. if there are sets of available consecutive nodes of sizes 6, 4, 3, 3, 2, 1, and 1 then a request for 10 nodes will always be allocated the 6 and 4 node sets rather than use the smaller sets).
We have tried to develop the SLURM code to be quite general and flexible, but compromises were made in several areas for the sake of simplicity and ease of support. Entire nodes are dedicated to user applications. Our customers at LLNL have expressed the opinion that sharing of nodes can severely reduce their job's performance and even reliability. This is due to contention for shared resources such as local disk space, real memory, virtual memory and processor cycles. The proper support of shared resources, including the enforcement of limits on these resources, entails a substantial amount of additional effort. Given such a cost to benefit situation at LLNL, we have decided to not support shared nodes. However, we have designed SLURM so as to not preclude the addition of such a capability at a later time if so desired.
Last Modified January 25, 2002
Maintained by Moe Jette jette1@llnl.gov