SLURM Programmer's Guide

Overview

Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters of thousands of nodes. Components include machine status, partition management, job management, and scheduling modules. The design also includes a scalable, general-purpose communication infrastructure. SLURM requires no kernel modifications and is relatively self-contained.

Component Overview

The Job Initiator (JI) is the tool used by the customer to initiate a job. The job initiator can execute on any computer in the cluster. Its request is sent to the controller executing on the control machine.

The controller (ControlDaemon) orchestrates all SLURM activities including: accepting the job initiation request, allocating nodes to the job, enforcing partition constraints, enforcing job limits, and general record keeping. The three primary components (threads) of the controller are the Partition Manager (PM), Node Manager (NM), and Job Manager (JM). The partition manager keeps track of partition state and constraints. The node manager keeps track of node state and configuration. The job manager keeps track of job state and enforces its limits. Since all of these functions are critical to the overall SLURM operation, a backup controller assumes these responsibilities in the event of control machine failure.

The final component of interest is the Job Shepherd (JS), which is part of the ServerDaemon. The ServerDaemon executes on every SLURM compute server. The job shepherd initiates the job's tasks. It allocates switch resources. It also monitors job state and resources utilization. Finally, it delivers signals to the processes as needed.

Figure 1: SLURM components

Interconnecting all of these components is a highly scalable and reliable communications library. The general mode of operation is for each every node to initiate a MasterDaemon. This daemon will in turn execute any defined InitProgram to insure the node is fully ready for service. The InitProgram can, for example, insure that all required file systems are mounted. MasterDaemon will subsequently initiate a ControlDaemon and/or ServerDaemon as defined in the SLURM configuration file and terminate itself. Is this model good, it does eliminate unique configuration files on the controller and backup controller nodes (RC files).?

The ControlDaemon will read the node and partition information from the appropriate SLURM configuration files. It will then contact each ServerDaemon to gather current job and system state information. The BackupController will ping the ControlDaemon periodically to insure that it is operative. If the ControlDaemon fails to respond for a period specified as ControllerTimeout, the BackupController will assume those responsibilities. The original ControlDaemon will reclaim those responsibilities when returned to service. Whenever the machine responsible for control responsibilities changes, it must notify every other SLURM daemon to insure that messages are routed in an appropriate fashion.

The Job Initiator will contact the ControlDaemon in order to be allocated appropriate resources as possible, including authorization for interconnect use. The Job Initiator itself will be responsible for distributing the program, environment variables, identification of the current directory, standard input, etc. Standard output and standard error from the program will be transmitted to the Job Initiator. Should the Job Initiator terminate prior to the parallel job's termination (for example, if the node fails), the ControlDaemon will initiate a new Job Initiator. While the new Job Initiator will not be capable of transmitting additional standard input data, it will log the standard output and error data.

ServerDaemon's Job Shepherd will initiate the user program's tasks and monitor their state. The ServerDaemon will also monitor and report overall node state information periodically to the ControlDaemon. Should any node associated with a user task fail (ServerDaemon fails to respond within ServerTimeout), the entire application will be terminated by the Job Initiator.

Controller Details

The controller is the overall manager of SLURM activities. For scalability, the controller code is multi-threaded. Upon initiation, the controller reads the SLURM configuration files: /etc/SLURM.conf (overall SLURM configuration), plus node and partition configurations as described in the SLURM Administrator's Guide. SLURM is designed to support thousands of nodes and to facilitate locating node records quickly, uses a hash table. Several different hashing schemes are supported based upon the node name. Each table entry can be directly accessed without any searching if the name contains a sequence number suffix. SLURM can be built with the HASH_BASE set to indicate the hashing algorithm. Possible values are "10" and "8" for names containing decimal or octal sequence numbers or "0" which processes mixed alpha-numeric without sequence numbers. HASH_BASE is defined in the Mach_Stat_Mgr.c module. If you use a naming convention lacking a sequence number, it may be desirable to review the hashing function Hash_Index in the Mach_Stat_Mgr.c module.

The controller will then load the last known node, partition, and job state information from primary or secondary backup locations. This state recovery mechanism facilitates the recovery process, especially if the control machine changes. Each SLURM machine is then requested to send current state information. State is saved on a periodical bases from that point forward based upon interval and filename specifications identified in the SLURM configuration file. Both primary and secondary intervals and files can be configured. Ideally the primary and secondary backup files will be made to distinct file systems and/or devices for greater fault tolerance. Upon receipt of a shutdown request, the controller will save state to both the primary and backup files and terminate.

At this point, the controller enters a reactive mode. Node and job state information is logged when received, requests for getting and/or setting state information are processed, resources are allocated to jobs, etc.

The allocation of resources to jobs is fairly complex. When a job initiation request is received, a record of each partition that might be used to satisfy the request is made. Each available node is then checked for possible use. This involves many tests:

The node selection process can have a great influence upon job performance with some interconnects. If SLURM is built with INTERCONNECT defined as QUADRICS, the selection process will build a list of all possible nodes. The nodes are selected so as to allocate the smallest set of consecutive nodes satisfying the request. If no single set of consecutive nodes satisfies the request, the smallest number of such sets will be allocated to the job. If INTERCONNECT is not defined as QUADRICS, the node selection process is much faster. As soon as sufficient resources have been identified which can satisfy the request, the allocation is made and the selection process ends.

The controller expects each SLURM Job Shepherd (on the computer servers) to report its state every ServerTimeout seconds. If it fails to do so, the node will have its state set to DOWN and no further jobs will be scheduled on that node until it reports a valid state. The controller will also send a state request message to the wayward node. The controller collects node and job resource use information. When a job has reached its prescribed time-limit, its termination is initiated through signals to the appropriate Job Shepherds.

The controller also reports its state to the backup controller (if any) at the HeartbeatInterval. If the backup controller has not received any state information from the primary controller in ControllerTimeout seconds, it begins to provide controller functions using an identical startup process. When the primary controller resumes operation, it notifies the backup controller to save state and terminate, waits for the backup controller to notify the primary controller of termination (or waits for the HeartbeatInterval if no response), reads the saved state files, and resumes operation.

The controller, like all other SLURM daemons, logs all significant activities using the syslog function. This not only identifies the event, but its significance.

Job Shepherd

The job shepherd is a relatively light-weight daemon. It too is multi-threaded and performs five primary functions:
  1. Initiate jobs
  2. Manage running jobs
  3. Monitor job state
  4. Monitor system state
  5. Forward authenticated user and administrator requests to the controller
The job shepherd, as its name implies, is primarily responsible for managing the tasks of a user job. When a request to initiate a job is received, its environment is established, the executable and standard-input files received, the interconnect configured and allocated, the epilog executed, the executable is forked and executed.

While the job is running, standard-output and standard-error is collected and reported back to the Job Initiator. Signals sent to the job from the controller (e.g. time-limit enforcement) or from the Job Initiator (e.g. user initiated termination) are forwarded.

The job shepherd collects resource use by all processes on the node. Resource use monitored includes:

This data is then coalesced by session ID for all sessions and not only those which can be associated with the running the job (e.g. kernel resource use, idle time, system daemon time, interactively initiated jobs, and multiple parallel jobs if SLURM is so configured). This data is reported to the controller every HeartbeatInterval seconds. The job shepherd is state-less and maintains no record of past resource use (unlike the controller). If there are no executing jobs, system state information (e.g. kernel resource use, idle time, system daemon time) is still reported.

The job shepherd accepts connections from the the SLURM administrative tool and Job Initiators. It can then confirm the identity of the user executing the command and forward the authenticated request to the control machine. Responses to the request from the control machine are forwarded as needed.

Communications Summary

BackupController pings ControlDaemon periodically and assumes control after ControllerTimeout. When there is a change in the node on which the ControlDaemon executes, all SLURM daemons are notified in order to route their messages appropriately.

ControlDaemon collects state information from ServerDaemon. If there have been no communications for a while, it pings the ServerDaemon. If there is no response within ServerTimeout, the node is considered DOWN and unavailable for use. The appropriate Job Initiator is also notified in order to terminate the job. The ControlDaemon also processes administrator and user requests.

The ServerDaemon wait for work requests from the Job Initiators. It spawns user tasks as required. It transfers standard input, output and error as required. It reports job and system state information as requested by the Job Initiator and ControlDaemon.

Authentication and Authorization

I am inclined for the administrator tool and job initiator work through a SLURM daemon. The SLURM daemon can confirm the identify of the user and forward the communications through low-numbered sockets. This eliminates the authentication problems without introducing the complexity of Kerberos or PKI, which I would really like to avoid. - Moe

Code Modules

Controller.c
Primary SLURM daemon to execute on control machine. It manages the Partition Manager, Node Manager, and Job Manager threads.
Get_Mach_Stat.c
Module gets the machine's status and configuration. This includes: operating system version, size of real memory, size of virtual memory, size of /tmp disk storage, number of processors, and speed of processors. This is a module of the Job Shepherd component.
list.c
Module is a general purpose list manager. One can define a list, add and delete entries, search for entries, etc. This module is used by multiple SLURM components.
list.h
Module contains definitions for list.c and documentation for its functions.
Mach_Stat_Mgr.c
Module reads, writes, records, updates, and otherwise manages the state information for all nodes (machines) in the cluster managed by SLURM. This module performs much of the Node Manager component functionality.
Partition_Mgr.c
Module reads, writes, records, updates, and otherwise manages the state information associated with partitions in the cluster managed by SLURM. This module is the Partition Manager component.
Read_Config.c
Module reads overall SLURM configuration file.
Read_Proc.c
Module reads system process table state. Used to determine job state including resource usage.
Slurm_Admin.c
Administration tool for reading, writing, and updating SLURM configuration.

Design Issues

Most modules are constructed with a some simple, built-in tests. Set declarations for DEBUG_MODULE and DEBUG_SYSTEM both to 1 near the top of the module's code. Then compile and run the test. Required input scripts and configuration files for these tests will be kept in the "etc" subdirectory and the commands to execute the tests are in the "Makefile". In some cases, the module must be loaded with some other components. In those cases, the support modules should be built with the declaration for DEBUG_MODULE set to 0 and for DEBUG_SYSTEM set to 1.

Many of these modules have been built and tested on a variety of Unix computers including Redhat's Linux, IBM's AIX, Sun's Solaris, and Compaq's Tru-64. The only module at this time which is operating system dependent is Get_Mach_Stat.c.

The node selection logic allocates nodes to jobs in a fashion which makes most sense for a Quadrics switch interconnect. It allocates the smallest collection of consecutive nodes that satisfies the request (e.g. if there are 32 consecutive nodes and 16 consecutive nodes available, a job needing 16 or fewer nodes will be allocated those nodes from the 16 node set rather than fragment the 32 node set). If the job can not be allocated consecutive nodes, it will be allocated the smallest number of consecutive sets (e.g. if there are sets of available consecutive nodes of sizes 6, 4, 3, 3, 2, 1, and 1 then a request for 10 nodes will always be allocated the 6 and 4 node sets rather than use the smaller sets).

We have tried to develop the SLURM code to be quite general and flexible, but compromises were made in several areas for the sake of simplicity and ease of support. Entire nodes are dedicated to user applications. Our customers at LLNL have expressed the opinion that sharing of nodes can severely reduce their job's performance and even reliability. This is due to contention for shared resources such as local disk space, real memory, virtual memory and processor cycles. The proper support of shared resources, including the enforcement of limits on these resources, entails a substantial amount of additional effort. Given such a cost to benefit situation at LLNL, we have decided to not support shared nodes. However, we have designed SLURM so as to not preclude the addition of such a capability at a later time if so desired.

Application Program Interface (API)

All functions described below can be issued from any node in the SLURM cluster.
int Allocate_Resources(char *Job_Spec);
Allocate resources for the job with the specification Job_Spec. This call can only be successfully executed by user root. Returns -2 if Job_Spec can not be successfully parsed. Returns -1 if the job can not be initiated given current SLURM configuration. Returns 0 if the job can not presently be initiated due to busy nodes. Returns a SLURM job ID greater than zero.
Get_Acctg_Info(TBD);
Return job and system accounting information. This function has yet to be defined.
int Deallocate_Resources(int Job_Id);
Deallocated the resources associated with the specified SLURM Job_Id. This call can only be successfully executed by user root. If there is an active job associated with this resource allocation, it will be terminated. Returns zero or an error code. Possible error codes include: TBD.
int Get_Build_Info(char *Info_Req, char **Build_Info);
Return SLURM build information. Specify the names of configuration parameters requested in the string Info_Req. All configuration information is returned if the length of Info_Req is zero. The keywords and values are returned in the buffer Build_Info using the format "keyword=value" with white-space between each pair. The buffer Build_Info is created or its size changed as needed. The application is responsible for setting *Build_Info to NULL initially and executing "free(*Build_Info)" when the buffer is no longer needed. Returns an error code or zero if no error. Possible error codes include: TBD.
int Get_Job_Info(time_t *Last_Update, int *Version_Job_Record, struct Job_Record *Job_Info, int *Job_Records);
Load into the buffer Job_Info the current job state information only if changed since Last_Update. The buffer Job_Info is created or its size changed as needed. The application is responsible for setting *Job_Info to NULL initially and executing "free(*Job_Info)" when the buffer is no longer needed. The value of Last_Update is set with the time of last update. The value of Version_Job_Record is set with the version number of the structure format. The value of Job_Records is set with the count of records returned. Version_Job_Record can be checked by the application to insure it is built with the appropriate structure format. Returns an error code or zero if no error. Possible error codes include: TBD.
int Get_Key(int *key);
Load into the location key the value of an authorization key. This key can be used as part of a job specification (see Job_Spec in the Run_Job and Will_Job_Run functions) to grant access to partitions with access restrictions. This call can only be successfully executed by user root. The key can only be used once to initiate a job. A key that has been issued and not utilized in KEY_TIMEOUT seconds (defined at SLURM build time) will be revoked. Returns an error code or zero if no error. Possible error codes include: TBD.
int Get_Node_Info(time_t *Last_Update, int *Version_Node_Record, struct Node_Record *Node_Info, int *Node_Records);
Load into the buffer Node_Info the current node state information only if changed since Last_Update. The buffer Node_Info is created or its size changed as needed. The application is responsible for setting *Node_Info to NULL initially and executing "free(*Node_Info)" when the buffer is no longer needed. The value of Last_Update is set with the time of last update. The value of Version_Node_Record is set with the version number of the structure format. The value of Node_Records is set with the count of records returned. Version_Node_Record can be checked by the application to insure it is built with the appropriate structure format. Returns an error code or zero if no error. Possible error codes include: TBD.
int Get_Part_Info(time_t *Last_Update, int *Version_Part_Record, struct Part_Record *Part_Info, int *Part_Records);
Load into the buffer Part_Info the current partition state information only if changed since Last_Update. The buffer Part_Info is created or its size changed as needed. The application is responsible for setting *Part_Info to NULL initially and executing "free(*Node_Info)" when the buffer is no longer needed. The value of Last_Update is set with the time of last update. The value of Version_Part_Record is set with the version number of the structure format. The value of Part_Records is set with the count of records returned. Version_Part_Record can be checked by the application to insure it is built with the appropriate structure format. Returns an error code or zero if no error. Possible error codes include: TBD.
int Kill_Job(int Job_Id);
Terminate the specified SLURM job. The SIGTERM signal is sent to task zero of the job followed by SIGKILL to all processes KILL_WAIT seconds later. KILL_WAIT is specified at SLURM build time. This command can only be issued by user root or the user whose job is specified by Job_Id. The Kill_Job request must succeed in removing the job record and releasing its nodes for re-use even if one or more of the nodes allocated to the job is not responding. The job will be terminated on that node when it returns to service. Returns zero or an error code. Possible error codes include: TBD.
int NodeBitMap2List(char **NodeList, char *BitMap, time_t BitMapTime);
Translate the supplied Node BitMap into its List into its equivalent List. The calling program must execute free(NodeList[0]) to release allocated memory. A time stamp associated with the BitMap is supplied in order to invalidate old BitMaps when the nodes defined to SLURM change. Returns zero or an error code. Possible error codes include: TBD.
int NodeList2BitMap(char *NodeList, char **BitMap, time_t *BitMapTime);
Translate the supplied NodeList string into its equivalent BitMap. The calling program must execute free(BitMap[0]) to release allocated memory. A time stamp associated with the BitMap is returned in order to invalidate old BitMaps when the nodes defined to SLURM change. Returns zero or an error code. Possible error codes include: TBD.
int Reconfigure(char *NodeList);
The SLURM daemons on the specified nodes will re-read the configuration file. NodeList contains a comma separated list of nodes. All nodes are reconfigured if NodeList has zero length. This command can only be issued by user root. Returns zero or an error code. Possible error codes include: TBD.
int Run_Job(char *Job_Spec);
Initiate the job with the specification Job_Spec. Returns -2 if Job_Spec can not be successfully parsed. Returns -1 if the job can not be initiated given current SLURM configuration. Returns 0 if the job can not presently be initiated due to busy nodes. Returns a SLURM job ID greater than zero if the job is being initiated.
int Signal_Job(int Job_Id, int Signal);
Send the specified signal to the specified SLURM job. The signal is sent only to task zero of the job. This command can only be issued by user root or the user whose job is specified by Job_Id. Returns zero or an error code. Possible error codes include: TBD.
int Transfer_Resources(pid_t Pid, int Job_Id);
Transfer the ownership of resources associated with the specified SLURM Job_Id to the indicated process. This call can only be successfully executed by user root. Returns zero or an error code. Possible error codes include: TBD.
int Update(char *Config_Spec);
Update the SLURM configuration per Config_Spec. The format of Config_Spec is identical to that of the SLURM configuration file as described in the SLURM Administrator's Guide. This command can only be issued by user root. Returns zero or an error code. Possible error codes include: TBD.
int Upload(char *NodeList);
Upload into the SLURM node configuration table actual configuration as actually reported by SERVER_DAEMON on each node (memory, CPU count, temporary disk, etc.). This could be used to establish a baseline configuration rather than entering the configurations manually into a file. Information from all nodes is uploaded if NodeList has zero length. This command can only be issued by user root. Returns zero or an error code. Possible error codes include: TBD.
int Will_Job_Run(char *Job_Spec);
Determine if a job with the specification Job_Spec can be initiated. Returns -2 if Job_Spec can not be successfully parsed. Returns -1 if the job can not be initiated given current SLURM configuration. Returns 0 if the job can not presently be initiated due to busy nodes. Returns 1 if the job can be initiated immediately.

Examples of API Use

    char *Build_Info;
    int Error_Code, i, Job_Id, Signal;
    pid_t Proc_Id;
    time_t Last_Update;
    struct Job_Record  *Job_Info;
    struct Node_Record *Node_Info;
    struct Part_Record *Part_Info;
    int Job_Records, Node_Records, Part_Records;
    int Version_Job_Record, Version_Node_Record, Version_Part_Record;
    int Key;
    char Scratch[128];
    char *BitMap, *Node_List;

    Build_Info = NULL;
    Error_Code = Get_Build_Info("PROLOG", &Build_Info);
    if (Error_Code != 0) 
        printf("Error %d executing Get_Build_Info for PROLOG\n", Error_Code);
    else
	printf("Get_Build_Info for PROLOG returns %s\n", Build_Info[0]);
    Error_Code = Get_Build_Info("", Build_Info);
    if (Error_Code != 0) 
        printf("Error %d executing Get_Build_Info for everything\n", Error_Code);
    else
	printf("Get_Build_Info for everything returns %s\n", Build_Info[0]);
    free(Build_Info[0]);

    Last_Update = (time_t) 0;
    Job_Info = (struct Job_Record *)NULL;
    Error_Code = Get_Job_Info(&Last_Update, &Version_Job_Record, &Job_Info, &Job_Records);
    if (Error_Code != 0) 
        printf("Error %d executing Get_Job_Info\n", Error_Code);
    else if (Version_Job_Record != JOB_STRUCT_VERSION) 
        printf("Get_Job_Info returned version %d, expected version %d\n", Version_Job_Record, JOB_STRUCT_VERSION);
    else {
        printf("Get_Job_Info returned %d records\n", Job_Records);
        for (i=0; i<Job_Records; i++) {
            printf("Job_Id=%d\n", Job_Info[i].Job_Id);
        } /* for */
    } /* else */
    free(Job_Info);

    Error_Code = Get_Key(&Key);
    if (Error_Code != 0) 
        printf("Error %d executing Get_Key\n", Error_Code);
    else 
        printf("Get_Key value is %d\n", Key);

    Last_Update = (time_t) 0;
    Node_Info = (struct Node_Info *)NULL;
    Error_Code = Get_Node_Info(&Last_Update, &Version_Node_Record, &Node_Info, &Node_Records);
    if (Error_Code != 0) 
        printf("Error %d executing Get_Node_Info\n", Error_Code);
    else if (Version_Node_Record != NODE_STRUCT_VERSION) 
        printf("Get_Node_Info returned version %d, expected version %d\n", Version_Node_Record, NODE_STRUCT_VERSION);
    else {
        printf("Get_Node_Info returned %d records\n", Node_Records);
        for (i=0; i<Node_Records; i++) {
            printf("NodeName=%s\n", Node_Info[i].Name);
        } /* for */
    } /* else */
    free(Node_Info);

    Last_Update = (time_t) 0;
    Part_Info = (struct Job_Record *)NULL;
    Error_Code = Get_Part_Info(&Last_Update, &Version_Part_Record, &Part_Info, &Part_Records);
    if (Error_Code != 0) 
        printf("Error %d executing Get_Part_Info\n", Error_Code);
    else if (Version_Job_Record != JOB_STRUCT_VERSION) 
        printf("Get_Part_Info returned version %d, expected version %d\n", Version_Part_Record, PART_STRUCT_VERSION);
    else {
        printf("Get_Part_Info returned %d records\n", Part_Records);
        /* Format TBD */
    } /* else */
    free(Job_Info);

    printf("Enter SLURM Job_Id of job to be killed: ");
    fgets(Scratch, sizeof(Scratch), stdin);
    Job_Id = atoi(Scratch);
    Error_Code = Kill_Job(Job_Id);
    if (Error_Code != 0) 
        printf("Error %d executing Kill_Job on job %d\n", Error_Code, Job_Id);

    printf("Enter name of node to reconfigure: ");
    fgets(Scratch, sizeof(Scratch), stdin);
    Error_Code = Reconfigure(Scratch);
    if (Error_Code != 0) 
        printf("Error %d executing Reconfigure on node %s\n", Error_Code, Scratch);

    strcpy(Scratch, "lx[01-10]");
    Error_Code = NodeList2BitMap(Scratch, &BitMap, &Last_Update);
    if (Error_Code != 0) 
        printf("Error %d executing NodeList2BitMap on nodes %s\n", Error_Code, Scratch);

    Error_Code = NodeBitMap2List(&NodeList, BitMap, Last_Update);
    if (Error_Code != 0) 
        printf("Error %d executing NodeBitMap2List on nodes %s\n", Error_Code, Scratch);
    else {
	printf("NodeBitMap2List returned %s, expected %s\n", NodeList, Scratch);
	free(BitMap);
	free(NodeList);
    } /* else */

    printf("Enter job specification: ");
    fgets(Scratch, sizeof(Scratch), stdin);
    Error_Code = Will_Job_Run(Scratch);
    if (Error_Code != 0) 
        printf("Error %d executing Will_Job_Run on specification %s\n", Error_Code, Scratch);
    Error_Code = Run_Job(Scratch);
    if (Error_Code != 0) 
        printf("Error %d executing Run_Job on specification %s\n", Error_Code, Scratch);

    Job_Id = Allocate_Resources(Scratch);
    if (Job_Id <= 0) 
        printf("Error %d executing Allocate_Resources on specification %s\n", Error_Code, Scratch);
    else
        printf("Allocate_Resources to Job ID %d with specification %s\n", Error_Code, Job_Id, Scratch);

    printf("Enter process ID of process to be given the allocated resources: ");
    fgets(Scratch, sizeof(Scratch), stdin);
    Proc_Id = atoi(Scratch);
    Error_Code = Transfer_Resources(Proc_Id, Job_Id);
    if (Error_Code != 0) 
        printf("Error %d executing Transfer_Resources on Job ID %d to Proc ID %d\n", Error_Code, Job_Id, Proc_Id);

    Error_Code = Deallocate_Resources(Job_Id);
    if (Error_Code != 0) 
        printf("Error %d executing Deallocate_Resources on Job ID %d\n", Error_Code, Job_Id);

    printf("Enter SLURM Job_Id of job to be signalled: ");
    fgets(Scratch, sizeof(Scratch), stdin);
    Job_Id = atoi(Scratch);
    printf("Enter signal number: ");
    fgets(Scratch, sizeof(Scratch), stdin);
    Signal = atoi(Scratch);
    Error_Code = Signal_Job(Job_Id, Signal);
    if (Error_Code != 0) 
        printf("Error %d executing Signal_Job on job %d and signal %d\n", Error_Code, Job_Id, Signal);

    printf("Enter configuration update specification: ");
    fgets(Scratch, sizeof(Scratch), stdin);
    Error_Code = Update(Scratch);
    if (Error_Code != 0) 
        printf("Error %d executing Update on specification %s\n", Error_Code, Scratch);

    printf("Enter name of node to upload state from: ");
    fgets(Scratch, sizeof(Scratch), stdin);
    Error_Code = Upload(Scratch);
    if (Error_Code != 0) 
        printf("Error %d executing Upload on node %s\n", Error_Code, Scratch);

To Do


URL = http://www-lc.llnl.gov/dctg-lc/slurm/programmer.guide.html

Last Modified February 12, 2002

Maintained by slurm-dev@lists.llnl.gov