Simple Linux Utility for Resource Management

Home

About
Overview
What's New
Publications
SLURM Team

Using
Documentation
FAQ
Getting Help
Mailing Lists

Installing
Platforms
Download
Guide

SLURM Switch Plugin API

Overview

This document describes SLURM switch (interconnect) plugins and the API that defines them. It is intended as a resource to programmers wishing to write their own SLURM switch plugins. This is version 0 of the API. Note that many of the API functions are used only by one of the daemons. For example the slurmctld daemon builds a job step's switch credential (switch_p_build_jobinfo) while the slurmd daemon enables and disables that credential for the job step's tasks on a particular node(switch_p_job_init, etc.).

SLURM switch plugins are SLURM plugins that implement the SLURM switch or interconnect API described herein. They must conform to the SLURM Plugin API with the following specifications:

const char plugin_type[]
The major type must be "switch." The minor type can be any recognizable abbreviation for the type of switch. We recommend, for example:

  • none—A plugin that implements the API without providing any actual switch service. This is the case for Ethernet and Myrinet interconnects.
  • elanQuadrics Elan3 or Elan4 interconnect.
  • federation—IBM Federation interconnects (presently under development).

The plugin_name and plugin_version symbols required by the SLURM Plugin API require no specialization for switch support. Note carefully, however, the versioning discussion below.

The programmer is urged to study src/plugins/switch/switch_elan.c and src/plugins/switch/switch_none.c for sample implementations of a SLURM switch plugin.

Data Objects

The implementation must support two opaque data classes. One is used as an job's switch "credential." This class must encapsulate all job-specific information necessary for the operation of the API specification below. The second is a node's switch state record. Both data classes are referred to in SLURM code using an anonymous pointer (void *).

The implementation must maintain (though not necessarily directly export) an enumerated errno to allow SLURM to discover as practically as possible the reason for any failed API call. Plugin-specific enumerated integer values should be used when appropriate. It is desirable that these values be mapped into the range ESLURM_SWITCH_MIN and ESLURM_SWITCH_MAX as defined in slurm/slurm_errno.h. The error number should be returned by the function switch_p_get_errno() and this error number can be converted to an appropriate string description using the switch_p_strerror() function described below.

These values must not be used as return values in integer-valued functions in the API. The proper error return value from integer-valued functions is SLURM_ERROR. The implementation should endeavor to provide useful and pertinent information by whatever means is practical. In some cases this means an errno for each credential, since plugins must be re-entrant. If a plugin maintains a global errno in place of or in addition to a per-credential errno, it is not required to enforce mutual exclusion on it. Successful API calls are not required to reset any errno to a known value. However, the initial value of any errno, prior to any error condition arising, should be SLURM_SUCCESS.

API Functions

The following functions must appear. Functions which are not implemented should be stubbed.

Global Switch State Functions

int switch_p_libstate_save (char *dir_name);

Description: Save any global switch state to a file within the specified directory. The actual file name used is plugin specific. It is recommended that the global switch state contain a magic number for validation purposes. This function is called by the slurmctld deamon on shutdown. Note that if the slurmctld daemon fails, this function will not be called. The plugin may save state independently and/or make use of the switch_p_job_step_allocated function to restore state.

Arguments: dir_name    (input) fully-qualified pathname of a directory into which user SlurmUser (as defined in slurm.conf) can create a file and write state information into that file. Cannot be NULL.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_libstate_restore(char *dir_name, bool recover);

Description: Restore any global switch state from a file within the specified directory. The actual file name used is plugin specific. It is recommended that any magic number associated with the global switch state be verified. This function is called by the slurmctld deamon on startup.

Arguments:
dir_name    (input) fully-qualified pathname of a directory containing a state information file from which user SlurmUser (as defined in slurm.conf) can read. Cannot be NULL.
recover  true of restart with state preserved, false if no state recovery.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_libstate_clear (void);

Description: Clear switch state information.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

bool switch_p_no_frag(void);

Description: Report if resource fragmentation is important. If so, delay scheduling a new job while another is in the process of terminating.

Arguments: None

Returns: TRUE if job scheduling should be delayed while any other job is in the process of terminating.

Node's Switch State Monitoring Functions

Nodes will register with current switch state information when the slurmd daemon is initiated. The slurmctld daemon will also request that slurmd supply current switch state information on a periodic basis.

int switch_p_clear_node_state (void);

Description: Initialize node state. If any switch state has previously been established for a job, it will be cleared. This will be used to establish a "clean" state for the switch on the node upon which it is executed.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_alloc_node_info(switch_node_info_t *switch_node);

Description: Allocate storage for a node's switch state record. It is recommended that the record contain a magic number for validation purposes.

Arguments: switch_node    (output) location for writing location of node's switch state record.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_build_node_info(switch_node_info_t switch_node);

Description: Fill in a previously allocated switch state record for the node on which this function is executed. It is recommended that the magic number be validated.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_pack_node_info (switch_node_info_t switch_node, Buf buffer);

Description: Pack the data associated with a node's switch state into a buffer for network transmission.

Arguments:
switch_node    (input) an existing node's switch state record.
buffer    (input/output) buffer onto which the switch state information is appended.

Returns: The number of bytes written should be returned if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_unpack_node_info (switch_node_info_t switch_node, Buf buffer);

Description: Unpack the data associated with a node's switch state record from a buffer.

Arguments:
switch_node    (input/output) a previously allocated node switch state record to be filled in with data read from the buffer.
buffer    (input/output) buffer from which the record's contents are read.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

void switch_p_free_node_info (switch_node_info_t switch_node);

Description: Release the storage associated with a node's switch state record.

Arguments: switch_node    (intput/output) a previously allocated node switch state record.

Returns: None

char * switch_p_sprintf_node_info (switch_node_info_t switch_node, char *buf, size_t size);

Description: Print the contents of a node's switch state record to a buffer.

Arguments:
switch_node    (input) a node's switch state record.
buf    (input/output) point to buffer into which the switch state record is to be written.
of buf in bytes.
size    (input) size of buf in bytes.

Returns: Location of buffer, same as buf.

Job's Switch Credential Management Functions

int switch_p_alloc_jobinfo(switch_jobinfo_t *switch_job);

Description: Allocate storage for a job's switch credential. It is recommended that the credential contain a magic number for validation purposes.

Arguments: switch_job    (output) location for writing location of job's switch credential.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_build_jobinfo (switch_jobinfo_t switch_job, char *nodelist, int *tasks_per_node, int cyclic_alloc, char *network);

Description: Build a job's switch credential. It is recommended that the credential's magic number be validated.

Arguments:
switch_job    (input/output) Job's switch credential to be updated
nodelist    (input) List of nodes allocated to the job. This may contain expressions to specify node ranges (e.g. "linux[1-20]" or "linux[2,4,6,8]").
tasks_per_node    (input) List of processes per node to be initiated as part of the job.
cyclic_alloc    (input) Non-zero if job's processes are to be allocated across nodes in a cyclic fashion (task 0 on node 0, task 1 on node 1, etc). If zero, processes are allocated sequentially on a node before moving to the next node (tasks 0 and 1 on node 0, tasks 2 and 3 on node 1, etc.).
network    (input) Job's network specification from srun command.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

switch_jobinfo_t switch_p_copy_jobinfo (switch_jobinfo_t switch_job);

Description: Allocate storage for a job's switch credential and copy an existing credential to that location.

Arguments: switch_job    (input) an existing job switch credential.

Returns: A newly allocated job switch credential containing a copy of the function argument.

void switch_p_free_jobinfo (switch_jobinfo_t switch_job);

Description: Release the storage associated with a job's switch credential.

Arguments: switch_job    (intput) an existing job switch credential.

Returns: None

int switch_p_pack_jobinfo (switch_jobinfo_t switch_job, Buf buffer);

Description: Pack the data associated with a job's switch credential into a buffer for network transmission.

Arguments:
switch_job    (input) an existing job switch credential.
buffer    (input/output) buffer onto which the credential's contents are appended.

Returns: The number of bytes written should be returned if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_unpack_jobinfo (switch_jobinfo_t switch_job, Buf buffer);

Description: Unpack the data associated with a job's switch credential from a buffer.

Arguments:
switch_job    (input/output) a previously allocated job switch credential to be filled in with data read from the buffer.
buffer    (input/output) buffer from which the credential's contents are read.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_get_jobinfo (switch_jobinfo_t switch_job, int data_type, void *data);

Description: Get some specific data from a job's switch credential.

Arguments:
switch_job    (input) a job's switch credential.
data_type    (input) identification as to the type of data requested. The interpretation of this value is plugin dependent.
data    (output) filled in with the desired data. The form of this data is dependent upon the value of data_type and the plugin.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_step_complete (switch_jobinfo_t switch_job, char *nodelist);

Description: Note that the job step associated with the specified node has completed execution.

Arguments: switch_job     (input) The completed job's switch credential.
nodelist    (input) A list of nodes on which the job has completed. This may contain expressions to specify node ranges. (e.g. "linux[1-20]" or "linux[2,4,6,8]").

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

void switch_p_print_jobinfo(FILE *fp, switch_jobinfo_t switch_job);

Description: Print the contents of a job's switch credential to a file.

Arguments:
fp    (input) pointer to an open file.
switch_job    (input) a job's switch credential.

Returns: None.

char *switch_p_sprint_jobinfo(switch_jobinfo_t switch_job, char *buf, size_t size);

Description: Print the contents of a job's switch credential to a buffer.

Arguments:
switch_job    (input) a job's switch credential.
buf    (input/output) pointer to buffer into which the job credential information is to be written.
size    (input) size of buf in bytes

Returns: location of buffer, same as buf.

int switch_p_get_data_jobinfo(switch_jobinfo_t switch_job, int key, void *resulting_data);

Description: Get data from a job's switch credential.

Arguments:
switch_job    (input) a job's switch credential.
key    (input) identification of the type of data to be retrieved from the switch credential. NOTE: The interpretation of this key is dependent upon the switch type.
resulting_data    (input/output) pointer to where the requested data should be stored.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

Node Specific Switch Management Functions

int switch_p_node_init (void);

Description: This function is run from the top level slurmd only once per slurmd run. It may be used, for instance, to perform some one-time interconnect setup or spawn an error handling thread.

Arguments: None

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_node_fini (void);

Description: This function is called once as slurmd exits (slurmd will wait for this function to return before continuing the exit process).

Arguments: None

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

Job Management Functions

=========================================================================
Process 1 (root)        Process 2 (root, user)  |  Process 3 (user task) 
                                                |                        
switch_p_job_preinit                            |                        
fork ------------------ switch_p_job_init       |                        
waitpid                 setuid, chdir, etc.     |                        
                        fork N procs -----------+--- switch_p_job_attach 
                        wait all                |    exec mpi process    
                        switch_p_job_fini*      |                        
switch_p_job_postfini                           |                        
=========================================================================

int switch_p_job_preinit (switch_jobinfo_t jobinfo switch_job);

Description: Preinit is run as root in the first slurmd process, the so called job manager. This function can be used to perform any initialization that needs to be performed in the same process as switch_p_job_fini().

Arguments: switch_job    (input) a job's switch credential.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_init (switch_jobinfo_t jobinfo switch_job, uid_t uid);

Description: Initialize interconnect on node for a job. This function is run from the second slurmd process (some interconnect implementations may require the switch_p_job_init functions to be executed from a separate process than the process executing switch_p_job_fini() [e.g. Quadrics Elan]).

Arguments:
switch_job    (input) a job's switch credential.
uid    (input) the user id to execute a job.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_attach ( switch_jobinfo_t switch_job, char ***env, uint32_t nodeid, uint32_t procid, uint32_t nnodes, uint32_t nprocs, uint32_t rank );

Description: Attach process to interconnect (Called from within the process, so it is appropriate to set interconnect specific environment variables here).

Arguments:
switch_job    (input) a job's switch credential.
env    (input/output) the environment variables to be set upon job initiation. Switch specific environment variables are added as needed.
nodeid    (input) zero-origin id of this node.
procid    (input) zero-origin process id local to slurmd and not equivalent to the global task id or MPI rank.
nnodes    (input) count of nodes allocated to this job.
nprocs    (input) total count of processes or tasks to be initiated for this job.
rank    (input) zero-origin id of this task.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_fini (switch_jobinfo_t jobinfo switch_job);

Description: This function is run from the same process as switch_p_job_init() after all job tasks have exited. It is *not* run as root, because the process in question has already setuid to the job owner.

Arguments: switch_job    (input) a job's switch credential.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_postfini ( switch_jobinfo_t switch_job, uid_t pgid, uint32_t job_id, uint32_t step_id );

Description: This function is run from the initial slurmd process (same process as switch_p_job_preinit()), and is run as root. Any cleanup routines that need to be run with root privileges should be run from this function.

Arguments:
switch_job    (input) a job's switch credential.
pgid    (input) The process group id associated with this task.
job_id    (input) the associated SLURM job id.
step_id    (input) the associated SLURM job step id.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_step_allocated (switch_jobinfo_t jobinfo switch_job, char *nodelist);

Description: Note that the identified job step is active at restart time. This function can be used to restore global switch state information based upon job steps known to be active at restart time. Use of this function is prefered over switch state saved and restored by the switch plugin. Direct use of job step switch information eliminates the possibility of inconsistent state information between the switch and job steps.

Arguments:
switch_job    (input) a job's switch credential.
nodelist    (input) the nodes allocated to a job step.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

Error Handling Functions

int switch_p_get_errno (void);

Description: Return the number of a switch specific error.

Arguments: None

Returns: Error number for the last failure encountered by the switch plugin.

char *switch_p_strerror(int errnum);

Description: Return a string description of a switch specific error code.

Arguments: errnum    (input) a switch specific error code.

Returns: Pointer to string describing the error or NULL if no description found in this plugin.

Versioning

This document describes version 0 of the SLURM Switch API. Future releases of SLURM may revise this API. A switch plugin conveys its ability to implement a particular API version using the mechanism outlined for SLURM plugins. In addition, the credential is transmitted along with the version number of the plugin that transmitted it. It is at the discretion of the plugin author whether to maintain data format compatibility across different versions of the plugin.


For information about this page, contact slurm-dev@lists.llnl.gov.