Simple Linux Utility for Resource Management

Home

About
Overview
What's New
Publications
SLURM Team

Using
Documentation
FAQ
Getting Help

Installing
Platforms
Download
Guide

SLURM Switch Plugin API

Overview

This document describes SLURM switch (interconnect) plugins and the API that defines them. It is intended as a resource to programmers wishing to write their own SLURM switch plugins. This is version 0 of the API.

SLURM switch plugins are SLURM plugins that implement the SLURM switch or interconnect API described herein. They must conform to the SLURM Plugin API with the following specifications:

const char plugin_type[]
The major type must be "switch." The minor type can be any recognizable abbreviation for the type of switch. We recommend, for example:

  • none—A plugin that implements the API without providing any actual switch service. This is the case for Ethernet and Myrinet interconnects.
  • elanQuadrics Elan3 or Elan4 interconnect.
  • federation—IBM Federation interconnects (presently under development).

The plugin_name and plugin_version symbols required by the SLURM Plugin API require no specialization for switch support. Note carefully, however, the versioning discussion below.

The programmer is urged to study src/plugins/switch/switch_elan.c and src/plugins/switch/switch_none.c for sample implementations of a SLURM switch plugin.

Data Objects

The implementation must support an opaque class, which it defines, to be used as an job's switch "credential." This class must encapsulate all job-specific information necessary for the operation of the API specification below. The credential is referred to in SLURM code by an anonymous pointer (void *).

The implementation must maintain (though not necessarily directly export) an enumerated errno to allow SLURM to discover as practically as possible the reason for any failed API call. Plugin-specific enumerated integer values should be used when appropriate. It is desirable that these values be mapped into the range ESLURM_SWITCH_MIN and ESLURM_SWITCH_MAX as defined in slurm/slurm_errno.h. The error number should be returned by the function switch_p_get_errno() and this error number can be converted to an appropriate string description using the switch_p_strerror() function described below.

These values must not be used as return values in integer-valued functions in the API. The proper error return value from integer-valued functions is SLURM_ERROR. The implementation should endeavor to provide useful and pertinent information by whatever means is practical. In some cases this means an errno for each credential, since plugins must be re-entrant. If a plugin maintains a global errno in place of or in addition to a per-credential errno, it is not required to enforce mutual exclusion on it. Successful API calls are not required to reset any errno to a known value. However, the initial value of any errno, prior to any error condition arising, should be SLURM_SUCCESS.

API Functions

The following functions must appear. Functions which are not implemented should be stubbed.

Global Switch State Functions

int switch_p_libstate_save (char *dir_name);

Description: Save any global switch state to a file within the specified directory. The actual file name used is plugin specific. It is recommended that the global switch state contain a magic number for validation purposes. This function is called by the slurmctld deamon on shutdown.

Arguments: dir_name    (input) fully-qualified pathname of a directory into which user SlurmUser (as defined in slurm.conf) can create a file and write state information into that file. Cannot be NULL.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_libstate_restore(char *dir_name);

Description: Restore any global switch state from a file within the specified directory. The actual file name used is plugin specific. It is recommended that any magic number associated with the global switch state be verified. This function is called by the slurmctld deamon on startup.

Arguments: dir_name    (input) fully-qualified pathname of a directory containing a state information file from which user SlurmUser (as defined in slurm.conf) can read. Cannot be NULL.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

bool switch_p_no_frag(void);

Description: Report if resource fragmentation is important. If so, delay scheduling a new job while another is in the process of terminating.

Arguments: None

Returns: TRUE if job scheduling should be delayed while any other job is in the process of terminating.

Job's Switch Credential Management Functions

int switch_p_alloc_jobinfo(switch_jobinfo_t *switch_job);

Description: Allocate storage for a job's switch credential. It is recommended that the credential contain a magic number for validation purposes.

Arguments: switch_job    (output) location for writing location of job's switch credential.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_build_jobinfo (switch_jobinfo_t switch_job, char *nodelist, int nprocs, int cyclic_alloc);

Description: Build a job's switch credential. It is recommended that the credential's magic number be validated.

Arguments:
switch_job    (input/output) Job's switch credential to be updated
nodelist    (input) List of nodes allocated to the job. This may contain expressions to specify node ranges (e.g. "linux[1-20]" or "linux[2,4,6,8]").
nprocs    (input) Number of processes to be initiated as part of the job.
cyclic_alloc    (input) Non-zero if job's processes are to be allocated across nodes in a cyclic fashion (task 0 on node 0, task 1 on node 1, etc). If zero, processes are allocated sequentially on a node before moving to the next node (tasks 0 and 1 on node 0, tasks 2 and 3 on node 1, etc.).

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

switch_jobinfo_t switch_p_copy_jobinfo (switch_jobinfo_t switch_job);

Description: Allocate storage for a job's switch credential and copy an existing credential to that location.

Arguments: switch_job    (input) an existing job switch credential.

Returns: A newly allocated job switch credential containing a copy of the function argument.

void switch_p_free_jobinfo (switch_jobinfo_t switch_job);

Description: Release the storage associated with a job's switch credential.

Arguments: switch_job    (intput) an existing job switch credential.

Returns: None

int switch_p_pack_jobinfo (switch_jobinfo_t switch_job, Buf buffer);

Description: Pack the data associated with a job's switch credential into a buffer for network transmission.

Arguments:
switch_job    (input) an existing job switch credential.
buffer    (input/output) buffer onto which the credential's contents are appended.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_unpack_jobinfo (switch_jobinfo_t switch_job, Buf buffer);

Description: Unack the data associated with a job's switch credential from a buffer.

Arguments:
switch_job    (input/output) a previously allocated job switch credential to be filled in with data read from the buffer.
buffer    (input/output) buffer from which the credential's contents are read.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

void switch_p_print_jobinfo(FILE *fp, switch_jobinfo_t switch_job);

Description: Print the contents of a job's switch credential to a file.

Arguments:
fp    (input) pointer to an open file.
switch_job    (input) a job's switch credential.

Returns: None.

char *switch_p_sprint_jobinfo(switch_jobinfo_t switch_job, char *buf, size_t size);

Description: Print the contents of a job's switch credential to a buffer.

Arguments:
switch_job    (input) a job's switch credential.
buf    (input/output) pointer to buffer into which the job credential information is to be written.
size    (input) size of buf in bytes

Returns: location of buffer, same as buf.

Node Specific Switch Management Functions

int switch_p_node_init (void);

Description: This function is run from the top level slurmd only once per slurmd run. It may be used, for instance, to perform some one-time interconnect setup or spawn an error handling thread.

Arguments: None

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_node_fini (void);

Description: This function is called once as slurmd exits (slurmd will wait for this function to return before continuing the exit process).

Arguments: None

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

Job Management Functions

=========================================================================
Process 1 (root)        Process 2 (root, user)  |  Process 3 (user task) 
                                                |                        
switch_p_job_preinit                            |                        
fork ------------------ switch_p_job_init       |                        
waitpid                 setuid, chdir, etc.     |                        
                        fork N procs -----------+--- switch_p_job_attach 
                        wait all                |    exec mpi process    
                        switch_p_job_fini*      |                        
switch_p_job_postfini                           |                        
=========================================================================

int switch_p_job_preinit (switch_jobinfo_t jobinfo switch_job);

Description: Preinit is run as root in the first slurmd process, the so called job manager. This function can be used to perform any initialization that needs to be performed in the same process as switch_p_job_fini().

Arguments: switch_job    (input) a job's switch credential.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_init (switch_jobinfo_t jobinfo switch_job, uid_t uid);

Description: Initialize interconnect on node for a job. This function is run from the second slurmd process (some interconnect implementations may require the switch_p_job_init functions to be executed from a separate process than the process executing switch_p_job_fini() [e.g. Quadrics Elan]).

Arguments:
switch_job    (input) a job's switch credential.
uid    (input) the user id to execute a job.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_attach ( switch_jobinfo_t switch_job, char ***env, uint32_t nodeid, uint32_t procid, uint32_t nnodes, uint32_t nprocs, uint32_t rank );

Description: Attach process to interconnect (Called from within the process, so it is appropriate to set interconnect specific environment variables here).

Arguments:
switch_job    (input) a job's switch credential.
env    (input/output) the environment variables to be set upon job initiation. Switch specific environment variables are added as needed.
nodeid    (input) zero-origin id of this node.
procid    (input) zero-origin process id local to slurmd and not equivalent to the global task id or MPI rank.
nnodes    (input) count of nodes allocated to this job.
nprocs    (input) total count of processes or tasks to be initiated for this job.
rank    (input) zero-origin id of this task.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_fini (switch_jobinfo_t jobinfo switch_job);

Description: This function is run from the same process as switch_p_job_init() after all job tasks have exited. It is *not* run as root, because the process in question has already setuid to the job owner.

Arguments: switch_job    (input) a job's switch credential.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_postfini ( switch_jobinfo_t switch_job, uid_t pgid, uint32_t job_id, uint32_t step_id );

Description: This function is run from the initial slurmd process (same process as switch_p_job_preinit()), and is run as root. Any cleanup routines that need to be run with root privileges should be run from this function.

Arguments:
switch_job    (input) a job's switch credential.
pgid    (input) The process group id associated with this task.
job_id    (input) the associated SLURM job id.
step_id    (input) the associated SLURM job step id.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

Error Handling Functions

int switch_p_get_errno (void);

Description: Return the number of a switch specific error.

Arguments: None

Returns: Error number for the last failure encountered by the switch plugin.

char *switch_p_strerror(int errnum);

Description: Return a string description of a switch specific error code.

Arguments: errnum    (input) a switch specific error code.

Returns: Pointer to string describing the error or NULL if no description found in this plugin.

Versioning

This document describes version 0 of the SLURM Switch API. Future releases of SLURM may revise this API. A switch plugin conveys its ability to implement a particular API version using the mechanism outlined for SLURM plugins. In addition, the credential is transmitted along with the version number of the plugin that transmitted it. It is at the discretion of the plugin author whether to maintain data format compatibility across different versions of the plugin.


For information about this page, contact slurm-dev@lists.llnl.gov.