Overview
This document describes SLURM switch (interconnect) plugins and the API that defines
them. It is intended as a resource to programmers wishing to write their own SLURM
switch plugins. This is version 0 of the API.
Note that many of the API functions are used only by one of the daemons. For
example the slurmctld daemon builds a job step's switch credential
(switch_p_build_jobinfo) while the
slurmd daemon enables and disables that credential for the job step's
tasks on a particular node(switch_p_job_init,
etc.).
SLURM switch plugins are SLURM plugins that implement the SLURM switch or interconnect
API described herein. They must conform to the SLURM Plugin API with the following
specifications:
const char plugin_type[]
The major type must be "switch." The minor type can be any recognizable
abbreviation for the type of switch. We recommend, for example:
- noneA plugin that implements the API without providing any actual
switch service. This is the case for Ethernet and Myrinet interconnects.
- elanQuadrics Elan3 or Elan4
interconnect.
- federationIBM Federation interconnects (presently under development).
The plugin_name and
plugin_version
symbols required by the SLURM Plugin API require no specialization for switch support.
Note carefully, however, the versioning discussion below.
The programmer is urged to study
src/plugins/switch/switch_elan.c and
src/plugins/switch/switch_none.c
for sample implementations of a SLURM switch plugin.
Data Objects
The implementation must support two opaque data classes.
One is used as an job's switch "credential."
This class must encapsulate all job-specific information necessary
for the operation of the API specification below.
The second is a node's switch state record.
Both data classes are referred to in SLURM code using an anonymous
pointer (void *).
The implementation must maintain (though not necessarily directly export) an
enumerated errno to allow SLURM to discover
as practically as possible the reason for any failed API call. Plugin-specific enumerated
integer values should be used when appropriate. It is desirable that these values
be mapped into the range ESLURM_SWITCH_MIN and ESLURM_SWITCH_MAX
as defined in slurm/slurm_errno.h.
The error number should be returned by the function
switch_p_get_errno()
and this error number can be converted to an appropriate string description using the
switch_p_strerror()
function described below.
These values must not be used as return values in integer-valued functions
in the API. The proper error return value from integer-valued functions is SLURM_ERROR.
The implementation should endeavor to provide useful and pertinent information by
whatever means is practical. In some cases this means an errno for each credential,
since plugins must be re-entrant. If a plugin maintains a global errno in place of or in
addition to a per-credential errno, it is not required to enforce mutual exclusion on it.
Successful API calls are not required to reset any errno to a known value. However,
the initial value of any errno, prior to any error condition arising, should be
SLURM_SUCCESS.
API Functions
The following functions must appear. Functions which are not implemented should
be stubbed.
Global Switch State Functions
int switch_p_libstate_save (char *dir_name);
Description: Save any global switch state to a file
within the specified directory. The actual file name used is plugin specific. It is recommended
that the global switch state contain a magic number for validation purposes. This function
is called by the slurmctld deamon on shutdown.
Arguments: dir_name
(input) fully-qualified pathname of a directory into which user SlurmUser (as defined
in slurm.conf) can create a file and write state information into that file. Cannot be NULL.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int switch_p_libstate_restore(char *dir_name);
Description: Restore any global switch state from a file
within the specified directory. The actual file name used is plugin specific. It is recommended
that any magic number associated with the global switch state be verified. This function
is called by the slurmctld deamon on startup.
Arguments: dir_name
(input) fully-qualified pathname of a directory containing a state information file
from which user SlurmUser (as defined in slurm.conf) can read. Cannot be NULL.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
bool switch_p_no_frag(void);
Description: Report if resource fragmentation is important.
If so, delay scheduling a new job while another is in the process of terminating.
Arguments: None
Returns: TRUE if job scheduling should be delayed while
any other job is in the process of terminating.
Node's Switch State Monitoring Functions
Nodes will register with current switch state information when the slurmd daemon
is initiated. The slurmctld daemon will also request that slurmd supply current
switch state information on a periodic basis.
int switch_p_clear_node_state(void);
Description: Initialize node state.
If any switch state has previously been established for a job, it will be cleared.
This will be used to establish a "clean" state for the switch on the node upon
which it is executed.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int switch_p_alloc_node_info(switch_node_info_t *switch_node);
Description: Allocate storage for a node's switch
state record. It is recommended that the record contain a magic number for validation
purposes.
Arguments: switch_node
(output) location for writing location of node's switch state record.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int switch_p_build_node_info(switch_node_info_t switch_node);
Description: Fill in a previously allocated switch state
record for the node on which this function is executed.
It is recommended that the magic number be validated.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int switch_p_pack_node_info (switch_node_info_t switch_node,
Buf buffer);
Description: Pack the data associated with a
node's switch state into a buffer for network transmission.
Arguments:
switch_node (input) an existing
node's switch state record.
buffer (input/output) buffer onto
which the switch state information is appended.
Returns:
The number of bytes written should be returned if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int switch_p_unpack_node_info (switch_node_info_t switch_node,
Buf buffer);
Description: Unpack the data associated with a
node's switch state record from a buffer.
Arguments:
switch_node (input/output) a
previously allocated node switch state record to be filled in with data read from
the buffer.
buffer (input/output) buffer from
which the record's contents are read.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
void switch_p_free_node_info (switch_node_info_t switch_node);
Description: Release the storage associated with
a node's switch state record.
Arguments: switch_node
(intput/output) a previously allocated node switch state record.
Returns: None
char * switch_p_sprintf_node_info (switch_node_info_t switch_node,
char *buf, size_t size);
Description: Print the contents of a node's switch state
record to a buffer.
Arguments:
switch_node (input) a
node's switch state record.
buf (input/output) point to
buffer into which the switch state record is to be written.
of buf in bytes.
size (input) size
of buf in bytes.
Returns: Location of buffer, same as buf.
Job's Switch Credential Management Functions
int switch_p_alloc_jobinfo(switch_jobinfo_t *switch_job);
Description: Allocate storage for a job's switch credential.
It is recommended that the credential contain a magic number for validation purposes.
Arguments: switch_job
(output) location for writing location of job's switch credential.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int switch_p_build_jobinfo (switch_jobinfo_t switch_job,
char *nodelist, int nprocs, int cyclic_alloc);
Description: Build a job's switch credential.
It is recommended that the credential's magic number be validated.
Arguments:
switch_job (input/output) Job's
switch credential to be updated
nodelist (input) List of nodes
allocated to the job. This may contain expressions to specify node ranges (e.g.
"linux[1-20]" or "linux[2,4,6,8]").
nprocs (input) Number of
processes to be initiated as part of the job.
cyclic_alloc (input) Non-zero
if job's processes are to be allocated across nodes in a cyclic fashion (task 0 on node 0,
task 1 on node 1, etc). If zero, processes are allocated sequentially on a node before
moving to the next node (tasks 0 and 1 on node 0, tasks 2 and 3 on node 1, etc.).
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
switch_jobinfo_t switch_p_copy_jobinfo (switch_jobinfo_t switch_job);
Description: Allocate storage for a job's switch credential
and copy an existing credential to that location.
Arguments: switch_job
(input) an existing job switch credential.
Returns: A newly allocated job switch credential containing a
copy of the function argument.
void switch_p_free_jobinfo (switch_jobinfo_t switch_job);
Description: Release the storage associated with a job's
switch credential.
Arguments: switch_job
(intput) an existing job switch credential.
Returns: None
int switch_p_pack_jobinfo (switch_jobinfo_t switch_job, Buf buffer);
Description: Pack the data associated with a job's
switch credential into a buffer for network transmission.
Arguments:
switch_job (input) an existing job
switch credential.
buffer (input/output) buffer onto
which the credential's contents are appended.
Returns:
The number of bytes written should be returned if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int switch_p_unpack_jobinfo (switch_jobinfo_t switch_job, Buf buffer);
Description: Unpack the data associated with a job's
switch credential from a buffer.
Arguments:
switch_job (input/output) a previously
allocated job switch credential to be filled in with data read from the buffer.
buffer (input/output) buffer from
which the credential's contents are read.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int switch_p_get_jobinfo (switch_jobinfo_t switch_job, int data_type, void *data);
Description: Get some specific data from a job's switch credential.
Arguments:
switch_job (input) a job's switch credential.
data_type (input) identification
as to the type of data requested. The interpretation of this value is plugin dependent.
data (output) filled in with the desired
data. The form of this data is dependent upon the value of data_type and the plugin.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int switch_p_job_step_complete (switch_jobinfo_t switch_job,
char *nodelist);
Description: Note that the job step associated
with the specified node has completed execution.
Arguments: switch_job
(input) The completed job's switch credential.
nodelist (input) A list of nodes
on which the job has completed. This may contain expressions to specify node ranges.
(e.g. "linux[1-20]" or "linux[2,4,6,8]").
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
void switch_p_print_jobinfo(FILE *fp, switch_jobinfo_t switch_job);
Description: Print the contents of a job's
switch credential to a file.
Arguments:
fp (input) pointer to an open file.
switch_job (input) a job's
switch credential.
Returns: None.
char *switch_p_sprint_jobinfo(switch_jobinfo_t switch_job,
char *buf, size_t size);
Description: Print the contents of a job's
switch credential to a buffer.
Arguments:
switch_job (input) a job's
switch credential.
buf (input/output) pointer to
buffer into which the job credential information is to be written.
size (input) size of buf in
bytes
Returns: location of buffer, same as buf.
int switch_p_get_data_jobinfo(switch_jobinfo_t switch_job,
int key, void *resulting_data);
Description: Get data from a job's
switch credential.
Arguments:
switch_job (input) a job's
switch credential.
key (input) identification
of the type of data to be retrieved from the switch credential. NOTE: The
interpretation of this key is dependent upon the switch type.
resulting_data (input/output)
pointer to where the requested data should be stored.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
Node Specific Switch Management Functions
int switch_p_node_init (void);
Description: This function is run from the top level slurmd
only once per slurmd run. It may be used, for instance, to perform some one-time
interconnect setup or spawn an error handling thread.
Arguments: None
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int switch_p_node_fini (void);
Description: This function is called once as slurmd exits
(slurmd will wait for this function to return before continuing the exit process).
Arguments: None
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
Job Management Functions
=========================================================================
Process 1 (root) Process 2 (root, user) | Process 3 (user task)
|
switch_p_job_preinit |
fork ------------------ switch_p_job_init |
waitpid setuid, chdir, etc. |
fork N procs -----------+--- switch_p_job_attach
wait all | exec mpi process
switch_p_job_fini* |
switch_p_job_postfini |
=========================================================================
int switch_p_job_preinit (switch_jobinfo_t jobinfo switch_job);
Description: Preinit is run as root in the first slurmd process,
the so called job manager. This function can be used to perform any initialization
that needs to be performed in the same process as switch_p_job_fini().
Arguments:
switch_job (input) a job's
switch credential.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int switch_p_job_init (switch_jobinfo_t jobinfo switch_job, uid_t uid);
Description: Initialize interconnect on node for a job.
This function is run from the second slurmd process (some interconnect implementations
may require the switch_p_job_init functions to be executed from a separate process
than the process executing switch_p_job_fini() [e.g. Quadrics Elan]).
Arguments:
switch_job (input) a job's
switch credential.
uid (input) the user id
to execute a job.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int switch_p_job_attach ( switch_jobinfo_t switch_job, char ***env,
uint32_t nodeid, uint32_t procid, uint32_t nnodes, uint32_t nprocs, uint32_t rank );
Description: Attach process to interconnect
(Called from within the process, so it is appropriate to set interconnect specific
environment variables here).
Arguments:
switch_job (input) a job's
switch credential.
env (input/output) the
environment variables to be set upon job initiation. Switch specific environment
variables are added as needed.
nodeid (input) zero-origin
id of this node.
procid (input) zero-origin
process id local to slurmd and not equivalent to the global task id or MPI rank.
nnodes (input) count of
nodes allocated to this job.
nprocs (input) total count of
processes or tasks to be initiated for this job.
rank (input) zero-origin
id of this task.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int switch_p_job_fini (switch_jobinfo_t jobinfo switch_job);
Description: This function is run from the same process
as switch_p_job_init() after all job tasks have exited. It is *not* run as root, because
the process in question has already setuid to the job owner.
Arguments:
switch_job (input) a job's
switch credential.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int switch_p_job_postfini ( switch_jobinfo_t switch_job, uid_t pgid,
uint32_t job_id, uint32_t step_id );
Description: This function is run from the initial slurmd
process (same process as switch_p_job_preinit()), and is run as root. Any cleanup routines
that need to be run with root privileges should be run from this function.
Arguments:
switch_job (input) a job's
switch credential.
pgid (input) The process
group id associated with this task.
job_id (input) the
associated SLURM job id.
step_id (input) the
associated SLURM job step id.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
Error Handling Functions
int switch_p_get_errno (void);
Description: Return the number of a switch
specific error.
Arguments: None
Returns: Error number for the last failure encountered by
the switch plugin.
char *switch_p_strerror(int errnum);
Description: Return a string description of a switch
specific error code.
Arguments:
errnum (input) a switch
specific error code.
Returns: Pointer to string describing the error
or NULL if no description found in this plugin.
Versioning
This document describes version 0 of the SLURM Switch API. Future
releases of SLURM may revise this API. A switch plugin conveys its ability
to implement a particular API version using the mechanism outlined for SLURM plugins.
In addition, the credential is transmitted along with the version number of the
plugin that transmitted it. It is at the discretion of the plugin author whether
to maintain data format compatibility across different versions of the plugin.
|