Overview
This document describes SLURM mpi selection plugins and the API that defines
them. It is intended as a resource to programmers wishing to write their own SLURM
node selection plugins. This is version 0 of the API.
SLURM mpi selection plugins are SLURM plugins that implement the which version of
mpi is used during execution of the new SLURM job. API described herein. They are
intended to provide a mechanism for both selecting mpi versions for pending jobs and
performing any mpi-specific tasks for job launch or termination. The plugins must
conform to the SLURM Plugin API with the following specifications:
const char plugin_type[]
The major type must be "mpi." The minor type can be any recognizable
abbreviation for the type of node selection algorithm. We recommend, for example:
- mpich-gmFor use with Myrinet.
- mvapichFor use with Infiniband.
- lamA no-op right now. LAM is not implemented yet.
The plugin_name and
plugin_version
symbols required by the SLURM Plugin API require no specialization for node selection support.
Note carefully, however, the versioning discussion below.
A simplified flow of logic follows:
srun is able to specify the correct mpi to use. with --mpi=MPITYPE
srun calls
mpi_p_thr_create((srun_job_t *)job);
which will set up the correct enviornment for the specified mpi.
slurmd daemon runs
mpi_p_init((slurmd_job_t *)job, (int)rank);
which will set configure the slurmd to use the correct mpi as well to interact with the srun.
Data Objects
These functions are expected to read and/or modify data structures directly in
the slurmd daemon's and srun memory. Slurmd is a multi-threaded program with independent
read and write locks on each data structure type. Thererfore the type of operations
permitted on various data structures is identified for each function.
API Functions
The following functions must appear. Functions which are not implemented should
be stubbed.
int mpi_p_init (slurmd_job_t *job, int rank);
Description: Used by slurmd to configure the slurmd's environment
to that of the correct mpi.
Arguments: job
(input) Pointer to the slurmd_job that is running. Cannot be NULL.
rank
(input) Primarially there for MVAPICH. Used to send the rank fo the mpirun job.
This can be 0 if no rank information is needed for the mpi type.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR.
int mpi_p_thr_create (srun_job_t *job);
Description: Used by srun to spawn the thread for the mpi processes.
Most all the real proccessing happens here.
Arguments: job
(input) Pointer to the srun_job that is running. Cannot be NULL.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return -1.
int mpi_p_single_task ();
Description: Tells the system whether or not multiple tasks
can run at the same time
Arguments:
none
Returns: false if multiple tasks can run and true if only
a single task can run at one time.
int mpi_p_exit();
Description: Cleans up anything that needs cleaning up after
execution.
Arguments:
none
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR, causing slurmctld to exit.
Versioning
This document describes version 0 of the SLURM node selection API. Future
releases of SLURM may revise this API. A node selection plugin conveys its ability
to implement a particular API version using the mechanism outlined for SLURM plugins.
In addition, the credential is transmitted along with the version number of the
plugin that transmitted it. It is at the discretion of the plugin author whether
to maintain data format compatibility across different versions of the plugin.
|