Overview
This document describes SLURM job checkpoint plugins and the API that defines
them. It is intended as a resource to programmers wishing to write their own SLURM
job checkpoint plugins. This is version 0 of the API.
SLURM job checkpoint plugins are SLURM plugins that implement the SLURM
API for checkpointing and restarting jobs.
The plugins must conform to the SLURM Plugin API with the following specifications:
const char plugin_type[]
The major type must be "checkpoint." The minor type can be any recognizable
abbreviation for the type of scheduler. We recommend, for example:
- noneNo job checkpoint.
- aixAIX system checkpoint.
The plugin_name and
plugin_version
symbols required by the SLURM Plugin API require no specialization for
job checkpoint support.
Note carefully, however, the versioning discussion below.
The programmer is urged to study
src/plugins/checkpoint/checkpoint_aix.c
for a sample implementation of a SLURM job checkpoint plugin.
Data Objects
The implementation must maintain (though not necessarily directly export) an
enumerated errno to allow SLURM to discover
as practically as possible the reason for any failed API call. Plugin-specific enumerated
integer values may be used when appropriate.
These values must not be used as return values in integer-valued functions
in the API. The proper error return value from integer-valued functions is SLURM_ERROR.
The implementation should endeavor to provide useful and pertinent information by
whatever means is practical.
Successful API calls are not required to reset any errno to a known value. However,
the initial value of any errno, prior to any error condition arising, should be
SLURM_SUCCESS.
There is also a checkpoint-specific error code and message that may be associated
with each job step.
API Functions
The following functions must appear. Functions which are not implemented should
be stubbed.
int slurm_ckpt_alloc_job (check_jobinfo_t *jobinfo);
Description: Allocate storage for job-step specific
checkpoint data.
Argument: jobinfo
(output) returns pointer to the allocated storage.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int slurm_ckpt_free_job (check_jobinfo_t jobinfo);
Description: Release storage for job-step specific
checkpoint data that was previously allocated by slurm_ckpt_alloc_job.
Argument: jobinfo
(input) pointer to the previously allocated storage.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int slurm_ckpt_pack_job (check_jobinfo_t jobinfo, Buf buffer);
Description: Store job-step specific checkpoint data
into a buffer.
Arguments:
jobinfo
(input) pointer to the previously allocated storage.
Buf (input/output) buffer to which
jobinfo has been appended.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int slurm_ckpt_unpack_job (check_jobinfo_t jobinfo, Buf buffer);
Description: Retrieve job-step specific checkpoint data
from a buffer.
Arguments:
jobinfo
(output) pointer to the previously allocated storage.
Buf (input/output) buffer from which
jobinfo has been removed.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.
int slurm_ckpt_op ( uint16_t op, uint16_t data,
struct step_record * step_ptr, time_t * event_time,
uint32_t *error_code, char **error_msg );
Description: Perform some checkpoint operation on a
specific job step.
Arguments:
op
(input) specifies the operation to be performed. Currently supported
operations include CHECK_ABLE (is job step currently able to be checkpointed),
CHECK_DISABLE (disable checkpoints for this job step),
CHECK_ENABLE (enable checkpoints for this job step),
CHECK_CREATE (create a checkpoint for this job step and continue its execution),
CHECK_VACATE (create a checkpoint for this job step and terminate it),
CHECK_RESTART (restart this previously checkpointed job step), and
CHECK_ERROR (return checkpoint-specific error information for this job step).
data (input) operation-specific
data.
step_ptr (input/output) identifies
the job step to be operated upon.
event_time (output) identifies
the time of a checkpoint or restart operation.
error_code (output) returns
checkpoint-specific error code associated with an operation.
error_msg (output) identifies
checkpoint-specific error message associated with an operation.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the error_code and error_msg to an
appropriate value to indicate the reason for failure.
int slurm_ckpt_comp ( struct step_record * step_ptr, time_t event_time,
uint32_t error_code, char *error_msg );
Description: Note the completion of a checkpoint operation.
Arguments:
step_ptr (input/output) identifies
the job step to be operated upon.
event_time (input) identifies
the time that the checkpoint operation began.
error_code (input)
checkpoint-specific error code associated with an operation.
error_msg (input)
checkpoint-specific error message associated with an operation.
Returns: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the error_code and error_msg to an
appropriate value to indicate the reason for failure.
Versioning
This document describes version 0 of the SLURM checkpoint API. Future
releases of SLURM may revise this API. A scheduler plugin conveys its ability
to implement a particular API version using the mechanism outlined for SLURM plugins.
|