lamssi_cr

Section: LAM SSI CR OVERVIEW (7)
Updated: May, 2004
 

NAME

LAM SSI checkpoint / restart - overview of LAM's MPI checkpoint / restart SSI modules  

DESCRIPTION

The "kind" for checkpoint / restart SSI modules is "cr". Specifically, the string "cr" (without the quotes) is the prefix that should be used with the mpirun command line with the -ssi switch. For example:
mpirun -ssi cr blcr C my_mpi_program

LAM/MPI can involuntarily checkpoint and restart parallel MPI jobs. Doing so requires that LAM/MPI was compiled with thread support and that back-end checkpointing systems are available at run-time. MPI jobs will have to run with at least MPI_THREAD_SERIALIZED support. If a job elects to run with checkpoint/restart support and an available cr module is found, the job's thread level will automatically be promoted to MPI_THREAD_SERIALIZED. See the User's Guide for more details.  

AVAILABLE MODULES

LAM currently only has one cr module: blcr. In order for an MPI job to be able to be checkpointed and restarted, all of its MPI SSI modules must support checkpoint/restart. Currently, this means using the crtcp RPI module.  

BLCR

The Berkeley Lab Checkpoint/Restart (BLCR) single-node checkpointer is a software system from Lawrence Berkeley Labs. See the project web page for more details: http://www.nersc.gov/research/ftg/checkpoint/.

The blcr module has one SSI parameter:

cr_blcr_priority
blcr's default priority is 50.
 

SEE ALSO

lamssi(7), mpirun(1)


 

Index

NAME
DESCRIPTION
AVAILABLE MODULES
BLCR
SEE ALSO
blog comments powered by Disqus