Newer
Older
## startR - Retrieval and processing of multidimensional datasets
startR is an R package developed at the Barcelona Supercomputing Center which implements the MapReduce paradigm (a.k.a. domain decomposition) on HPCs in a way transparent to the user and specially oriented to complex multidimensional datasets.
Following the startR framework, the user can represent in a one-page startR script all the information that defines a use case, including:
- the involved (multidimensional) data sources and the distribution of the data files
- the workflow of operations to be applied, over which data sources and dimensions
- the HPC platform properties and the configuration of the execution
When run, the script triggers the execution of the defined workflow. Furthermore, the EC-Flow workflow manager is transparently used to dispatch tasks onto the HPC, and the user can employ its graphical interface for monitoring and control purposes.
An extensive part of this package is devoted to the automatic retrieval (from disk or store to RAM) and arrangement of multi-dimensional distributed data sets. This functionality is encapsulated in a single funcion called `Start()`, which is explained in detail in the [**Start()**](inst/doc/start.md) documentation page and in `?Start`.
startR is an open source project that is open to external collaboration and funding, and will continuously evolve to support as many data set formats as possible while maximizing its efficiency.
See the [**Deployment**](inst/doc/deployment.md) documentation page for details on the set up steps. The most relevant system dependencies are listed next:
- netCDF-4
- R with the startR, bigmemory and easyNCDF R packages
- For distributed computation:
- UNIX-based HPC (cluster of multi-processor nodes)
- a job scheduler (Slurm, PBS or LSF)
- EC-Flow >= 4.9.0
In order to install and load the latest published version of the package on CRAN, you can run the following lines in your R session:
```r
install.packages('startR')
library(startR)
Also, you can install the latest stable version from the GitLab repository as follows:
devtools::install_git('https://earth.bsc.es/gitlab/es/startR.git')
An overview example of how to process a large data set is shown in the following. You can see real use cases in the [**Practical guide for processing large data sets with startR**](inst/doc/practical_guide.md), and you can find more information on the use of the `Start()` function in the [**Start()**](inst/doc/start.md) documentation page, as well as in the documentation of the functions in the package.
The purpose of the example in this section is simply to illustrate how the user is expected to use startR once the framework is deployed on the workstation and HPC. It shows how a simple addition and averaging operation is performed on BSC's CTE-Power HPC, over a multi-dimensional climate data set, which lives in the BSC-ES storage infrastructure. As mentioned in the introduction, the user will need to declare the involved data sources, the workflow of operations to carry out, and the computing environment and parameters.
repos <- '/esarchive/exp/ecmwf/system5_m1/6hourly/$var$/$var$_$sdate$.nc'
# A Start() call is built with the indices to operate
data <- Start(dat = repos,
var = 'tas',
sdate = '20180101',
ensemble = 'all',
latitude = indices(1:40),
longitude = indices(1:40),
retrieve = FALSE)
```
# The function to be applied is defined.
# It only operates on the essential 'target' dimensions.
fun <- function(x) {
# Expected inputs:
# x: array with dimensions ('ensemble', 'time')
apply(x + 1, 2, mean)
# Outputs:
# single array with dimensions ('time')
}
# A startR Step is defined, specifying its expected input and
# output dimensions.
step <- Step(fun,
target_dims = c('ensemble', 'time'),
output_dims = c('time'))
# The workflow of operations is cast before execution.
#### 3. Declaration of the HPC platform and execution
```r
res <- Compute(wf,
chunks = list(latitude = 2,
longitude = 2),
threads_load = 1,
threads_compute = 2,
cluster = list(queue_host = 'p9login1.bsc.es',
queue_type = 'slurm',
temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_tests/',
lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.5/',
r_module = 'R/3.5.0-foss-2018b',
cores_per_job = 2,
job_wallclock = '00:10:00',
max_jobs = 4,
extra_queue_params = list('#SBATCH --qos=bsc_es'),
bidirectional = FALSE,
polling_period = 10
),
ecflow_suite_dir = '/home/Earth/nmanuben/test_startR/',
wait = TRUE)
```
#### 4. Monitoring the execution
During the execution of the workflow, which is orchestrated by EC-Flow and a job scheduler (either Slurm, LSF or PBS), the status can be monitored using the EC-Flow graphical user interface. Pending tasks are coloured in blue, ongoing in green, and finished in yellow.
<img src="inst/doc/ecflow_monitoring.png" width="600" />
#### 5. Profiling of the execution
Additionally, profiling measurements of the execution are provided together with the output data. Such measurements can be visualized with the `PlotProfiling()` function made available in the source code of the startR package.
This function has not been included as part of the official set of functions of the package because it requires a number of extense plotting libraries which take time to load and, since the startR package is loaded in each of the worker jobs on the HPC or cluster, this could imply a substantial amount of time spent in repeatedly loading unused visualization libraries during the computing stage.
```r
source('https://earth.bsc.es/gitlab/es/startR/raw/master/inst/PlotProfiling.R')
PlotProfiling(attr(res, 'startR_compute_profiling'))
```
You can click on the image to expand it.
<img src="inst/doc/compute_profiling.png" width="800" />