README.md 5.77 KB
Newer Older
## startR
The startR package, developed at the Barcelona Supercomputing Center, implements the MapReduce paradigm (a.k.a. domain decomposition) on HPCs in a way transparent to the user and specially oriented to complex multidimensional datasets.
Following the startR framework, the user can represent in a one-page startR script all the information that defines a use case, including:
- the involved (multidimensional) data sources and the distribution of the data files
- the workflow of operations to be applied, over which data sources and dimensions
- the HPC platform properties and the configuration of the execution
When run, the script triggers the execution of the defined workflow. Furthermore, the EC-Flow workflow manager is transparently used to dispatch tasks onto the HPC, and the user can employ its graphical interface for monitoring and control purposes.
startR is a project started at BSC with the aim to develop a tool that allows the user to automatically retrieve, homogenize and process multidimensional distributed data sets. It is an open source project that is open to external collaboration and funding, and will continuously evolve to support as many data set formats as possible while maximizing its efficiency.
An extensive part of this package is devoted to the automatic retrieval (from disk or store to RAM) and arrangement of multi-dimensional distributed data sets. This functionality is encapsulated in a single funcion called `Start()`, which is explained in detail in the [**Start()**](vignettes/start.md) documentation page and in `?Start`.
### Installation
In order to install and load the latest published version of the package on CRAN, you can run the following lines in your R session:
```r
install.packages('startR')
library(startR)
Also, you can install the latest stable version from this GitHub repository as follows:
```r
devtools::install_git('https://earth.bsc.es/gitlab/es/startR')
See the [**Deployment**](vignettes/deployment.md) documentation page or the details in `?Compute` for a guide on deployment and set up steps, and additional technical aspects.
### How to use
An overview example of how to process a large data set is shown in the following. See the [**Start()**](vignettes/start.md) documentation page, as well as the documentation of the functions in the package for further details on usage.
The purpose of the example in this section is simply to illustrate how the user is expected to interact with the startR loading and distributed computing capability once the framework is deployed on the user workstation and computing cluster or HPC.
In this example, it is shown how a simple addition and averaging operation is performed, on BSC's CTE-Power HPC, over a multi-dimensional climate data set, which lives in the BSC-ES storage infrastructure. As mentioned in the introduction, the user will need to declare the involved data sources, the workflow of operations to carry out, and the computing environment and parameters.

#### 1. Declaration of data sources
```r
library(startR)
repos <- '/esarchive/exp/ecmwf/system5_m1/6hourly/$var$/$var$_$sdate$.nc'
data <- Start(dat = repos,
              var = 'tas',
              sdate = '20180101',
              ensemble = 'all',
Nicolau Manubens's avatar
Nicolau Manubens committed
              time = 'all',
              latitude = indices(1:40),
              longitude = indices(1:40),
              retrieve = FALSE)
```

#### 2. Declaration of the workflow
# The function to be applied is defined.
# It only operates on the essential 'target' dimensions.
fun <- function(x) {
  # Expected inputs:
  #   x: array with dimensions ('ensemble', 'time')
  apply(x + 1, 2, mean)
}

# A startR Step is defined, specifying its expected input and 
# output dimensions.
step <- Step(fun, 
             target_dims = c('ensemble', 'time'), 
             output_dims = c('time'))

# The workflow of operations is cast before execution.
wf <- AddStep(data, step)
```

#### 3. Declaration of the HPC platform and execution

```r
res <- Compute(wf,
               chunks = list(latitude = 2,
                             longitude = 2),
               threads_load = 1,
               threads_compute = 2,
               cluster = list(queue_host = 'p9login1.bsc.es',
                              queue_type = 'slurm',
                              temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_tests/',
                              lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.5/',
                              r_module = 'R/3.5.0-foss-2018b',
                              cores_per_job = 2,
                              job_wallclock = '00:10:00',
                              max_jobs = 4,
                              extra_queue_params = list('#SBATCH --qos=bsc_es'),
                              bidirectional = FALSE,
                              polling_period = 10
                             ),
               ecflow_suite_dir = '/home/Earth/nmanuben/test_startR/',
               wait = TRUE)
```

#### 4. Profiling of the execution

Additionally, profiling measurements of the execution are preserved together with the output data. Such measurements can be visualized with the `PlotProfiling` function made available in the source code of the startR package.

This function has not been included as part of the official set of functions of the package because it requires a number of plotting libraries which can take time to load and, since the startR package is loaded in each of the worker jobs on the HPC or cluster, this could imply a substantial amount of time spent in repeatedly loading unused visualization libraries during the computing stage.

```r
source('https://earth.bsc.es/gitlab/es/startR/raw/master/inst/PlotProfiling.R')
PlotProfiling(attr(res, 'startR_compute_profiling'))
```

You can click on the image to expand it.

<img src="vignettes/compute_profiling.png" width="800" />