class: center, middle, inverse, title-slide # startR: A tool for large multi-dimensional data processing ### An-Chi Ho*, Núria Pérez-Zanón, Nicolau Manubens, Francesco Benincasa, Pierre-Antoine Bretonnière ### 8th July 2021 --- <style type="text/css"> pre { max-height: 350px; overflow-y: auto; } pre[class] { max-height: 100px; } </style> ## Outline ### 1. startR’s background and purpose ### 2. Functions and workflow ### 3. Use case ### 4. Summary and resources --- class: chapter-slide # startR’s background and purpose --- ### Data analysis process in Earth Science field Good .red[data manipulation] can greatly facilitate the following data analysis work. <div class="figure" style="text-align: center"> <img src="fig1.png" alt="General data analysis process in earth science domain" width="90%" /> <p class="caption">...</p> </div> --- ### Potential problems **1. Big data** - Higher resolution in all dimensions - More ensemble members in experiments - Higher temporal and spatial resolutions - Various types of data - The analysis may require the combination of the experimental, reanalysis, and observational data, which may have different data structure. -- **2. Complex analysis** - Bootstrapping - Machine learning model - etc. --- ### Potential problems **1. Big data** **2. Complex analysis** <br> **↠ Issues** - Longer data loading and processing time - Limited memory space in local machine -- <div class="figure" style="text-align: center"> <img src="fig44.png" alt="startR is a new tool to manage large multi-dimensional data retrieval and manipulation" width="80%" /> <p class="caption">...</p> </div> --- ### startR features - An R package tailored for **big multi-dimensional data** retrieval and processing - Apply **multiApply** paradigm, which provides flexibility in multi-dimensional data processing - Implement the MapReduce paradigm (i.e., chunking) on HPCs for **parallel distributed data-processing** - Pre-processing: data **transformation** or **reordering/reshaping/renaming** dimensions before performing analysis - Well-preserved **metadata** during the whole process - Use **ecFlow** workflow manager for job distribution and monitoring on HPCs - Acceptable data format: **netCDF** for now, but may be available for other formats. --- class: chapter-slide # startR functions and workflow --- ### startR functions and workflow With startR, users can create a concise script for data analysis with all the information needed. <img src="fig2.png" title="The workflow and corresponding function at each step" alt="The workflow and corresponding function at each step" width="90%" style="display: block; margin: auto auto auto 0;" /> --- #### 1. Data declaration Identify which and where the files are and define the desired dimensions. ```r data <- Start( # data source dat = '/esarchive/exp/ecmwf/system5_m1/monthly_mean/$var$_f6h/$var$_$sdate$.nc', # file dimensions var = 'tas', sdate = c('20170101', '20170201'), # inner dimensions ensemble = indices(1:50), time = 'all', latitude = values(list(-90, 90)), longitude = values(list(0, 359.9)), # Parameters for pre-processing, metadata, and definition etc. ..., retrieve = FALSE) ``` .blue[`retrieve = TRUE`] → Load data in workstation and occupy memory. Return a multi-dimensional array. .blue[`retrieve = FALSE`] → Create an object of `startR_cube` class as a pointer to data repository and store metadata only. --- #### 1. Data declaration: **`Start()` parameters**: Help users organize data structure with simply a few lines in the call. **[define dimension]** pattern_dims, metadata_dims, path_glob_permissive, return_vars, synonims, \*_depends, \*_across, \*_var **[reshape]** merge_across_dims, merge_across_dims_narm, split_multiselected_dims **[interpolate]** transform, transform_params, transform_vars, transform_extra_cells, apply_indices_after_transform **[interface function]** file_opener, file_var_reader, file_dim_reader, file_data_reader, file_closer, selector_checker **[operation]** num_procs, silent, debug .blue[_→ Good data manipulation can greatly facilitate the following data analysis work!_] --- #### 2. Operation defining Define the operation in the **R function format**. - The operation is only for **the essential dimensions** but not the whole data, which is the concept of `apply` (or `multiApply`, the package used by `startR`.) - The output size should be small enough to fit in the workstation. -- _E.g.,_ Start() call detected the data size: ```r Detected dimension size: dat var sdate member time latitude longitude 1 1 32 25 7 640 1296 Total size of involved data: 1 x 1 x 32 x 25 x 7 x 640 x 1296 x 8 bytes = 34.6 Gb ``` → Too large to fit in workstation. The operation has to reduce the size. --- #### 2. Operation defining ```r Detected dimension size: dat var sdate member time latitude longitude 1 1 32 25 7 640 1296 Total size of involved data: 1 x 1 x 32 x 25 x 7 x 640 x 1296 x 8 bytes = 34.6 Gb ``` Do ensemble mean and calculate temporal trend... ```r fun <- function(x) { # x: [sdate, member] # ensemble mean x <- apply(x, 1, mean) # trend x <- s2dv:::.Trend(x)$trend[2] return(x) } ``` The input `x` will change from a two-dim array `[sdate, member]` to a number. --- #### 2. Operation defining ```r Detected dimension size: dat var sdate member time latitude longitude 1 1 32 25 7 640 1296 Total size of involved data: 1 x 1 x 32 x 25 x 7 x 640 x 1296 x 8 bytes = 34.6 Gb ``` The dimensions and size will become: ```r dat var sdate member time latitude longitude 1 1 1 1 7 640 1296 Total size of involved data: 1 x 1 x 1 x 1 x 7 x 640 x 1296 x 8 bytes = 44.3 Mb ``` → Small enough to fit in workstation and do the following operation (e.g., plotting). --- #### 3. Workflow defining Join the data (i.e., the startR_cube object from Start()) and the user-defined function together. _(Review the function we defined previously to find out the target and output dimensions filled above)_ ```r fun <- function(x) { # x: [sdate, member] x <- apply(x, 1, mean) x <- s2dv:::.Trend(x)$trend[2] return(x) } ``` → The input `x` should be a two-dim array `[sdate, member]`, and the output `x` should be a number without dimension. ```r step <- Step(fun = fun, # Which dimensions the operation performs on? target_dims = c('sdate', 'member'), # Which dimensions of output are expected? output_dims = NULL) wf <- AddStep(data, step, ...) ``` --- #### 4. Job execution Execute the workflow either locally or on HPCs. - Decide the chunking. Ensure that each chunk size fits in the memory module of machine. - Specify the configuration of the remote cluster if needed. - Decide the resource usage, e.g., nodes/cores/threads/wallclock/jobs/etc. ```r res <- Compute(wf, chunks = list(latitude = 2, longitude = 2), threads_load = 2, threads_compute = 4, cluster = list( queue_host = 'nord3', queue_type = 'lsf', temp_dir = '<path_on_cluster>/user_id/startR_hpc/', cores_per_job = 2, job_wallclock = '05:00', max_jobs = 4, extra_queue_params = list('#BSUB -q bsc_es'), bidirectional = FALSE, polling_period = 10, ...), ecflow_suite_dir = '<local_path>/user_id/startR_local/', wait = TRUE) ``` --- #### 5. Result collection The results of each chunk combine and return automatically if we wait for the execution finished (`wait = TRUE`). If the computation is long and we don’t want to wait (`wait = FALSE`), `Collect()` is used to combine and return the results back to the workstation when the jobs are finished. ```r res <- Compute(wf, ..., wait = FALSE) saveRDS(res, file = './res_collect.Rds') ``` Now you can close the R console and come back later when the jobs are done. ```r collect_info <- readRDS('./res_collect.Rds') result <- Collect(collect_info, wait = TRUE) ``` The object `result` is a list containing multi-dimensional arrays. --- #### 6. Execution monitoring After submitting the jobs, we can monitor the execution through the **ecFlow UI**. We can see the progress of each chunk and control them separately (for example, restart one chunk if it fails.) <img src="fig55.png" width="90%" style="display: block; margin: auto auto auto 0;" /> --- class: chapter-slide # Use case --- ### Use case: Calibration and Climatology Data: - tas/monthly_mean/January 1981 - 2010 - experiment: ECMWF/system5_m1 - observation: ECMWF/Era-interim .pull-left[ Pre-processing: - Reshape the dimensions - Regrid to 1° resolution - Reorder latitudes (-90° to 90°) Operations: - Bias adjustment. Use CSTools:::.cal - Ensemble mean - Monthly climatology ] .pull-right[ <img src="fig66.png" width="100%" style="display: block; margin: auto;" /> ] --- ### Use case: Calibration and Climatology Examine the orginial netCDF files, and use `Start()` parameters to align the experimental and observational data. <img src="fig77.png" width="100%" style="display: block; margin: auto;" /> - Reshape `sdate` into `smonth` and `syear` - Regrid `latitude` and `longitude` - Rename `lat` and `lon` - Reorder `lat` as ascending order --- ### Use case: Calibration and Climatology #### 1. Data declaration: Experimental data ```r exp <- Start(dat = '/esarchive/exp/ecmwf/system5_m1/monthly_mean/$var$_f6h/$var$_$sdate$.nc', var = 'tas', sdate = sdates, # reshape split_multiselected_dims = TRUE, ensemble = 'all', time = indices(1), latitude = values(list(lats.min, lats.max)), longitude = values(list(lons.min, lons.max)), # reorder latitude_reorder = Sort(), longitude_reorder = CircularSort(0, 360), # regrid transform = CDORemapper, transform_extra_cells = 2, transform_params = list(grid = 'r360x181', method = 'conservative', crop = c(lons.min, lons.max, lats.min, lats.max)), transform_vars = c('latitude', 'longitude'), # metadata return_vars = list(latitude = 'dat', longitude = 'dat', time = c('sdate')), retrieve = FALSE) ``` --- ### Use case: Calibration and Climatology #### 1. Data declaration: Observational data ```r obs <- Start(dat = '/esarchive/recon/ecmwf/erainterim/monthly_mean/$var$_f6h/$var$_$sdate$.nc', var = 'tas', sdate = sdates_obs, # reshape split_multiselected_dims = TRUE, time = indices(1), latitude = values(list(lats.min, lats.max)), longitude = values(list(lons.min, lons.max)), # reorder latitude_reorder = Sort(), longitude_reorder = CircularSort(0, 360), # regrid transform = CDORemapper, transform_extra_cells = 2, transform_params = list(grid = 'r360x181', method = 'conservative', crop = c(lons.min, lons.max, lats.min, lats.max)), transform_vars = c('latitude', 'longitude'), # rename synonims = list(latitude = c('latitude', 'lat'), longitude = c('longitude', 'lon')), # metadata return_vars = list(latitude = 'dat', longitude = 'dat', time = c('sdate')), retrieve = FALSE) ``` --- ### Use case: Calibration and Climatology #### 2. Operation defining ```r wrap_cal <- function(obs, exp) { # obs: [syear] # exp: [ensemble, syear] # (1) Calibration calibrated <- CSTools:::.cal(exp = exp, obs = obs, cal.method = "bias", eval.method = "leave-one-out", multi.model = FALSE, na.fill = TRUE, na.rm = TRUE, apply_to = NULL, alpha = NULL) # calibrated: [ensemble, syear] # (2) Ensemble mean ens_mean <- apply(calibrated, 2, mean, na.rm = TRUE) # (3) Climatology clim <- mean(ens_mean, na.rm = TRUE) return(clim) } ``` --- ### Use case: Calibration and Climatology #### 2. Operation defining Calculate the data size... .bg-grey[Total size of involved data:] - exp: 1 x 1 x 360 x 25 x 1 x 640 x 1296 x 8 bytes = .blue[55.6 Gb] - obs: 1 x 1 x 360 x 1 x 256 x 512 x 8 bytes = .blue[360 Mb] -- <br> `[ensemble, syear]` becomes 1, so... .bg-grey[Estimated size of output data:] - 1 x 1 x 12 x 30 x 25 x 1 x 181 x 360 x 8 bytes = .blue[6.25 Mb] → Small enough to fit in local memory. --- ### Use case: Calibration and Climatology #### 3. Workflow defining ```r step <- Step(wrap_cal, target_dims = list(obs = c('syear'), exp = c('ensemble', 'syear')), output_dims = NULL) wf <- AddStep(list(obs = obs, exp = exp), step) ``` --- ### Use case: Calibration and Climatology #### 4. Job execution Run on remote cluster. .bg-grey[//] Divide into two chunks. .bg-grey[//] Use 2 threads to load the data and 4 threads for computation. .bg-grey[//] The cluster uses LSF queue, and the connection between local workstation and the cluster is uni-directional. .bg-grey[//] Use 2 cores for each job, and the maximum concurrent job number is 4. .bg-grey[//] The polling period is 10 seconds, and the reserved wall-clock time is 5 minutes. ```r res <- Compute(wf, chunks = list(smonth = 2), threads_load = 2, threads_compute = 4, cluster = list(queue_host = queue_host, queue_type = 'lsf', bidirectional = FALSE, temp_dir = temp_dir, cores_per_job = 2, max_jobs = 4, polling_period = 10, job_wallclock = '05:00', extra_queue_params = list('#BSUB -q bsc_es') ), ecflow_suite_dir = ecflow_suite_dir, wait = TRUE ) ``` --- ### Use case: Calibration and Climatology The startR script is well-organized by the analysis steps, easy to reuse and apply to other analyses. <img src="fig8.png" title="The overview of startR workflow script" alt="The overview of startR workflow script" width="100%" style="display: block; margin: auto;" /> --- class: chapter-slide # Summary and resources --- ### Summary **startR’s advantages:** - Proficient in big and complex multi-dimensional data retrieval and processing - Highly adaptable to data structure and users’ needs - Clear and concise workflow, easy to be reused and adapted to other analyses - Automatically chunking and dispatching jobs in parallel on HPCs - Compatible with other R tools developed in the department, forming a strong toolset for climate research - With the plug-in of interface functions, startR can be exploited in different scientific domains where large multi-dimensional data is involved **startR’s disadvantages:** - Long learning curve - Not-so-intuitive coding style (functional paradigm) .red[→ We provide resources and user support!] --- ### Resources **[ Relative talk ]** 10:15am - 10:35am (UTC) Climate Forecast Analysis Tools Framework: from the storage to the HPC to get reproducible climate research results and services _by Núria Pérez-Zanón_ **[ GitLab ]** https://earth.bsc.es/gitlab/es/startR README, pracitcal guide, FAQ, Use cases, issues, etc. **[ CRAN ]** The manual and installation https://cran.r-project.org/web/packages/startR/index.html **[ Contact ]** An-Chi Ho (an.ho@bsc.es) Núria Pérez-Zanón (nuria.perez@bsc.es) -- .pull-left[ ] .pull-right[ ### Thank you and let’s .red[start_R_] ! ] --- ## Do not forget to include alt-text to your figures! Knitr (version >= 1.31) have a new feature to add alt-text to your figures. Just add fig.alt = "Your alt-text” in the chunk options.