diff --git a/README.md b/README.md index b137729badb238cd58e6800dec6b67602b4cf92d..7fd50eabdcb3800a75f155a8bf5fece14774098b 100644 --- a/README.md +++ b/README.md @@ -1,286 +1,115 @@ -Automatically retrieve multidimensional distributed data sets in R -================================================================== +## startR -The first step in data analysis made easy ------------------------------------------ +The startR package, developed at the Barcelona Supercomputing Center, implements the MapReduce paradigm (a.k.a. domain decomposition) on HPCs in a way transparent to the user and specially oriented to complex multidimensional datasets. -Data retrieval and alignment is the first step in data analysis in any field and is often highly complex and time-consuming, especially nowadays in the era of Big Data, where large multidimensional data sets from diverse sources need to be combined and processed. Taking subsets of these datasets (Divide) to then be processed efficiently (and Conquer) becomes an indispensable technique. +Following the startR framework, the user can represent in a one-page startR script all the information that defines a use case, including: +- the involved (multidimensional) data sources and the distribution of the data files +- the workflow of operations to be applied, over which data sources and dimensions +- the HPC platform properties and the configuration of the execution -`startR` (Subset, Transform, Arrange and ReTrieve multidimensional subsets in R) is an R project started at BSC with the aim to develop a tool that allows the user to automatically retrieve, homogenize and align subsets of multidimensional distributed data sets. It is an open source project that is open to external collaboration and funding, and will continuously evolve to support as many data set formats as possible while maximizing its efficiency. +When run, the script triggers the execution of the defined workflow. Furthermore, the EC-Flow workflow manager is transparently used to dispatch tasks onto the HPC, and the user can employ its graphical interface for monitoring and control purposes. +startR is a project started at BSC with the aim to develop a tool that allows the user to automatically retrieve, homogenize and process multidimensional distributed data sets. It is an open source project that is open to external collaboration and funding, and will continuously evolve to support as many data set formats as possible while maximizing its efficiency. -What it does ------------- +An extensive part of this package is devoted to the automatic retrieval (from disk or store to RAM) and arrangement of multi-dimensional distributed data sets. This functionality is encapsulated in a single funcion called `Start()`, which is explained in detail in the [**Start()**](vignettes/start.md) documentation page and in `?Start`. -`startR`, through its main function `Start()`, provides an interface that allows to perceive and access one or a collection of data sets as if they all were part of a large multidimensional array. Indices or bounds can be specified for each of the dimensions in order to crop the whole array into a smaller sub-array. `Start()` will perform the required operations to fetch the corresponding regions of the corresponding files (potentially distributed over various remote servers) and arrange them into a local R multidimensional array. By default, as many cores as available locally are used in this procedure. +### Installation -Usually, in retrieval processes previous to multidimensional data analysis, it is needed to apply a set of common transformations, pre-processes or reorderings to the data as it comes in. `Start()` accepts user-defined transformation or reordering functions to be applied for such purposes. +In order to install and load the latest published version of the package on CRAN, you can run the following lines in your R session: -`Start()` is not bound to a specific file format. Interface functions to custom file formats can be provided for `Start()` to read them. As of April 2017 `startR` includes interface functions to the following file formats: -- NetCDF - -Metadata and auxilliary data is also preserved and arranged by `Start()` in the measure that it is retrieved by the interface functions for a specific file format. - - -How to use it -------------- - -`Start()` is the only function in the package intended for direct use. This function has a rather steep learning curve but it makes the retrieval process straightforward and highly customizable. The header looks as follows: - -```R -Start(..., - return_vars = NULL, - synonims = NULL, - file_opener = NcOpener, - file_var_reader = NcVarReader, - file_dim_reader = NcDimReader, - file_data_reader = NcDataReader, - file_closer = NcCloser, - transform = NULL, - transform_params = NULL, - transform_vars = NULL, - transform_extra_cells = 0, - apply_indices_after_transform = FALSE, - pattern_dims = NULL, - metadata_dims = NULL, - selector_checker = SelectorChecker, - num_procs = NULL, - silent = FALSE) +```r +install.packages('startR') +library(startR) ``` -Usually most of the required information will be provided through the `...` parameter and only a few of the parameters in the function header will be used. - -The parameters can be grouped as follows: -- The parameters provided via `...`, with information on the structure of the datasets which to take data from, information on which dimensions they have, which indices to take from each of the dimensions, and how to reorder each of the dimensions if needed. -- `synonims` with information to identify dimensions and variablse across the multiple files/datasets (useful if they don't follow a homogeneous naming convention). -- `return_vars` with information on auxilliary data to be retrieved. -- `file_*` parameters, which allow to specify interface functions to the desired file format. -- `transform_*` parameters, which allow to specify a transformation to be applied to the data as it comes in. -- Other general configuration parameters. - -Examples / Tutorial -------------------- - -Next, an explanation on how to use `Start()`, starting from a simple example and progressively adding complexity. - -### Retrieving a single entire file -`Start()` can be used in the simplest situation to take a subset of data from a single file. +Also, you can install the latest stable version from this GitHub repository as follows: -Let's imagine we have a file named 'file.nc' which contains an array of data (a single variable) with the following dimensions: -``` -c(a = 5, time = 12, b = 100, c = 100) +```r +devtools::install_git('https://earth.bsc.es/gitlab/es/startR') ``` -It is then possible to take the entire data file as follows: -```R -data <- Start(dataset = '/path/to/file.nc', - a = 'all', - time = 'all', - b = 'all', - c = 'all') -* Exploring files... This will take a variable amount of time depending -* on the issued request and the performance of the file server... -* Detected dimension sizes: -* dataset: 1 -* a: 5 -* time: 3 -* b: 100 -* c: 100 -* Total size of requested data: -* 1 x 5 x 3 x 100 x 100 x 8 bytes = 1.2 Mb -* If the size of the requested data is close to or above the free shared -* RAM memory, R may crash. -* If the size of the requested data is close to or above the half of the -* free RAM memory, R may crash. -* Will now proceed to read and process 1 data files: -* /path/to/file.nc -* Loading... This may take several minutes... -starting worker pid=30733 on localhost:11276 at 15:30:39.997 -starting worker pid=30749 on localhost:11276 at 15:30:40.180 -starting worker pid=30765 on localhost:11276 at 15:30:40.382 -starting worker pid=30781 on localhost:11276 at 15:30:40.557 -* Successfully retrieved data. -``` +See the [**Deployment**](vignettes/deployment.md) documentation page or the details in `?Compute` for a guide on deployment and set up steps, and additional technical aspects. -None of the provided parameters to `Start()` are in the set of known parameters (`return_vars`, `synonims`, ...). Each unknown parameter is interpreted as a specification of a dimension of the data set you want to retrieve data from, where the name of the parameter matches the name of the dimension and the value associated expresses the indices you want to retrieve from the dimension. In this example, we have defined that the data set has 5 dimensions, with the names 'dataset', 'a', 'time', 'b', and 'c', and we want to take all indices of each of these dimensions. +### How to use -It is mandatory to make `Start()` aware of all the existing dimensions in the file (unless they are of length 1). +An overview example of how to process a large data set is shown in the following. See the [**Start()**](vignettes/start.md) documentation page, as well as the documentation of the functions in the package for further details on usage. -Note that the file to read is considered to be an element that belongs to the 'dataset' dimension (could be any other name). `Start()` automatically looks for at least one of the dimension specifications with an expression pointing to a set of files or URLs. +The purpose of the example in this section is simply to illustrate how the user is expected to interact with the startR loading and distributed computing capability once the framework is deployed on the user workstation and computing cluster or HPC. -The returned result looks as follows: -```R -str(data) -List of 5 - $ Data : num [1, 1:5, 1:3, 1:100, 1:100] 1 2 3 4 5 6 7 8 ... - $ Variables :List of 2 - ..$ common : NULL - ..$ dataset1: NULL - $ Files : chr [1(1d)] "/path/to/file.nc" - $ NotFoundFiles: NULL - $ FileSelectors:List of 1 - ..$ dataset1:List of 1 - .. ..$ dataset:List of 1 - .. .. ..$ : chr "dataset1" -``` +In this example, it is shown how a simple addition and averaging operation is performed, on BSC's CTE-Power HPC, over a multi-dimensional climate data set, which lives in the BSC-ES storage infrastructure. As mentioned in the introduction, the user will need to declare the involved data sources, the workflow of operations to carry out, and the computing environment and parameters. -In this case, most of the returned information is empty. +#### 1. Declaration of data sources -These are the dimensions of the actual data array: -```R -dim(data$Data) -dataset a time b c - 1 5 3 100 100 -``` +```r +library(startR) -### Reordering array dimensions -If the dimensions are specified and requested in a different order, the resulting array will be arranged following the same order: -```R -data <- Start(dataset = '/path/to/file.nc', - a = 'all', - b = 'all', - c = 'all', - time = 'all') -dim(data$Data) -dataset a b c time - 1 5 100 100 3 +repos <- '/esarchive/exp/ecmwf/system5_m1/6hourly/$var$/$var$_$sdate$.nc' +data <- Start(dat = repos, + var = 'tas', + sdate = '20180101', + ensemble = 'all', + time = 'all', + latitude = indices(1:40), + longitude = indices(1:40), + retrieve = FALSE) ``` -### Retrieving multiple files -Assuming we have the files -``` - /path/to/ - |-> group_a/ - | |-> file_1.nc - | |-> file_2.nc - | |-> file_3.nc - |-> group_b/ - |-> file_1.nc - |-> file_2.nc - |-> file_3.nc -``` -We can load them as follows: -```R -data <- Start(dataset = '/path/to/group_$group$/file_$number$.nc', - group = 'all', - number = 'all', - a = 'all', - b = 'all', - c = 'all', - time = 'all') -dim(data$Data) -dataset group number a b c time - 1 2 3 5 100 100 3 -``` +#### 2. Declaration of the workflow -### Path pattern tags that depend on other tags -Assuming we have the files -``` - /path/to/ - |-> group_a/ - | |-> file_1.nc - | |-> file_2.nc - | |-> file_3.nc - |-> group_b/ - |-> file_4.nc - |-> file_5.nc - |-> file_6.nc -``` -We can load them as follows: -```R -data <- Start(dataset = '/path/to/group_$group$/file_$number$.nc', - group = 'all', - number = 'all', - number_depends = 'group', - a = 'all', - b = 'all', - c = 'all', - time = 'all') -dim(data$Data) -dataset group number a b c time - 1 2 3 5 100 100 3 -``` +```r +# The function to be applied is defined. +# It only operates on the essential 'target' dimensions. +fun <- function(x) { + # Expected inputs: + # x: array with dimensions ('ensemble', 'time') + apply(x + 1, 2, mean) +} -### Dimensions inside the files that go across files -Assuming the 'time' dimension goes across all the 'number' files in a group. We would like to select time indices e.g. 3 to 7 without `Start()` crashing because of indices out of bounds. We can do so as follows: -```R -data <- Start(dataset = '/path/to/group_$group$/file_$number$.nc', - group = 'all', - number = 'all', - number_depends = 'group', - a = 'all', - b = 'all', - c = 'all', - time = indices(list(2, 5)), - time_across = 'number') -dim(data$Data) -dataset group number a b c time - 1 2 3 5 100 100 5 -``` -In this case, the dimension 'number' is of length 3 because we have retrieved data from 3 different 'number's: the 'time' index 3 from 'number' 1, the 'time' indices 4 to 6 from 'number' 2 and the 'time' index 7 from 'number 3. The non-taken indices from a 'number' are filled in with NA in the returned array. +# A startR Step is defined, specifying its expected input and +# output dimensions. +step <- Step(fun, + target_dims = c('ensemble', 'time'), + output_dims = c('time')) -### Taking specific indices of a dimension -```R -data <- Start(dataset = '/path/to/file.nc', - a = indices(c(1, 3)), - b = 'all', - c = indices(list(10, 20)), - time = 'all') -dim(data$Data) -dataset a b c time - 1 2 100 11 3 +# The workflow of operations is cast before execution. +wf <- AddStep(data, step) ``` -### Taking specific indices of a dimension in function of associated values -```R -data <- Start(dataset = '/path/to/file.nc', - a = c('value1', 'value2', 'value5'), - a_var = 'x', - b = 'all', - c = indices(list(10, 20)), - time = 'all', - return_vars = list(a_var = NULL)) -dim(data$Data) -dataset a b c time - 1 3 100 11 3 -``` +#### 3. Declaration of the HPC platform and execution -### Taking data from NetCDF files with multiple variables -Now let us imagine the data array in the file has an extra dimension, 'var', of length 2, and a variable 'var_names' with the names of the variables at each position along the dimension 'var'. The names of the 2 variables are 'x' and 'y'. We would like being able to tell `Start()` to take only the variable 'y', regardless of its position along the 'var' dimension. This can be achieved by defining the 'var' dimension with more detail, using the '*_var' parameters: - -```R -data <- Start(dataset = '/path/to/file.nc', - var = 'y', - var_var = 'var_names', - a = 'all', - b = 'all', - c = 'all', - time = 'all', - return_vars = list(var_names = NULL)) -dim(data$Data) -dataset var a b c time - 1 1 5 100 100 3 +```r +res <- Compute(wf, + chunks = list(latitude = 2, + longitude = 2), + threads_load = 1, + threads_compute = 2, + cluster = list(queue_host = 'p9login1.bsc.es', + queue_type = 'slurm', + temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_tests/', + lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.5/', + r_module = 'R/3.5.0-foss-2018b', + cores_per_job = 2, + job_wallclock = '00:10:00', + max_jobs = 4, + extra_queue_params = list('#SBATCH --qos=bsc_es'), + bidirectional = FALSE, + polling_period = 10 + ), + ecflow_suite_dir = '/home/Earth/nmanuben/test_startR/', + wait = TRUE) ``` -### Taking specific indices of a dimension in function of associated values, with tolerance - - -### Dimension and variable name synonims +#### 4. Profiling of the execution +Additionally, profiling measurements of the execution are preserved together with the output data. Such measurements can be visualized with the `PlotProfiling` function made available in the source code of the startR package. -### Reordering inner dims with associated values +This function has not been included as part of the official set of functions of the package because it requires a number of plotting libraries which can take time to load and, since the startR package is loaded in each of the worker jobs on the HPC or cluster, this could imply a substantial amount of time spent in repeatedly loading unused visualization libraries during the computing stage. +```r +source('https://earth.bsc.es/gitlab/es/startR/raw/master/inst/PlotProfiling.R') +PlotProfiling(attr(res, 'startR_compute_profiling')) +``` -### Transformations - - -### Defining interface functions to a custom file format - - -### Explanation of outputs - - -### Fetching metadata - +You can click on the image to expand it. -### Other configuration parameters + diff --git a/inst/PlotProfiling.R b/inst/PlotProfiling.R new file mode 100644 index 0000000000000000000000000000000000000000..2cd7ed7c8837faf4872fb2eceb07ca3183b755cf --- /dev/null +++ b/inst/PlotProfiling.R @@ -0,0 +1,247 @@ +PlotProfiling <- function(configs, n_test = 1, file_name = NULL, + config_names = NULL, items = NULL, + total_timings = TRUE, + ideal_timings = FALSE, + crop = NULL, subtitle = NULL) { + check_package <- function(x) { + if (!(x %in% installed.packages())) { + stop("This function requires ", x, " to be installed.") + } + } + sapply(c('reshape2', 'ggplot2', 'gridExtra'), check_package) + library(reshape2) + library(ggplot2) + library(gridExtra) + + gglegend <- function(x) { + tmp <- ggplot_gtable(ggplot_build(x)) + leg <- which(sapply(tmp$grobs, function(y) y$name) == "guide-box") + tmp$grobs[[leg]] + } + + # Check configs + if (is.list(configs)) { + if (!is.list(configs[[1]])) { + configs <- list(setNames(list(configs), Sys.time())) + } else { + if (!is.list(configs[[1]][[1]])) { + configs <- list(configs) + } + } + } else { + stop("Expected one or a list of 'startR_compute_profiling' objects in configs.") + } + + # Check config names + if (is.null(config_names)) { + config_names <- paste0('config_', 1:length(configs)) + } + + # Check items + if (is.null(items)) { + items <- c('bychunks_setup', 'transfer', 'all_chunks', 'queue', 'job_setup', + 'load', 'compute', 'transfer_back', 'merge') + } + items <- c('nchunks', 'concurrent_chunks', 'cores_per_job', 'threads_load', + 'threads_compute', items, 'total', 'all_chunks') + + all_timings <- NULL + config_total_times <- NULL + config_long_names <- NULL + config_index <- 1 + for (timings in configs) { + #config_name <- config_name[length(config_name)] + #config_name <- strsplit(config_name, '.timings')[[1]] + #config_name <- config_name[1] + #config_name <- strsplit(config, 'tests/')[[1]] + #config_name <- config_name[length(config_name)] + + #timings <- readRDS(config) + dates <- names(timings) + if (n_test > length(timings)) { + selected_sample <- 1 + } else { + selected_sample <- length(timings) + 1 - n_test + } + timings <- timings[[selected_sample]] +# crop_value <- 400 +# timings[['total']] <- sapply(timings[['total']], function(x) min(crop_value, x)) +# timings[['queue']] <- sapply(timings[['queue']], function(x) min(crop_value, x)) + config_name <- config_names[config_index] + config_long_name <- paste0('\n', config_name, + '\nDate: ', dates[selected_sample], + '\nN. chunks: ', timings[['nchunks']], + '\nMax. jobs: ', timings[['concurrent_chunks']], + '\nAsk cores: ', timings[['cores_per_job']], + '\nLoad thr: ', timings[['threads_load']], + '\nComp. thr: ', timings[['threads_compute']], + '\n') + config_long_names <- c(config_long_names, config_long_name) + config_total_times <- c(config_total_times, timings[['total']]) + timings <- as.data.frame(timings) + t_all_chunks <- timings[['total']] - timings[['bychunks_setup']] - timings[['merge']] - + timings[['merge']] + if (!is.na(timings[['transfer_back']])) { + t_all_chunks <- t_all_chunks - timings[['transfer_back']] + } else { + #EEP + } + timings$all_chunks <- t_all_chunks + timings <- timings[items[which(items %in% names(timings))]] + if (ideal_timings) { + timings[['T - [q] * N / M']] <- timings[['total']] - + mean(timings[['queue']]) * timings[['nchunks']] / timings[['concurrent_chunks']] + timings[['b_s + ([j_s] + [l] + [c]) * N / M + m']] <- timings[['bychunks_setup']] + + (mean(timings[['job_setup']]) + mean(timings[['load']]) + + mean(timings[['compute']])) * timings[['nchunks']] / + timings[['concurrent_chunks']] + timings[['merge']] + } + timings$config <- config_long_name + #timings$confign <- config_index + timings <- melt(timings, id.vars = c('config')) + if (is.null(all_timings)) { + all_timings <- timings + } else { + all_timings <- rbind(all_timings, timings) + } + config_index <- config_index + 1 + } + if (!is.null(crop)) { + all_timings$value <- sapply(all_timings$value, function(x) min(crop, x)) + } + a <- as.factor(all_timings$config) + all_timings$config <- a + #new_levels <- levels(a)[order(nchar(levels(a)), levels(a))] + new_levels <- config_long_names + all_timings$config <- factor(all_timings$config, levels = new_levels) + cols <- colorRampPalette(RColorBrewer::brewer.pal(9, "Set1")) + myPal <- cols(length(configs)) + items_total <- c('total') + if (ideal_timings) { + items_total <- c(items_total, 'T - [q] * N / M') + } + timings_total <- subset(all_timings, variable %in% items_total) + if (is.null(subtitle)) { + n_lines_subtitle <- 0 + } else { + n_lines_subtitle <- length(strsplit(subtitle, "\n")[[1]]) + } + plot_total <- ggplot(timings_total, aes(x = config, + y = value, fill = config, label = round(value))) + + geom_bar(stat = 'summary', fun.y = 'mean') + facet_wrap(~variable, nrow = 1) + + #geom_text(angle = 90, nudge_y = -10) + + labs(y = 'time (s)', + title = ' ', + subtitle = paste0(rep("\n", n_lines_subtitle), collapse = '')) + + guides(fill = guide_legend(title = 'configurations')) + + theme(axis.title.x = element_blank(), + axis.text.x = element_blank(), + axis.ticks.x = element_blank()) + + scale_fill_manual(values = myPal) + + if (ideal_timings) { + items_ideal <- c('b_s + ([j_s] + [l] + [c]) * N / M + m') + timings_ideal <- subset(all_timings, variable %in% items_ideal) + plot_ideal <- ggplot(timings_ideal, aes(x = config, y = value, fill = config)) + + geom_bar(stat = 'summary', fun.y = 'mean') + + facet_wrap(~variable, nrow = 1) + + labs(y = 'time (s)', + title = ' ') + + guides(fill = guide_legend(title = 'configurations')) + + theme(axis.title.x = element_blank(), + axis.text.x = element_blank(), + axis.ticks.x = element_blank()) + + scale_fill_manual(values = myPal) + } + items_general <- items[which(items %in% c('bychunks_setup', 'transfer', 'all_chunks', 'merge'))] + timings_general <- subset(all_timings, variable %in% items_general) + plot_general <- ggplot(timings_general, aes(x = config, y = value, fill = config)) + + geom_bar(stat = 'summary', fun.y = 'mean') + facet_wrap(~variable, nrow = 1) + + labs(y = 'time (s)', + title = 'startR::Compute profiling', + subtitle = subtitle) + + guides(fill = guide_legend(title = 'configurations')) + + theme(axis.title.x = element_blank(), + axis.text.x = element_blank(), + axis.ticks.x = element_blank()) + + scale_fill_manual(values = myPal) + + items_chunk <- items[which(items %in% c('queue', 'job_setup', 'load', 'compute', 'transfer_back'))] + timings_chunk <- subset(all_timings, variable %in% items_chunk) + plot_chunk <- ggplot(timings_chunk, aes(x = config, y = value, fill = config)) + + geom_boxplot() + facet_wrap(~variable, nrow = 1) + + labs(y = 'time (s)', + title = 'summary of performance of all chunks') + + # subtitle = subtitle) + + guides(fill = guide_legend(title = 'configurations', ncol = ceiling(length(configs) / 10))) + + theme(axis.title.x = element_blank(), + axis.text.x = element_blank(), + axis.ticks.x = element_blank()) + + scale_fill_manual(values = myPal) + + legend_cols <- ceiling(length(configs) / 10) + legend_rows <- ceiling(length(configs) / legend_cols) + if (legend_rows > 8) { + height <- 30 + } else if (legend_rows > 6) { + height <- 25 + } else if (legend_rows > 4) { + height <- 20 + } else { + height <- 15 + } + width <- 25 + 5 * legend_cols + if (!total_timings) { + if (!ideal_timings) { + plot <- list(plot_general + guides(fill = FALSE), + plot_chunk + guides(fill = FALSE), + gglegend(plot_chunk), + #top = 'startR::Compute() profiling', + widths = c(3, legend_cols), + layout_matrix = rbind(c(1, 3), + c(2, 3))) + } else { + extra <- legend_cols - 1 + width <- width + (legend_cols + 1) * 6 + plot <- list(plot_general + guides(fill = FALSE), + plot_ideal + guides(fill = FALSE), + plot_chunk + guides(fill = FALSE), + gglegend(plot_chunk), + #top = 'startR::Compute() profiling', + widths = c(0.7 + extra / 4, 0.3 + extra / 4, + 3 + extra / 2, legend_cols), + layout_matrix = rbind(c(1, 1, 1, 4), + c(2, 3, 3, 4))) + } + } else { + if (!ideal_timings) { + plot <- list(plot_total + guides(fill = FALSE), + plot_general + guides(fill = FALSE), + plot_chunk + guides(fill = FALSE), + gglegend(plot_chunk), + #top = 'startR::Compute() profiling', + widths = c(1, 3, legend_cols), + layout_matrix = rbind(c(1, 2, 4), + c(3, 3, 4))) + } else { + extra <- legend_cols - 1 + width <- width + (legend_cols + 1) * 6 + plot <- list(plot_total + guides(fill = FALSE), + plot_general + guides(fill = FALSE), + plot_ideal + guides(fill = FALSE), + plot_chunk + guides(fill = FALSE), + gglegend(plot_chunk), + #top = 'startR::Compute() profiling', + widths = c(0.7 + extra / 4, 0.3 + extra / 4, + 3 + extra / 2, legend_cols), + layout_matrix = rbind(c(1, 1, 2, 5), + c(3, 4, 4, 5))) + } + } + if (!is.null(file_name)) { + plot <- do.call('arrangeGrob', plot) + ggsave(file_name, plot, units = 'cm', width = width, height = height) + } else { + do.call('grid.arrange', plot) + } +} diff --git a/vignettes/compute_profiling.png b/vignettes/compute_profiling.png new file mode 100644 index 0000000000000000000000000000000000000000..fb39056054629ad4a355b71506883cb29c9dca59 Binary files /dev/null and b/vignettes/compute_profiling.png differ diff --git a/vignettes/deployment.md b/vignettes/deployment.md new file mode 100644 index 0000000000000000000000000000000000000000..f2dd3e1fa7d9f532212e22de60ba2d8884095a83 --- /dev/null +++ b/vignettes/deployment.md @@ -0,0 +1,3 @@ +## Deployment of startR + +This documentation page is work in progress. diff --git a/vignettes/start.md b/vignettes/start.md new file mode 100644 index 0000000000000000000000000000000000000000..be096562cd65e82295cb12d1e7a60ecd3525bf02 --- /dev/null +++ b/vignettes/start.md @@ -0,0 +1,274 @@ +## Documentation and examples on the Start() function + +Data retrieval and alignment is the first step in data analysis in any field and is often highly complex and time-consuming, especially nowadays in the era of Big Data, where large multidimensional data sets from diverse sources need to be combined and processed. Taking subsets of these datasets (Divide) to then be processed efficiently (and Conquer) becomes an indispensable technique. + +`Start()` has been designed to automatically retrieve multidimensional distributed data sets in R. It provides an interface that allows to perceive and access one or a collection of data sets as if they all were part of a large multidimensional array. Indices or bounds can be specified for each of the dimensions in order to crop the whole array into a smaller sub-array. `Start()` will perform the required operations to fetch the corresponding regions of the corresponding files (potentially distributed over various remote servers) and arrange them into a local R multidimensional array. By default, as many cores as available locally are used in this procedure. + +Usually, in retrieval processes previous to multidimensional data analysis, it is needed to apply a set of common transformations, pre-processes or reorderings to the data as it comes in. `Start()` accepts user-defined transformation or reordering functions to be applied for such purposes. + +`Start()` is not bound to a specific file format. Interface functions to custom file formats can be provided for `Start()` to read them. As of April 2017 `startR` includes interface functions to the following file formats: +- NetCDF + +Metadata and auxilliary data is also preserved and arranged by `Start()` in the measure that it is retrieved by the interface functions for a specific file format. + +### How to use + +`Start()` has a rather steep learning curve but it makes the retrieval process straightforward and highly customizable. The header looks as follows: + +```R +Start(..., + return_vars = NULL, + synonims = NULL, + file_opener = NcOpener, + file_var_reader = NcVarReader, + file_dim_reader = NcDimReader, + file_data_reader = NcDataReader, + file_closer = NcCloser, + transform = NULL, + transform_params = NULL, + transform_vars = NULL, + transform_extra_cells = 0, + apply_indices_after_transform = FALSE, + pattern_dims = NULL, + metadata_dims = NULL, + selector_checker = SelectorChecker, + num_procs = NULL, + silent = FALSE) +``` + +Usually most of the required information will be provided through the `...` parameter and only a few of the parameters in the function header will be used. + +The parameters can be grouped as follows: +- The parameters provided via `...`, with information on the structure of the datasets which to take data from, information on which dimensions they have, which indices to take from each of the dimensions, and how to reorder each of the dimensions if needed. +- `synonims` with information to identify dimensions and variablse across the multiple files/datasets (useful if they don't follow a homogeneous naming convention). +- `return_vars` with information on auxilliary data to be retrieved. +- `file_*` parameters, which allow to specify interface functions to the desired file format. +- `transform_*` parameters, which allow to specify a transformation to be applied to the data as it comes in. +- Other general configuration parameters. + +Examples / Tutorial +------------------- + +Next, an explanation on how to use `Start()`, starting from a simple example and progressively adding complexity. + +### Retrieving a single entire file +`Start()` can be used in the simplest situation to take a subset of data from a single file. + +Let's imagine we have a file named 'file.nc' which contains an array of data (a single variable) with the following dimensions: +``` +c(a = 5, time = 12, b = 100, c = 100) +``` + +It is then possible to take the entire data file as follows: +```R +data <- Start(dataset = '/path/to/file.nc', + a = 'all', + time = 'all', + b = 'all', + c = 'all') +* Exploring files... This will take a variable amount of time depending +* on the issued request and the performance of the file server... +* Detected dimension sizes: +* dataset: 1 +* a: 5 +* time: 3 +* b: 100 +* c: 100 +* Total size of requested data: +* 1 x 5 x 3 x 100 x 100 x 8 bytes = 1.2 Mb +* If the size of the requested data is close to or above the free shared +* RAM memory, R may crash. +* If the size of the requested data is close to or above the half of the +* free RAM memory, R may crash. +* Will now proceed to read and process 1 data files: +* /path/to/file.nc +* Loading... This may take several minutes... +starting worker pid=30733 on localhost:11276 at 15:30:39.997 +starting worker pid=30749 on localhost:11276 at 15:30:40.180 +starting worker pid=30765 on localhost:11276 at 15:30:40.382 +starting worker pid=30781 on localhost:11276 at 15:30:40.557 +* Successfully retrieved data. +``` + +None of the provided parameters to `Start()` are in the set of known parameters (`return_vars`, `synonims`, ...). Each unknown parameter is interpreted as a specification of a dimension of the data set you want to retrieve data from, where the name of the parameter matches the name of the dimension and the value associated expresses the indices you want to retrieve from the dimension. In this example, we have defined that the data set has 5 dimensions, with the names 'dataset', 'a', 'time', 'b', and 'c', and we want to take all indices of each of these dimensions. + +It is mandatory to make `Start()` aware of all the existing dimensions in the file (unless they are of length 1). + +Note that the file to read is considered to be an element that belongs to the 'dataset' dimension (could be any other name). `Start()` automatically looks for at least one of the dimension specifications with an expression pointing to a set of files or URLs. + +The returned result looks as follows: +```R +str(data) +List of 5 + $ Data : num [1, 1:5, 1:3, 1:100, 1:100] 1 2 3 4 5 6 7 8 ... + $ Variables :List of 2 + ..$ common : NULL + ..$ dataset1: NULL + $ Files : chr [1(1d)] "/path/to/file.nc" + $ NotFoundFiles: NULL + $ FileSelectors:List of 1 + ..$ dataset1:List of 1 + .. ..$ dataset:List of 1 + .. .. ..$ : chr "dataset1" +``` + +In this case, most of the returned information is empty. + +These are the dimensions of the actual data array: +```R +dim(data$Data) +dataset a time b c + 1 5 3 100 100 +``` + +### Reordering array dimensions +If the dimensions are specified and requested in a different order, the resulting array will be arranged following the same order: +```R +data <- Start(dataset = '/path/to/file.nc', + a = 'all', + b = 'all', + c = 'all', + time = 'all') +dim(data$Data) +dataset a b c time + 1 5 100 100 3 +``` + +### Retrieving multiple files +Assuming we have the files +``` + /path/to/ + |-> group_a/ + | |-> file_1.nc + | |-> file_2.nc + | |-> file_3.nc + |-> group_b/ + |-> file_1.nc + |-> file_2.nc + |-> file_3.nc +``` +We can load them as follows: +```R +data <- Start(dataset = '/path/to/group_$group$/file_$number$.nc', + group = 'all', + number = 'all', + a = 'all', + b = 'all', + c = 'all', + time = 'all') +dim(data$Data) +dataset group number a b c time + 1 2 3 5 100 100 3 +``` + +### Path pattern tags that depend on other tags +Assuming we have the files +``` + /path/to/ + |-> group_a/ + | |-> file_1.nc + | |-> file_2.nc + | |-> file_3.nc + |-> group_b/ + |-> file_4.nc + |-> file_5.nc + |-> file_6.nc +``` +We can load them as follows: +```R +data <- Start(dataset = '/path/to/group_$group$/file_$number$.nc', + group = 'all', + number = 'all', + number_depends = 'group', + a = 'all', + b = 'all', + c = 'all', + time = 'all') +dim(data$Data) +dataset group number a b c time + 1 2 3 5 100 100 3 +``` + +### Dimensions inside the files that go across files +Assuming the 'time' dimension goes across all the 'number' files in a group. We would like to select time indices e.g. 3 to 7 without `Start()` crashing because of indices out of bounds. We can do so as follows: +```R +data <- Start(dataset = '/path/to/group_$group$/file_$number$.nc', + group = 'all', + number = 'all', + number_depends = 'group', + a = 'all', + b = 'all', + c = 'all', + time = indices(list(2, 5)), + time_across = 'number') +dim(data$Data) +dataset group number a b c time + 1 2 3 5 100 100 5 +``` +In this case, the dimension 'number' is of length 3 because we have retrieved data from 3 different 'number's: the 'time' index 3 from 'number' 1, the 'time' indices 4 to 6 from 'number' 2 and the 'time' index 7 from 'number 3. The non-taken indices from a 'number' are filled in with NA in the returned array. + +### Taking specific indices of a dimension +```R +data <- Start(dataset = '/path/to/file.nc', + a = indices(c(1, 3)), + b = 'all', + c = indices(list(10, 20)), + time = 'all') +dim(data$Data) +dataset a b c time + 1 2 100 11 3 +``` + +### Taking specific indices of a dimension in function of associated values +```R +data <- Start(dataset = '/path/to/file.nc', + a = c('value1', 'value2', 'value5'), + a_var = 'x', + b = 'all', + c = indices(list(10, 20)), + time = 'all', + return_vars = list(a_var = NULL)) +dim(data$Data) +dataset a b c time + 1 3 100 11 3 +``` + +### Taking data from NetCDF files with multiple variables +Now let us imagine the data array in the file has an extra dimension, 'var', of length 2, and a variable 'var_names' with the names of the variables at each position along the dimension 'var'. The names of the 2 variables are 'x' and 'y'. We would like being able to tell `Start()` to take only the variable 'y', regardless of its position along the 'var' dimension. This can be achieved by defining the 'var' dimension with more detail, using the '*_var' parameters: + +```R +data <- Start(dataset = '/path/to/file.nc', + var = 'y', + var_var = 'var_names', + a = 'all', + b = 'all', + c = 'all', + time = 'all', + return_vars = list(var_names = NULL)) +dim(data$Data) +dataset var a b c time + 1 1 5 100 100 3 +``` + +### Taking specific indices of a dimension in function of associated values, with tolerance + + +### Dimension and variable name synonims + + +### Reordering inner dims with associated values + + +### Transformations + + +### Defining interface functions to a custom file format + + +### Explanation of outputs + + +### Fetching metadata + + +### Other configuration parameters