Start.Rd 42.8 KB
Newer Older
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Start.R
Nicolau Manubens's avatar
Nicolau Manubens committed
\name{Start}
\alias{Start}
\title{Declare, discover, subset and retrieve multidimensional distributed data sets}
\usage{
aho's avatar
aho committed
Start(
  ...,
  return_vars = NULL,
  synonims = NULL,
  file_opener = NcOpener,
  file_var_reader = NcVarReader,
  file_dim_reader = NcDimReader,
  file_data_reader = NcDataReader,
  file_closer = NcCloser,
  transform = NULL,
  transform_params = NULL,
  transform_vars = NULL,
  transform_extra_cells = 2,
  apply_indices_after_transform = FALSE,
  pattern_dims = NULL,
  metadata_dims = NULL,
  selector_checker = SelectorChecker,
  merge_across_dims = FALSE,
  merge_across_dims_narm = TRUE,
  split_multiselected_dims = FALSE,
  path_glob_permissive = FALSE,
  largest_dims_length = FALSE,
  retrieve = FALSE,
  num_procs = 1,
  ObjectBigmemory = NULL,
  silent = FALSE,
  debug = FALSE
)
}
\arguments{
nperez's avatar
nperez committed
\item{\dots}{A selection of custemized parameters depending on the data 
format. When we retrieve data from one or a collection of data sets, 
the involved data can be perceived as belonging to a large multi-dimensional 
array. For instance, let us consider an example case. We want to retrieve data
from a source, which contains data for the number of monthly sales of various 
items, and also for their retail price each month. The data on source is 
stored as follows:\cr\cr
\command{
nperez's avatar
nperez committed
\cr #  /data/
\cr #    |-> sales/
\cr #    |    |-> electronics
\cr #    |    |    |-> item_a.data
\cr #    |    |    |-> item_b.data
\cr #    |    |    |-> item_c.data
\cr #    |    |-> clothing
\cr #    |         |-> item_d.data
\cr #    |         |-> idem_e.data
\cr #    |         |-> idem_f.data
\cr #    |-> prices/
\cr #         |-> electronics
\cr #         |    |-> item_a.data
\cr #         |    |-> item_b.data
\cr #         |    |-> item_c.data
\cr #         |-> clothing
\cr #              |-> item_d.data
\cr #              |-> item_e.data
\cr #              |-> item_f.data
}\cr\cr
Each item file contains data, stored in whichever format, for the sales or 
prices over a time period, e.g. for the past 24 months, registered at 100 
different stores over the world. Whichever the format it is stored in, each 
file can be perceived as a container of a data array of 2 dimensions, time and
store. Let us assume the '.data' format allows to keep a name for each of 
these dimensions, and the actual names are 'time' and 'store'.\cr\cr
The different item files for sales or prices can be perceived as belonging to 
an 'item' dimension of length 3, and the two groups of three items to a 
'section' dimension of length 2, and the two groups of two sections (one with
the sales and the other with the prices) can be perceived as belonging also to
another dimension 'variable' of length 2. Even the source can be perceived as 
belonging to a dimension 'source' of length 1.\cr\cr
All in all, in this example, the whole data could be perceived as belonging to
a multidimensional 'large array' of dimensions\cr
\command{
\cr #  source variable  section      item    store    month
\cr #       1        2        2         3      100       24
\cr\cr
nperez's avatar
nperez committed
The dimensions of this 'large array' can be classified in two types. The ones 
that group actual files (the file dimensions) and the ones that group data 
values inside the files (the inner dimensions). In the example, the file 
dimensions are 'source', 'variable', 'section' and 'item', whereas the inner 
dimensions are 'store' and 'month'.
\cr\cr
nperez's avatar
nperez committed
Having the dimensions of our target sources in mind, the parameter \code{\dots} 
expects to receive information on:
   \itemize{
     \item{
The names of the expected dimensions of the 'large dataset' we want to 
retrieve data from
     }
     \item{
The indices to take from each dimension (and other constraints)
     }
     \item{
How to reorder the dimension if needed
     }
     \item{
The location and organization of the files of the data sets
     }
   }
For each dimension, the 3 first information items can be specified with a set
of parameters to be provided through \code{\dots}. For a given dimension 
'dimname', six parameters can be specified:\cr
\command{
nperez's avatar
nperez committed
\cr # dimname = <indices_to_take>,  # 'all' / 'first' / 'last' /
\cr #                               # indices(c(1, 10, 20)) /
\cr #                               # indices(c(1:20)) /
\cr #                               # indices(list(1, 20)) /
\cr #                               # c(1, 10, 20) / c(1:20) /
\cr #                               # list(1, 20)
\cr # dimname_var = <name_of_associated_coordinate_variable>,
\cr # dimname_tolerance = <tolerance_value>,
\cr # dimname_reorder = <reorder_function>,
\cr # dimname_depends = <name_of_another_dimension>,
\cr # dimname_across = <name_of_another_dimension>
nperez's avatar
nperez committed
The \bold{indices to take} can be specified in three possible formats (see 
code comments above for examples). The first format consists in using 
character tags, such as 'all' (take all the indices available for that 
dimension), 'first' (take only the first) and 'last' (only the last). The 
second format consists in using numeric indices, which have to be wrapped in a
call to the indices() helper function. For the second format, either a
vector of numeric indices can be provided, or a list with two numeric indices 
can be provided to take all the indices in the range between the two specified
indices (both extremes inclusive). The third format consists in providing a 
vector character strings (for file dimensions) or of values of whichever type
(for inner dimensions). For the file dimensions, the provided character 
strings in the third format will be used as components to build up the final 
path to the files (read further). For inner dimensions, the provided values in
the third format will be compared to the values of an associated coordinate 
variable (must be specified in '<dimname>_reorder', read further), and the 
indices of the closest values will be retrieved. When using the third format, 
a list with two values can also be provided to take all the indices of the 
values within the specified range.
\cr\cr
nperez's avatar
nperez committed
The \bold{name of the associated coordinate variable} must be a character 
string with the name of an associated coordinate variable to be found in the 
data files (in all* of them). For this to work, a 'file_var_reader' 
function must be specified when calling Start() (see parameter 
'file_var_reader'). The coordinate variable must also be requested in the 
parameter 'return_vars' (see its section for details). This feature only 
works for inner dimensions.
\cr\cr
nperez's avatar
nperez committed
The \bold{tolerance value} is useful when indices for an inner dimension are 
specified in the third format (values of whichever type). In that case, the 
indices of the closest values in the coordinate variable are seeked. However 
the closest value might be too distant and we would want to consider no real 
match exists for such provided value. This is possible via the tolerance,
which allows to specify a threshold beyond which not to seek for matching 
values and mark that index as missing value.
\cr\cr
nperez's avatar
nperez committed
The \bold{reorder_function} is useful when indices for an inner dimension are
specified in the third fromat, and the retrieved indices need to be reordered 
in function of their provided associated variable values. A function can be 
provided, which receives as input a vector of values, and returns as outputs a
list with the components \code{$x} with the reordered values, and \code{$ix} 
with the permutation indices. Two reordering functions are included in 
startR, the Sort() and the CircularSort().
nperez's avatar
nperez committed
The \bold{name of another dimension} to be specified in <dimname>_depends,
only available for file dimensions, must be a character string with the name 
of another requested \bold{file dimension} in \code{\dots}, and will make 
Start() aware that the path components of a file dimension can vary in
function of the path component of another file dimension. For instance, in the
example above, specifying \code{item_depends = 'section'} will make 
Start() aware that the item names vary in function of the section, i.e.
section 'electronics' has items 'a', 'b' and 'c' but section 'clothing' has 
items 'd', 'e', 'f'. Otherwise Start() would expect to find the same 
item names in all the sections.
\cr\cr
nperez's avatar
nperez committed
The \bold{name of another dimension} to be specified in '<dimname>_across',
only available for inner dimensions, must be a character string with the name 
of another requested \bold{inner dimension} in \code{\dots}, and will make 
Start() aware that an inner dimension extends along multiple files. For
instance, let us imagine that in the example above, the records for each item 
are so large that it becomes necessary to split them in multiple files each 
one containing the registers for a different period of time, e.g. in 10 files 
with 100 months each ('item_a_period1.data', 'item_a_period2.data', and so on).
In that case, the data can be perceived as having an extra file dimension, the 
'period' dimension. The inner dimension 'month' would extend across multiple 
files, and providing the parameter \code{month = indices(1, 300)} would make 
Start() crash because it would perceive we have made a request out of 
bounds (each file contains 100 'month' indices, but we requested 1 to 300). 
This can be solved by specifying the parameter \code{month_across = period} (a
long with the full specification of the dimension 'period').
\cr\cr
nperez's avatar
nperez committed
\bold{Defining the path pattern}
\cr
As mentioned above, the parameter \dots also expects to receive information 
with the location of the data files. In order to do this, a special dimension 
must be defined. In that special dimension, in place of specifying indices to 
take, a path pattern must be provided. The path pattern is a character string 
that encodes the way the files are organized in their source. It must be a 
path to one of the data set files in an accessible local or remote file system,
or a URL to one of the files provided by a local or remote server. The regions
of this path that vary across files (along the file dimensions) must be 
replaced by wildcards. The wildcards must match any of the defined file 
dimensions in the call to Start() and must be delimited with heading 
and trailing '$'. Shell globbing expressions can be used in the path pattern. 
See the next code snippet for an example of a path pattern.
\cr\cr
All in all, the call to Start() to load the entire data set in the 
example of store item sales, would look as follows:
\cr
\command{
\cr # data <- Start(source = paste0('/data/$variable$/',
\cr #                               '$section$/$item$.data'),
\cr #               variable = 'all',
\cr #               section = 'all',
\cr #               item = 'all',
\cr #               item_depends = 'section',
\cr #               store = 'all',
\cr #               month = 'all')
}
\cr\cr
nperez's avatar
nperez committed
Note that in this example it would still be pending to properly define the 
parameters 'file_opener', 'file_closer', 'file_dim_reader', 
'file_var_reader' and 'file_data_reader' for the '.data' file format
(see the corresponding sections).
\cr\cr
nperez's avatar
nperez committed
The call to Start() will return a multidimensional R array with the 
following dimensions:
\cr
\command{
\cr #  source variable  section      item    store    month
\cr #       1        2        2         3      100       24
}
\cr
The dimension specifications in the \code{\dots} do not have to follow any 
particular order. The returned array will have the dimensions in the same order
as they have been specified in the call. For example, the following call:
\cr
\command{
\cr # data <- Start(source = paste0('/data/$variable$/',
\cr #                               '$section$/$item$.data'),
\cr #               month = 'all',
\cr #               store = 'all',
\cr #               item = 'all',
\cr #               item_depends = 'section',
\cr #               section = 'all',
\cr #               variable = 'all')
}
\cr\cr
aho's avatar
aho committed
would return an array with the following dimensions:
\cr
\command{
\cr #  source    month    store      item  section variable
\cr #       1       24      100         3        2        2
}
\cr\cr
Next, a more advanced example to retrieve data for only the sales records, for
the first section ('electronics'), for the 1st and 3rd items and for the 
stores located in Barcelona (assuming the files contain the variable 
'store_location' with the name of the city each of the 100 stores are located 
at):
\cr
\command{
\cr # data <- Start(source = paste0('/data/$variable$/',
\cr #                               '$section$/$item$.data'),
\cr #               variable = 'sales',
\cr #               section = 'first',
\cr #               item = indices(c(1, 3)),
\cr #               item_depends = 'section',
\cr #               store = 'Barcelona',
\cr #               store_var = 'store_location',
\cr #               month = 'all',
\cr #               return_vars = list(store_location = NULL))
}
\cr\cr
The defined names for the dimensions do not necessarily have to match the 
names of the dimensions inside the file. Lists of alternative names to be 
seeked can be defined in the parameter 'synonims'.
\cr\cr
If data from multiple sources (not necessarily following the same structure) 
has to be retrieved, it can be done by providing a vector of character strings
with path pattern specifications, or, in the extended form, by providing a 
list of lists with the components 'name' and 'path', and the name of the 
dataset and path pattern as values, respectively. For example:
\cr
\command{
\cr # data <- Start(source = list(
\cr #                 list(name = 'sourceA',
\cr #                      path = paste0('/sourceA/$variable$/',
\cr #                                    '$section$/$item$.data')),
\cr #                 list(name = 'sourceB',
\cr #                      path = paste0('/sourceB/$section$/',
\cr #                                    '$variable$/$item$.data'))
\cr #               ),
\cr #               variable = 'sales',
\cr #               section = 'first',
\cr #               item = indices(c(1, 3)),
\cr #               item_depends = 'section',
\cr #               store = 'Barcelona',
\cr #               store_var = 'store_location',
\cr #               month = 'all',
\cr #               return_vars = list(store_location = NULL))
}
\cr}

\item{return_vars}{A named list where the names are the names of the 
variables to be fetched in the files, and the values are vectors of 
character strings with the names of the file dimension which to retrieve each
variable for, or NULL if the variable has to be retrieved only once 
from any (the first) of the involved files.\cr\cr
Apart from retrieving a multidimensional data array, retrieving auxiliary 
variables inside the files can also be needed. The parameter 
'return_vars' allows for requesting such variables, as long as a 
'file_var_reader' function is also specified in the call to 
Start() (see documentation on the corresponding parameter). 
\cr\cr
In the case of the the item sales example (see documentation on parameter 
\code{\dots)}, the store location variable is requested with the parameter\cr 
\code{return_vars = list(store_location = NULL)}.\cr This will cause 
Start() to fetch once the variable 'store_location' and return it in 
the component\cr \code{$Variables$common$store_location},\cr and will be an 
array of character strings with the location names, with the dimensions 
\code{c('store' = 100)}. Although useless in this example, we could ask 
Start() to fetch and return such variable for each file along the 
items dimension as follows: \cr 
\code{return_vars = list(store_location = c('item'))}.\cr In that case, the 
variable will be fetched once from a file of each of the items, and will be 
returned as an array with the dimensions \code{c('item' = 3, 'store' = 100)}.
\cr\cr
If a variable is requested along a file dimension that contains path pattern 
specifications ('source' in the example), the fetched variable values will be 
returned in the component\cr \code{$Variables$<dataset_name>$<variable_name>}.\cr 
For example:
\cr
\command{
\cr # data <- Start(source = list(
\cr #                 list(name = 'sourceA',
\cr #                      path = paste0('/sourceA/$variable$/',
\cr #                                    '$section$/$item$.data')),
\cr #                 list(name = 'sourceB',
\cr #                      path = paste0('/sourceB/$section$/',
\cr #                                    '$variable$/$item$.data'))
\cr #               ),
\cr #               variable = 'sales',
\cr #               section = 'first',
\cr #               item = indices(c(1, 3)),
\cr #               item_depends = 'section',
\cr #               store = 'Barcelona',
\cr #               store_var = 'store_location',
\cr #               month = 'all',
\cr #               return_vars = list(store_location = c('source',
\cr #                                                     'item')))
\cr # # Checking the structure of the returned variables
\cr # str(found_data$Variables)
\cr # Named list
\cr # ..$common: NULL
\cr # ..$sourceA: Named list
\cr # .. ..$store_location: char[1:18(3d)] 'Barcelona' 'Barcelona' ...
\cr # ..$sourceB: Named list
\cr # .. ..$store_location: char[1:18(3d)] 'Barcelona' 'Barcelona' ...
\cr # # Checking the dimensions of the returned variable
\cr # # for the source A
\cr # dim(found_data$Variables$sourceA)
\cr #     item   store
\cr #        3       3
}
\cr\cr
The names of the requested variables do not necessarily have to match the 
actual variable names inside the files. A list of alternative names to be 
seeked can be specified via the parameter 'synonims'.}

\item{synonims}{A named list where the names are the requested variable or 
dimension names, and the values are vectors of character strings with 
alternative names to seek for such dimension or variable.\cr\cr
In some requests, data from different sources may follow different naming 
conventions for the dimensions or variables, or even files in the same source
could have varying names. This parameter is in order for Start() to 
properly identify the dimensions or variables with different names.
\cr\cr
In the example used in parameter 'return_vars', it may be the case that 
the two involved data sources follow slightly different naming conventions. 
For example, source A uses 'sect' as name for the sections dimension, whereas 
source B uses 'section'; source A uses 'store_loc' as variable name for the 
store locations, whereas source B uses 'store_location'. This can be taken 
into account as follows:
\cr
\command{
\cr # data <- Start(source = list(
\cr #                 list(name = 'sourceA',
\cr #                      path = paste0('/sourceA/$variable$/',
\cr #                                    '$section$/$item$.data')),
\cr #                 list(name = 'sourceB',
\cr #                      path = paste0('/sourceB/$section$/',
\cr #                                    '$variable$/$item$.data'))
\cr #               ),
\cr #               variable = 'sales',
\cr #               section = 'first',
\cr #               item = indices(c(1, 3)),
\cr #               item_depends = 'section',
\cr #               store = 'Barcelona',
\cr #               store_var = 'store_location',
\cr #               month = 'all',
\cr #               return_vars = list(store_location = c('source',
\cr #                                                     'item')),
\cr #               synonims = list(
\cr #                 section = c('sec', 'section'),
\cr #                 store_location = c('store_loc',
\cr #                                    'store_location')
\cr #               ))
}
\cr}

\item{file_opener}{A function that receives as a single parameter 
 'file_path' a character string with the path to a file to be opened, 
 and returns an object with an open connection to the file (optionally with 
 header information) on success, or returns NULL on failure.
\cr\cr
This parameter takes by default NcOpener() (an opener function for NetCDF
files).
\cr\cr
See NcOpener() for a template to build a file opener for your own file 
format.}

\item{file_var_reader}{A function with the header \code{file_path = NULL}, 
 \code{file_object = NULL}, \code{file_selectors = NULL}, \code{var_name}, 
 \code{synonims} that returns an array with auxiliary data (i.e. data from a
 variable) inside a file. Start() will provide automatically either a 
 'file_path' or a 'file_object' to the 'file_var_reader'
 function (the function has to be ready to work whichever of these two is 
 provided). The parameter 'file_selectors' will also be provided 
 automatically to the variable reader, containing a named list where the 
 names are the names of the file dimensions of the queried data set (see 
 documentation on \code{\dots}) and the values are single character strings 
 with the components used to build the path to the file being read (the one 
 provided in 'file_path' or 'file_object'). The parameter 'var_name'
 will be filled in automatically by Start() also, with the name of one
 of the variales to be read. The parameter 'synonims' will be filled in 
 with exactly the same value as provided in the parameter 'synonims' in 
 the call to Start(), and has to be used in the code of the variable 
 reader to check for alternative variable names inside the target file. The 
 'file_var_reader' must return a (multi)dimensional array with named 
 dimensions, and optionally with the attribute 'variales' with other 
 additional metadata on the retrieved variable.
\cr\cr
Usually, the 'file_var_reader' should be a degenerate case of the 
'file_data_reader' (see documentation on the corresponding parameter), 
so it is recommended to code the 'file_data_reder' in first place.
\cr\cr
This parameter takes by default NcVarReader() (a variable reader function
for NetCDF files).
\cr\cr
See NcVarReader() for a template to build a variale reader for your own 
file format.}

\item{file_dim_reader}{A function with the header \code{file_path = NULL}, 
 \code{file_object = NULL}, \code{file_selectors = NULL}, \code{synonims} 
 that returns a named numeric vector where the names are the names of the 
 dimensions of the multidimensional data array in the file and the values are
 the sizes of such dimensions. Start() will provide automatically 
 either a 'file_path' or a 'file_object' to the 
 'file_dim_reader' function (the function has to be ready to work 
 whichever of these two is provided). The parameter 'file_selectors'
 will also be provided automatically to the dimension reader, containing a
 named list where the names are the names of the file dimensions of the 
 queried data set (see documentation on \code{\dots}) and the values are 
 single character strings with the components used to build the path to the 
 file being read (the one provided in 'file_path' or 'file_object'). 
 The parameter 'synonims' will be filled in with exactly the same value 
 as provided in the parameter 'synonims' in the call to Start(), 
 and can optionally be used in advanced configurations.
\cr\cr
This parameter takes by default NcDimReader() (a dimension reader 
function for NetCDF files).
\cr\cr
See NcDimReader() for (an advanced) template to build a dimension reader
for your own file format.}

\item{file_data_reader}{A function with the header \code{file_path = NULL}, 
 \code{file_object = NULL}, \code{file_selectors = NULL}, 
 \code{inner_indices = NULL}, \code{synonims} that returns a subset of the 
 multidimensional data array inside a file (even if internally it is not an 
 array). Start() will provide automatically either a 'file_path'
 or a 'file_object' to the 'file_data_reader' function (the 
 function has to be ready to work whichever of these two is provided). The
 parameter 'file_selectors' will also be provided automatically to the
 data reader, containing a named list where the names are the names of the
 file dimensions of the queried data set (see documentation on \code{\dots})
 and the values are single character strings with the components used to 
 build the path to the file being read (the one provided in 'file_path' or 
 'file_object'). The parameter 'inner_indices' will be filled in 
 automatically by Start() also, with a named list of numeric vectors, 
 where the names are the names of all the expected inner dimensions in a file
 to be read, and the numeric vectors are the indices to be taken from the 
 corresponding dimension (the indices may not be consecutive nor in order).
 The parameter 'synonims' will be filled in with exactly the same value 
 as provided in the parameter 'synonims' in the call to Start(), 
 and has to be used in the code of the data reader to check for alternative 
 dimension names inside the target file. The 'file_data_reader' must 
 return a (multi)dimensional array with named dimensions, and optionally with
 the attribute 'variables' with other additional metadata on the retrieved 
 data.
\cr\cr
Usually, 'file_data_reader' should use 'file_dim_reader'
(see documentation on the corresponding parameter), so it is recommended to 
code 'file_dim_reder' in first place.
\cr\cr
aho's avatar
aho committed
This parameter takes by default NcDataReader() (a data reader function 
for NetCDF files).
\cr\cr
aho's avatar
aho committed
See NcDataReader() for a template to build a data reader for your own 
file format.}

\item{file_closer}{A function that receives as a single parameter 
 'file_object' an open connection (as returned by 'file_opener') 
 to one of the files to be read, optionally with header information, and 
 closes the open connection. Always returns NULL.
nperez's avatar
nperez committed
\cr\cr
aho's avatar
aho committed
This parameter takes by default NcCloser() (a closer function for NetCDF 
files).
\cr\cr
See NcCloser() for a template to build a file closer for your own file 
format.}

\item{transform}{A function with the header \code{dara_array}, 
\code{variables}, \code{file_selectors = NULL}, \code{\dots}. It receives as
input, through the parameter \code{data_array}, a subset of a 
multidimensional array (as returned by 'file_data_reader'), applies a 
transformation to it and returns it, preserving the amount of dimensions but
potentially modifying their size. This transformation may require data from 
other auxiliary variables, automatically provided to 'transform' 
through the parameter 'variables', in the form of a named list where
the names are the variable names and the values are (multi)dimensional
arrays. Which variables need to be sent to 'transform' can be specified
with the parameter 'transform_vars' in Start(). The parameter 
'file_selectors' will also be provided automatically to 
'transform', containing a named list where the names are the names of 
the file dimensions of the queried data set (see documentation on 
\code{\dots}) and the values are single character strings with the 
components used to build the path to the file the subset being processed 
belongs to. The parameter \code{\dots} will be filled in with other 
additional parameters to adjust the transformation, exactly as provided in 
the call to Start() via the parameter 'transform_params'.}

\item{transform_params}{A named list with additional parameters to be sent to 
the 'transform' function (if specified). See documentation on parameter
'transform' for details.}

\item{transform_vars}{A vector of character strings with the names of 
auxiliary variables to be sent to the 'transform' function (if 
specified). All the variables to be sent to 'transform' must also 
have been requested as return variables in the parameter 'return_vars' 
of Start().}

\item{transform_extra_cells}{An integer of extra indices to retrieve from the 
data set, beyond the requested indices in \code{\dots}, in order for 
'transform' to dispose of additional information to properly apply 
whichever transformation (if needed). As many as 
'transform_extra_cells' will be retrieved beyond each of the limits for
each of those inner dimensions associated to a coordinate variable and sent 
to 'transform' (i.e. present in 'transform_vars'). After 
'transform' has finished, Start() will take again and return a 
subset of the result, for the returned data to fall within the specified 
bounds in \code{\dots}. The default value is 2.}

\item{apply_indices_after_transform}{A logical value indicating when a 
'transform' is specified in Start() and numeric indices are 
provided for any of the inner dimensions that depend on coordinate variables,
these numeric indices can be made effective (retrieved) before applying the 
transformation or after. The boolean flag allows to adjust this behaviour. 
It takes FALSE by default (numeric indices are applied before sending
data to 'transform').}

\item{pattern_dims}{A character string indicating the name of the dimension 
with path pattern specifications (see \code{\dots} for details). If not  
specified, Start() assumes the first provided dimension is the pattern 
dimension, with a warning.}

\item{metadata_dims}{A vector of character strings with the names of the file 
dimensions which to return metadata for. As noted in 'file_data_reader', 
the data reader can optionally return auxiliary data via the attribute 
'variables' of the returned array. Start() by default returns the 
auxiliary data read for only the first file of each source (or data set) in 
the pattern dimension (see \code{\dots} for info on what the pattern 
dimension is). However it can be configured to return the metadata for all 
the files along any set of file dimensions. The default value is NULL, and
it will be assigned automatically as parameter 'pattern_dims'.}

\item{selector_checker}{A function used internaly by Start() to 
translate a set of selectors (values for a dimension associated to a 
coordinate variable) into a set of numeric indices. It takes by default 
SelectorChecker() and, in principle, it should not be required to 
change it for customized file formats. The option to replace it is left open
for more versatility. See the code of SelectorChecker() for details on
the inputs, functioning and outputs of a selector checker.}

\item{merge_across_dims}{A logical value indicating whether to merge 
dimensions across which another dimension extends (according to the 
'<dimname>_across' parameters). Takes the value FALSE by default. For 
example, if the dimension 'time' extends across the dimension 'chunk' and 
\code{merge_across_dims = TRUE}, the resulting data array will only contain
only the dimension 'time' as long as all the chunks together.}

\item{merge_across_dims_narm}{A logical value indicating whether to remove
the additional NAs from data when parameter 'merge_across_dims' is TRUE.
It is helpful when the length of the to-be-merged dimension is different 
across another dimension. For example, if the dimension 'time' extends 
across dimension 'chunk', and the time length along the first chunk is 2 
while along the second chunk is 10. Setting this parameter as TRUE can 
remove the additional 8 NAs at position 3 to 10. The default value is TRUE,
but will be automatically turned to FALSE if 'merge_across_dims = FALSE'.}

\item{split_multiselected_dims}{A logical value indicating whether to split a 
dimension that has been selected with a multidimensional array of selectors
into as many dimensions as present in the selector array. The default value
is FALSE.}

\item{path_glob_permissive}{A logical value or an integer specifying how many
 folder levels in the path pattern, beginning from the end, the shell glob
 expressions must be preserved and worked out for each file. The default 
 value is FALSE, which is equivalent to 0. TRUE is equivalent to 1.\cr\cr
When specifying a path pattern for a dataset, it might contain shell glob 
experissions. For each dataset, the first file matching the path pattern is 
found, and the found file is used to work out fixed values for the glob 
expressions that will be used for all the files of the dataset. However, in 
some cases, the values of the shell glob expressions may not be constant for 
all files in a dataset, and they need to be worked out for each file 
involved.\cr\cr
For example, a path pattern could be as follows: \cr
\code{'/path/to/dataset/$var$_*/$date$_*_foo.nc'}. \cr Leaving 
\code{path_glob_permissive = FALSE} will trigger automatic seek of the 
 contents to replace the asterisks (e.g. the first asterisk matches with 
 \code{'bar'} and the second with \code{'baz'}. The found contents will be 
 used for all files in the dataset (in the example, the path pattern will be
 fixed to\cr \code{'/path/to/dataset/$var$_bar/$date$_baz_foo.nc'}. However, if
 any of the files in the dataset have other contents in the position of the
 asterisks, Start() will not find them (in the example, a file like \cr
 \code{'/path/to/dataset/precipitation_bar/19901101_bin_foo.nc'} would not be
 found). Setting \code{path_glob_permissive = 1} would preserve global
 expressions in the latest level (in the example, the fixed path pattern
 would be\cr \code{'/path/to/dataset/$var$_bar/$date$_*_foo.nc'}, and the
 problematic file mentioned before would be found), but of course this would
 slow down the Start() call if the dataset involves a large number of
 files. Setting \code{path_glob_permissive = 2} would leave the original path
 pattern with the original glob expressions in the 1st and 2nd levels (in the
 example, both asterisks would be preserved, thus would allow Start()
 to recognize files such as \cr
 \code{'/path/to/dataset/precipitation_zzz/19901101_yyy_foo.nc'}).\cr\cr
Note that each glob expression can only represent one possibility (Start() 
chooses the first). Because /code{*} is not the tag, which means it cannot
be a dimension of the output array. Therefore, only one possibility can be
adopted. For example, if \cr
\code{'/path/to/dataset/precipitation_*/19901101_*_foo.nc'}\cr
has two matches:\cr
\code{'/path/to/dataset/precipitation_xxx/19901101_yyy_foo.nc'} and\cr
\code{'/path/to/dataset/precipitation_zzz/19901101_yyy_foo.nc'},\cr
only the first found file will be used.}

\item{largest_dims_length}{A logical value or a named integer vector
 indicating if Start() should examine all the files to get the largest 
 length of the inner dimensions (TRUE) or use the first valid file of each 
 dataset as the returned dimension length (FALSE). Since examining all the 
 files could be time-consuming, a vector can be used to explicitly specify
 the expected length of the inner dimensions. For those inner dimensions not
 specified, the first valid file will be used. The default value is FALSE.\cr\cr
 This parameter is useful when the required files don't have consistent 
 inner dimension. For example, there are 10 required experimental data files
 of a series of start dates. The data only contain 25 members for the first
 2 years while 51 members for the later years. If \code{'largest_dims_length = FALSE'},
 the returned member dimension length will be 25 only. The 26th to 51st 
 members in the later 8 years will be discarded. If \code{'largest_dims_length = TRUE'},
 the returned member dimension length will be 51. To save the resource,
\code{'largest_dims_length = c(member = 51)'} can also be used.}

\item{retrieve}{A logical value indicating whether to retrieve the data
defined in the Start() call or to explore only its dimension lengths 
and names, and the values for the file and inner dimensions. The default
value is FALSE.}

\item{num_procs}{An integer of number of processes to be created for the
parallel execution of the retrieval/transformation/arrangement of the
multiple involved files in a call to Start(). If set to NULL,
takes the number of available cores (as detected by detectCores() in 
the package 'future'). The default value is 1 (no parallel execution).}

\item{ObjectBigmemory}{a character string to be included as part of the 
bigmemory object name. This parameter is thought to be used internally by the
chunking capabilities of startR.}

\item{silent}{A logical value of whether to display progress messages (FALSE)
or not (TRUE). The default value is FALSE.}

\item{debug}{A logical value of whether to return detailed messages on the
progress and operations in a Start() call (TRUE) or not (FALSE). The
default value is FALSE.}
\value{
If \code{retrieve = TRUE} the involved data is loaded into RAM memory
 and an object of the class 'startR_cube' with the following components is
 returned:\cr
 \item{Data}{
 Multidimensional data array with named dimensions, with the data values
aho's avatar
aho committed
 requested via \code{\dots} and other parameters. This array can potentially 
 contain metadata in the attribute 'variables'.
 }
 \item{Variables}{
 Named list of 1 + N components, containing lists of retrieved variables (as
aho's avatar
aho committed
 requested in 'return_vars') common to all the data sources (in the 1st
 component, \code{$common}), and for each of the N dara sources (named after 
 the source name, as specified in \dots, or, if not specified, \code{$dat1},
 \code{$dat2}, ..., \code{$datN}). Each of the variables are contained in a
 multidimensional array with named dimensions, and potentially with the
 attribute 'variables' with additional auxiliary data.
 }
 \item{Files}{
 Multidimensonal character string array with named dimensions. Its dimensions
aho's avatar
aho committed
 are the file dimensions (as requested in \code{\dots}). Each cell in this
 array contains a path to a retrieved file, or NULL if the corresponding
 file was not found.
 }
 \item{NotFoundFiles}{
aho's avatar
aho committed
 Array with the same shape as \code{$Files} but with NULL in the
 positions for which the corresponding file was found, and a path to the
 expected file in the positions for which the corresponding file was not
 found.
 }
 \item{FileSelectors}{
 Multidimensional character string array with named dimensions, with the same
 shape as \code{$Files} and \code{$NotFoundFiles}, which contains the
 components used to build up the paths to each of the files in the data
 sources.
 }
If \code{retrieve = FALSE} the involved data is not loaded into RAM memory and
an object of the class 'startR_header' with the following components is
returned:\cr
 \item{Dimensions}{
 Named vector with the dimension lengths and names of the data involved in
aho's avatar
aho committed
 the Start() call.
aho's avatar
aho committed
 \item{Variables}{
 Named list of 1 + N components, containing lists of retrieved variables (as
aho's avatar
aho committed
 requested in 'return_vars') common to all the data sources (in the 1st
 component, \code{$common}), and for each of the N dara sources (named after
 the source name, as specified in \dots, or, if not specified, \code{$dat1},
 \code{$dat2}, ..., \code{$datN}). Each of the variables are contained in a
 multidimensional array with named dimensions, and potentially with the
 attribute 'variables' with additional auxiliary data.
 }
 \item{Files}{
 Multidimensonal character string array with named dimensions. Its dimensions
 are the file dimensions (as requested in \dots). Each cell in this array
 contains a path to a file to be retrieved (which may exist or not).
 }
 \item{FileSelectors}{
 Multidimensional character string array with named dimensions, with the same
 shape as \code{$Files} and \code{$NotFoundFiles}, which contains the
 components used to build up the paths to each of the files in the data
 sources.
 }
 \item{StartRCall}{
aho's avatar
aho committed
 List of parameters sent to the Start() call, with the parameter
 'retrieve' set to TRUE. Intended for calling in order to
 retrieve the associated data a posteriori with a call to do.call().
\description{
aho's avatar
aho committed
See the \href{https://earth.bsc.es/gitlab/es/startR}{startR documentation and
tutorial} for a step-by-step explanation on how to use Start().\cr\cr
Nowadays in the era of big data, large multidimensional data sets from 
diverse sources need to be combined and processed. Analysis of big data in any
field is often highly complex and time-consuming. Taking subsets of these data
sets and processing them efficiently become an indispensable practice. This 
technique is also known as Domain Decomposition, Map Reduce or, more commonly,
'chunking'.\cr\cr
aho's avatar
aho committed
startR (Subset, TrAnsform, ReTrieve, arrange and process large 
multidimensional data sets in R) is an R project started at BSC with the aim 
to develop a tool that allows the user to automatically process large 
multidimensional distributed data sets. It is an open source project that is 
open to external collaboration and funding, and will continuously evolve to 
support as many data set formats as possible while maximizing its efficiency.\cr\cr
aho's avatar
aho committed
startR provides a framework under which a data set (collection of one 
or multiple data files, potentially distributed over various remote servers) 
are perceived as if they all were part of a single large multidimensional 
array. Once such multidimensional array is declared, any user-defined function
aho's avatar
aho committed
can be applied to the data in a \code{apply}-like fashion, where startR
transparently implements the Map Reduce paradigm. The steps to follow in order
to process a collection of big data sets are as follows:\cr
\itemize{
 \item{
Declaring the data set, i.e. declaring the distribution of the data files 
involved, the dimensions and shape of the multidimensional array, and the 
boundaries of the target data. This step can be performed with the 
aho's avatar
aho committed
Start() function. Numeric indices or coordinate values can be used when
fixing the boundaries. It is common having the need to apply transformations, 
pre-processing or reordering to the data. Start() accepts user-defined 
transformation or reordering functions to be applied for such purposes. Once a
data set is declared, a list of involved files, dimension lengths, memory size
and other metadata is made available. Optionally, the data set can be 
retrieved and loaded onto the current R session if it is small enough. 
 }
 \item{
Declaring the workflow of operations to perform on the involved data set(s).
aho's avatar
aho committed
This step can be performed with the Step() and AddStep() functions.
 }
 \item{
Defining the computation settings. The mandatory settings include a) how many
subsets to divide the data sets into and along which dimensions; b) which 
platform to perform the workflow of operations on (local machine or remote 
machine/HPC?), how to communicate with it (unidirectional or bidirectional 
connection? shared or separate file systems?), which queuing system it uses 
(slurm, PBS, LSF, none?); and c) how many parallel jobs and execution threads
per job to use when running the calculations. This step can be performed when 
aho's avatar
aho committed
building up the call to the Compute() function.
 }
 \item{
Running the computation. startR transparently implements the Map Reduce 
paradigm, according to the settings in the previous steps. The progress can 
optionally be monitored with the EC-Flow workflow management tool. When the 
computation ends, a report of performance timings is displayed. This step can 
aho's avatar
aho committed
be triggered with the Compute() function.
aho's avatar
aho committed
startR is not bound to a specific file format. Interface functions to
custom file formats can be provided for Start() to read them. As this
version, startR includes interface functions to the following file formats:
\itemize{
 \item{
NetCDF
 }
Nicolau Manubens's avatar
Nicolau Manubens committed
}
aho's avatar
aho committed
Metadata and auxilliary data is also preserved and arranged by Start()
in the measure that it is retrieved by the interface functions for a specific 
file format.
Nicolau Manubens's avatar
Nicolau Manubens committed
}
\examples{
aho's avatar
aho committed
 data_path <- system.file('extdata', package = 'startR')
 path_obs <- file.path(data_path, 'obs/monthly_mean/$var$/$var$_$sdate$.nc')
 sdates <- c('200011', '200012')
aho's avatar
aho committed
 data <- Start(dat = list(list(path = path_obs)),
               var = 'tos',
               sdate = sdates,
               time = 'all',
               latitude = 'all',
               longitude = 'all',
               return_vars = list(latitude = 'dat', 
                                  longitude = 'dat', 
                                  time = 'sdate'),
               retrieve = FALSE)

Nicolau Manubens's avatar
Nicolau Manubens committed
}