Start.R

#'Declare, discover, subset and retrieve multidimensional distributed data sets
#'
#'See the \href{https://earth.bsc.es/gitlab/es/startR}{startR documentation and
#'tutorial} for a step-by-step explanation on how to use Start().\cr\cr
#'Nowadays in the era of big data, large multidimensional data sets from 
#'diverse sources need to be combined and processed. Analysis of big data in any
#'field is often highly complex and time-consuming. Taking subsets of these data
#'sets and processing them efficiently become an indispensable practice. This 
#'technique is also known as Domain Decomposition, Map Reduce or, more commonly,
#''chunking'.\cr\cr
#'startR (Subset, TrAnsform, ReTrieve, arrange and process large 
#'multidimensional data sets in R) is an R project started at BSC with the aim 
#'to develop a tool that allows the user to automatically process large 
#'multidimensional distributed data sets. It is an open source project that is 
#'open to external collaboration and funding, and will continuously evolve to 
#'support as many data set formats as possible while maximizing its efficiency.\cr\cr
#'startR provides a framework under which a data set (collection of one 
#'or multiple data files, potentially distributed over various remote servers) 
#'are perceived as if they all were part of a single large multidimensional 
#'array. Once such multidimensional array is declared, any user-defined function
#'can be applied to the data in a \code{apply}-like fashion, where startR
#'transparently implements the Map Reduce paradigm. The steps to follow in order
#'to process a collection of big data sets are as follows:\cr
#'\itemize{
#'  \item{
#'Declaring the data set, i.e. declaring the distribution of the data files 
#'involved, the dimensions and shape of the multidimensional array, and the 
#'boundaries of the target data. This step can be performed with the 
#'Start() function. Numeric indices or coordinate values can be used when
#'fixing the boundaries. It is common having the need to apply transformations, 
#'pre-processing or reordering to the data. Start() accepts user-defined 
#'transformation or reordering functions to be applied for such purposes. Once a
#'data set is declared, a list of involved files, dimension lengths, memory size
#'and other metadata is made available. Optionally, the data set can be 
#'retrieved and loaded onto the current R session if it is small enough. 
#'  }
#'  \item{
#'Declaring the workflow of operations to perform on the involved data set(s).
#'This step can be performed with the Step() and AddStep() functions.
#'  }
#'  \item{
#'Defining the computation settings. The mandatory settings include a) how many
#'subsets to divide the data sets into and along which dimensions; b) which 
#'platform to perform the workflow of operations on (local machine or remote 
#'machine/HPC?), how to communicate with it (unidirectional or bidirectional 
#'connection? shared or separate file systems?), which queuing system it uses 
#'(slurm, PBS, LSF, none?); and c) how many parallel jobs and execution threads
#'per job to use when running the calculations. This step can be performed when 
#'building up the call to the Compute() function.
#'  }
#'  \item{
#'Running the computation. startR transparently implements the Map Reduce 
#'paradigm, according to the settings in the previous steps. The progress can 
#'optionally be monitored with the EC-Flow workflow management tool. When the 
#'computation ends, a report of performance timings is displayed. This step can 
#'be triggered with the Compute() function.
#'  }
#'}
#'startR is not bound to a specific file format. Interface functions to
#'custom file formats can be provided for Start() to read them. As this
#'version, startR includes interface functions to the following file formats:
#'\itemize{
#'  \item{
#'NetCDF
#'  }
#'}
#'Metadata and auxilliary data is also preserved and arranged by Start()
#'in the measure that it is retrieved by the interface functions for a specific 
#'file format.
#'
#'@param \dots A selection of custemized parameters depending on the data 
#'format. When we retrieve data from one or a collection of data sets, 
#'the involved data can be perceived as belonging to a large multi-dimensional 
#'array. For instance, let us consider an example case. We want to retrieve data
#'from a source, which contains data for the number of monthly sales of various 
#'items, and also for their retail price each month. The data on source is 
#'stored as follows:\cr\cr
#'\command{
#'\cr #  /data/
#'\cr #    |-> sales/
#'\cr #    |    |-> electronics
#'\cr #    |    |    |-> item_a.data
#'\cr #    |    |    |-> item_b.data
#'\cr #    |    |    |-> item_c.data
#'\cr #    |    |-> clothing
#'\cr #    |         |-> item_d.data
#'\cr #    |         |-> idem_e.data
#'\cr #    |         |-> idem_f.data
#'\cr #    |-> prices/
#'\cr #         |-> electronics
#'\cr #         |    |-> item_a.data
#'\cr #         |    |-> item_b.data
#'\cr #         |    |-> item_c.data
#'\cr #         |-> clothing
#'\cr #              |-> item_d.data
#'\cr #              |-> item_e.data
#'\cr #              |-> item_f.data
#'}\cr\cr
#'Each item file contains data, stored in whichever format, for the sales or 
#'prices over a time period, e.g. for the past 24 months, registered at 100 
#'different stores over the world. Whichever the format it is stored in, each 
#'file can be perceived as a container of a data array of 2 dimensions, time and
#'store. Let us assume the '.data' format allows to keep a name for each of 
#'these dimensions, and the actual names are 'time' and 'store'.\cr\cr
#'The different item files for sales or prices can be perceived as belonging to 
#'an 'item' dimension of length 3, and the two groups of three items to a 
#''section' dimension of length 2, and the two groups of two sections (one with
#'the sales and the other with the prices) can be perceived as belonging also to
#'another dimension 'variable' of length 2. Even the source can be perceived as 
#'belonging to a dimension 'source' of length 1.\cr\cr
#'All in all, in this example, the whole data could be perceived as belonging to
#'a multidimensional 'large array' of dimensions\cr
#'\command{
#'\cr #  source variable  section      item    store    month
#'\cr #       1        2        2         3      100       24
#'}
#'\cr\cr
#'The dimensions of this 'large array' can be classified in two types. The ones 
#'that group actual files (the file dimensions) and the ones that group data 
#'values inside the files (the inner dimensions). In the example, the file 
#'dimensions are 'source', 'variable', 'section' and 'item', whereas the inner 
#'dimensions are 'store' and 'month'.
#'\cr\cr
#'Having the dimensions of our target sources in mind, the parameter \code{\dots} 
#'expects to receive information on:
#'    \itemize{
#'      \item{
#'The names of the expected dimensions of the 'large dataset' we want to 
#'retrieve data from
#'      }
#'      \item{
#'The indices to take from each dimension (and other constraints)
#'      }
#'      \item{
#'How to reorder the dimension if needed
#'      }
#'      \item{
#'The location and organization of the files of the data sets
#'      }
#'    }
#'For each dimension, the 3 first information items can be specified with a set
#'of parameters to be provided through \code{\dots}. For a given dimension 
#''dimname', six parameters can be specified:\cr
#'\command{
#'\cr # dimname = <indices_to_take>,  # 'all' / 'first' / 'last' /
#'\cr #                               # indices(c(1, 10, 20)) /
#'\cr #                               # indices(c(1:20)) /
#'\cr #                               # indices(list(1, 20)) /
#'\cr #                               # c(1, 10, 20) / c(1:20) /
#'\cr #                               # list(1, 20)
#'\cr # dimname_var = <name_of_associated_coordinate_variable>,
#'\cr # dimname_tolerance = <tolerance_value>,
#'\cr # dimname_reorder = <reorder_function>,
#'\cr # dimname_depends = <name_of_another_dimension>,
#'\cr # dimname_across = <name_of_another_dimension>
#'}
#'\cr\cr
#'The \bold{indices to take} can be specified in three possible formats (see 
#'code comments above for examples). The first format consists in using 
#'character tags, such as 'all' (take all the indices available for that 
#'dimension), 'first' (take only the first) and 'last' (only the last). The 
#'second format consists in using numeric indices, which have to be wrapped in a
#'call to the indices() helper function. For the second format, either a
#'vector of numeric indices can be provided, or a list with two numeric indices 
#'can be provided to take all the indices in the range between the two specified
#'indices (both extremes inclusive). The third format consists in providing a 
#'vector character strings (for file dimensions) or of values of whichever type
#'(for inner dimensions). For the file dimensions, the provided character 
#'strings in the third format will be used as components to build up the final 
#'path to the files (read further). For inner dimensions, the provided values in
#'the third format will be compared to the values of an associated coordinate 
#'variable (must be specified in '<dimname>_reorder', read further), and the 
#'indices of the closest values will be retrieved. When using the third format, 
#'a list with two values can also be provided to take all the indices of the 
#'values within the specified range.
#'\cr\cr
#'The \bold{name of the associated coordinate variable} must be a character 
#'string with the name of an associated coordinate variable to be found in the 
#'data files (in all* of them). For this to work, a 'file_var_reader' 
#'function must be specified when calling Start() (see parameter 
#''file_var_reader'). The coordinate variable must also be requested in the 
#'parameter 'return_vars' (see its section for details). This feature only 
#'works for inner dimensions.
#'\cr\cr
#'The \bold{tolerance value} is useful when indices for an inner dimension are 
#'specified in the third format (values of whichever type). In that case, the 
#'indices of the closest values in the coordinate variable are seeked. However 
#'the closest value might be too distant and we would want to consider no real 
#'match exists for such provided value. This is possible via the tolerance,
#'which allows to specify a threshold beyond which not to seek for matching 
#'values and mark that index as missing value.
#'\cr\cr
#'The \bold{reorder_function} is useful when indices for an inner dimension are
#'specified in the third fromat, and the retrieved indices need to be reordered 
#'in function of their provided associated variable values. A function can be 
#'provided, which receives as input a vector of values, and returns as outputs a
#'list with the components \code{$x} with the reordered values, and \code{$ix} 
#'with the permutation indices. Two reordering functions are included in 
#'startR, the Sort() and the CircularSort().
#'\cr\cr
#'The \bold{name of another dimension} to be specified in <dimname>_depends,
#'only available for file dimensions, must be a character string with the name 
#'of another requested \bold{file dimension} in \code{\dots}, and will make 
#'Start() aware that the path components of a file dimension can vary in
#'function of the path component of another file dimension. For instance, in the
#'example above, specifying \code{item_depends = 'section'} will make 
#'Start() aware that the item names vary in function of the section, i.e.
#'section 'electronics' has items 'a', 'b' and 'c' but section 'clothing' has 
#'items 'd', 'e', 'f'. Otherwise Start() would expect to find the same 
#'item names in all the sections.
#'\cr\cr
#'The \bold{name of another dimension} to be specified in '<dimname>_across',
#'only available for inner dimensions, must be a character string with the name 
#'of another requested \bold{inner dimension} in \code{\dots}, and will make 
#'Start() aware that an inner dimension extends along multiple files. For
#'instance, let us imagine that in the example above, the records for each item 
#'are so large that it becomes necessary to split them in multiple files each 
#'one containing the registers for a different period of time, e.g. in 10 files 
#'with 100 months each ('item_a_period1.data', 'item_a_period2.data', and so on).
#'In that case, the data can be perceived as having an extra file dimension, the 
#''period' dimension. The inner dimension 'month' would extend across multiple 
#'files, and providing the parameter \code{month = indices(1, 300)} would make 
#'Start() crash because it would perceive we have made a request out of 
#'bounds (each file contains 100 'month' indices, but we requested 1 to 300). 
#'This can be solved by specifying the parameter \code{month_across = period} (a
#'long with the full specification of the dimension 'period').
#'\cr\cr
#'\bold{Defining the path pattern}
#'\cr
#'As mentioned above, the parameter \dots also expects to receive information 
#'with the location of the data files. In order to do this, a special dimension 
#'must be defined. In that special dimension, in place of specifying indices to 
#'take, a path pattern must be provided. The path pattern is a character string 
#'that encodes the way the files are organized in their source. It must be a 
#'path to one of the data set files in an accessible local or remote file system,
#'or a URL to one of the files provided by a local or remote server. The regions
#'of this path that vary across files (along the file dimensions) must be 
#'replaced by wildcards. The wildcards must match any of the defined file 
#'dimensions in the call to Start() and must be delimited with heading 
#'and trailing '$'. Shell globbing expressions can be used in the path pattern. 
#'See the next code snippet for an example of a path pattern.
#'\cr\cr
#'All in all, the call to Start() to load the entire data set in the 
#'example of store item sales, would look as follows:
#'\cr
#'\command{
#'\cr # data <- Start(source = paste0('/data/$variable$/',
#'\cr #                               '$section$/$item$.data'),
#'\cr #               variable = 'all',
#'\cr #               section = 'all',
#'\cr #               item = 'all',
#'\cr #               item_depends = 'section',
#'\cr #               store = 'all',
#'\cr #               month = 'all')
#'}
#'\cr\cr
#'Note that in this example it would still be pending to properly define the 
#'parameters 'file_opener', 'file_closer', 'file_dim_reader', 
#''file_var_reader' and 'file_data_reader' for the '.data' file format
#'(see the corresponding sections).
#'\cr\cr
#'The call to Start() will return a multidimensional R array with the 
#'following dimensions:
#'\cr
#'\command{
#'\cr #  source variable  section      item    store    month
#'\cr #       1        2        2         3      100       24
#'}
#'\cr
#'The dimension specifications in the \code{\dots} do not have to follow any 
#'particular order. The returned array will have the dimensions in the same order
#'as they have been specified in the call. For example, the following call:
#'\cr
#'\command{
#'\cr # data <- Start(source = paste0('/data/$variable$/',
#'\cr #                               '$section$/$item$.data'),
#'\cr #               month = 'all',
#'\cr #               store = 'all',
#'\cr #               item = 'all',
#'\cr #               item_depends = 'section',
#'\cr #               section = 'all',
#'\cr #               variable = 'all')
#'}
#'\cr\cr
#'would return an array with the following dimensions:
#'\cr
#'\command{
#'\cr #  source    month    store      item  section variable
#'\cr #       1       24      100         3        2        2
#'}
#'\cr\cr
#'Next, a more advanced example to retrieve data for only the sales records, for
#'the first section ('electronics'), for the 1st and 3rd items and for the 
#'stores located in Barcelona (assuming the files contain the variable 
#''store_location' with the name of the city each of the 100 stores are located 
#'at):
#'\cr
#'\command{
#'\cr # data <- Start(source = paste0('/data/$variable$/',
#'\cr #                               '$section$/$item$.data'),
#'\cr #               variable = 'sales',
#'\cr #               section = 'first',
#'\cr #               item = indices(c(1, 3)),
#'\cr #               item_depends = 'section',
#'\cr #               store = 'Barcelona',
#'\cr #               store_var = 'store_location',
#'\cr #               month = 'all',
#'\cr #               return_vars = list(store_location = NULL))
#'}
#'\cr\cr
#'The defined names for the dimensions do not necessarily have to match the 
#'names of the dimensions inside the file. Lists of alternative names to be 
#'seeked can be defined in the parameter 'synonims'.
#'\cr\cr
#'If data from multiple sources (not necessarily following the same structure) 
#'has to be retrieved, it can be done by providing a vector of character strings
#'with path pattern specifications, or, in the extended form, by providing a 
#'list of lists with the components 'name' and 'path', and the name of the 
#'dataset and path pattern as values, respectively. For example:
#'\cr
#'\command{
#'\cr # data <- Start(source = list(
#'\cr #                 list(name = 'sourceA',
#'\cr #                      path = paste0('/sourceA/$variable$/',
#'\cr #                                    '$section$/$item$.data')),
#'\cr #                 list(name = 'sourceB',
#'\cr #                      path = paste0('/sourceB/$section$/',
#'\cr #                                    '$variable$/$item$.data'))
#'\cr #               ),
#'\cr #               variable = 'sales',
#'\cr #               section = 'first',
#'\cr #               item = indices(c(1, 3)),
#'\cr #               item_depends = 'section',
#'\cr #               store = 'Barcelona',
#'\cr #               store_var = 'store_location',
#'\cr #               month = 'all',
#'\cr #               return_vars = list(store_location = NULL))
#'}
#'\cr
#'
#'@param return_vars A named list where the names are the names of the 
#'variables to be fetched in the files, and the values are vectors of 
#'character strings with the names of the file dimension which to retrieve each
#'variable for, or NULL if the variable has to be retrieved only once 
#'from any (the first) of the involved files.\cr\cr
#'Apart from retrieving a multidimensional data array, retrieving auxiliary 
#'variables inside the files can also be needed. The parameter 
#''return_vars' allows for requesting such variables, as long as a 
#''file_var_reader' function is also specified in the call to 
#'Start() (see documentation on the corresponding parameter). 
#'\cr\cr
#'In the case of the the item sales example (see documentation on parameter 
#'\code{\dots)}, the store location variable is requested with the parameter\cr 
#'\code{return_vars = list(store_location = NULL)}.\cr This will cause 
#'Start() to fetch once the variable 'store_location' and return it in 
#'the component\cr \code{$Variables$common$store_location},\cr and will be an 
#'array of character strings with the location names, with the dimensions 
#'\code{c('store' = 100)}. Although useless in this example, we could ask 
#'Start() to fetch and return such variable for each file along the 
#'items dimension as follows: \cr 
#'\code{return_vars = list(store_location = c('item'))}.\cr In that case, the 
#'variable will be fetched once from a file of each of the items, and will be 
#'returned as an array with the dimensions \code{c('item' = 3, 'store' = 100)}.
#'\cr\cr
#'If a variable is requested along a file dimension that contains path pattern 
#'specifications ('source' in the example), the fetched variable values will be 
#'returned in the component\cr \code{$Variables$<dataset_name>$<variable_name>}.\cr 
#'For example:
#'\cr
#'\command{
#'\cr # data <- Start(source = list(
#'\cr #                 list(name = 'sourceA',
#'\cr #                      path = paste0('/sourceA/$variable$/',
#'\cr #                                    '$section$/$item$.data')),
#'\cr #                 list(name = 'sourceB',
#'\cr #                      path = paste0('/sourceB/$section$/',
#'\cr #                                    '$variable$/$item$.data'))
#'\cr #               ),
#'\cr #               variable = 'sales',
#'\cr #               section = 'first',
#'\cr #               item = indices(c(1, 3)),
#'\cr #               item_depends = 'section',
#'\cr #               store = 'Barcelona',
#'\cr #               store_var = 'store_location',
#'\cr #               month = 'all',
#'\cr #               return_vars = list(store_location = c('source',
#'\cr #                                                     'item')))
#'\cr # # Checking the structure of the returned variables
#'\cr # str(found_data$Variables)
#'\cr # Named list
#'\cr # ..$common: NULL
#'\cr # ..$sourceA: Named list
#'\cr # .. ..$store_location: char[1:18(3d)] 'Barcelona' 'Barcelona' ...
#'\cr # ..$sourceB: Named list
#'\cr # .. ..$store_location: char[1:18(3d)] 'Barcelona' 'Barcelona' ...
#'\cr # # Checking the dimensions of the returned variable
#'\cr # # for the source A
#'\cr # dim(found_data$Variables$sourceA)
#'\cr #     item   store
#'\cr #        3       3
#'}
#'\cr\cr
#'The names of the requested variables do not necessarily have to match the 
#'actual variable names inside the files. A list of alternative names to be 
#'seeked can be specified via the parameter 'synonims'.
#'
#'@param synonims A named list where the names are the requested variable or 
#'dimension names, and the values are vectors of character strings with 
#'alternative names to seek for such dimension or variable.\cr\cr
#'In some requests, data from different sources may follow different naming 
#'conventions for the dimensions or variables, or even files in the same source
#'could have varying names. This parameter is in order for Start() to 
#'properly identify the dimensions or variables with different names.
#'\cr\cr
#'In the example used in parameter 'return_vars', it may be the case that 
#'the two involved data sources follow slightly different naming conventions. 
#'For example, source A uses 'sect' as name for the sections dimension, whereas 
#'source B uses 'section'; source A uses 'store_loc' as variable name for the 
#'store locations, whereas source B uses 'store_location'. This can be taken 
#'into account as follows:
#'\cr
#'\command{
#'\cr # data <- Start(source = list(
#'\cr #                 list(name = 'sourceA',
#'\cr #                      path = paste0('/sourceA/$variable$/',
#'\cr #                                    '$section$/$item$.data')),
#'\cr #                 list(name = 'sourceB',
#'\cr #                      path = paste0('/sourceB/$section$/',
#'\cr #                                    '$variable$/$item$.data'))
#'\cr #               ),
#'\cr #               variable = 'sales',
#'\cr #               section = 'first',
#'\cr #               item = indices(c(1, 3)),
#'\cr #               item_depends = 'section',
#'\cr #               store = 'Barcelona',
#'\cr #               store_var = 'store_location',
#'\cr #               month = 'all',
#'\cr #               return_vars = list(store_location = c('source',
#'\cr #                                                     'item')),
#'\cr #               synonims = list(
#'\cr #                 section = c('sec', 'section'),
#'\cr #                 store_location = c('store_loc',
#'\cr #                                    'store_location')
#'\cr #               ))
#'}
#'\cr
#'
#'@param file_opener A function that receives as a single parameter 
#'  'file_path' a character string with the path to a file to be opened, 
#'  and returns an object with an open connection to the file (optionally with 
#'  header information) on success, or returns NULL on failure.
#'\cr\cr
#'This parameter takes by default NcOpener() (an opener function for NetCDF
#'files).
#'\cr\cr
#'See NcOpener() for a template to build a file opener for your own file 
#'format.
#'
#'@param file_var_reader A function with the header \code{file_path = NULL}, 
#'  \code{file_object = NULL}, \code{file_selectors = NULL}, \code{var_name}, 
#'  \code{synonims} that returns an array with auxiliary data (i.e. data from a
#'  variable) inside a file. Start() will provide automatically either a 
#'  'file_path' or a 'file_object' to the 'file_var_reader'
#'  function (the function has to be ready to work whichever of these two is 
#'  provided). The parameter 'file_selectors' will also be provided 
#'  automatically to the variable reader, containing a named list where the 
#'  names are the names of the file dimensions of the queried data set (see 
#'  documentation on \code{\dots}) and the values are single character strings 
#'  with the components used to build the path to the file being read (the one 
#'  provided in 'file_path' or 'file_object'). The parameter 'var_name'
#'  will be filled in automatically by Start() also, with the name of one
#'  of the variales to be read. The parameter 'synonims' will be filled in 
#'  with exactly the same value as provided in the parameter 'synonims' in 
#'  the call to Start(), and has to be used in the code of the variable 
#'  reader to check for alternative variable names inside the target file. The 
#'  'file_var_reader' must return a (multi)dimensional array with named 
#'  dimensions, and optionally with the attribute 'variales' with other 
#'  additional metadata on the retrieved variable.
#'\cr\cr
#'Usually, the 'file_var_reader' should be a degenerate case of the 
#''file_data_reader' (see documentation on the corresponding parameter), 
#'so it is recommended to code the 'file_data_reder' in first place.
#'\cr\cr
#'This parameter takes by default NcVarReader() (a variable reader function
#'for NetCDF files).
#'\cr\cr
#'See NcVarReader() for a template to build a variale reader for your own 
#'file format.
#'
#'@param file_dim_reader A function with the header \code{file_path = NULL}, 
#'  \code{file_object = NULL}, \code{file_selectors = NULL}, \code{synonims} 
#'  that returns a named numeric vector where the names are the names of the 
#'  dimensions of the multidimensional data array in the file and the values are
#'  the sizes of such dimensions. Start() will provide automatically 
#'  either a 'file_path' or a 'file_object' to the 
#'  'file_dim_reader' function (the function has to be ready to work 
#'  whichever of these two is provided). The parameter 'file_selectors'
#'  will also be provided automatically to the dimension reader, containing a
#'  named list where the names are the names of the file dimensions of the 
#'  queried data set (see documentation on \code{\dots}) and the values are 
#'  single character strings with the components used to build the path to the 
#'  file being read (the one provided in 'file_path' or 'file_object'). 
#'  The parameter 'synonims' will be filled in with exactly the same value 
#'  as provided in the parameter 'synonims' in the call to Start(), 
#'  and can optionally be used in advanced configurations.
#'\cr\cr
#'This parameter takes by default NcDimReader() (a dimension reader 
#'function for NetCDF files).
#'\cr\cr
#'See NcDimReader() for (an advanced) template to build a dimension reader
#'for your own file format.
#'
#'@param file_data_reader A function with the header \code{file_path = NULL}, 
#'  \code{file_object = NULL}, \code{file_selectors = NULL}, 
#'  \code{inner_indices = NULL}, \code{synonims} that returns a subset of the 
#'  multidimensional data array inside a file (even if internally it is not an 
#'  array). Start() will provide automatically either a 'file_path'
#'  or a 'file_object' to the 'file_data_reader' function (the 
#'  function has to be ready to work whichever of these two is provided). The
#'  parameter 'file_selectors' will also be provided automatically to the
#'  data reader, containing a named list where the names are the names of the
#'  file dimensions of the queried data set (see documentation on \code{\dots})
#'  and the values are single character strings with the components used to 
#'  build the path to the file being read (the one provided in 'file_path' or 
#'  'file_object'). The parameter 'inner_indices' will be filled in 
#'  automatically by Start() also, with a named list of numeric vectors, 
#'  where the names are the names of all the expected inner dimensions in a file
#'  to be read, and the numeric vectors are the indices to be taken from the 
#'  corresponding dimension (the indices may not be consecutive nor in order).
#'  The parameter 'synonims' will be filled in with exactly the same value 
#'  as provided in the parameter 'synonims' in the call to Start(), 
#'  and has to be used in the code of the data reader to check for alternative 
#'  dimension names inside the target file. The 'file_data_reader' must 
#'  return a (multi)dimensional array with named dimensions, and optionally with
#'  the attribute 'variables' with other additional metadata on the retrieved 
#'  data.
#'\cr\cr
#'Usually, 'file_data_reader' should use 'file_dim_reader'
#'(see documentation on the corresponding parameter), so it is recommended to 
#'code 'file_dim_reder' in first place.
#'\cr\cr
#'This parameter takes by default NcDataReader() (a data reader function 
#'for NetCDF files).
#'\cr\cr
#'See NcDataReader() for a template to build a data reader for your own 
#'file format.
#'
#'@param file_closer A function that receives as a single parameter 
#'  'file_object' an open connection (as returned by 'file_opener') 
#'  to one of the files to be read, optionally with header information, and 
#'  closes the open connection. Always returns NULL.
#'\cr\cr
#'This parameter takes by default NcCloser() (a closer function for NetCDF 
#'files).
#'\cr\cr
#'See NcCloser() for a template to build a file closer for your own file 
#'format.
#'
#'@param transform A function with the header \code{dara_array}, 
#'  \code{variables}, \code{file_selectors = NULL}, \code{\dots}. It receives as
#'  input, through the parameter \code{data_array}, a subset of a 
#'  multidimensional array (as returned by 'file_data_reader'), applies a 
#'  transformation to it and returns it, preserving the amount of dimensions but
#'  potentially modifying their size. This transformation may require data from 
#'  other auxiliary variables, automatically provided to 'transform' 
#'  through the parameter 'variables', in the form of a named list where
#'  the names are the variable names and the values are (multi)dimensional
#'  arrays. Which variables need to be sent to 'transform' can be specified
#'  with the parameter 'transform_vars' in Start(). The parameter 
#'  'file_selectors' will also be provided automatically to 
#'  'transform', containing a named list where the names are the names of 
#'  the file dimensions of the queried data set (see documentation on 
#'  \code{\dots}) and the values are single character strings with the 
#'  components used to build the path to the file the subset being processed 
#'  belongs to. The parameter \code{\dots} will be filled in with other 
#'  additional parameters to adjust the transformation, exactly as provided in 
#'  the call to Start() via the parameter 'transform_params'.
#'@param transform_params A named list with additional parameters to be sent to 
#'  the 'transform' function (if specified). See documentation on parameter
#'  'transform' for details.
#'@param transform_vars A vector of character strings with the names of 
#'  auxiliary variables to be sent to the 'transform' function (if 
#'  specified). All the variables to be sent to 'transform' must also 
#'  have been requested as return variables in the parameter 'return_vars' 
#'  of Start().
#'@param transform_extra_cells An integer of extra indices to retrieve from the 
#'  data set, beyond the requested indices in \code{\dots}, in order for 
#'  'transform' to dispose of additional information to properly apply 
#'  whichever transformation (if needed). As many as 
#'  'transform_extra_cells' will be retrieved beyond each of the limits for
#'  each of those inner dimensions associated to a coordinate variable and sent 
#'  to 'transform' (i.e. present in 'transform_vars'). After 
#'  'transform' has finished, Start() will take again and return a 
#'  subset of the result, for the returned data to fall within the specified 
#'  bounds in \code{\dots}. The default value is 2.
#'@param apply_indices_after_transform A logical value indicating when a 
#'  'transform' is specified in Start() and numeric indices are 
#'  provided for any of the inner dimensions that depend on coordinate variables,
#'  these numeric indices can be made effective (retrieved) before applying the 
#'  transformation or after. The boolean flag allows to adjust this behaviour. 
#'  It takes FALSE by default (numeric indices are applied before sending
#'  data to 'transform').
#'@param pattern_dims A character string indicating the name of the dimension 
#'  with path pattern specifications (see \code{\dots} for details). If not  
#'  specified, Start() assumes the first provided dimension is the pattern 
#'  dimension, with a warning.
#'@param metadata_dims A vector of character strings with the names of the file 
#'  dimensions which to return metadata for. As noted in 'file_data_reader', 
#'  the data reader can optionally return auxiliary data via the attribute 
#'  'variables' of the returned array. Start() by default returns the 
#'  auxiliary data read for only the first file of each source (or data set) in 
#'  the pattern dimension (see \code{\dots} for info on what the pattern 
#'  dimension is). However it can be configured to return the metadata for all 
#'  the files along any set of file dimensions. The default value is NULL, and
#'  it will be assigned automatically as parameter 'pattern_dims'.
#'@param selector_checker A function used internaly by Start() to 
#'  translate a set of selectors (values for a dimension associated to a 
#'  coordinate variable) into a set of numeric indices. It takes by default 
#'  SelectorChecker() and, in principle, it should not be required to 
#'  change it for customized file formats. The option to replace it is left open
#'  for more versatility. See the code of SelectorChecker() for details on
#'  the inputs, functioning and outputs of a selector checker.
#'@param merge_across_dims A logical value indicating whether to merge 
#'  dimensions across which another dimension extends (according to the 
#'  '<dimname>_across' parameters). Takes the value FALSE by default. For 
#'  example, if the dimension 'time' extends across the dimension 'chunk' and 
#'  \code{merge_across_dims = TRUE}, the resulting data array will only contain
#'  only the dimension 'time' as long as all the chunks together.
#'@param merge_across_dims_narm A logical value indicating whether to remove
#'  the additional NAs from data when parameter 'merge_across_dims' is TRUE.
#'  It is helpful when the length of the to-be-merged dimension is different 
#'  across another dimension. For example, if the dimension 'time' extends 
#'  across dimension 'chunk', and the time length along the first chunk is 2 
#'  while along the second chunk is 10. Setting this parameter as TRUE can 
#'  remove the additional 8 NAs at position 3 to 10. The default value is FALSE.
#'@param split_multiselected_dims A logical value indicating whether to split a 
#'  dimension that has been selected with a multidimensional array of selectors
#'  into as many dimensions as present in the selector array. The default value
#'  is FALSE.
#'@param path_glob_permissive A logical value or an integer specifying how many
#'  folder levels in the path pattern, beginning from the end, the shell glob
#'  expressions must be preserved and worked out for each file. The default 
#'  value is FALSE, which is equivalent to 0. TRUE is equivalent to 1.\cr\cr
#'When specifying a path pattern for a dataset, it might contain shell glob 
#'experissions. For each dataset, the first file matching the path pattern is 
#'found, and the found file is used to work out fixed values for the glob 
#'expressions that will be used for all the files of the dataset. However, in 
#'some cases, the values of the shell glob expressions may not be constant for 
#'all files in a dataset, and they need to be worked out for each file 
#'involved.\cr\cr
#'For example, a path pattern could be as follows: \cr
#'\code{'/path/to/dataset/$var$_*/$date$_*_foo.nc'}. \cr Leaving 
#'\code{path_glob_permissive = FALSE} will trigger automatic seek of the 
#'  contents to replace the asterisks (e.g. the first asterisk matches with 
#'  \code{'bar'} and the second with \code{'baz'}. The found contents will be 
#'  used for all files in the dataset (in the example, the path pattern will be
#'  fixed to\cr \code{'/path/to/dataset/$var$_bar/$date$_baz_foo.nc'}. However, if
#'  any of the files in the dataset have other contents in the position of the
#'  asterisks, Start() will not find them (in the example, a file like \cr
#'  \code{'/path/to/dataset/precipitation_bar/19901101_bin_foo.nc'} would not be
#'  found). Setting \code{path_glob_permissive = 1} would preserve global
#'  expressions in the latest level (in the example, the fixed path pattern
#'  would be\cr \code{'/path/to/dataset/$var$_bar/$date$_*_foo.nc'}, and the
#'  problematic file mentioned before would be found), but of course this would
#'  slow down the Start() call if the dataset involves a large number of
#'  files. Setting \code{path_glob_permissive = 2} would leave the original path
#'  pattern with the original glob expressions in the 1st and 2nd levels (in the
#'  example, both asterisks would be preserved, thus would allow Start()
#'  to recognize files such as \cr
#'  \code{'/path/to/dataset/precipitation_zzz/19901101_yyy_foo.nc'}).\cr\cr
#'Note that each glob expression can only represent one possibility (Start() 
#'chooses the first). Because /code{*} is not the tag, which means it cannot
#'be a dimension of the output array. Therefore, only one possibility can be
#'adopted. For example, if \cr
#'\code{'/path/to/dataset/precipitation_*/19901101_*_foo.nc'}\cr
#'has two matches:\cr
#'\code{'/path/to/dataset/precipitation_xxx/19901101_yyy_foo.nc'} and\cr
#'\code{'/path/to/dataset/precipitation_zzz/19901101_yyy_foo.nc'},\cr
#'only the first found file will be used.
#'@param retrieve A logical value indicating whether to retrieve the data
#'  defined in the Start() call or to explore only its dimension lengths 
#'  and names, and the values for the file and inner dimensions. The default
#'  value is FALSE.
#'@param num_procs An integer of number of processes to be created for the
#'  parallel execution of the retrieval/transformation/arrangement of the
#'  multiple involved files in a call to Start(). If set to NULL,
#'  takes the number of available cores (as detected by detectCores() in 
#'  the package 'future'). The default value is 1 (no parallel execution).
#'@param silent A logical value of whether to display progress messages (FALSE)
#'   or not (TRUE). The default value is FALSE.
#'@param debug A logical value of whether to return detailed messages on the
#'  progress and operations in a Start() call (TRUE) or not (FALSE). The
#'  default value is FALSE.
#'
#'@return If \code{retrieve = TRUE} the involved data is loaded into RAM memory
#'  and an object of the class 'startR_cube' with the following components is
#'  returned:\cr
#'  \item{Data}{
#'  Multidimensional data array with named dimensions, with the data values
#'  requested via \code{\dots} and other parameters. This array can potentially 
#'  contain metadata in the attribute 'variables'.
#'  }
#'  \item{Variables}{
#'  Named list of 1 + N components, containing lists of retrieved variables (as
#'  requested in 'return_vars') common to all the data sources (in the 1st
#'  component, \code{$common}), and for each of the N dara sources (named after 
#'  the source name, as specified in \dots, or, if not specified, \code{$dat1},
#'  \code{$dat2}, ..., \code{$datN}). Each of the variables are contained in a
#'  multidimensional array with named dimensions, and potentially with the
#'  attribute 'variables' with additional auxiliary data.
#'  }
#'  \item{Files}{
#'  Multidimensonal character string array with named dimensions. Its dimensions
#'  are the file dimensions (as requested in \code{\dots}). Each cell in this
#'  array contains a path to a retrieved file, or NULL if the corresponding
#'  file was not found.
#'  }
#'  \item{NotFoundFiles}{
#'  Array with the same shape as \code{$Files} but with NULL in the
#'  positions for which the corresponding file was found, and a path to the
#'  expected file in the positions for which the corresponding file was not
#'  found.
#'  }
#'  \item{FileSelectors}{
#'  Multidimensional character string array with named dimensions, with the same
#'  shape as \code{$Files} and \code{$NotFoundFiles}, which contains the
#'  components used to build up the paths to each of the files in the data
#'  sources.
#'  }
#'If \code{retrieve = FALSE} the involved data is not loaded into RAM memory and
#'an object of the class 'startR_header' with the following components is
#' returned:\cr
#'  \item{Dimensions}{
#'  Named vector with the dimension lengths and names of the data involved in
#'  the Start() call.
#'  }
#'  \item{Variables}{
#'  Named list of 1 + N components, containing lists of retrieved variables (as
#'  requested in 'return_vars') common to all the data sources (in the 1st
#'  component, \code{$common}), and for each of the N dara sources (named after
#'  the source name, as specified in \dots, or, if not specified, \code{$dat1},
#'  \code{$dat2}, ..., \code{$datN}). Each of the variables are contained in a
#'  multidimensional array with named dimensions, and potentially with the
#'  attribute 'variables' with additional auxiliary data.
#'  }
#'  \item{Files}{
#'  Multidimensonal character string array with named dimensions. Its dimensions
#'  are the file dimensions (as requested in \dots). Each cell in this array
#'  contains a path to a file to be retrieved (which may exist or not).
#'  }
#'  \item{FileSelectors}{
#'  Multidimensional character string array with named dimensions, with the same
#'  shape as \code{$Files} and \code{$NotFoundFiles}, which contains the
#'  components used to build up the paths to each of the files in the data
#'  sources.
#'  }
#'  \item{StartRCall}{
#'  List of parameters sent to the Start() call, with the parameter
#'  'retrieve' set to TRUE. Intended for calling in order to
#'  retrieve the associated data a posteriori with a call to do.call().
#'  }
#'
#'@examples
#'  data_path <- system.file('extdata', package = 'startR')
#'  path_obs <- file.path(data_path, 'obs/monthly_mean/$var$/$var$_$sdate$.nc')
#'  sdates <- c('200011', '200012')
#'  data <- Start(dat = list(list(path = path_obs)),
#'                var = 'tos',
#'                sdate = sdates,
#'                time = 'all',
#'                latitude = 'all',
#'                longitude = 'all',
#'                return_vars = list(latitude = 'dat', 
#'                                   longitude = 'dat', 
#'                                   time = 'sdate'),
#'                retrieve = FALSE)
#'
#'@import bigmemory multiApply parallel abind future
#'@importFrom utils str
#'@importFrom stats na.omit setNames 
#'@importFrom ClimProjDiags Subset
#'@export
Start <- function(..., # dim = indices/selectors, 
                  # dim_var = 'var', 
                  # dim_reorder = Sort/CircularSort, 
                  # dim_tolerance = number, 
                  # dim_depends = 'file_dim', 
                  # dim_across = 'file_dim', 
                  return_vars = NULL, 
                  synonims = NULL, 
                  file_opener = NcOpener, 
                  file_var_reader = NcVarReader, 
                  file_dim_reader = NcDimReader, 
                  file_data_reader = NcDataReader, 
                  file_closer = NcCloser, 
                  transform = NULL, 
                  transform_params = NULL, 
                  transform_vars = NULL, 
                  transform_extra_cells = 2, 
                  apply_indices_after_transform = FALSE, 
                  pattern_dims = NULL,
                  metadata_dims = NULL, 
                  selector_checker = SelectorChecker,
                  merge_across_dims = FALSE,
                  merge_across_dims_narm = FALSE,
                  split_multiselected_dims = FALSE,
                  path_glob_permissive = FALSE,
                  retrieve = FALSE, 
                  num_procs = 1, 
                  silent = FALSE, debug = FALSE) {
  #, config_file = NULL
  #dictionary_dim_names = ,
  #dictionary_var_names =  

  # Specify Subset() is from ClimProjDiags
  Subset <- ClimProjDiags::Subset

  dim_params <- list(...)
  
  # Take *_var parameters apart
  var_params_ind <- grep('_var$', names(dim_params))
  var_params <- dim_params[var_params_ind]
  # Check all *_var are NULL or vectors of character strings, and 
  # that they all have a matching dimension param.
  i <- 1
  for (var_param in var_params) {
    if (!is.character(var_param)) {
      stop("All '*_var' parameters must be character strings.")
    } else if (!any(grepl(paste0('^', strsplit(names(var_params)[i], 
                                               '_var$')[[1]][1], '$'),
                          names(dim_params)))) {
      stop(paste0("All '*_var' parameters must be associated to a dimension parameter. Found parameter '", 
                  names(var_params)[i], "' but no parameter '", 
                  strsplit(names(var_params)[i], '_var$')[[1]][1], "'."))
    }
    i <- i + 1
  }
  # Make the keys of 'var_params' to be the name of 
  # the corresponding dimension.
  if (length(var_params) < 1) {
    var_params <- NULL
  } else {
    names(var_params) <- gsub('_var$', '', names(var_params))
  }
  
  # Take *_reorder parameters apart
  dim_reorder_params_ind <- grep('_reorder$', names(dim_params))
  dim_reorder_params <- dim_params[dim_reorder_params_ind]
  # Make the keys of 'dim_reorder_params' to be the name of 
  # the corresponding dimension.
  if (length(dim_reorder_params) < 1) {
    dim_reorder_params <- NULL
  } else {
    names(dim_reorder_params) <- gsub('_reorder$', '', names(dim_reorder_params))
  }
  
  # Take *_tolerance parameters apart
  tolerance_params_ind <- grep('_tolerance$', names(dim_params))
  tolerance_params <- dim_params[tolerance_params_ind]
  
  # Take *_depends parameters apart
  depends_params_ind <- grep('_depends$', names(dim_params))
  depends_params <- dim_params[depends_params_ind]
  # Check all *_depends are NULL or vectors of character strings, and 
  # that they all have a matching dimension param.
  i <- 1
  for (depends_param in depends_params) {
    if (!is.character(depends_param) || (length(depends_param) > 1)) {
      stop("All '*_depends' parameters must be single character strings.")
    } else if (!any(grepl(paste0('^', strsplit(names(depends_params)[i], 
                                               '_depends$')[[1]][1], '$'),
                          names(dim_params)))) {
      stop(paste0("All '*_depends' parameters must be associated to a dimension parameter. Found parameter '", 
                  names(depends_params)[i], "' but no parameter '", 
                  strsplit(names(depends_params)[i], '_depends$')[[1]][1], "'."))
    }
    i <- i + 1
  }
  # Make the keys of 'depends_params' to be the name of 
  # the corresponding dimension.
  if (length(depends_params) < 1) {
    depends_params <- NULL
  } else {
    names(depends_params) <- gsub('_depends$', '', names(depends_params))
  }
  # Change name to depending_file_dims
  depending_file_dims <- depends_params
  
  # Take *_across parameters apart
  across_params_ind <- grep('_across$', names(dim_params))
  across_params <- dim_params[across_params_ind]
  # Check all *_across are NULL or vectors of character strings, and 
  # that they all have a matching dimension param.
  i <- 1
  for (across_param in across_params) {
    if (!is.character(across_param) || (length(across_param) > 1)) {
      stop("All '*_across' parameters must be single character strings.")
    } else if (!any(grepl(paste0('^', strsplit(names(across_params)[i], 
                                               '_across$')[[1]][1], '$'),
                          names(dim_params)))) {
      stop(paste0("All '*_across' parameters must be associated to a dimension parameter. Found parameter '", 
                  names(across_params)[i], "' but no parameter '", 
                  strsplit(names(across_params)[i], '_across$')[[1]][1], "'."))
    }
    i <- i + 1
  }
  # Make the keys of 'across_params' to be the name of 
  # the corresponding dimension.
  if (length(across_params) < 1) {
    across_params <- NULL
  } else {
    names(across_params) <- gsub('_across$', '', names(across_params))
  }
  # Change name to inner_dims_across_files
  inner_dims_across_files <- across_params
  
  # Check merge_across_dims
  if (!is.logical(merge_across_dims)) {
    stop("Parameter 'merge_across_dims' must be TRUE or FALSE.")
  }
  
  # Check merge_across_dims_narm
  if (!is.logical(merge_across_dims_narm)) {
    stop("Parameter 'merge_across_dims_narm' must be TRUE or FALSE.")
  }
  if (!merge_across_dims & merge_across_dims_narm) {
    merge_across_dims_narm <- FALSE
    warning(paste0("Parameter 'merge_across_dims_narm' can only be TRUE when ",
                   "'merge_across_dims' is TRUE. Set 'merge_across_dims_narm'",
                   " to FALSE."))
  }
  
  # Leave alone the dimension parameters in the variable dim_params
  if (length(c(var_params_ind, dim_reorder_params_ind, tolerance_params_ind,
               depends_params_ind, across_params_ind)) > 0) {
    dim_params <- dim_params[-c(var_params_ind, dim_reorder_params_ind,
                                tolerance_params_ind, depends_params_ind,
                                across_params_ind)]
    # Reallocating pairs of across file and inner dimensions if they have
    # to be merged. They are put one next to the other to ease merge later.
    if (merge_across_dims) {
      for (inner_dim_across in names(inner_dims_across_files)) {
        inner_dim_pos <- which(names(dim_params) == inner_dim_across)
        file_dim_pos <- which(names(dim_params) == inner_dims_across_files[[inner_dim_across]])
        new_pos <- inner_dim_pos
        if (file_dim_pos < inner_dim_pos) {
          new_pos <- new_pos - 1
        }
        dim_params_to_move <- dim_params[c(inner_dim_pos, file_dim_pos)]
        dim_params <- dim_params[-c(inner_dim_pos, file_dim_pos)]
        new_dim_params <- list()
        if (new_pos > 1) {
          new_dim_params <- c(new_dim_params, dim_params[1:(new_pos - 1)])
        }
        new_dim_params <- c(new_dim_params, dim_params_to_move)
        if (length(dim_params) >= new_pos) {
          new_dim_params <- c(new_dim_params, dim_params[new_pos:length(dim_params)])
        }
        dim_params <- new_dim_params
      }
    }
  }
  dim_names <- names(dim_params)
  if (is.null(dim_names)) {
    stop("At least one pattern dim must be specified.")
  }
  
  # Look for chunked dims
  chunks <- vector('list', length(dim_names))
  names(chunks) <- dim_names
  for (dim_name in dim_names) {
    if (!is.null(attr(dim_params[[dim_name]], 'chunk'))) {
      chunks[[dim_name]] <- attr(dim_params[[dim_name]], 'chunk')
      attributes(dim_params[[dim_name]]) <- attributes(dim_params[[dim_name]])[-which(names(attributes(dim_params[[dim_name]])) == 'chunk')]
    } else {
      chunks[[dim_name]] <- c(chunk = 1, n_chunks = 1)
    }
  }
  # This is a helper function to compute the chunk indices to take once the total
  # number of indices for a dimension has been discovered.
  chunk_indices <- function(n_indices, chunk, n_chunks, dim_name) {
    if (n_chunks > n_indices) {
      stop("Requested to divide dimension '", dim_name, "' of length ", 
           n_indices, " in ", n_chunks, " chunks, which is not possible.")
    }
    chunk_sizes <- rep(floor(n_indices / n_chunks), n_chunks)
    chunks_to_extend <- n_indices - chunk_sizes[1] * n_chunks
    if (chunks_to_extend > 0) {
      chunk_sizes[1:chunks_to_extend] <- chunk_sizes[1:chunks_to_extend] + 1
    }
    chunk_size <- chunk_sizes[chunk]
    offset <- 0
    if (chunk > 1) {
      offset <- sum(chunk_sizes[1:(chunk - 1)])
    }
    indices <- 1:chunk_sizes[chunk] + offset
    array(indices, dim = setNames(length(indices), dim_name))
  }
  
  # Check pattern_dims