Start.R 238 KB
Newer Older
aho's avatar
aho committed
#'Declare, discover, subset and retrieve multidimensional distributed data sets
#'
#'See the \href{https://earth.bsc.es/gitlab/es/startR}{startR documentation and
#'tutorial} for a step-by-step explanation on how to use Start().\cr\cr
#'Nowadays in the era of big data, large multidimensional data sets from 
#'diverse sources need to be combined and processed. Analysis of big data in any
#'field is often highly complex and time-consuming. Taking subsets of these data
#'sets and processing them efficiently become an indispensable practice. This 
#'technique is also known as Domain Decomposition, Map Reduce or, more commonly,
#''chunking'.\cr\cr
#'startR (Subset, TrAnsform, ReTrieve, arrange and process large 
#'multidimensional data sets in R) is an R project started at BSC with the aim 
#'to develop a tool that allows the user to automatically process large 
#'multidimensional distributed data sets. It is an open source project that is 
#'open to external collaboration and funding, and will continuously evolve to 
#'support as many data set formats as possible while maximizing its efficiency.\cr\cr
#'startR provides a framework under which a data set (collection of one 
#'or multiple data files, potentially distributed over various remote servers) 
#'are perceived as if they all were part of a single large multidimensional 
#'array. Once such multidimensional array is declared, any user-defined function
#'can be applied to the data in a \code{apply}-like fashion, where startR
#'transparently implements the Map Reduce paradigm. The steps to follow in order
#'to process a collection of big data sets are as follows:\cr
#'\itemize{
#'  \item{
#'Declaring the data set, i.e. declaring the distribution of the data files 
#'involved, the dimensions and shape of the multidimensional array, and the 
#'boundaries of the target data. This step can be performed with the 
#'Start() function. Numeric indices or coordinate values can be used when
#'fixing the boundaries. It is common having the need to apply transformations, 
#'pre-processing or reordering to the data. Start() accepts user-defined 
#'transformation or reordering functions to be applied for such purposes. Once a
#'data set is declared, a list of involved files, dimension lengths, memory size
#'and other metadata is made available. Optionally, the data set can be 
#'retrieved and loaded onto the current R session if it is small enough. 
#'  }
#'  \item{
#'Declaring the workflow of operations to perform on the involved data set(s).
#'This step can be performed with the Step() and AddStep() functions.
#'  }
#'  \item{
#'Defining the computation settings. The mandatory settings include a) how many
#'subsets to divide the data sets into and along which dimensions; b) which 
#'platform to perform the workflow of operations on (local machine or remote 
#'machine/HPC?), how to communicate with it (unidirectional or bidirectional 
#'connection? shared or separate file systems?), which queuing system it uses 
#'(slurm, PBS, LSF, none?); and c) how many parallel jobs and execution threads
#'per job to use when running the calculations. This step can be performed when 
#'building up the call to the Compute() function.
#'  }
#'  \item{
#'Running the computation. startR transparently implements the Map Reduce 
#'paradigm, according to the settings in the previous steps. The progress can 
#'optionally be monitored with the EC-Flow workflow management tool. When the 
#'computation ends, a report of performance timings is displayed. This step can 
#'be triggered with the Compute() function.
#'  }
#'}
#'startR is not bound to a specific file format. Interface functions to
#'custom file formats can be provided for Start() to read them. As this
#'version, startR includes interface functions to the following file formats:
#'\itemize{
#'  \item{
#'NetCDF
#'  }
#'}
#'Metadata and auxilliary data is also preserved and arranged by Start()
#'in the measure that it is retrieved by the interface functions for a specific 
#'file format.
#'
#'@param \dots A selection of custemized parameters depending on the data 
#'format. When we retrieve data from one or a collection of data sets, 
#'the involved data can be perceived as belonging to a large multi-dimensional 
#'array. For instance, let us consider an example case. We want to retrieve data
#'from a source, which contains data for the number of monthly sales of various 
#'items, and also for their retail price each month. The data on source is 
#'stored as follows:\cr\cr
#'\command{
#'\cr #  /data/
#'\cr #    |-> sales/
#'\cr #    |    |-> electronics
#'\cr #    |    |    |-> item_a.data
#'\cr #    |    |    |-> item_b.data
#'\cr #    |    |    |-> item_c.data
#'\cr #    |    |-> clothing
#'\cr #    |         |-> item_d.data
#'\cr #    |         |-> idem_e.data
#'\cr #    |         |-> idem_f.data
#'\cr #    |-> prices/
#'\cr #         |-> electronics
#'\cr #         |    |-> item_a.data
#'\cr #         |    |-> item_b.data
#'\cr #         |    |-> item_c.data
#'\cr #         |-> clothing
#'\cr #              |-> item_d.data
#'\cr #              |-> item_e.data
#'\cr #              |-> item_f.data
#'}\cr\cr
#'Each item file contains data, stored in whichever format, for the sales or 
#'prices over a time period, e.g. for the past 24 months, registered at 100 
#'different stores over the world. Whichever the format it is stored in, each 
#'file can be perceived as a container of a data array of 2 dimensions, time and
#'store. Let us assume the '.data' format allows to keep a name for each of 
#'these dimensions, and the actual names are 'time' and 'store'.\cr\cr
#'The different item files for sales or prices can be perceived as belonging to 
#'an 'item' dimension of length 3, and the two groups of three items to a 
#''section' dimension of length 2, and the two groups of two sections (one with
#'the sales and the other with the prices) can be perceived as belonging also to
#'another dimension 'variable' of length 2. Even the source can be perceived as 
#'belonging to a dimension 'source' of length 1.\cr\cr
#'All in all, in this example, the whole data could be perceived as belonging to
#'a multidimensional 'large array' of dimensions\cr
#'\command{
#'\cr #  source variable  section      item    store    month
#'\cr #       1        2        2         3      100       24
#'}
#'\cr\cr
#'The dimensions of this 'large array' can be classified in two types. The ones 
#'that group actual files (the file dimensions) and the ones that group data 
#'values inside the files (the inner dimensions). In the example, the file 
#'dimensions are 'source', 'variable', 'section' and 'item', whereas the inner 
#'dimensions are 'store' and 'month'.
#'\cr\cr
#'Having the dimensions of our target sources in mind, the parameter \code{\dots} 
#'expects to receive information on:
#'    \itemize{
#'      \item{
#'The names of the expected dimensions of the 'large dataset' we want to 
#'retrieve data from
#'      }
#'      \item{
#'The indices to take from each dimension (and other constraints)
#'      }
#'      \item{
#'How to reorder the dimension if needed
#'      }
#'      \item{
#'The location and organization of the files of the data sets
#'      }
#'    }
#'For each dimension, the 3 first information items can be specified with a set
#'of parameters to be provided through \code{\dots}. For a given dimension 
#''dimname', six parameters can be specified:\cr
#'\command{
#'\cr # dimname = <indices_to_take>,  # 'all' / 'first' / 'last' /
#'\cr #                               # indices(c(1, 10, 20)) /
#'\cr #                               # indices(c(1:20)) /
#'\cr #                               # indices(list(1, 20)) /
#'\cr #                               # c(1, 10, 20) / c(1:20) /
#'\cr #                               # list(1, 20)
#'\cr # dimname_var = <name_of_associated_coordinate_variable>,
#'\cr # dimname_tolerance = <tolerance_value>,
#'\cr # dimname_reorder = <reorder_function>,
#'\cr # dimname_depends = <name_of_another_dimension>,
#'\cr # dimname_across = <name_of_another_dimension>
#'}
#'\cr\cr
#'The \bold{indices to take} can be specified in three possible formats (see 
#'code comments above for examples). The first format consists in using 
#'character tags, such as 'all' (take all the indices available for that 
#'dimension), 'first' (take only the first) and 'last' (only the last). The 
#'second format consists in using numeric indices, which have to be wrapped in a
#'call to the indices() helper function. For the second format, either a
#'vector of numeric indices can be provided, or a list with two numeric indices 
#'can be provided to take all the indices in the range between the two specified
#'indices (both extremes inclusive). The third format consists in providing a 
#'vector character strings (for file dimensions) or of values of whichever type
#'(for inner dimensions). For the file dimensions, the provided character 
#'strings in the third format will be used as components to build up the final 
#'path to the files (read further). For inner dimensions, the provided values in
#'the third format will be compared to the values of an associated coordinate 
#'variable (must be specified in '<dimname>_reorder', read further), and the 
#'indices of the closest values will be retrieved. When using the third format, 
#'a list with two values can also be provided to take all the indices of the 
#'values within the specified range.
#'\cr\cr
#'The \bold{name of the associated coordinate variable} must be a character 
#'string with the name of an associated coordinate variable to be found in the 
#'data files (in all* of them). For this to work, a 'file_var_reader' 
#'function must be specified when calling Start() (see parameter 
#''file_var_reader'). The coordinate variable must also be requested in the 
#'parameter 'return_vars' (see its section for details). This feature only 
#'works for inner dimensions.
#'\cr\cr
#'The \bold{tolerance value} is useful when indices for an inner dimension are 
#'specified in the third format (values of whichever type). In that case, the 
#'indices of the closest values in the coordinate variable are seeked. However 
#'the closest value might be too distant and we would want to consider no real 
#'match exists for such provided value. This is possible via the tolerance,
#'which allows to specify a threshold beyond which not to seek for matching 
#'values and mark that index as missing value.
#'\cr\cr
#'The \bold{reorder_function} is useful when indices for an inner dimension are
#'specified in the third fromat, and the retrieved indices need to be reordered 
#'in function of their provided associated variable values. A function can be 
#'provided, which receives as input a vector of values, and returns as outputs a
#'list with the components \code{$x} with the reordered values, and \code{$ix} 
#'with the permutation indices. Two reordering functions are included in 
#'startR, the Sort() and the CircularSort().
#'\cr\cr
#'The \bold{name of another dimension} to be specified in <dimname>_depends,
#'only available for file dimensions, must be a character string with the name 
#'of another requested \bold{file dimension} in \code{\dots}, and will make 
#'Start() aware that the path components of a file dimension can vary in
#'function of the path component of another file dimension. For instance, in the
#'example above, specifying \code{item_depends = 'section'} will make 
#'Start() aware that the item names vary in function of the section, i.e.
#'section 'electronics' has items 'a', 'b' and 'c' but section 'clothing' has 
#'items 'd', 'e', 'f'. Otherwise Start() would expect to find the same 
#'item names in all the sections.
#'If values() is used to define dimensions, it is possible to provide different 
#'values of the depending dimension for each depended dimension values. For 
#'example, if \code{section = c('electronics', 'clothing')}, we can use
#'\code{item = list(electronics = c('a', 'b', 'c'), clothing = c('d', 'e', 'f'))}.
aho's avatar
aho committed
215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638
#'\cr\cr
#'The \bold{name of another dimension} to be specified in '<dimname>_across',
#'only available for inner dimensions, must be a character string with the name 
#'of another requested \bold{inner dimension} in \code{\dots}, and will make 
#'Start() aware that an inner dimension extends along multiple files. For
#'instance, let us imagine that in the example above, the records for each item 
#'are so large that it becomes necessary to split them in multiple files each 
#'one containing the registers for a different period of time, e.g. in 10 files 
#'with 100 months each ('item_a_period1.data', 'item_a_period2.data', and so on).
#'In that case, the data can be perceived as having an extra file dimension, the 
#''period' dimension. The inner dimension 'month' would extend across multiple 
#'files, and providing the parameter \code{month = indices(1, 300)} would make 
#'Start() crash because it would perceive we have made a request out of 
#'bounds (each file contains 100 'month' indices, but we requested 1 to 300). 
#'This can be solved by specifying the parameter \code{month_across = period} (a
#'long with the full specification of the dimension 'period').
#'\cr\cr
#'\bold{Defining the path pattern}
#'\cr
#'As mentioned above, the parameter \dots also expects to receive information 
#'with the location of the data files. In order to do this, a special dimension 
#'must be defined. In that special dimension, in place of specifying indices to 
#'take, a path pattern must be provided. The path pattern is a character string 
#'that encodes the way the files are organized in their source. It must be a 
#'path to one of the data set files in an accessible local or remote file system,
#'or a URL to one of the files provided by a local or remote server. The regions
#'of this path that vary across files (along the file dimensions) must be 
#'replaced by wildcards. The wildcards must match any of the defined file 
#'dimensions in the call to Start() and must be delimited with heading 
#'and trailing '$'. Shell globbing expressions can be used in the path pattern. 
#'See the next code snippet for an example of a path pattern.
#'\cr\cr
#'All in all, the call to Start() to load the entire data set in the 
#'example of store item sales, would look as follows:
#'\cr
#'\command{
#'\cr # data <- Start(source = paste0('/data/$variable$/',
#'\cr #                               '$section$/$item$.data'),
#'\cr #               variable = 'all',
#'\cr #               section = 'all',
#'\cr #               item = 'all',
#'\cr #               item_depends = 'section',
#'\cr #               store = 'all',
#'\cr #               month = 'all')
#'}
#'\cr\cr
#'Note that in this example it would still be pending to properly define the 
#'parameters 'file_opener', 'file_closer', 'file_dim_reader', 
#''file_var_reader' and 'file_data_reader' for the '.data' file format
#'(see the corresponding sections).
#'\cr\cr
#'The call to Start() will return a multidimensional R array with the 
#'following dimensions:
#'\cr
#'\command{
#'\cr #  source variable  section      item    store    month
#'\cr #       1        2        2         3      100       24
#'}
#'\cr
#'The dimension specifications in the \code{\dots} do not have to follow any 
#'particular order. The returned array will have the dimensions in the same order
#'as they have been specified in the call. For example, the following call:
#'\cr
#'\command{
#'\cr # data <- Start(source = paste0('/data/$variable$/',
#'\cr #                               '$section$/$item$.data'),
#'\cr #               month = 'all',
#'\cr #               store = 'all',
#'\cr #               item = 'all',
#'\cr #               item_depends = 'section',
#'\cr #               section = 'all',
#'\cr #               variable = 'all')
#'}
#'\cr\cr
#'would return an array with the following dimensions:
#'\cr
#'\command{
#'\cr #  source    month    store      item  section variable
#'\cr #       1       24      100         3        2        2
#'}
#'\cr\cr
#'Next, a more advanced example to retrieve data for only the sales records, for
#'the first section ('electronics'), for the 1st and 3rd items and for the 
#'stores located in Barcelona (assuming the files contain the variable 
#''store_location' with the name of the city each of the 100 stores are located 
#'at):
#'\cr
#'\command{
#'\cr # data <- Start(source = paste0('/data/$variable$/',
#'\cr #                               '$section$/$item$.data'),
#'\cr #               variable = 'sales',
#'\cr #               section = 'first',
#'\cr #               item = indices(c(1, 3)),
#'\cr #               item_depends = 'section',
#'\cr #               store = 'Barcelona',
#'\cr #               store_var = 'store_location',
#'\cr #               month = 'all',
#'\cr #               return_vars = list(store_location = NULL))
#'}
#'\cr\cr
#'The defined names for the dimensions do not necessarily have to match the 
#'names of the dimensions inside the file. Lists of alternative names to be 
#'seeked can be defined in the parameter 'synonims'.
#'\cr\cr
#'If data from multiple sources (not necessarily following the same structure) 
#'has to be retrieved, it can be done by providing a vector of character strings
#'with path pattern specifications, or, in the extended form, by providing a 
#'list of lists with the components 'name' and 'path', and the name of the 
#'dataset and path pattern as values, respectively. For example:
#'\cr
#'\command{
#'\cr # data <- Start(source = list(
#'\cr #                 list(name = 'sourceA',
#'\cr #                      path = paste0('/sourceA/$variable$/',
#'\cr #                                    '$section$/$item$.data')),
#'\cr #                 list(name = 'sourceB',
#'\cr #                      path = paste0('/sourceB/$section$/',
#'\cr #                                    '$variable$/$item$.data'))
#'\cr #               ),
#'\cr #               variable = 'sales',
#'\cr #               section = 'first',
#'\cr #               item = indices(c(1, 3)),
#'\cr #               item_depends = 'section',
#'\cr #               store = 'Barcelona',
#'\cr #               store_var = 'store_location',
#'\cr #               month = 'all',
#'\cr #               return_vars = list(store_location = NULL))
#'}
#'\cr
#'
#'@param return_vars A named list where the names are the names of the 
#'variables to be fetched in the files, and the values are vectors of 
#'character strings with the names of the file dimension which to retrieve each
#'variable for, or NULL if the variable has to be retrieved only once 
#'from any (the first) of the involved files.\cr\cr
#'Apart from retrieving a multidimensional data array, retrieving auxiliary 
#'variables inside the files can also be needed. The parameter 
#''return_vars' allows for requesting such variables, as long as a 
#''file_var_reader' function is also specified in the call to 
#'Start() (see documentation on the corresponding parameter). 
#'\cr\cr
#'In the case of the the item sales example (see documentation on parameter 
#'\code{\dots)}, the store location variable is requested with the parameter\cr 
#'\code{return_vars = list(store_location = NULL)}.\cr This will cause 
#'Start() to fetch once the variable 'store_location' and return it in 
#'the component\cr \code{$Variables$common$store_location},\cr and will be an 
#'array of character strings with the location names, with the dimensions 
#'\code{c('store' = 100)}. Although useless in this example, we could ask 
#'Start() to fetch and return such variable for each file along the 
#'items dimension as follows: \cr 
#'\code{return_vars = list(store_location = c('item'))}.\cr In that case, the 
#'variable will be fetched once from a file of each of the items, and will be 
#'returned as an array with the dimensions \code{c('item' = 3, 'store' = 100)}.
#'\cr\cr
#'If a variable is requested along a file dimension that contains path pattern 
#'specifications ('source' in the example), the fetched variable values will be 
#'returned in the component\cr \code{$Variables$<dataset_name>$<variable_name>}.\cr 
#'For example:
#'\cr
#'\command{
#'\cr # data <- Start(source = list(
#'\cr #                 list(name = 'sourceA',
#'\cr #                      path = paste0('/sourceA/$variable$/',
#'\cr #                                    '$section$/$item$.data')),
#'\cr #                 list(name = 'sourceB',
#'\cr #                      path = paste0('/sourceB/$section$/',
#'\cr #                                    '$variable$/$item$.data'))
#'\cr #               ),
#'\cr #               variable = 'sales',
#'\cr #               section = 'first',
#'\cr #               item = indices(c(1, 3)),
#'\cr #               item_depends = 'section',
#'\cr #               store = 'Barcelona',
#'\cr #               store_var = 'store_location',
#'\cr #               month = 'all',
#'\cr #               return_vars = list(store_location = c('source',
#'\cr #                                                     'item')))
#'\cr # # Checking the structure of the returned variables
#'\cr # str(found_data$Variables)
#'\cr # Named list
#'\cr # ..$common: NULL
#'\cr # ..$sourceA: Named list
#'\cr # .. ..$store_location: char[1:18(3d)] 'Barcelona' 'Barcelona' ...
#'\cr # ..$sourceB: Named list
#'\cr # .. ..$store_location: char[1:18(3d)] 'Barcelona' 'Barcelona' ...
#'\cr # # Checking the dimensions of the returned variable
#'\cr # # for the source A
#'\cr # dim(found_data$Variables$sourceA)
#'\cr #     item   store
#'\cr #        3       3
#'}
#'\cr\cr
#'The names of the requested variables do not necessarily have to match the 
#'actual variable names inside the files. A list of alternative names to be 
#'seeked can be specified via the parameter 'synonims'.
#'
#'@param synonims A named list where the names are the requested variable or 
#'dimension names, and the values are vectors of character strings with 
#'alternative names to seek for such dimension or variable.\cr\cr
#'In some requests, data from different sources may follow different naming 
#'conventions for the dimensions or variables, or even files in the same source
#'could have varying names. This parameter is in order for Start() to 
#'properly identify the dimensions or variables with different names.
#'\cr\cr
#'In the example used in parameter 'return_vars', it may be the case that 
#'the two involved data sources follow slightly different naming conventions. 
#'For example, source A uses 'sect' as name for the sections dimension, whereas 
#'source B uses 'section'; source A uses 'store_loc' as variable name for the 
#'store locations, whereas source B uses 'store_location'. This can be taken 
#'into account as follows:
#'\cr
#'\command{
#'\cr # data <- Start(source = list(
#'\cr #                 list(name = 'sourceA',
#'\cr #                      path = paste0('/sourceA/$variable$/',
#'\cr #                                    '$section$/$item$.data')),
#'\cr #                 list(name = 'sourceB',
#'\cr #                      path = paste0('/sourceB/$section$/',
#'\cr #                                    '$variable$/$item$.data'))
#'\cr #               ),
#'\cr #               variable = 'sales',
#'\cr #               section = 'first',
#'\cr #               item = indices(c(1, 3)),
#'\cr #               item_depends = 'section',
#'\cr #               store = 'Barcelona',
#'\cr #               store_var = 'store_location',
#'\cr #               month = 'all',
#'\cr #               return_vars = list(store_location = c('source',
#'\cr #                                                     'item')),
#'\cr #               synonims = list(
#'\cr #                 section = c('sec', 'section'),
#'\cr #                 store_location = c('store_loc',
#'\cr #                                    'store_location')
#'\cr #               ))
#'}
#'\cr
#'
#'@param file_opener A function that receives as a single parameter 
#'  'file_path' a character string with the path to a file to be opened, 
#'  and returns an object with an open connection to the file (optionally with 
#'  header information) on success, or returns NULL on failure.
#'\cr\cr
#'This parameter takes by default NcOpener() (an opener function for NetCDF
#'files).
#'\cr\cr
#'See NcOpener() for a template to build a file opener for your own file 
#'format.
#'
#'@param file_var_reader A function with the header \code{file_path = NULL}, 
#'  \code{file_object = NULL}, \code{file_selectors = NULL}, \code{var_name}, 
#'  \code{synonims} that returns an array with auxiliary data (i.e. data from a
#'  variable) inside a file. Start() will provide automatically either a 
#'  'file_path' or a 'file_object' to the 'file_var_reader'
#'  function (the function has to be ready to work whichever of these two is 
#'  provided). The parameter 'file_selectors' will also be provided 
#'  automatically to the variable reader, containing a named list where the 
#'  names are the names of the file dimensions of the queried data set (see 
#'  documentation on \code{\dots}) and the values are single character strings 
#'  with the components used to build the path to the file being read (the one 
#'  provided in 'file_path' or 'file_object'). The parameter 'var_name'
#'  will be filled in automatically by Start() also, with the name of one
#'  of the variales to be read. The parameter 'synonims' will be filled in 
#'  with exactly the same value as provided in the parameter 'synonims' in 
#'  the call to Start(), and has to be used in the code of the variable 
#'  reader to check for alternative variable names inside the target file. The 
#'  'file_var_reader' must return a (multi)dimensional array with named 
#'  dimensions, and optionally with the attribute 'variales' with other 
#'  additional metadata on the retrieved variable.
#'\cr\cr
#'Usually, the 'file_var_reader' should be a degenerate case of the 
#''file_data_reader' (see documentation on the corresponding parameter), 
#'so it is recommended to code the 'file_data_reder' in first place.
#'\cr\cr
#'This parameter takes by default NcVarReader() (a variable reader function
#'for NetCDF files).
#'\cr\cr
#'See NcVarReader() for a template to build a variale reader for your own 
#'file format.
#'
#'@param file_dim_reader A function with the header \code{file_path = NULL}, 
#'  \code{file_object = NULL}, \code{file_selectors = NULL}, \code{synonims} 
#'  that returns a named numeric vector where the names are the names of the 
#'  dimensions of the multidimensional data array in the file and the values are
#'  the sizes of such dimensions. Start() will provide automatically 
#'  either a 'file_path' or a 'file_object' to the 
#'  'file_dim_reader' function (the function has to be ready to work 
#'  whichever of these two is provided). The parameter 'file_selectors'
#'  will also be provided automatically to the dimension reader, containing a
#'  named list where the names are the names of the file dimensions of the 
#'  queried data set (see documentation on \code{\dots}) and the values are 
#'  single character strings with the components used to build the path to the 
#'  file being read (the one provided in 'file_path' or 'file_object'). 
#'  The parameter 'synonims' will be filled in with exactly the same value 
#'  as provided in the parameter 'synonims' in the call to Start(), 
#'  and can optionally be used in advanced configurations.
#'\cr\cr
#'This parameter takes by default NcDimReader() (a dimension reader 
#'function for NetCDF files).
#'\cr\cr
#'See NcDimReader() for (an advanced) template to build a dimension reader
#'for your own file format.
#'
#'@param file_data_reader A function with the header \code{file_path = NULL}, 
#'  \code{file_object = NULL}, \code{file_selectors = NULL}, 
#'  \code{inner_indices = NULL}, \code{synonims} that returns a subset of the 
#'  multidimensional data array inside a file (even if internally it is not an 
#'  array). Start() will provide automatically either a 'file_path'
#'  or a 'file_object' to the 'file_data_reader' function (the 
#'  function has to be ready to work whichever of these two is provided). The
#'  parameter 'file_selectors' will also be provided automatically to the
#'  data reader, containing a named list where the names are the names of the
#'  file dimensions of the queried data set (see documentation on \code{\dots})
#'  and the values are single character strings with the components used to 
#'  build the path to the file being read (the one provided in 'file_path' or 
#'  'file_object'). The parameter 'inner_indices' will be filled in 
#'  automatically by Start() also, with a named list of numeric vectors, 
#'  where the names are the names of all the expected inner dimensions in a file
#'  to be read, and the numeric vectors are the indices to be taken from the 
#'  corresponding dimension (the indices may not be consecutive nor in order).
#'  The parameter 'synonims' will be filled in with exactly the same value 
#'  as provided in the parameter 'synonims' in the call to Start(), 
#'  and has to be used in the code of the data reader to check for alternative 
#'  dimension names inside the target file. The 'file_data_reader' must 
#'  return a (multi)dimensional array with named dimensions, and optionally with
#'  the attribute 'variables' with other additional metadata on the retrieved 
#'  data.
#'\cr\cr
#'Usually, 'file_data_reader' should use 'file_dim_reader'
#'(see documentation on the corresponding parameter), so it is recommended to 
#'code 'file_dim_reder' in first place.
#'\cr\cr
#'This parameter takes by default NcDataReader() (a data reader function 
#'for NetCDF files).
#'\cr\cr
#'See NcDataReader() for a template to build a data reader for your own 
#'file format.
#'
#'@param file_closer A function that receives as a single parameter 
#'  'file_object' an open connection (as returned by 'file_opener') 
#'  to one of the files to be read, optionally with header information, and 
#'  closes the open connection. Always returns NULL.
#'\cr\cr
#'This parameter takes by default NcCloser() (a closer function for NetCDF 
#'files).
#'\cr\cr
#'See NcCloser() for a template to build a file closer for your own file 
#'format.
#'
#'@param transform A function with the header \code{dara_array}, 
#'  \code{variables}, \code{file_selectors = NULL}, \code{\dots}. It receives as
#'  input, through the parameter \code{data_array}, a subset of a 
#'  multidimensional array (as returned by 'file_data_reader'), applies a 
#'  transformation to it and returns it, preserving the amount of dimensions but
#'  potentially modifying their size. This transformation may require data from 
#'  other auxiliary variables, automatically provided to 'transform' 
#'  through the parameter 'variables', in the form of a named list where
#'  the names are the variable names and the values are (multi)dimensional
#'  arrays. Which variables need to be sent to 'transform' can be specified
#'  with the parameter 'transform_vars' in Start(). The parameter 
#'  'file_selectors' will also be provided automatically to 
#'  'transform', containing a named list where the names are the names of 
#'  the file dimensions of the queried data set (see documentation on 
#'  \code{\dots}) and the values are single character strings with the 
#'  components used to build the path to the file the subset being processed 
#'  belongs to. The parameter \code{\dots} will be filled in with other 
#'  additional parameters to adjust the transformation, exactly as provided in 
#'  the call to Start() via the parameter 'transform_params'.
#'@param transform_params A named list with additional parameters to be sent to 
#'  the 'transform' function (if specified). See documentation on parameter
#'  'transform' for details.
#'@param transform_vars A vector of character strings with the names of 
#'  auxiliary variables to be sent to the 'transform' function (if 
#'  specified). All the variables to be sent to 'transform' must also 
#'  have been requested as return variables in the parameter 'return_vars' 
#'  of Start().
#'@param transform_extra_cells An integer of extra indices to retrieve from the 
#'  data set, beyond the requested indices in \code{\dots}, in order for 
#'  'transform' to dispose of additional information to properly apply 
#'  whichever transformation (if needed). As many as 
#'  'transform_extra_cells' will be retrieved beyond each of the limits for
#'  each of those inner dimensions associated to a coordinate variable and sent 
#'  to 'transform' (i.e. present in 'transform_vars'). After 
#'  'transform' has finished, Start() will take again and return a 
#'  subset of the result, for the returned data to fall within the specified 
#'  bounds in \code{\dots}. The default value is 2.
#'@param apply_indices_after_transform A logical value indicating when a 
#'  'transform' is specified in Start() and numeric indices are 
#'  provided for any of the inner dimensions that depend on coordinate variables,
#'  these numeric indices can be made effective (retrieved) before applying the 
#'  transformation or after. The boolean flag allows to adjust this behaviour. 
#'  It takes FALSE by default (numeric indices are applied before sending
#'  data to 'transform').
#'@param pattern_dims A character string indicating the name of the dimension 
#'  with path pattern specifications (see \code{\dots} for details). If not  
#'  specified, Start() assumes the first provided dimension is the pattern 
#'  dimension, with a warning.
#'@param metadata_dims A vector of character strings with the names of the file 
#'  dimensions which to return metadata for. As noted in 'file_data_reader', 
#'  the data reader can optionally return auxiliary data via the attribute 
#'  'variables' of the returned array. Start() by default returns the 
#'  auxiliary data read for only the first file of each source (or data set) in 
#'  the pattern dimension (see \code{\dots} for info on what the pattern 
#'  dimension is). However it can be configured to return the metadata for all 
#'  the files along any set of file dimensions. The default value is NULL, and
#'  it will be assigned automatically as parameter 'pattern_dims'.
#'@param selector_checker A function used internaly by Start() to 
#'  translate a set of selectors (values for a dimension associated to a 
#'  coordinate variable) into a set of numeric indices. It takes by default 
#'  SelectorChecker() and, in principle, it should not be required to 
#'  change it for customized file formats. The option to replace it is left open
#'  for more versatility. See the code of SelectorChecker() for details on
#'  the inputs, functioning and outputs of a selector checker.
#'@param merge_across_dims A logical value indicating whether to merge 
#'  dimensions across which another dimension extends (according to the 
#'  '<dimname>_across' parameters). Takes the value FALSE by default. For 
#'  example, if the dimension 'time' extends across the dimension 'chunk' and 
#'  \code{merge_across_dims = TRUE}, the resulting data array will only contain
#'  only the dimension 'time' as long as all the chunks together.
#'@param merge_across_dims_narm A logical value indicating whether to remove
#'  the additional NAs from data when parameter 'merge_across_dims' is TRUE.
#'  It is helpful when the length of the to-be-merged dimension is different 
#'  across another dimension. For example, if the dimension 'time' extends 
#'  across dimension 'chunk', and the time length along the first chunk is 2 
#'  while along the second chunk is 10. Setting this parameter as TRUE can 
#'  remove the additional 8 NAs at position 3 to 10. The default value is TRUE,
#'  but will be automatically turned to FALSE if 'merge_across_dims = FALSE'.
aho's avatar
aho committed
#'@param split_multiselected_dims A logical value indicating whether to split a 
#'  dimension that has been selected with a multidimensional array of selectors
#'  into as many dimensions as present in the selector array. The default value
#'  is FALSE.
#'@param path_glob_permissive A logical value or an integer specifying how many
#'  folder levels in the path pattern, beginning from the end, the shell glob
#'  expressions must be preserved and worked out for each file. The default 
#'  value is FALSE, which is equivalent to 0. TRUE is equivalent to 1.\cr\cr
#'When specifying a path pattern for a dataset, it might contain shell glob 
#'experissions. For each dataset, the first file matching the path pattern is 
#'found, and the found file is used to work out fixed values for the glob 
aho's avatar
aho committed
#'expressions that will be used for all the files of the dataset. However, in 
#'some cases, the values of the shell glob expressions may not be constant for 
aho's avatar
aho committed
#'all files in a dataset, and they need to be worked out for each file 
#'involved.\cr\cr
#'For example, a path pattern could be as follows: \cr
#'\code{'/path/to/dataset/$var$_*/$date$_*_foo.nc'}. \cr Leaving 
#'\code{path_glob_permissive = FALSE} will trigger automatic seek of the 
#'  contents to replace the asterisks (e.g. the first asterisk matches with 
#'  \code{'bar'} and the second with \code{'baz'}. The found contents will be 
#'  used for all files in the dataset (in the example, the path pattern will be
#'  fixed to\cr \code{'/path/to/dataset/$var$_bar/$date$_baz_foo.nc'}. However, if
#'  any of the files in the dataset have other contents in the position of the
#'  asterisks, Start() will not find them (in the example, a file like \cr
#'  \code{'/path/to/dataset/precipitation_bar/19901101_bin_foo.nc'} would not be
#'  found). Setting \code{path_glob_permissive = 1} would preserve global
#'  expressions in the latest level (in the example, the fixed path pattern
#'  would be\cr \code{'/path/to/dataset/$var$_bar/$date$_*_foo.nc'}, and the
#'  problematic file mentioned before would be found), but of course this would
#'  slow down the Start() call if the dataset involves a large number of
#'  files. Setting \code{path_glob_permissive = 2} would leave the original path
#'  pattern with the original glob expressions in the 1st and 2nd levels (in the
#'  example, both asterisks would be preserved, thus would allow Start()
#'  to recognize files such as \cr
aho's avatar
aho committed
#'  \code{'/path/to/dataset/precipitation_zzz/19901101_yyy_foo.nc'}).\cr\cr
#'Note that each glob expression can only represent one possibility (Start() 
aho's avatar
aho committed
#'chooses the first). Because \code{*} is not the tag, which means it cannot
aho's avatar
aho committed
#'be a dimension of the output array. Therefore, only one possibility can be
#'adopted. For example, if \cr
#'\code{'/path/to/dataset/precipitation_*/19901101_*_foo.nc'}\cr
#'has two matches:\cr
#'\code{'/path/to/dataset/precipitation_xxx/19901101_yyy_foo.nc'} and\cr
#'\code{'/path/to/dataset/precipitation_zzz/19901101_yyy_foo.nc'},\cr
#'only the first found file will be used.
#'@param largest_dims_length A logical value or a named integer vector
#'  indicating if Start() should examine all the files to get the largest 
#'  length of the inner dimensions (TRUE) or use the first valid file of each 
#'  dataset as the returned dimension length (FALSE). Since examining all the 
#'  files could be time-consuming, a vector can be used to explicitly specify
#'  the expected length of the inner dimensions. For those inner dimensions not
#'  specified, the first valid file will be used. The default value is FALSE.\cr\cr
#'  This parameter is useful when the required files don't have consistent 
#'  inner dimension. For example, there are 10 required experimental data files
#'  of a series of start dates. The data only contain 25 members for the first
#'  2 years while 51 members for the later years. If \code{'largest_dims_length = FALSE'},
#'  the returned member dimension length will be 25 only. The 26th to 51st 
aho's avatar
aho committed
#'  members in the later 8 years will be discarded. If \code{'largest_dims_length = TRUE'},
#'  the returned member dimension length will be 51. To save the resource,
#' \code{'largest_dims_length = c(member = 51)'} can also be used.
aho's avatar
aho committed
#'@param retrieve A logical value indicating whether to retrieve the data
#'  defined in the Start() call or to explore only its dimension lengths 
#'  and names, and the values for the file and inner dimensions. The default
#'  value is FALSE.
#'@param num_procs An integer of number of processes to be created for the
#'  parallel execution of the retrieval/transformation/arrangement of the
#'  multiple involved files in a call to Start(). If set to NULL,
#'  takes the number of available cores (as detected by future::availableCores).
#'  The default value is 1 (no parallel execution).
#'@param ObjectBigmemory a character string to be included as part of the 
#'  bigmemory object name. This parameter is thought to be used internally by the
#'  chunking capabilities of startR.
aho's avatar
aho committed
#'@param silent A logical value of whether to display progress messages (FALSE)
#'   or not (TRUE). The default value is FALSE.
#'@param debug A logical value of whether to return detailed messages on the
#'  progress and operations in a Start() call (TRUE) or not (FALSE). The
#'  default value is FALSE.
#'
#'@return If \code{retrieve = TRUE} the involved data is loaded into RAM memory
#'  and an object of the class 'startR_cube' with the following components is
#'  returned:\cr
#'  \item{Data}{
#'  Multidimensional data array with named dimensions, with the data values
#'  requested via \code{\dots} and other parameters. This array can potentially 
#'  contain metadata in the attribute 'variables'.
#'  }
#'  \item{Variables}{
#'  Named list of 1 + N components, containing lists of retrieved variables (as
#'  requested in 'return_vars') common to all the data sources (in the 1st
#'  component, \code{$common}), and for each of the N dara sources (named after 
#'  the source name, as specified in \dots, or, if not specified, \code{$dat1},
#'  \code{$dat2}, ..., \code{$datN}). Each of the variables are contained in a
#'  multidimensional array with named dimensions, and potentially with the
#'  attribute 'variables' with additional auxiliary data.
#'  }
#'  \item{Files}{
#'  Multidimensonal character string array with named dimensions. Its dimensions
#'  are the file dimensions (as requested in \code{\dots}). Each cell in this
#'  array contains a path to a retrieved file, or NULL if the corresponding
#'  file was not found.
#'  }
#'  \item{NotFoundFiles}{
#'  Array with the same shape as \code{$Files} but with NULL in the
#'  positions for which the corresponding file was found, and a path to the
#'  expected file in the positions for which the corresponding file was not
#'  found.
#'  }
#'  \item{FileSelectors}{
#'  Multidimensional character string array with named dimensions, with the same
#'  shape as \code{$Files} and \code{$NotFoundFiles}, which contains the
#'  components used to build up the paths to each of the files in the data
#'  sources.
#'  }
#'  \item{PatternDim}{
#'  Character string containing the name of the file pattern dimension.
#'  }
aho's avatar
aho committed
#'If \code{retrieve = FALSE} the involved data is not loaded into RAM memory and
#'an object of the class 'startR_header' with the following components is
#' returned:\cr
#'  \item{Dimensions}{
#'  Named vector with the dimension lengths and names of the data involved in
#'  the Start() call.
#'  }
#'  \item{Variables}{
#'  Named list of 1 + N components, containing lists of retrieved variables (as
#'  requested in 'return_vars') common to all the data sources (in the 1st
#'  component, \code{$common}), and for each of the N dara sources (named after
#'  the source name, as specified in \dots, or, if not specified, \code{$dat1},
#'  \code{$dat2}, ..., \code{$datN}). Each of the variables are contained in a
#'  multidimensional array with named dimensions, and potentially with the
#'  attribute 'variables' with additional auxiliary data.
#'  }
aho's avatar
aho committed
#'  Multidimensonal character string array with named dimensions. Its dimensions
#'  are the file dimensions (as requested in \dots). Each cell in this array
#'  contains a path to a file to be retrieved (which may exist or not).
#'  }
#'  \item{FileSelectors}{
#'  Multidimensional character string array with named dimensions, with the same
#'  shape as \code{$Files} and \code{$NotFoundFiles}, which contains the
#'  components used to build up the paths to each of the files in the data
#'  sources.
#'  }
#'  \item{PatternDim}{
#'  Character string containing the name of the file pattern dimension.
#'  }
aho's avatar
aho committed
#'  \item{StartRCall}{
#'  List of parameters sent to the Start() call, with the parameter
#'  'retrieve' set to TRUE. Intended for calling in order to
#'  retrieve the associated data a posteriori with a call to do.call().
#'  }
#'
#'@examples
#'  data_path <- system.file('extdata', package = 'startR')
#'  path_obs <- file.path(data_path, 'obs/monthly_mean/$var$/$var$_$sdate$.nc')
#'  sdates <- c('200011', '200012')
#'  data <- Start(dat = list(list(path = path_obs)),
#'                var = 'tos',
#'                sdate = sdates,
#'                time = 'all',
#'                latitude = 'all',
#'                longitude = 'all',
#'                return_vars = list(latitude = 'dat', 
#'                                   longitude = 'dat', 
#'                                   time = 'sdate'),
#'                retrieve = FALSE)
#'
#'@import bigmemory multiApply parallel abind future
#'@importFrom utils str
#'@importFrom stats na.omit setNames 
#'@importFrom ClimProjDiags Subset
aho's avatar
aho committed
#'@importFrom methods is
aho's avatar
aho committed
#'@export
Start <- function(..., # dim = indices/selectors, 
                  # dim_var = 'var', 
                  # dim_reorder = Sort/CircularSort, 
                  # dim_tolerance = number, 
                  # dim_depends = 'file_dim', 
                  # dim_across = 'file_dim', 
                  return_vars = NULL, 
                  synonims = NULL, 
                  file_opener = NcOpener, 
                  file_var_reader = NcVarReader, 
                  file_dim_reader = NcDimReader, 
                  file_data_reader = NcDataReader, 
                  file_closer = NcCloser, 
                  transform = NULL, 
                  transform_params = NULL, 
                  transform_vars = NULL, 
                  transform_extra_cells = 2, 
                  apply_indices_after_transform = FALSE, 
                  pattern_dims = NULL,
                  metadata_dims = NULL, 
                  selector_checker = SelectorChecker,
                  merge_across_dims = FALSE,
                  merge_across_dims_narm = TRUE,
aho's avatar
aho committed
                  split_multiselected_dims = FALSE,
                  path_glob_permissive = FALSE,
aho's avatar
aho committed
                  retrieve = FALSE, 
                  num_procs = 1, 
                  ObjectBigmemory = NULL,
aho's avatar
aho committed
                  silent = FALSE, debug = FALSE) {
  #, config_file = NULL
  #dictionary_dim_names = ,
  #dictionary_var_names =  

  # Specify Subset() is from ClimProjDiags
  Subset <- ClimProjDiags::Subset

  dim_params <- list(...)
  # Take *_var parameters apart
  var_params <- take_var_params(dim_params) 
aho's avatar
aho committed
  
  # Take *_reorder parameters apart
  dim_reorder_params <- take_var_reorder(dim_params)

aho's avatar
aho committed
  # Take *_tolerance parameters apart
  tolerance_params_ind <- grep('_tolerance$', names(dim_params))
  tolerance_params <- dim_params[tolerance_params_ind]
  
  # Take *_depends parameters apart
  depending_file_dims <- take_var_depends(dim_params)

aho's avatar
aho committed
  # Take *_across parameters apart
  inner_dims_across_files <- take_var_across(dim_params)
aho's avatar
aho committed
  
  # Check merge_across_dims
  if (!is.logical(merge_across_dims)) {
    stop("Parameter 'merge_across_dims' must be TRUE or FALSE.")
  }
  if (merge_across_dims & is.null(inner_dims_across_files)) {
    merge_across_dims <- FALSE
    .warning("Parameter 'merge_across_dims' is changed to FALSE because there is no *_across argument.")
  }
aho's avatar
aho committed
  
  # Check merge_across_dims_narm
  if (!is.logical(merge_across_dims_narm)) {
    stop("Parameter 'merge_across_dims_narm' must be TRUE or FALSE.")
  }
  if (!merge_across_dims & merge_across_dims_narm) {
    merge_across_dims_narm <- FALSE
  }
  
  # Leave alone the dimension parameters in the variable dim_params
  dim_params <- rebuild_dim_params(dim_params, merge_across_dims,
                                   inner_dims_across_files)
aho's avatar
aho committed
  dim_names <- names(dim_params)
  # Look for chunked dims
  chunks <- look_for_chunks(dim_params, dim_names) 
aho's avatar
aho committed
  # Check pattern_dims
  # Function found_pattern_dims may change pattern_dims in the .GlobalEnv
  found_pattern_dim <- found_pattern_dims(pattern_dims, dim_names, var_params,
                                          dim_params, dim_reorder_params)
aho's avatar
aho committed
  # Check all *_reorder are NULL or functions, and that they all have 
  # a matching dimension param.
  i <- 1
  for (dim_reorder_param in dim_reorder_params) {
    if (!is.function(dim_reorder_param)) {
      stop("All '*_reorder' parameters must be functions.")
    } else if (!any(grepl(paste0('^', strsplit(names(dim_reorder_params)[i], 
                                               '_reorder$')[[1]][1], '$'),
                          names(dim_params)))) {
      stop(paste0("All '*_reorder' parameters must be associated to a dimension parameter. Found parameter '", 
                  names(dim_reorder_params)[i], "' but no parameter '", 
                  strsplit(names(dim_reorder_params)[i], '_reorder$')[[1]][1], "'."))
      #} else if (!any(grepl(paste0('^', strsplit(names(dim_reorder_params)[i], 
      #                                           '_reorder$')[[1]][1], '$'),
      #                      names(var_params)))) {
      #  stop(paste0("All '*_reorder' parameters must be associated to a dimension parameter associated to a ",
      #              "variable. Found parameter '", names(dim_reorder_params)[i], "' and dimension parameter '", 
      #              strsplit(names(dim_reorder_params)[i], '_reorder$')[[1]][1], "' but did not find variable ",
      #              "parameter '", strsplit(names(dim_reorder_params)[i], '_reorder$')[[1]][1], "_var'."))
    }
    i <- i + 1
  }
  
  # Check all *_tolerance are NULL or vectors of character strings, and 
  # that they all have a matching dimension param.
  i <- 1
  for (tolerance_param in tolerance_params) {
    if (!any(grepl(paste0('^', strsplit(names(tolerance_params)[i], 
                                        '_tolerance$')[[1]][1], '$'),
                   names(dim_params)))) {
      stop(paste0("All '*_tolerance' parameters must be associated to a dimension parameter. Found parameter '", 
                  names(tolerance_params)[i], "' but no parameter '", 
                  strsplit(names(tolerance_params)[i], '_tolerance$')[[1]][1], "'."))
      #} else if (!any(grepl(paste0('^', strsplit(names(tolerance_params)[i], 
      #                                           '_tolerance$')[[1]][1], '$'),
      #                      names(var_params)))) {
      #  stop(paste0("All '*_tolerance' parameters must be associated to a dimension parameter associated to a ",
      #              "variable. Found parameter '", names(tolerance_params)[i], "' and dimension parameter '", 
      #              strsplit(names(tolerance_params)[i], '_tolerance$')[[1]][1], "' but did not find variable ",
      #              "parameter '", strsplit(names(tolerance_params)[i], '_tolerance$')[[1]][1], "_var'."))
    }
    i <- i + 1
  }
  # Make the keys of 'tolerance_params' to be the name of 
  # the corresponding dimension.
  if (length(tolerance_params) < 1) {
    tolerance_params <- NULL
  } else {
    names(tolerance_params) <- gsub('_tolerance$', '', names(tolerance_params))
  }
  
  # Check metadata_dims
  if (!is.null(metadata_dims)) {
    if (any(is.na(metadata_dims))) {
      metadata_dims <- NULL
    } else if (!is.character(metadata_dims) || (length(metadata_dims) < 1)) {
      stop("Parameter 'metadata' dims must be a vector of at least one character string.")
    }
  } else {
    metadata_dims <- pattern_dims
  }
  
  # Check if pattern_dims is the first item in metadata_dims
  if ((pattern_dims %in% metadata_dims) & metadata_dims[1] != pattern_dims) {
aho's avatar
aho committed
    metadata_dims <- c(pattern_dims, metadata_dims[-which(metadata_dims == pattern_dims)])
aho's avatar
aho committed
  }
  # Check if metadata_dims has more than 2 elements
  if ((metadata_dims[1] == pattern_dims & length(metadata_dims) > 2)) {
aho's avatar
aho committed
    .warning(paste0("Parameter 'metadata_dims' has too many elements which serve repetitive ",
                    "function. Keep '", metadata_dims[1], "' and '", metadata_dims[2], "' only."))
    metadata_dims <- metadata_dims[1:2]
aho's avatar
aho committed
  } else if (!(pattern_dims %in% metadata_dims) & length(metadata_dims) > 1) {
aho's avatar
aho committed
    .warning(paste0("Parameter 'metadata_dims' has too many elements which serve repetitive ",
                    "function. Keep '", metadata_dims[1], "' only."))
    metadata_dims <- metadata_dims[1]
aho's avatar
aho committed
  }

  # Once the pattern dimension with dataset specifications is found,
  # the variable 'dat' is mounted with the information of each
  # dataset.
  # Take only the datasets for the requested chunk
  dats_to_take <- get_chunk_indices(length(dim_params[[found_pattern_dim]]), 
                                    chunks[[found_pattern_dim]]['chunk'], 
                                    chunks[[found_pattern_dim]]['n_chunks'],
                                    found_pattern_dim)
aho's avatar
aho committed
  dim_params[[found_pattern_dim]] <- dim_params[[found_pattern_dim]][dats_to_take]
aho's avatar
aho committed
  dat <- dim_params[[found_pattern_dim]]
  #NOTE: This function creates the object 'dat_names'
aho's avatar
aho committed
  dat_names <- c()
  dat <- mount_dat(dat, pattern_dims, found_pattern_dim, dat_names) 
aho's avatar
aho committed

  dim_params[[found_pattern_dim]] <- dat_names

aho's avatar
aho committed
  # Reorder inner_dims_across_files (to make the keys be the file dimensions,
  # and the values to be the inner dimensions that go across it).
  if (!is.null(inner_dims_across_files)) {
    # Reorder: example, convert list(ftime = 'chunk', ensemble = 'member', xx = 'chunk')
    #                        to list(chunk = c('ftime', 'xx'), member = 'ensemble')
    new_idaf <- list()
    for (i in names(inner_dims_across_files)) {
      if (!(inner_dims_across_files[[i]] %in% names(new_idaf))) {
        new_idaf[[inner_dims_across_files[[i]]]] <- i
      } else {
        new_idaf[[inner_dims_across_files[[i]]]] <- c(new_idaf[[inner_dims_across_files[[i]]]], i)
      }