Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
#'Declare, discover, subset and retrieve multidimensional distributed data sets
#'
#'See the \href{https://earth.bsc.es/gitlab/es/startR}{startR documentation and
#'tutorial} for a step-by-step explanation on how to use Start().\cr\cr
#'Nowadays in the era of big data, large multidimensional data sets from
#'diverse sources need to be combined and processed. Analysis of big data in any
#'field is often highly complex and time-consuming. Taking subsets of these data
#'sets and processing them efficiently become an indispensable practice. This
#'technique is also known as Domain Decomposition, Map Reduce or, more commonly,
#''chunking'.\cr\cr
#'startR (Subset, TrAnsform, ReTrieve, arrange and process large
#'multidimensional data sets in R) is an R project started at BSC with the aim
#'to develop a tool that allows the user to automatically process large
#'multidimensional distributed data sets. It is an open source project that is
#'open to external collaboration and funding, and will continuously evolve to
#'support as many data set formats as possible while maximizing its efficiency.\cr\cr
#'startR provides a framework under which a data set (collection of one
#'or multiple data files, potentially distributed over various remote servers)
#'are perceived as if they all were part of a single large multidimensional
#'array. Once such multidimensional array is declared, any user-defined function
#'can be applied to the data in a \code{apply}-like fashion, where startR
#'transparently implements the Map Reduce paradigm. The steps to follow in order
#'to process a collection of big data sets are as follows:\cr
#'\itemize{
#' \item{
#'Declaring the data set, i.e. declaring the distribution of the data files
#'involved, the dimensions and shape of the multidimensional array, and the
#'boundaries of the target data. This step can be performed with the
#'Start() function. Numeric indices or coordinate values can be used when
#'fixing the boundaries. It is common having the need to apply transformations,
#'pre-processing or reordering to the data. Start() accepts user-defined
#'transformation or reordering functions to be applied for such purposes. Once a
#'data set is declared, a list of involved files, dimension lengths, memory size
#'and other metadata is made available. Optionally, the data set can be
#'retrieved and loaded onto the current R session if it is small enough.
#' }
#' \item{
#'Declaring the workflow of operations to perform on the involved data set(s).
#'This step can be performed with the Step() and AddStep() functions.
#' }
#' \item{
#'Defining the computation settings. The mandatory settings include a) how many
#'subsets to divide the data sets into and along which dimensions; b) which
#'platform to perform the workflow of operations on (local machine or remote
#'machine/HPC?), how to communicate with it (unidirectional or bidirectional
#'connection? shared or separate file systems?), which queuing system it uses
#'(slurm, PBS, LSF, none?); and c) how many parallel jobs and execution threads
#'per job to use when running the calculations. This step can be performed when
#'building up the call to the Compute() function.
#' }
#' \item{
#'Running the computation. startR transparently implements the Map Reduce
#'paradigm, according to the settings in the previous steps. The progress can
#'optionally be monitored with the EC-Flow workflow management tool. When the
#'computation ends, a report of performance timings is displayed. This step can
#'be triggered with the Compute() function.
#' }
#'}
#'startR is not bound to a specific file format. Interface functions to
#'custom file formats can be provided for Start() to read them. As this
#'version, startR includes interface functions to the following file formats:
#'\itemize{
#' \item{
#'NetCDF
#' }
#'}
#'Metadata and auxilliary data is also preserved and arranged by Start()
#'in the measure that it is retrieved by the interface functions for a specific
#'file format.
#'
#'@param \dots A selection of custemized parameters depending on the data
#'format. When we retrieve data from one or a collection of data sets,
#'the involved data can be perceived as belonging to a large multi-dimensional
#'array. For instance, let us consider an example case. We want to retrieve data
#'from a source, which contains data for the number of monthly sales of various
#'items, and also for their retail price each month. The data on source is
#'stored as follows:\cr\cr
#'\command{
#'\cr # /data/
#'\cr # |-> sales/
#'\cr # | |-> electronics
#'\cr # | | |-> item_a.data
#'\cr # | | |-> item_b.data
#'\cr # | | |-> item_c.data
#'\cr # | |-> clothing
#'\cr # | |-> item_d.data
#'\cr # | |-> idem_e.data
#'\cr # | |-> idem_f.data
#'\cr # |-> prices/
#'\cr # |-> electronics
#'\cr # | |-> item_a.data
#'\cr # | |-> item_b.data
#'\cr # | |-> item_c.data
#'\cr # |-> clothing
#'\cr # |-> item_d.data
#'\cr # |-> item_e.data
#'\cr # |-> item_f.data
#'}\cr\cr
#'Each item file contains data, stored in whichever format, for the sales or
#'prices over a time period, e.g. for the past 24 months, registered at 100
#'different stores over the world. Whichever the format it is stored in, each
#'file can be perceived as a container of a data array of 2 dimensions, time and
#'store. Let us assume the '.data' format allows to keep a name for each of
#'these dimensions, and the actual names are 'time' and 'store'.\cr\cr
#'The different item files for sales or prices can be perceived as belonging to
#'an 'item' dimension of length 3, and the two groups of three items to a
#''section' dimension of length 2, and the two groups of two sections (one with
#'the sales and the other with the prices) can be perceived as belonging also to
#'another dimension 'variable' of length 2. Even the source can be perceived as
#'belonging to a dimension 'source' of length 1.\cr\cr
#'All in all, in this example, the whole data could be perceived as belonging to
#'a multidimensional 'large array' of dimensions\cr
#'\command{
#'\cr # source variable section item store month
#'\cr # 1 2 2 3 100 24
#'}
#'\cr\cr
#'The dimensions of this 'large array' can be classified in two types. The ones
#'that group actual files (the file dimensions) and the ones that group data
#'values inside the files (the inner dimensions). In the example, the file
#'dimensions are 'source', 'variable', 'section' and 'item', whereas the inner
#'dimensions are 'store' and 'month'.
#'\cr\cr
#'Having the dimensions of our target sources in mind, the parameter \code{\dots}
#'expects to receive information on:
#' \itemize{
#' \item{
#'The names of the expected dimensions of the 'large dataset' we want to
#'retrieve data from
#' }
#' \item{
#'The indices to take from each dimension (and other constraints)
#' }
#' \item{
#'How to reorder the dimension if needed
#' }
#' \item{
#'The location and organization of the files of the data sets
#' }
#' }
#'For each dimension, the 3 first information items can be specified with a set
#'of parameters to be provided through \code{\dots}. For a given dimension
#''dimname', six parameters can be specified:\cr
#'\command{
#'\cr # dimname = <indices_to_take>, # 'all' / 'first' / 'last' /
#'\cr # # indices(c(1, 10, 20)) /
#'\cr # # indices(c(1:20)) /
#'\cr # # indices(list(1, 20)) /
#'\cr # # c(1, 10, 20) / c(1:20) /
#'\cr # # list(1, 20)
#'\cr # dimname_var = <name_of_associated_coordinate_variable>,
#'\cr # dimname_tolerance = <tolerance_value>,
#'\cr # dimname_reorder = <reorder_function>,
#'\cr # dimname_depends = <name_of_another_dimension>,
#'\cr # dimname_across = <name_of_another_dimension>
#'}
#'\cr\cr
#'The \bold{indices to take} can be specified in three possible formats (see
#'code comments above for examples). The first format consists in using
#'character tags, such as 'all' (take all the indices available for that
#'dimension), 'first' (take only the first) and 'last' (only the last). The
#'second format consists in using numeric indices, which have to be wrapped in a
#'call to the indices() helper function. For the second format, either a
#'vector of numeric indices can be provided, or a list with two numeric indices
#'can be provided to take all the indices in the range between the two specified
#'indices (both extremes inclusive). The third format consists in providing a
#'vector character strings (for file dimensions) or of values of whichever type
#'(for inner dimensions). For the file dimensions, the provided character
#'strings in the third format will be used as components to build up the final
#'path to the files (read further). For inner dimensions, the provided values in
#'the third format will be compared to the values of an associated coordinate
#'variable (must be specified in '<dimname>_reorder', read further), and the
#'indices of the closest values will be retrieved. When using the third format,
#'a list with two values can also be provided to take all the indices of the
#'values within the specified range.
#'\cr\cr
#'The \bold{name of the associated coordinate variable} must be a character
#'string with the name of an associated coordinate variable to be found in the
#'data files (in all* of them). For this to work, a 'file_var_reader'
#'function must be specified when calling Start() (see parameter
#''file_var_reader'). The coordinate variable must also be requested in the
#'parameter 'return_vars' (see its section for details). This feature only
#'works for inner dimensions.
#'\cr\cr
#'The \bold{tolerance value} is useful when indices for an inner dimension are
#'specified in the third format (values of whichever type). In that case, the
#'indices of the closest values in the coordinate variable are seeked. However
#'the closest value might be too distant and we would want to consider no real
#'match exists for such provided value. This is possible via the tolerance,
#'which allows to specify a threshold beyond which not to seek for matching
#'values and mark that index as missing value.
#'\cr\cr
#'The \bold{reorder_function} is useful when indices for an inner dimension are
#'specified in the third fromat, and the retrieved indices need to be reordered
#'in function of their provided associated variable values. A function can be
#'provided, which receives as input a vector of values, and returns as outputs a
#'list with the components \code{$x} with the reordered values, and \code{$ix}
#'with the permutation indices. Two reordering functions are included in
#'startR, the Sort() and the CircularSort().
#'\cr\cr
#'The \bold{name of another dimension} to be specified in <dimname>_depends,
#'only available for file dimensions, must be a character string with the name
#'of another requested \bold{file dimension} in \code{\dots}, and will make
#'Start() aware that the path components of a file dimension can vary in
#'function of the path component of another file dimension. For instance, in the
#'example above, specifying \code{item_depends = 'section'} will make
#'Start() aware that the item names vary in function of the section, i.e.
#'section 'electronics' has items 'a', 'b' and 'c' but section 'clothing' has
#'items 'd', 'e', 'f'. Otherwise Start() would expect to find the same
#'item names in all the sections.
#'If values() is used to define dimensions, it is possible to provide different
#'values of the depending dimension for each depended dimension values. For
#'example, if \code{section = c('electronics', 'clothing')}, we can use
#'\code{item = list(electronics = c('a', 'b', 'c'), clothing = c('d', 'e', 'f'))}.
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
#'\cr\cr
#'The \bold{name of another dimension} to be specified in '<dimname>_across',
#'only available for inner dimensions, must be a character string with the name
#'of another requested \bold{inner dimension} in \code{\dots}, and will make
#'Start() aware that an inner dimension extends along multiple files. For
#'instance, let us imagine that in the example above, the records for each item
#'are so large that it becomes necessary to split them in multiple files each
#'one containing the registers for a different period of time, e.g. in 10 files
#'with 100 months each ('item_a_period1.data', 'item_a_period2.data', and so on).
#'In that case, the data can be perceived as having an extra file dimension, the
#''period' dimension. The inner dimension 'month' would extend across multiple
#'files, and providing the parameter \code{month = indices(1, 300)} would make
#'Start() crash because it would perceive we have made a request out of
#'bounds (each file contains 100 'month' indices, but we requested 1 to 300).
#'This can be solved by specifying the parameter \code{month_across = period} (a
#'long with the full specification of the dimension 'period').
#'\cr\cr
#'\bold{Defining the path pattern}
#'\cr
#'As mentioned above, the parameter \dots also expects to receive information
#'with the location of the data files. In order to do this, a special dimension
#'must be defined. In that special dimension, in place of specifying indices to
#'take, a path pattern must be provided. The path pattern is a character string
#'that encodes the way the files are organized in their source. It must be a
#'path to one of the data set files in an accessible local or remote file system,
#'or a URL to one of the files provided by a local or remote server. The regions
#'of this path that vary across files (along the file dimensions) must be
#'replaced by wildcards. The wildcards must match any of the defined file
#'dimensions in the call to Start() and must be delimited with heading
#'and trailing '$'. Shell globbing expressions can be used in the path pattern.
#'See the next code snippet for an example of a path pattern.
#'\cr\cr
#'All in all, the call to Start() to load the entire data set in the
#'example of store item sales, would look as follows:
#'\cr
#'\command{
#'\cr # data <- Start(source = paste0('/data/$variable$/',
#'\cr # '$section$/$item$.data'),
#'\cr # variable = 'all',
#'\cr # section = 'all',
#'\cr # item = 'all',
#'\cr # item_depends = 'section',
#'\cr # store = 'all',
#'\cr # month = 'all')
#'}
#'\cr\cr
#'Note that in this example it would still be pending to properly define the
#'parameters 'file_opener', 'file_closer', 'file_dim_reader',
#''file_var_reader' and 'file_data_reader' for the '.data' file format
#'(see the corresponding sections).
#'\cr\cr
#'The call to Start() will return a multidimensional R array with the
#'following dimensions:
#'\cr
#'\command{
#'\cr # source variable section item store month
#'\cr # 1 2 2 3 100 24
#'}
#'\cr
#'The dimension specifications in the \code{\dots} do not have to follow any
#'particular order. The returned array will have the dimensions in the same order
#'as they have been specified in the call. For example, the following call:
#'\cr
#'\command{
#'\cr # data <- Start(source = paste0('/data/$variable$/',
#'\cr # '$section$/$item$.data'),
#'\cr # month = 'all',
#'\cr # store = 'all',
#'\cr # item = 'all',
#'\cr # item_depends = 'section',
#'\cr # section = 'all',
#'\cr # variable = 'all')
#'}
#'\cr\cr
#'would return an array with the following dimensions:
#'\cr
#'\command{
#'\cr # source month store item section variable
#'\cr # 1 24 100 3 2 2
#'}
#'\cr\cr
#'Next, a more advanced example to retrieve data for only the sales records, for
#'the first section ('electronics'), for the 1st and 3rd items and for the
#'stores located in Barcelona (assuming the files contain the variable
#''store_location' with the name of the city each of the 100 stores are located
#'at):
#'\cr
#'\command{
#'\cr # data <- Start(source = paste0('/data/$variable$/',
#'\cr # '$section$/$item$.data'),
#'\cr # variable = 'sales',
#'\cr # section = 'first',
#'\cr # item = indices(c(1, 3)),
#'\cr # item_depends = 'section',
#'\cr # store = 'Barcelona',
#'\cr # store_var = 'store_location',
#'\cr # month = 'all',
#'\cr # return_vars = list(store_location = NULL))
#'}
#'\cr\cr
#'The defined names for the dimensions do not necessarily have to match the
#'names of the dimensions inside the file. Lists of alternative names to be
#'seeked can be defined in the parameter 'synonims'.
#'\cr\cr
#'If data from multiple sources (not necessarily following the same structure)
#'has to be retrieved, it can be done by providing a vector of character strings
#'with path pattern specifications, or, in the extended form, by providing a
#'list of lists with the components 'name' and 'path', and the name of the
#'dataset and path pattern as values, respectively. For example:
#'\cr
#'\command{
#'\cr # data <- Start(source = list(
#'\cr # list(name = 'sourceA',
#'\cr # path = paste0('/sourceA/$variable$/',
#'\cr # '$section$/$item$.data')),
#'\cr # list(name = 'sourceB',
#'\cr # path = paste0('/sourceB/$section$/',
#'\cr # '$variable$/$item$.data'))
#'\cr # ),
#'\cr # variable = 'sales',
#'\cr # section = 'first',
#'\cr # item = indices(c(1, 3)),
#'\cr # item_depends = 'section',
#'\cr # store = 'Barcelona',
#'\cr # store_var = 'store_location',
#'\cr # month = 'all',
#'\cr # return_vars = list(store_location = NULL))
#'}
#'\cr
#'
#'@param return_vars A named list where the names are the names of the
#'variables to be fetched in the files, and the values are vectors of
#'character strings with the names of the file dimension which to retrieve each
#'variable for, or NULL if the variable has to be retrieved only once
#'from any (the first) of the involved files.\cr\cr
#'Apart from retrieving a multidimensional data array, retrieving auxiliary
#'variables inside the files can also be needed. The parameter
#''return_vars' allows for requesting such variables, as long as a
#''file_var_reader' function is also specified in the call to
#'Start() (see documentation on the corresponding parameter).
#'\cr\cr
#'In the case of the the item sales example (see documentation on parameter
#'\code{\dots)}, the store location variable is requested with the parameter\cr
#'\code{return_vars = list(store_location = NULL)}.\cr This will cause
#'Start() to fetch once the variable 'store_location' and return it in
#'the component\cr \code{$Variables$common$store_location},\cr and will be an
#'array of character strings with the location names, with the dimensions
#'\code{c('store' = 100)}. Although useless in this example, we could ask
#'Start() to fetch and return such variable for each file along the
#'items dimension as follows: \cr
#'\code{return_vars = list(store_location = c('item'))}.\cr In that case, the
#'variable will be fetched once from a file of each of the items, and will be
#'returned as an array with the dimensions \code{c('item' = 3, 'store' = 100)}.
#'\cr\cr
#'If a variable is requested along a file dimension that contains path pattern
#'specifications ('source' in the example), the fetched variable values will be
#'returned in the component\cr \code{$Variables$<dataset_name>$<variable_name>}.\cr
#'For example:
#'\cr
#'\command{
#'\cr # data <- Start(source = list(
#'\cr # list(name = 'sourceA',
#'\cr # path = paste0('/sourceA/$variable$/',
#'\cr # '$section$/$item$.data')),
#'\cr # list(name = 'sourceB',
#'\cr # path = paste0('/sourceB/$section$/',
#'\cr # '$variable$/$item$.data'))
#'\cr # ),
#'\cr # variable = 'sales',
#'\cr # section = 'first',
#'\cr # item = indices(c(1, 3)),
#'\cr # item_depends = 'section',
#'\cr # store = 'Barcelona',
#'\cr # store_var = 'store_location',
#'\cr # month = 'all',
#'\cr # return_vars = list(store_location = c('source',
#'\cr # 'item')))
#'\cr # # Checking the structure of the returned variables
#'\cr # str(found_data$Variables)
#'\cr # Named list
#'\cr # ..$common: NULL
#'\cr # ..$sourceA: Named list
#'\cr # .. ..$store_location: char[1:18(3d)] 'Barcelona' 'Barcelona' ...
#'\cr # ..$sourceB: Named list
#'\cr # .. ..$store_location: char[1:18(3d)] 'Barcelona' 'Barcelona' ...
#'\cr # # Checking the dimensions of the returned variable
#'\cr # # for the source A
#'\cr # dim(found_data$Variables$sourceA)
#'\cr # item store
#'\cr # 3 3
#'}
#'\cr\cr
#'The names of the requested variables do not necessarily have to match the
#'actual variable names inside the files. A list of alternative names to be
#'seeked can be specified via the parameter 'synonims'.
#'
#'@param synonims A named list where the names are the requested variable or
#'dimension names, and the values are vectors of character strings with
#'alternative names to seek for such dimension or variable.\cr\cr
#'In some requests, data from different sources may follow different naming
#'conventions for the dimensions or variables, or even files in the same source
#'could have varying names. This parameter is in order for Start() to
#'properly identify the dimensions or variables with different names.
#'\cr\cr
#'In the example used in parameter 'return_vars', it may be the case that
#'the two involved data sources follow slightly different naming conventions.
#'For example, source A uses 'sect' as name for the sections dimension, whereas
#'source B uses 'section'; source A uses 'store_loc' as variable name for the
#'store locations, whereas source B uses 'store_location'. This can be taken
#'into account as follows:
#'\cr
#'\command{
#'\cr # data <- Start(source = list(
#'\cr # list(name = 'sourceA',
#'\cr # path = paste0('/sourceA/$variable$/',
#'\cr # '$section$/$item$.data')),
#'\cr # list(name = 'sourceB',
#'\cr # path = paste0('/sourceB/$section$/',
#'\cr # '$variable$/$item$.data'))
#'\cr # ),
#'\cr # variable = 'sales',
#'\cr # section = 'first',
#'\cr # item = indices(c(1, 3)),
#'\cr # item_depends = 'section',
#'\cr # store = 'Barcelona',
#'\cr # store_var = 'store_location',
#'\cr # month = 'all',
#'\cr # return_vars = list(store_location = c('source',
#'\cr # 'item')),
#'\cr # synonims = list(
#'\cr # section = c('sec', 'section'),
#'\cr # store_location = c('store_loc',
#'\cr # 'store_location')
#'\cr # ))
#'}
#'\cr
#'
#'@param file_opener A function that receives as a single parameter
#' 'file_path' a character string with the path to a file to be opened,
#' and returns an object with an open connection to the file (optionally with
#' header information) on success, or returns NULL on failure.
#'\cr\cr
#'This parameter takes by default NcOpener() (an opener function for NetCDF
#'files).
#'\cr\cr
#'See NcOpener() for a template to build a file opener for your own file
#'format.
#'
#'@param file_var_reader A function with the header \code{file_path = NULL},
#' \code{file_object = NULL}, \code{file_selectors = NULL}, \code{var_name},
#' \code{synonims} that returns an array with auxiliary data (i.e. data from a
#' variable) inside a file. Start() will provide automatically either a
#' 'file_path' or a 'file_object' to the 'file_var_reader'
#' function (the function has to be ready to work whichever of these two is
#' provided). The parameter 'file_selectors' will also be provided
#' automatically to the variable reader, containing a named list where the
#' names are the names of the file dimensions of the queried data set (see
#' documentation on \code{\dots}) and the values are single character strings
#' with the components used to build the path to the file being read (the one
#' provided in 'file_path' or 'file_object'). The parameter 'var_name'
#' will be filled in automatically by Start() also, with the name of one
#' of the variales to be read. The parameter 'synonims' will be filled in
#' with exactly the same value as provided in the parameter 'synonims' in
#' the call to Start(), and has to be used in the code of the variable
#' reader to check for alternative variable names inside the target file. The
#' 'file_var_reader' must return a (multi)dimensional array with named
#' dimensions, and optionally with the attribute 'variales' with other
#' additional metadata on the retrieved variable.
#'\cr\cr
#'Usually, the 'file_var_reader' should be a degenerate case of the
#''file_data_reader' (see documentation on the corresponding parameter),
#'so it is recommended to code the 'file_data_reder' in first place.
#'\cr\cr
#'This parameter takes by default NcVarReader() (a variable reader function
#'for NetCDF files).
#'\cr\cr
#'See NcVarReader() for a template to build a variale reader for your own
#'file format.
#'
#'@param file_dim_reader A function with the header \code{file_path = NULL},
#' \code{file_object = NULL}, \code{file_selectors = NULL}, \code{synonims}
#' that returns a named numeric vector where the names are the names of the
#' dimensions of the multidimensional data array in the file and the values are
#' the sizes of such dimensions. Start() will provide automatically
#' either a 'file_path' or a 'file_object' to the
#' 'file_dim_reader' function (the function has to be ready to work
#' whichever of these two is provided). The parameter 'file_selectors'
#' will also be provided automatically to the dimension reader, containing a
#' named list where the names are the names of the file dimensions of the
#' queried data set (see documentation on \code{\dots}) and the values are
#' single character strings with the components used to build the path to the
#' file being read (the one provided in 'file_path' or 'file_object').
#' The parameter 'synonims' will be filled in with exactly the same value
#' as provided in the parameter 'synonims' in the call to Start(),
#' and can optionally be used in advanced configurations.
#'\cr\cr
#'This parameter takes by default NcDimReader() (a dimension reader
#'function for NetCDF files).
#'\cr\cr
#'See NcDimReader() for (an advanced) template to build a dimension reader
#'for your own file format.
#'
#'@param file_data_reader A function with the header \code{file_path = NULL},
#' \code{file_object = NULL}, \code{file_selectors = NULL},
#' \code{inner_indices = NULL}, \code{synonims} that returns a subset of the
#' multidimensional data array inside a file (even if internally it is not an
#' array). Start() will provide automatically either a 'file_path'
#' or a 'file_object' to the 'file_data_reader' function (the
#' function has to be ready to work whichever of these two is provided). The
#' parameter 'file_selectors' will also be provided automatically to the
#' data reader, containing a named list where the names are the names of the
#' file dimensions of the queried data set (see documentation on \code{\dots})
#' and the values are single character strings with the components used to
#' build the path to the file being read (the one provided in 'file_path' or
#' 'file_object'). The parameter 'inner_indices' will be filled in
#' automatically by Start() also, with a named list of numeric vectors,
#' where the names are the names of all the expected inner dimensions in a file
#' to be read, and the numeric vectors are the indices to be taken from the
#' corresponding dimension (the indices may not be consecutive nor in order).
#' The parameter 'synonims' will be filled in with exactly the same value
#' as provided in the parameter 'synonims' in the call to Start(),
#' and has to be used in the code of the data reader to check for alternative
#' dimension names inside the target file. The 'file_data_reader' must
#' return a (multi)dimensional array with named dimensions, and optionally with
#' the attribute 'variables' with other additional metadata on the retrieved
#' data.
#'\cr\cr
#'Usually, 'file_data_reader' should use 'file_dim_reader'
#'(see documentation on the corresponding parameter), so it is recommended to
#'code 'file_dim_reder' in first place.
#'\cr\cr
#'This parameter takes by default NcDataReader() (a data reader function
#'for NetCDF files).
#'\cr\cr
#'See NcDataReader() for a template to build a data reader for your own
#'file format.
#'
#'@param file_closer A function that receives as a single parameter
#' 'file_object' an open connection (as returned by 'file_opener')
#' to one of the files to be read, optionally with header information, and
#' closes the open connection. Always returns NULL.
#'\cr\cr
#'This parameter takes by default NcCloser() (a closer function for NetCDF
#'files).
#'\cr\cr
#'See NcCloser() for a template to build a file closer for your own file
#'format.
#'
#'@param transform A function with the header \code{dara_array},
#' \code{variables}, \code{file_selectors = NULL}, \code{\dots}. It receives as
#' input, through the parameter \code{data_array}, a subset of a
#' multidimensional array (as returned by 'file_data_reader'), applies a
#' transformation to it and returns it, preserving the amount of dimensions but
#' potentially modifying their size. This transformation may require data from
#' other auxiliary variables, automatically provided to 'transform'
#' through the parameter 'variables', in the form of a named list where
#' the names are the variable names and the values are (multi)dimensional
#' arrays. Which variables need to be sent to 'transform' can be specified
#' with the parameter 'transform_vars' in Start(). The parameter
#' 'file_selectors' will also be provided automatically to
#' 'transform', containing a named list where the names are the names of
#' the file dimensions of the queried data set (see documentation on
#' \code{\dots}) and the values are single character strings with the
#' components used to build the path to the file the subset being processed
#' belongs to. The parameter \code{\dots} will be filled in with other
#' additional parameters to adjust the transformation, exactly as provided in
#' the call to Start() via the parameter 'transform_params'.
#'@param transform_params A named list with additional parameters to be sent to
#' the 'transform' function (if specified). See documentation on parameter
#' 'transform' for details.
#'@param transform_vars A vector of character strings with the names of
#' auxiliary variables to be sent to the 'transform' function (if
#' specified). All the variables to be sent to 'transform' must also
#' have been requested as return variables in the parameter 'return_vars'
#' of Start().
#'@param transform_extra_cells An integer of extra indices to retrieve from the
#' data set, beyond the requested indices in \code{\dots}, in order for
#' 'transform' to dispose of additional information to properly apply
#' whichever transformation (if needed). As many as
#' 'transform_extra_cells' will be retrieved beyond each of the limits for
#' each of those inner dimensions associated to a coordinate variable and sent
#' to 'transform' (i.e. present in 'transform_vars'). After
#' 'transform' has finished, Start() will take again and return a
#' subset of the result, for the returned data to fall within the specified
#' bounds in \code{\dots}. The default value is 2.
#'@param apply_indices_after_transform A logical value indicating when a
#' 'transform' is specified in Start() and numeric indices are
#' provided for any of the inner dimensions that depend on coordinate variables,
#' these numeric indices can be made effective (retrieved) before applying the
#' transformation or after. The boolean flag allows to adjust this behaviour.
#' It takes FALSE by default (numeric indices are applied before sending
#' data to 'transform').
#'@param pattern_dims A character string indicating the name of the dimension
#' with path pattern specifications (see \code{\dots} for details). If not
#' specified, Start() assumes the first provided dimension is the pattern
#' dimension, with a warning.
#'@param metadata_dims A vector of character strings with the names of the file
#' dimensions which to return metadata for. As noted in 'file_data_reader',
#' the data reader can optionally return auxiliary data via the attribute
#' 'variables' of the returned array. Start() by default returns the
#' auxiliary data read for only the first file of each source (or data set) in
#' the pattern dimension (see \code{\dots} for info on what the pattern
#' dimension is). However it can be configured to return the metadata for all
#' the files along any set of file dimensions. The default value is NULL, and
#' it will be assigned automatically as parameter 'pattern_dims'.
#'@param selector_checker A function used internaly by Start() to
#' translate a set of selectors (values for a dimension associated to a
#' coordinate variable) into a set of numeric indices. It takes by default
#' SelectorChecker() and, in principle, it should not be required to
#' change it for customized file formats. The option to replace it is left open
#' for more versatility. See the code of SelectorChecker() for details on
#' the inputs, functioning and outputs of a selector checker.
#'@param merge_across_dims A logical value indicating whether to merge
#' dimensions across which another dimension extends (according to the
#' '<dimname>_across' parameters). Takes the value FALSE by default. For
#' example, if the dimension 'time' extends across the dimension 'chunk' and
#' \code{merge_across_dims = TRUE}, the resulting data array will only contain
#' only the dimension 'time' as long as all the chunks together.
#'@param merge_across_dims_narm A logical value indicating whether to remove
#' the additional NAs from data when parameter 'merge_across_dims' is TRUE.
#' It is helpful when the length of the to-be-merged dimension is different
#' across another dimension. For example, if the dimension 'time' extends
#' across dimension 'chunk', and the time length along the first chunk is 2
#' while along the second chunk is 10. Setting this parameter as TRUE can
#' remove the additional 8 NAs at position 3 to 10. The default value is TRUE,
#' but will be automatically turned to FALSE if 'merge_across_dims = FALSE'.
#'@param split_multiselected_dims A logical value indicating whether to split a
#' dimension that has been selected with a multidimensional array of selectors
#' into as many dimensions as present in the selector array. The default value
#' is FALSE.
#'@param path_glob_permissive A logical value or an integer specifying how many
#' folder levels in the path pattern, beginning from the end, the shell glob
#' expressions must be preserved and worked out for each file. The default
#' value is FALSE, which is equivalent to 0. TRUE is equivalent to 1.\cr\cr
#'When specifying a path pattern for a dataset, it might contain shell glob
#'experissions. For each dataset, the first file matching the path pattern is
#'found, and the found file is used to work out fixed values for the glob
#'expressions that will be used for all the files of the dataset. However, in
#'some cases, the values of the shell glob expressions may not be constant for
#'all files in a dataset, and they need to be worked out for each file
#'involved.\cr\cr
#'For example, a path pattern could be as follows: \cr
#'\code{'/path/to/dataset/$var$_*/$date$_*_foo.nc'}. \cr Leaving
#'\code{path_glob_permissive = FALSE} will trigger automatic seek of the
#' contents to replace the asterisks (e.g. the first asterisk matches with
#' \code{'bar'} and the second with \code{'baz'}. The found contents will be
#' used for all files in the dataset (in the example, the path pattern will be
#' fixed to\cr \code{'/path/to/dataset/$var$_bar/$date$_baz_foo.nc'}. However, if
#' any of the files in the dataset have other contents in the position of the
#' asterisks, Start() will not find them (in the example, a file like \cr
#' \code{'/path/to/dataset/precipitation_bar/19901101_bin_foo.nc'} would not be
#' found). Setting \code{path_glob_permissive = 1} would preserve global
#' expressions in the latest level (in the example, the fixed path pattern
#' would be\cr \code{'/path/to/dataset/$var$_bar/$date$_*_foo.nc'}, and the
#' problematic file mentioned before would be found), but of course this would
#' slow down the Start() call if the dataset involves a large number of
#' files. Setting \code{path_glob_permissive = 2} would leave the original path
#' pattern with the original glob expressions in the 1st and 2nd levels (in the
#' example, both asterisks would be preserved, thus would allow Start()
#' to recognize files such as \cr
#' \code{'/path/to/dataset/precipitation_zzz/19901101_yyy_foo.nc'}).\cr\cr
#'Note that each glob expression can only represent one possibility (Start()
#'chooses the first). Because /code{*} is not the tag, which means it cannot
#'be a dimension of the output array. Therefore, only one possibility can be
#'adopted. For example, if \cr
#'\code{'/path/to/dataset/precipitation_*/19901101_*_foo.nc'}\cr
#'has two matches:\cr
#'\code{'/path/to/dataset/precipitation_xxx/19901101_yyy_foo.nc'} and\cr
#'\code{'/path/to/dataset/precipitation_zzz/19901101_yyy_foo.nc'},\cr
#'only the first found file will be used.
#'@param largest_dims_length A logical value or a named integer vector
#' indicating if Start() should examine all the files to get the largest
#' length of the inner dimensions (TRUE) or use the first valid file of each
#' dataset as the returned dimension length (FALSE). Since examining all the
#' files could be time-consuming, a vector can be used to explicitly specify
#' the expected length of the inner dimensions. For those inner dimensions not
#' specified, the first valid file will be used. The default value is FALSE.\cr\cr
#' This parameter is useful when the required files don't have consistent
#' inner dimension. For example, there are 10 required experimental data files
#' of a series of start dates. The data only contain 25 members for the first
#' 2 years while 51 members for the later years. If \code{'largest_dims_length = FALSE'},
#' the returned member dimension length will be 25 only. The 26th to 51st
#' members in the later 8 years will be discarded. If \code{'largest_dims_length = TRUE'},
#' the returned member dimension length will be 51. To save the resource,
#' \code{'largest_dims_length = c(member = 51)'} can also be used.
#'@param retrieve A logical value indicating whether to retrieve the data
#' defined in the Start() call or to explore only its dimension lengths
#' and names, and the values for the file and inner dimensions. The default
#' value is FALSE.
#'@param num_procs An integer of number of processes to be created for the
#' parallel execution of the retrieval/transformation/arrangement of the
#' multiple involved files in a call to Start(). If set to NULL,
#' takes the number of available cores (as detected by detectCores() in
#' the package 'future'). The default value is 1 (no parallel execution).
#'@param ObjectBigmemory a character string to be included as part of the
#' bigmemory object name. This parameter is thought to be used internally by the
#' chunking capabilities of startR.
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
#'@param silent A logical value of whether to display progress messages (FALSE)
#' or not (TRUE). The default value is FALSE.
#'@param debug A logical value of whether to return detailed messages on the
#' progress and operations in a Start() call (TRUE) or not (FALSE). The
#' default value is FALSE.
#'
#'@return If \code{retrieve = TRUE} the involved data is loaded into RAM memory
#' and an object of the class 'startR_cube' with the following components is
#' returned:\cr
#' \item{Data}{
#' Multidimensional data array with named dimensions, with the data values
#' requested via \code{\dots} and other parameters. This array can potentially
#' contain metadata in the attribute 'variables'.
#' }
#' \item{Variables}{
#' Named list of 1 + N components, containing lists of retrieved variables (as
#' requested in 'return_vars') common to all the data sources (in the 1st
#' component, \code{$common}), and for each of the N dara sources (named after
#' the source name, as specified in \dots, or, if not specified, \code{$dat1},
#' \code{$dat2}, ..., \code{$datN}). Each of the variables are contained in a
#' multidimensional array with named dimensions, and potentially with the
#' attribute 'variables' with additional auxiliary data.
#' }
#' \item{Files}{
#' Multidimensonal character string array with named dimensions. Its dimensions
#' are the file dimensions (as requested in \code{\dots}). Each cell in this
#' array contains a path to a retrieved file, or NULL if the corresponding
#' file was not found.
#' }
#' \item{NotFoundFiles}{
#' Array with the same shape as \code{$Files} but with NULL in the
#' positions for which the corresponding file was found, and a path to the
#' expected file in the positions for which the corresponding file was not
#' found.
#' }
#' \item{FileSelectors}{
#' Multidimensional character string array with named dimensions, with the same
#' shape as \code{$Files} and \code{$NotFoundFiles}, which contains the
#' components used to build up the paths to each of the files in the data
#' sources.
#' }
#'If \code{retrieve = FALSE} the involved data is not loaded into RAM memory and
#'an object of the class 'startR_header' with the following components is
#' returned:\cr
#' \item{Dimensions}{
#' Named vector with the dimension lengths and names of the data involved in
#' the Start() call.
#' }
#' \item{Variables}{
#' Named list of 1 + N components, containing lists of retrieved variables (as
#' requested in 'return_vars') common to all the data sources (in the 1st
#' component, \code{$common}), and for each of the N dara sources (named after
#' the source name, as specified in \dots, or, if not specified, \code{$dat1},
#' \code{$dat2}, ..., \code{$datN}). Each of the variables are contained in a
#' multidimensional array with named dimensions, and potentially with the
#' attribute 'variables' with additional auxiliary data.
#' }
#' \item{Files}{
#' Multidimensonal character string array with named dimensions. Its dimensions
#' are the file dimensions (as requested in \dots). Each cell in this array
#' contains a path to a file to be retrieved (which may exist or not).
#' }
#' \item{FileSelectors}{
#' Multidimensional character string array with named dimensions, with the same
#' shape as \code{$Files} and \code{$NotFoundFiles}, which contains the
#' components used to build up the paths to each of the files in the data
#' sources.
#' }
#' \item{StartRCall}{
#' List of parameters sent to the Start() call, with the parameter
#' 'retrieve' set to TRUE. Intended for calling in order to
#' retrieve the associated data a posteriori with a call to do.call().
#' }
#'
#'@examples
#' data_path <- system.file('extdata', package = 'startR')
#' path_obs <- file.path(data_path, 'obs/monthly_mean/$var$/$var$_$sdate$.nc')
#' sdates <- c('200011', '200012')
#' data <- Start(dat = list(list(path = path_obs)),
#' var = 'tos',
#' sdate = sdates,
#' time = 'all',
#' latitude = 'all',
#' longitude = 'all',
#' return_vars = list(latitude = 'dat',
#' longitude = 'dat',
#' time = 'sdate'),
#' retrieve = FALSE)
#'
#'@import bigmemory multiApply parallel abind future
#'@importFrom utils str
#'@importFrom stats na.omit setNames
#'@importFrom ClimProjDiags Subset
#'@export
Start <- function(..., # dim = indices/selectors,
# dim_var = 'var',
# dim_reorder = Sort/CircularSort,
# dim_tolerance = number,
# dim_depends = 'file_dim',
# dim_across = 'file_dim',
return_vars = NULL,
synonims = NULL,
file_opener = NcOpener,
file_var_reader = NcVarReader,
file_dim_reader = NcDimReader,
file_data_reader = NcDataReader,
file_closer = NcCloser,
transform = NULL,
transform_params = NULL,
transform_vars = NULL,
transform_extra_cells = 2,
apply_indices_after_transform = FALSE,
pattern_dims = NULL,
metadata_dims = NULL,
selector_checker = SelectorChecker,
merge_across_dims = FALSE,
merge_across_dims_narm = TRUE,
split_multiselected_dims = FALSE,
path_glob_permissive = FALSE,
largest_dims_length = FALSE,
silent = FALSE, debug = FALSE) {
#, config_file = NULL
#dictionary_dim_names = ,
#dictionary_var_names =
# Specify Subset() is from ClimProjDiags
Subset <- ClimProjDiags::Subset
dim_params <- list(...)
# Take *_var parameters apart
var_params <- take_var_params(dim_params)
dim_reorder_params <- take_var_reorder(dim_params)
# Take *_tolerance parameters apart
tolerance_params_ind <- grep('_tolerance$', names(dim_params))
tolerance_params <- dim_params[tolerance_params_ind]
# Take *_depends parameters apart
depending_file_dims <- take_var_depends(dim_params)
inner_dims_across_files <- take_var_across(dim_params)
# Check merge_across_dims
if (!is.logical(merge_across_dims)) {
stop("Parameter 'merge_across_dims' must be TRUE or FALSE.")
}
if (merge_across_dims & is.null(inner_dims_across_files)) {
merge_across_dims <- FALSE
.warning("Parameter 'merge_across_dims' is changed to FALSE because there is no *_across argument.")
}
# Check merge_across_dims_narm
if (!is.logical(merge_across_dims_narm)) {
stop("Parameter 'merge_across_dims_narm' must be TRUE or FALSE.")
}
if (!merge_across_dims & merge_across_dims_narm) {
merge_across_dims_narm <- FALSE
}
# Leave alone the dimension parameters in the variable dim_params
dim_params <- rebuild_dim_params(dim_params, merge_across_dims,
inner_dims_across_files)
chunks <- look_for_chunks(dim_params, dim_names)
# Function found_pattern_dims may change pattern_dims in the .GlobalEnv
found_pattern_dim <- found_pattern_dims(pattern_dims, dim_names, var_params,
dim_params, dim_reorder_params)
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
# Check all *_reorder are NULL or functions, and that they all have
# a matching dimension param.
i <- 1
for (dim_reorder_param in dim_reorder_params) {
if (!is.function(dim_reorder_param)) {
stop("All '*_reorder' parameters must be functions.")
} else if (!any(grepl(paste0('^', strsplit(names(dim_reorder_params)[i],
'_reorder$')[[1]][1], '$'),
names(dim_params)))) {
stop(paste0("All '*_reorder' parameters must be associated to a dimension parameter. Found parameter '",
names(dim_reorder_params)[i], "' but no parameter '",
strsplit(names(dim_reorder_params)[i], '_reorder$')[[1]][1], "'."))
#} else if (!any(grepl(paste0('^', strsplit(names(dim_reorder_params)[i],
# '_reorder$')[[1]][1], '$'),
# names(var_params)))) {
# stop(paste0("All '*_reorder' parameters must be associated to a dimension parameter associated to a ",
# "variable. Found parameter '", names(dim_reorder_params)[i], "' and dimension parameter '",
# strsplit(names(dim_reorder_params)[i], '_reorder$')[[1]][1], "' but did not find variable ",
# "parameter '", strsplit(names(dim_reorder_params)[i], '_reorder$')[[1]][1], "_var'."))
}
i <- i + 1
}
# Check all *_tolerance are NULL or vectors of character strings, and
# that they all have a matching dimension param.
i <- 1
for (tolerance_param in tolerance_params) {
if (!any(grepl(paste0('^', strsplit(names(tolerance_params)[i],
'_tolerance$')[[1]][1], '$'),
names(dim_params)))) {
stop(paste0("All '*_tolerance' parameters must be associated to a dimension parameter. Found parameter '",
names(tolerance_params)[i], "' but no parameter '",
strsplit(names(tolerance_params)[i], '_tolerance$')[[1]][1], "'."))
#} else if (!any(grepl(paste0('^', strsplit(names(tolerance_params)[i],
# '_tolerance$')[[1]][1], '$'),
# names(var_params)))) {
# stop(paste0("All '*_tolerance' parameters must be associated to a dimension parameter associated to a ",
# "variable. Found parameter '", names(tolerance_params)[i], "' and dimension parameter '",
# strsplit(names(tolerance_params)[i], '_tolerance$')[[1]][1], "' but did not find variable ",
# "parameter '", strsplit(names(tolerance_params)[i], '_tolerance$')[[1]][1], "_var'."))
}
i <- i + 1
}
# Make the keys of 'tolerance_params' to be the name of
# the corresponding dimension.
if (length(tolerance_params) < 1) {
tolerance_params <- NULL
} else {
names(tolerance_params) <- gsub('_tolerance$', '', names(tolerance_params))
}
# Check metadata_dims
if (!is.null(metadata_dims)) {
if (any(is.na(metadata_dims))) {
metadata_dims <- NULL
} else if (!is.character(metadata_dims) || (length(metadata_dims) < 1)) {
stop("Parameter 'metadata' dims must be a vector of at least one character string.")
}
} else {
metadata_dims <- pattern_dims
}
# Check if pattern_dims is the first item in metadata_dims
if ((pattern_dims %in% metadata_dims) & metadata_dims[1] != pattern_dims) {
metadata_dims <- c(pattern_dims, metadata_dims[-which(metadata_dims == pattern_dims)])
}
# Check if metadata_dims has more than 2 elements
if ((metadata_dims[1] == pattern_dims & length(metadata_dims) > 2)) {
.warning(paste0("Parameter 'metadata_dims' has too many elements which serve repetitive ",
"function. Keep '", metadata_dims[1], "' and '", metadata_dims[2], "' only."))
metadata_dims <- metadata_dims[1:2]
} else if (!(pattern_dims %in% metadata_dims) & length(metadata_dims) > 1) {
.warning(paste0("Parameter 'metadata_dims' has too many elements which serve repetitive ",
"function. Keep '", metadata_dims[1], "' only."))
metadata_dims <- metadata_dims[1]
}
# Once the pattern dimension with dataset specifications is found,
# the variable 'dat' is mounted with the information of each
# dataset.
# Take only the datasets for the requested chunk
dats_to_take <- get_chunk_indices(length(dim_params[[found_pattern_dim]]),
chunks[[found_pattern_dim]]['chunk'],
chunks[[found_pattern_dim]]['n_chunks'],
found_pattern_dim)
dim_params[[found_pattern_dim]] <- dim_params[[found_pattern_dim]][dats_to_take]
dat <- dim_params[[found_pattern_dim]]
#NOTE: This function creates the object 'dat_names'
dat_names <- c()
dat <- mount_dat(dat, pattern_dims, found_pattern_dim, dat_names)
dim_params[[found_pattern_dim]] <- dat_names
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
# Reorder inner_dims_across_files (to make the keys be the file dimensions,
# and the values to be the inner dimensions that go across it).
if (!is.null(inner_dims_across_files)) {
# Reorder: example, convert list(ftime = 'chunk', ensemble = 'member', xx = 'chunk')
# to list(chunk = c('ftime', 'xx'), member = 'ensemble')
new_idaf <- list()
for (i in names(inner_dims_across_files)) {
if (!(inner_dims_across_files[[i]] %in% names(new_idaf))) {
new_idaf[[inner_dims_across_files[[i]]]] <- i
} else {
new_idaf[[inner_dims_across_files[[i]]]] <- c(new_idaf[[inner_dims_across_files[[i]]]], i)
}
}
inner_dims_across_files <- new_idaf
}
# Check return_vars
if (is.null(return_vars)) {
return_vars <- list()
# if (length(var_params) > 0) {
# return_vars <- as.list(paste0(names(var_params), '_var'))
# } else {
# return_vars <- list()
# }
}
if (!is.list(return_vars)) {
stop("Parameter 'return_vars' must be a list or NULL.")
}
if (length(return_vars) > 0 && is.null(names(return_vars))) {
# names(return_vars) <- rep('', length(return_vars))
stop("Parameter 'return_vars' must be a named list.")
}
i <- 1
while (i <= length(return_vars)) {
# if (names(return_vars)[i] == '') {
# if (!(is.character(return_vars[[i]]) && (length(return_vars[[i]]) == 1))) {
# stop("The ", i, "th specification in 'return_vars' is malformed.")
# }
# if (!grepl('_var$', return_vars[[i]])) {
# stop("The ", i, "th specification in 'return_vars' is malformed.")
# }
# dim_name <- strsplit(return_vars[[i]], '_var$')[[1]][1]
# if (!(dim_name %in% names(var_params))) {
# stop("'", dim_name, "_var' requested in 'return_vars' but ",
# "no '", dim_name, "_var' specified in the .Load call.")
# }
# names(return_vars)[i] <- var_params[[dim_name]]
# return_vars[[i]] <- found_pattern_dim
# } else
if (length(return_vars[[i]]) > 0) {
if (!is.character(return_vars[[i]])) {
stop("The ", i, "th specification in 'return_vars' is malformed. It ",
"must be a vector of character strings of valid file dimension ",
"names.")
}
}
i <- i + 1
}
# Check synonims
if (!is.null(synonims)) {
error <- FALSE
if (!is.list(synonims)) {
error <- TRUE
}
for (synonim_entry in names(synonims)) {
if (!(synonim_entry %in% names(dim_params)) &&
!(synonim_entry %in% names(return_vars))) {
error <- TRUE
}
if (!is.character(synonims[[synonim_entry]]) ||
length(synonims[[synonim_entry]]) < 1) {
error <- TRUE
}
}
if (error) {
stop("Parameter 'synonims' must be a named list, where the names are ",
"a name of a requested dimension or variable and the values are ",
"vectors of character strings with at least one alternative name ",
" for each dimension or variable in 'synonims'.")
}
}
if (length(unique(names(synonims))) < length(names(synonims))) {
stop("There must not be repeated entries in 'synonims'.")
}
if (length(unique(unlist(synonims))) < length(unlist(synonims))) {
stop("There must not be repeated values in 'synonims'.")
}
# Make that all dims and vars have an entry in synonims, even if only dim_name = dim_name
dim_entries_to_add <- which(!(names(dim_params) %in% names(synonims)))
if (length(dim_entries_to_add) > 0) {
synonims[names(dim_params)[dim_entries_to_add]] <- as.list(names(dim_params)[dim_entries_to_add])
}
var_entries_to_add <- which(!(names(var_params) %in% names(synonims)))
if (length(var_entries_to_add) > 0) {
synonims[names(var_params)[var_entries_to_add]] <- as.list(names(var_params)[var_entries_to_add])
}
# Check if return_vars name is inner dim name. If it is synonim, change back to inner dim name
# and return a warning.
use_syn_names <- which(names(return_vars) %in% unlist(synonims) &
!names(return_vars) %in% names(synonims))
if (!identical(use_syn_names, integer(0))) {
for (use_syn_name in use_syn_names) {
wrong_name <- names(return_vars)[use_syn_name]
names(return_vars)[use_syn_name] <- names(unlist(
lapply(lapply(synonims, '%in%',
names(return_vars)[use_syn_name]),
which)))
.warning(paste0("The name '", wrong_name, "' in parameter 'return_vars' in synonim. ",
"Change it back to the inner dimension name, '",
names(return_vars)[use_syn_name], "'."))
}
}
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
# Check selector_checker
if (is.null(selector_checker) || !is.function(selector_checker)) {
stop("Parameter 'selector_checker' must be a function.")
}
# Check file_opener
if (is.null(file_opener) || !is.function(file_opener)) {
stop("Parameter 'file_opener' must be a function.")
}
# Check file_var_reader
if (!is.null(file_var_reader) && !is.function(file_var_reader)) {
stop("Parameter 'file_var_reader' must be a function.")
}
# Check file_dim_reader
if (!is.null(file_dim_reader) && !is.function(file_dim_reader)) {
stop("Parameter 'file_dim_reader' must be a function.")
}
# Check file_data_reader
if (is.null(file_data_reader) || !is.function(file_data_reader)) {
stop("Parameter 'file_data_reader' must be a function.")
}
# Check file_closer
if (is.null(file_closer) || !is.function(file_closer)) {
stop("Parameter 'file_closer' must be a function.")
}
# Check transform
if (!is.null(transform)) {
if (!is.function(transform)) {
stop("Parameter 'transform' must be a function.")
}
}
# Check transform_params
if (!is.null(transform_params)) {
if (!is.list(transform_params)) {
stop("Parameter 'transform_params' must be a list.")
}
if (is.null(names(transform_params))) {
stop("Parameter 'transform_params' must be a named list.")
}
}
# Check transform_vars
if (!is.null(transform_vars)) {
if (!is.character(transform_vars)) {
stop("Parameter 'transform_vars' must be a vector of character strings.")
}
}
if (any(!(transform_vars %in% names(return_vars)))) {
stop("All the variables specified in 'transform_vars' must also be specified in 'return_vars'.")
}
# Check apply_indices_after_transform
if (!is.logical(apply_indices_after_transform)) {
stop("Parameter 'apply_indices_after_transform' must be either TRUE or FALSE.")
}
aiat <- apply_indices_after_transform
# Check transform_extra_cells
if (!is.numeric(transform_extra_cells)) {
stop("Parameter 'transform_extra_cells' must be numeric.")
}
transform_extra_cells <- round(transform_extra_cells)
# Check split_multiselected_dims
if (!is.logical(split_multiselected_dims)) {
stop("Parameter 'split_multiselected_dims' must be TRUE or FALSE.")
}
# Check path_glob_permissive
if (!is.numeric(path_glob_permissive) && !is.logical(path_glob_permissive)) {
stop("Parameter 'path_glob_permissive' must be TRUE, FALSE or an integer.")
}
if (length(path_glob_permissive) != 1) {
stop("Parameter 'path_glob_permissive' must be of length 1.")
}
# Check largest_dims_length
if (!is.numeric(largest_dims_length) && !is.logical(largest_dims_length)) {
stop("Parameter 'largest_dims_length' must be TRUE, FALSE or a named integer vector.")
}
if (is.numeric(largest_dims_length)) {
if (any(largest_dims_length %% 1 != 0) | any(largest_dims_length < 0) | is.null(names(largest_dims_length))) {
stop("Parameter 'largest_dims_length' must be TRUE, FALSE or a named integer vector.")
}
}
if (is.logical(largest_dims_length) && length(largest_dims_length) != 1) {
stop("Parameter 'path_glob_permissive' must be TRUE, FALSE or a named integer vector.")
}
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
# Check retrieve
if (!is.logical(retrieve)) {
stop("Parameter 'retrieve' must be TRUE or FALSE.")
}
# Check num_procs
if (!is.null(num_procs)) {
if (!is.numeric(num_procs)) {
stop("Parameter 'num_procs' must be numeric.")
} else {
num_procs <- round(num_procs)
}
}
# Check silent
if (!is.logical(silent)) {
stop("Parameter 'silent' must be logical.")
}
if (!silent) {
.message(paste0("Exploring files... This will take a variable amount ",
"of time depending on the issued request and the ",
"performance of the file server..."))
}
if (!is.character(debug)) {
dims_to_check <- c('time')
} else {
dims_to_check <- debug
debug <- TRUE
}
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
############################## READING FILE DIMS ############################
# Check that no unrecognized variables are present in the path patterns
# and also that no file dimensions are requested to THREDDs catalogs.
# And in the mean time, build all the work pieces and look for the
# first available file of each dataset.
array_of_files_to_load <- NULL
array_of_not_found_files <- NULL
indices_of_first_files_with_data <- vector('list', length(dat))
selectors_of_first_files_with_data <- vector('list', length(dat))
dataset_has_files <- rep(FALSE, length(dat))
found_file_dims <- vector('list', length(dat))
expected_inner_dims <- vector('list', length(dat))
#print("A")
for (i in 1:length(dat)) {
#print("B")
dat_selectors <- dim_params
dat_selectors[[found_pattern_dim]] <- dat_selectors[[found_pattern_dim]][i]
dim_vars <- paste0('$', dim_names, '$')
file_dims <- which(sapply(dim_vars, grepl, dat[[i]][['path']], fixed = TRUE))
if (length(file_dims) > 0) {
file_dims <- dim_names[file_dims]
}
file_dims <- unique(c(pattern_dims, file_dims))
found_file_dims[[i]] <- file_dims
expected_inner_dims[[i]] <- dim_names[which(!(dim_names %in% file_dims))]
# (Check the depending_file_dims).
if (any(c(names(depending_file_dims), unlist(depending_file_dims)) %in%
expected_inner_dims[[i]])) {
stop(paste0("The dimension dependancies specified in ",
"'depending_file_dims' can only be between file ",
"dimensions, but some inner dimensions found in ",
"dependancies for '", dat[[i]][['name']], "', which ",
"has the following file dimensions: ",
paste(paste0("'", file_dims, "'"), collapse = ', '), "."))
} else {
a <- names(depending_file_dims) %in% file_dims
b <- unlist(depending_file_dims) %in% file_dims
ab <- a & b
if (any(!ab)) {
.warning(paste0("Detected some dependancies in 'depending_file_dims' with ",
"non-existing dimension names. These will be disregarded."))
depending_file_dims <- depending_file_dims[-which(!ab)]
}
if (any(names(depending_file_dims) == unlist(depending_file_dims))) {
depending_file_dims <- depending_file_dims[-which(names(depending_file_dims) == unlist(depending_file_dims))]
}
}
# (Check the inner_dims_across_files).
if (any(!(names(inner_dims_across_files) %in% file_dims)) ||
any(!(unlist(inner_dims_across_files) %in% expected_inner_dims[[i]]))) {
stop(paste0("All relationships specified in ",
"'_across' parameters must be between a inner ",
"dimension and a file dimension. Found wrong ",
"specification for '", dat[[i]][['name']], "', which ",
"has the following file dimensions: ",
paste(paste0("'", file_dims, "'"), collapse = ', '),
", and the following inner dimensions: ",
paste(paste0("'", expected_inner_dims[[i]], "'"),
collapse = ', '), "."))
}
# (Check the return_vars).
j <- 1
while (j <= length(return_vars)) {
if (any(!(return_vars[[j]] %in% file_dims))) {
if (any(return_vars[[j]] %in% expected_inner_dims[[i]])) {
stop("Found variables in 'return_vars' requested ",
"for some inner dimensions (for dataset '",
dat[[i]][['name']], "'), but variables can only be ",
"requested for file dimensions.")
} else {
stop("Found variables in 'return_vars' requested ",
"for non-existing dimensions.")
}
}
j <- j + 1
}
# (Check the metadata_dims).
if (!is.null(metadata_dims)) {
if (any(!(metadata_dims %in% file_dims))) {
stop("All dimensions in 'metadata_dims' must be file dimensions.")
}
}
# Add attributes indicating whether this dimension selector is value or indice
tmp <- lapply(dat_selectors[which(dim_names != pattern_dims)], add_value_indices_flag)
dat_selectors <- c(dat_selectors[pattern_dims], tmp)
for (dim_name in dim_names[-which(dim_names == pattern_dims)]) {
## The following code 'rewrites' var_params for all datasets. If providing different
## path pattern repositories with different file/inner dimensions, var_params might
## have to be handled for each dataset separately.
if ((attr(dat_selectors[[dim_name]], 'values') || (dim_name %in% c('var', 'variable'))) &&
!(dim_name %in% names(var_params)) && !(dim_name %in% file_dims)) {
if (dim_name %in% c('var', 'variable')) {
var_params <- c(var_params, setNames(list('var_names'), dim_name))
.warning(paste0("Found specified values for dimension '", dim_name, "' but no '",
dim_name, "_var' provided. ", '"', dim_name, "_var = '",
'var_names', "'", '"', " has been automatically added to ",
"the Start call."))
} else {
var_params <- c(var_params, setNames(list(dim_name), dim_name))
.warning(paste0("Found specified values for dimension '", dim_name, "' but no '",
dim_name, "_var' requested. ", '"', dim_name, "_var = '",
dim_name, "'", '"', " has been automatically added to ",
"the Start call."))
if (attr(dat_selectors[[dim_name]], 'indices') & !(dim_name %in% names(var_params))) {
if (dim_name %in% transform_vars) {
var_params <- c(var_params, setNames(list(dim_name), dim_name))
.warning(paste0("Found dimension '", dim_name, "' is required to transform but no '",
dim_name, "_var' provided. ", '"', dim_name, "_var = '",
dim_name, "'", '"', " has been automatically added to ",
"the Start call."))
} else if (dim_name %in% names(dim_reorder_params)) {
var_params <- c(var_params, setNames(list(dim_name), dim_name))
.warning(paste0("Found dimension '", dim_name, "' is required to reorder but no '",
dim_name, "_var' provided. ", '"', dim_name, "_var = '",
dim_name, "'", '"', " has been automatically added to ",
"the Start call."))
## (Check the *_var parameters).
if (any(!(unlist(var_params) %in% names(return_vars)))) {
vars_to_add <- which(!(unlist(var_params) %in% names(return_vars)))
new_return_vars <- vector('list', length(vars_to_add))
names(new_return_vars) <- unlist(var_params)[vars_to_add]
return_vars <- c(return_vars, new_return_vars)
.warning(paste0("All '*_var' params must associate a dimension to one of the ",
"requested variables in 'return_vars'. The following variables",
" have been added to 'return_vars': ",
paste(paste0("'", unlist(var_params), "'"), collapse = ', ')))
}
# Examine the selectors of file dim and create 'replace_values', which uses the first
# explicit selector (i.e., character) for all file dimensions.
replace_values <- vector('list', length = length(file_dims))
names(replace_values) <- file_dims
for (file_dim in file_dims) {
if (file_dim %in% names(var_params)) {
.warning(paste0("The '", file_dim, "_var' param will be ignored since '",
file_dim, "' is a file dimension (for the dataset with pattern ",
dat[[i]][['path']], ")."))
}
# If the selector is a vector or a list of 2 without names (represent the value range)
if (!is.list(dat_selectors[[file_dim]]) ||
(is.list(dat_selectors[[file_dim]]) &&
length(dat_selectors[[file_dim]]) == 2 &&
is.null(names(dat_selectors[[file_dim]])))) {
dat_selectors[[file_dim]] <- list(dat_selectors[[file_dim]])
}
first_class <- class(dat_selectors[[file_dim]][[1]])
first_length <- length(dat_selectors[[file_dim]][[1]])
# Length will be > 1 if it is list since beginning, e.g., depending dim is a list with
# names as depended dim.
for (j in 1:length(dat_selectors[[file_dim]])) {
sv <- selector_vector <- dat_selectors[[file_dim]][[j]]
!identical(first_length, length(sv))) {
stop("All provided selectors for depending dimensions must ",
"be vectors of the same length and of the same class.")
}
if (is.character(sv) && !((length(sv) == 1) && (sv[1] %in% c('all', 'first', 'last')))) {
#NOTE: ???? It doesn't make any changes.
dat_selectors[[file_dim]][[j]] <- selector_checker(selectors = sv,
return_indices = FALSE)
# Take chunk if needed (only defined dim; undefined dims will be chunked later in
# find_ufd_value().
if (chunks[[file_dim]]['n_chunks'] > 1) {
desired_chunk_indices <- get_chunk_indices(
length(dat_selectors[[file_dim]][[j]]),
chunks[[file_dim]]['chunk'],
chunks[[file_dim]]['n_chunks'],
file_dim)
dat_selectors[[file_dim]][[j]] <- dat_selectors[[file_dim]][[j]][desired_chunk_indices]
# chunk the depending dim as well
if (file_dim %in% depending_file_dims) {
depending_dim_name <- names(which(file_dim == depending_file_dims))
# Chunk it only if it is defined dim (i.e., list of character with names of depended dim)
if (!(length(dat_selectors[[depending_dim_name]]) == 1 &&
dat_selectors[[depending_dim_name]] %in% c('all', 'first', 'last'))) {
if (any(sapply(dat_selectors[[depending_dim_name]], is.character))) {
dat_selectors[[depending_dim_name]] <-
dat_selectors[[depending_dim_name]][desired_chunk_indices]
}
}
}
}
} else if (!(is.numeric(sv) ||
(is.character(sv) && (length(sv) == 1) && (sv %in% c('all', 'first', 'last'))) ||
(is.list(sv) && (length(sv) == 2) && (all(sapply(sv, is.character)) ||
all(sapply(sv, is.numeric)))))) {
stop("All explicitly provided selectors for file dimensions must be character strings.")
}
}
sv <- dat_selectors[[file_dim]][[1]]
# 'replace_values' has the first selector (if it's character) or NULL (if it's not explicitly
# defined) for each file dimension.
if (is.character(sv) && !((length(sv) == 1) && (sv[1] %in% c('all', 'first', 'last')))) {
}
}
#print("C")
# Now we know which dimensions whose selectors are provided non-explicitly.
undefined_file_dims <- file_dims[which(sapply(replace_values, is.null))]
defined_file_dims <- file_dims[which(!(file_dims %in% undefined_file_dims))]
# Quickly check if the depending dimensions are provided properly. The check is only for
# if the depending and depended file dims are both explicited defined.
for (file_dim in file_dims) {
if (file_dim %in% names(depending_file_dims)) {
# Return error if depended dim is a list of values while depending dim is not
# defined (i.e., indices or 'all')
if (file_dim %in% defined_file_dims &
!(depending_file_dims[[file_dim]] %in% defined_file_dims)) {
stop(paste0("The depended dimension, ", file_dim, ", is explictly defined ",
"by a list of values, while the depending dimension, ",
depending_file_dims[[file_dim]], ", is not explictly defined. ",
"Specify ", depending_file_dims[[file_dim]], " by characters."))
}
#NOTE: The if statement below is tricky. It tries to distinguish if the depending dim
# has the depended dim as the names of the list. However, if the depending dim
# doesn't have list names and its length is 2 (i.e., list( , )), Start() thinks
# it means the range, just like `lat = values(list(10, 20))`. And because of this,
# we won't enter the following if statement, and the error will occur later in
# SelectorChecker(). Need to find a way to distinguish if list( , ) means range or
# just the values.
if (all(c(file_dim, depending_file_dims[[file_dim]]) %in% defined_file_dims)) {
if (length(dat_selectors[[file_dim]]) != length(dat_selectors[[depending_file_dims[[file_dim]]]][[1]])) {
stop(paste0("If providing selectors for the depending ",
"dimension '", file_dim, "', a ",
"vector of selectors must be provided for ",
"each selector of the dimension it depends on, '",
depending_file_dims[[file_dim]], "'."))
} else if (!all(names(dat_selectors[[file_dim]]) == dat_selectors[[depending_file_dims[[file_dim]]]][[1]])) {
stop(paste0("If providing selectors for the depending ",
"dimension '", file_dim, "', the name of the ",
"provided vectors of selectors must match ",
"exactly the selectors of the dimension it ",
"depends on, '", depending_file_dims[[file_dim]], "'."))
aho
committed
} else if (is.null(names(dat_selectors[[file_dim]]))) {
.warning(paste0("The selectors for the depending dimension '", file_dim, "' do not ",
"have list names. Assume that the order of the selectors matches the ",
"depended dimensions '", depending_file_dims[[file_dim]], "''s order."))
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
# Find the possible values for the selectors that are provided as
# indices. If the requested file is on server, impossible operation.
if (length(grep("^http", dat[[i]][['path']])) > 0) {
if (length(undefined_file_dims) > 0) {
stop(paste0("All selectors for the file dimensions must be ",
"character strings if requesting data to a remote ",
"server. Found invalid selectors for the file dimensions ",
paste(paste0("'", undefined_file_dims, "'"), collapse = ', '), "."))
}
dataset_has_files[i] <- TRUE
} else {
dat[[i]][['path']] <- path.expand(dat[[i]][['path']])
# Iterate over the known dimensions to find the first existing file.
# The path to the first existing file will be used to find the
# values for the non explicitly defined selectors.
first_file <- NULL
first_file_selectors <- NULL
if (length(undefined_file_dims) > 0) {
replace_values[undefined_file_dims] <- '*'
}
## TODO: What if length of defined_file_dims is 0? code might crash (in practice it worked for an example case)
files_to_check <- sapply(dat_selectors[defined_file_dims], function(x) length(x[[1]]))
sub_array_of_files_to_check <- array(1:prod(files_to_check), dim = files_to_check)
j <- 1
#print("D")
while (j <= prod(files_to_check) && is.null(first_file)) {
selector_indices <- which(sub_array_of_files_to_check == j, arr.ind = TRUE)[1, ]
selectors <- sapply(1:length(defined_file_dims),
function (x) {
vector_to_pick <- 1
if (defined_file_dims[x] %in% names(depending_file_dims)) {
vector_to_pick <- selector_indices[which(defined_file_dims == depending_file_dims[[defined_file_dims[x]]])]
}
dat_selectors[defined_file_dims][[x]][[vector_to_pick]][selector_indices[x]]
})
replace_values[defined_file_dims] <- selectors
file_path <- .ReplaceVariablesInString(dat[[i]][['path']], replace_values)
file_path <- Sys.glob(file_path)
if (length(file_path) > 0) {
first_file <- file_path[1]
first_file_selectors <- selectors
}
j <- j + 1
}
#print("E")
# Start looking for values for the non-explicitly defined selectors.
if (is.null(first_file)) {
.warning(paste0("No found files for the datset '", dat[[i]][['name']],
"'. Provide existing selectors for the file dimensions ",
" or check and correct its path pattern: ", dat[[i]][['path']]))
} else {
dataset_has_files[i] <- TRUE
## TODO: Improve message here if no variable found:
if (length(undefined_file_dims) > 0) {
# Note: "dat[[i]][['path']]" is changed by the function below.
dat_selectors <- find_ufd_value(undefined_file_dims, dat, i, replace_values,
first_file, file_dims, path_glob_permissive,
depending_file_dims, dat_selectors, selector_checker,
chunks)
#NOTE: If there is no non-explicitly defined dim, use the first found file
# to modify. Problem: '*' doesn't catch all the possible file. Only use
# the first file.
dat[[i]][['path']] <- .ReplaceGlobExpressions(dat[[i]][['path']], first_file, replace_values,
defined_file_dims, dat[[i]][['name']], path_glob_permissive)
}
}
}
dat[[i]][['selectors']] <- dat_selectors
# Now fetch for the first available file
if (dataset_has_files[i]) {
known_dims <- file_dims
} else {
known_dims <- defined_file_dims
}
replace_values <- vector('list', length = length(known_dims))
names(replace_values) <- known_dims
files_to_load <- sapply(dat_selectors[known_dims], function(x) length(x[[1]]))
files_to_load[found_pattern_dim] <- 1
sub_array_of_files_to_load <- array(1:prod(files_to_load),
dim = files_to_load)
names(dim(sub_array_of_files_to_load)) <- known_dims
sub_array_of_not_found_files <- array(!dataset_has_files[i],
dim = files_to_load)
names(dim(sub_array_of_not_found_files)) <- known_dims
if (largest_dims_length) {
if (!exists('selector_indices_save')) {
selector_indices_save <- vector('list', length = length(dat))
}
if (!exists('selectors_total_list')) {
selectors_total_list <- vector('list', length = length(dat))
}
selector_indices_save[[i]] <- vector('list', length = prod(files_to_load))
selectors_total_list[[i]] <- vector('list', length = prod(files_to_load))
j <- 1
# NOTE: This while loop has these objects that are used afterward: 'sub_array_of_files_to_load',
# 'sub_array_of_not_found_files', 'indices_of_first_files_with_data', 'selectors_of_first_files_with_data';
# 'selector_indices_save' and 'selectors_total_list' are used if 'largest_dims_length = T'.
while (j <= prod(files_to_load)) {
selector_indices <- which(sub_array_of_files_to_load == j, arr.ind = TRUE)[1, ]
names(selector_indices) <- known_dims
if (largest_dims_length) {
tmp <- selector_indices
tmp[which(known_dims == found_pattern_dim)] <- i
selector_indices_save[[i]][[j]] <- tmp
}
# This 'selectors' is only used in this while loop
selectors <- sapply(1:length(known_dims),
function (x) {
vector_to_pick <- 1
if (known_dims[x] %in% names(depending_file_dims)) {
vector_to_pick <- selector_indices[which(known_dims == depending_file_dims[[known_dims[x]]])]
}
dat_selectors[known_dims][[x]][[vector_to_pick]][selector_indices[x]]
})
names(selectors) <- known_dims
if (largest_dims_length) {
selectors_total_list[[i]][[j]] <- selectors
names(selectors_total_list[[i]][[j]]) <- known_dims
}
# 'replace_values' and 'file_path' are only used in this while loop
replace_values[known_dims] <- selectors
if (!dataset_has_files[i]) {
if (any(is.na(selectors))) {
replace_values <- replace_values[-which(names(replace_values) %in% names(selectors[which(is.na(selectors))]))]
}
file_path <- .ReplaceVariablesInString(dat[[i]][['path']], replace_values, TRUE)
sub_array_of_files_to_load[j] <- file_path
#sub_array_of_not_found_files[j] <- TRUE???
} else {
if (any(is.na(selectors))) {
replace_values <- replace_values[-which(names(replace_values) %in% names(selectors[which(is.na(selectors))]))]
file_path <- .ReplaceVariablesInString(dat[[i]][['path']], replace_values, TRUE)
sub_array_of_files_to_load[j] <- file_path
sub_array_of_not_found_files[j] <- TRUE
} else {
file_path <- .ReplaceVariablesInString(dat[[i]][['path']], replace_values)
#NOTE: After replacing tags, there is still * if path_glob_permissive is not FALSE.
# Find the possible value to substitute *.
if (grepl('\\*', file_path)) {
found_files <- Sys.glob(file_path)
file_path <- found_files[1] # choose only the first file.
#NOTE: Above line chooses only the first found file. Because * is not tags, which means
# it is not a dimension. So it cannot store more than one item. If use * to define
# the path, that * should only represent one possibility.
if (length(found_files) > 1) {
.warning("Using glob expression * to define the path, but more ",
"than one match is found. Choose the first match only.")
}
}
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
if (!(length(grep("^http", file_path)) > 0)) {
if (grepl(file_path, '*', fixed = TRUE)) {
file_path_full <- Sys.glob(file_path)[1]
if (nchar(file_path_full) > 0) {
file_path <- file_path_full
}
}
}
sub_array_of_files_to_load[j] <- file_path
if (is.null(indices_of_first_files_with_data[[i]])) {
if (!(length(grep("^http", file_path)) > 0)) {
if (!file.exists(file_path)) {
file_path <- NULL
}
}
if (!is.null(file_path)) {
test_file <- NULL
## TODO: suppress error messages
test_file <- file_opener(file_path)
if (!is.null(test_file)) {
selector_indices[which(known_dims == found_pattern_dim)] <- i
indices_of_first_files_with_data[[i]] <- selector_indices
selectors_of_first_files_with_data[[i]] <- selectors
file_closer(test_file)
}
}
}
}
}
j <- j + 1
}
# Extend array as needed progressively
if (is.null(array_of_files_to_load)) {
array_of_files_to_load <- sub_array_of_files_to_load
array_of_not_found_files <- sub_array_of_not_found_files
} else {
array_of_files_to_load <- .MergeArrays(array_of_files_to_load, sub_array_of_files_to_load,
along = found_pattern_dim)
## TODO: file_dims, and variables like that.. are still ok now? I don't think so
array_of_not_found_files <- .MergeArrays(array_of_not_found_files, sub_array_of_not_found_files,
along = found_pattern_dim)
}
}
if (all(sapply(indices_of_first_files_with_data, is.null))) {
stop("No data files found for any of the specified datasets.")
}
########################### READING INNER DIMS. #############################
#print("J")
## TODO: To be run in parallel (local multi-core)
# Now time to work out the inner file dimensions.
# First pick the requested variables.
#//// This part is moved below the new code////
# NOTE: 'dims_to_iterate' only consider common_return_vars. Move to below can save some work
# and get the revised common_return_vars if it is changed.
# dims_to_iterate <- NULL
# for (return_var in names(return_vars)) {
# dims_to_iterate <- unique(c(dims_to_iterate, return_vars[[return_var]]))
# }
# if (found_pattern_dim %in% dims_to_iterate) {
# dims_to_iterate <- dims_to_iterate[-which(dims_to_iterate == found_pattern_dim)]
# }
#//////////////////////////////////////////////
# Separate 'return_vars' into 'common_return_vars' and 'return_vars' (those = 'dat').
common_return_vars <- NULL
common_first_found_file <- NULL
common_return_vars_pos <- NULL
if (length(return_vars) > 0) {
common_return_vars_pos <- which(sapply(return_vars, function(x) !(found_pattern_dim %in% x)))
}
if (length(common_return_vars_pos) > 0) {
common_return_vars <- return_vars[common_return_vars_pos]
return_vars <- return_vars[-common_return_vars_pos]
common_first_found_file <- rep(FALSE, length(which(sapply(common_return_vars, length) == 0)))
names(common_first_found_file) <- names(common_return_vars[which(sapply(common_return_vars, length) == 0)])
}
#!!!!!!!Check here. return_vars has removed the common ones, and here remove 'dat' value????
#It seems like it does some benefits to later parts
return_vars <- lapply(return_vars,
function(x) {
if (found_pattern_dim %in% x) {
x[-which(x == found_pattern_dim)]
} else {
x
}
})
#////////////////////////////////////////////
# Force return_vars = (time = NULL) to (time = 'sdate') if (1) selector = [sdate = 2, time = 4] or
# (2) time_across = 'sdate'.
# NOTE: Not sure if the loop over dat is needed here. In theory, all the dat
# should have the same dimensions (?) so expected_inner_dims and
# found_file_dims are the same. The selector_array may possible be
# different, but then the attribute will be correct? If it's different,
# it should depend on 'dat' (but here we only consider common_return_vars)
for (i in 1:length(dat)) {
for (inner_dim in expected_inner_dims[[i]]) {
# The selectors for the inner dimension are taken.
selector_array <- dat[[i]][['selectors']][[inner_dim]]
file_dim_as_selector_array_dim <- 1
if (any(found_file_dims[[i]] %in% names(dim(selector_array)))) {
file_dim_as_selector_array_dim <-
found_file_dims[[i]][which(found_file_dims[[i]] %in% names(dim(selector_array)))]
if (inner_dim %in% inner_dims_across_files |
is.character(file_dim_as_selector_array_dim)) { #(2) or (1)
# inner_dim is not in return_vars or is NULL
need_correct <- FALSE
if (((!inner_dim %in% names(common_return_vars)) &
(!inner_dim %in% names(return_vars))) |
(inner_dim %in% names(common_return_vars) &
is.null(common_return_vars[[inner_dim]]))) {
need_correct <- TRUE
} else if (inner_dim %in% names(common_return_vars) &
(inner_dim %in% inner_dims_across_files) &
!is.null(names(inner_dims_across_files))) { #(2)
if (!names(inner_dims_across_files) %in% common_return_vars[[inner_dim]]) need_correct <- TRUE
} else if (inner_dim %in% names(common_return_vars) &
is.character(file_dim_as_selector_array_dim)) { #(1)
if (!all(file_dim_as_selector_array_dim %in% common_return_vars[[inner_dim]])) {
file_dim_as_selector_array_dim <- file_dim_as_selector_array_dim[which(!file_dim_as_selector_array_dim %in% common_return_vars[[inner_dim]])]
common_return_vars[[inner_dim]] <-
c(common_return_vars[[inner_dim]],
correct_return_vars(inner_dim, inner_dims_across_files,
found_pattern_dim, file_dim_as_selector_array_dim))
}
}
}
}
#////////////////////////////////////////////
# This part was above where return_vars is seperated into return_vars and common_return_vars
# NOTE: 'dims_to_iterate' only consider common_return_vars. Move to here can save some work
# and get the revised common_return_vars if it is changed in the part right above.
dims_to_iterate <- NULL
for (common_return_var in names(common_return_vars)) {
dims_to_iterate <- unique(c(dims_to_iterate, common_return_vars[[common_return_var]]))
}
#////////////////////////////////////////////
# Change the structure of 'dat'. If the selector is not list or it is list of 2 that represents
# range, make it as list. The dimensions that go across files will later be extended to have
# lists of lists/vectors of selectors.
for (i in 1:length(dat)) {
if (dataset_has_files[i]) {
for (inner_dim in expected_inner_dims[[i]]) {
if (!is.list(dat[[i]][['selectors']][[inner_dim]]) || # not list, or
(is.list(dat[[i]][['selectors']][[inner_dim]]) && # list of 2 that represents range
length(dat[[i]][['selectors']][[inner_dim]]) == 2 &&
is.null(names(dat[[i]][['selectors']][[inner_dim]])))) {
dat[[i]][['selectors']][[inner_dim]] <- list(dat[[i]][['selectors']][[inner_dim]])
}
}
}
}
# Use 'common_return_vars' and 'return_vars' to generate the initial picked(_common)_vars,
# picked(_common)_vars_ordered, and picked(_common)_vars_unorder_indices.
## Create 'picked_common_vars'
if (length(common_return_vars) > 0) {
picked_common_vars <- vector('list', length = length(common_return_vars))
names(picked_common_vars) <- names(common_return_vars)
} else {
picked_common_vars <- NULL
}
picked_common_vars_ordered <- picked_common_vars
picked_common_vars_unorder_indices <- picked_common_vars
## Create 'picked_vars'
picked_vars <- vector('list', length = length(dat))
names(picked_vars) <- dat_names
if (length(return_vars) > 0) {
picked_vars <- lapply(picked_vars, function(x) {
x <- vector('list', length = length(return_vars))} )
picked_vars <- lapply(picked_vars, setNames, names(return_vars))
}
for (i in 1:length(dat)) {
if (dataset_has_files[i]) {
indices_of_first_file <- as.list(indices_of_first_files_with_data[[i]])
array_file_dims <- sapply(dat[[i]][['selectors']][found_file_dims[[i]]], function(x) length(x[[1]]))
names(array_file_dims) <- found_file_dims[[i]]
if (length(dims_to_iterate) > 0) {
indices_of_first_file[dims_to_iterate] <- lapply(array_file_dims[dims_to_iterate], function(x) 1:x)
}
array_of_var_files <- do.call('[', c(list(x = array_of_files_to_load), indices_of_first_file, list(drop = FALSE)))
array_of_var_indices <- array(1:length(array_of_var_files), dim = dim(array_of_var_files))
array_of_not_found_var_files <- do.call('[', c(list(x = array_of_not_found_files), indices_of_first_file, list(drop = FALSE)))
# Create previous_indices. The initial value is -1 because there is no 'previous' before the
# 1st current_indices.
previous_indices <- rep(-1, length(indices_of_first_file))
names(previous_indices) <- names(indices_of_first_file)
# Create first_found_file for vars_to_read defining. It is for the dim value in return_vars
# that is NULL or character(0). Because these dims only need to be read once, so
# first_found_file indicates if these dims have been read or not.
# If read, it turns to TRUE and won't be included in vars_to_read again in the next
# 'for j loop'.
first_found_file <- NULL
if (length(return_vars) > 0) {
first_found_file <- rep(FALSE, length(which(sapply(return_vars, length) == 0)))
names(first_found_file) <- names(return_vars[which(sapply(return_vars, length) == 0)])
}
for (j in 1:length(array_of_var_files)) {
current_indices <- which(array_of_var_indices == j, arr.ind = TRUE)[1, ]
names(current_indices) <- names(indices_of_first_file)
if (!is.na(array_of_var_files[j]) && !array_of_not_found_var_files[j]) {
changed_dims <- which(current_indices != previous_indices)
# Prepare vars_to_read for this dataset (i loop) and this file (j loop)
vars_to_read <- generate_vars_to_read(return_vars, changed_dims, first_found_file,
common_return_vars, common_first_found_file, i)
file_object <- file_opener(array_of_var_files[j])
if (!is.null(file_object)) {
for (var_to_read in vars_to_read) {
if (var_to_read %in% unlist(var_params)) {
associated_dim_name <- names(var_params)[which(unlist(var_params) == var_to_read)]
}
var_name_to_reader <- var_to_read
names(var_name_to_reader) <- 'var'
var_dims <- file_dim_reader(NULL, file_object, var_name_to_reader, NULL,
synonims)
# file_dim_reader returns dimension names as found in the file.
# Need to translate accoridng to synonims:
names(var_dims) <- replace_with_synonmins(var_dims, synonims)
## (1) common_return_vars
if (var_to_read %in% names(common_return_vars)) {
var_to_check <- common_return_vars[[var_to_read]]
list_picked_var_of_read <- generate_picked_var_of_read(
var_to_read, var_to_check, array_of_files_to_load, var_dims,
array_of_var_files = array_of_var_files[j], file_var_reader,
file_object, synonims, associated_dim_name, dim_reorder_params,
aiat, current_indices, var_params,
either_picked_vars = picked_common_vars[[var_to_read]],
either_picked_vars_ordered = picked_common_vars_ordered[[var_to_read]],
either_picked_vars_unorder_indices = picked_common_vars_unorder_indices[[var_to_read]]
)
picked_common_vars[[var_to_read]] <- list_picked_var_of_read$either_picked_vars
picked_common_vars_ordered[[var_to_read]] <-
list_picked_var_of_read$either_picked_vars_ordered
picked_common_vars_unorder_indices[[var_to_read]] <-
list_picked_var_of_read$either_picked_vars_unorder_indices
## (2) return_vars
list_picked_var_of_read <- generate_picked_var_of_read(
var_to_read, var_to_check, array_of_files_to_load, var_dims,
array_of_var_files = array_of_var_files[j], file_var_reader,
file_object, synonims, associated_dim_name, dim_reorder_params,
aiat, current_indices, var_params,
either_picked_vars = picked_vars[[i]][[var_to_read]],
either_picked_vars_ordered = picked_vars_ordered[[i]][[var_to_read]],
either_picked_vars_unorder_indices = picked_vars_unorder_indices[[i]][[var_to_read]]
)
picked_vars[[i]][[var_to_read]] <- list_picked_var_of_read$either_picked_vars
picked_vars_ordered[[i]][[var_to_read]] <-
list_picked_var_of_read$either_picked_vars_ordered
picked_vars_unorder_indices[[i]][[var_to_read]] <-
list_picked_var_of_read$either_picked_vars_unorder_indices
}
if (var_to_read %in% names(first_found_file)) {
first_found_file[var_to_read] <- TRUE
}
if (var_to_read %in% names(common_first_found_file)) {
common_first_found_file[var_to_read] <- TRUE
}
} else {
stop("Could not find variable '", var_to_read,
"' in the file ", array_of_var_files[j])
}
}
file_closer(file_object)
}
}
previous_indices <- current_indices
}
}
}
# Once we have the variable values, we can work out the indices
# for the implicitly defined selectors.
beta <- transform_extra_cells
dims_to_crop <- vector('list')
transformed_vars <- vector('list', length = length(dat))
names(transformed_vars) <- dat_names
transformed_vars_ordered <- transformed_vars
transformed_vars_unorder_indices <- transformed_vars
transformed_common_vars <- NULL
transformed_common_vars_ordered <- NULL
transformed_common_vars_unorder_indices <- NULL
# store warning messages from transform
warnings1 <- NULL
warnings2 <- NULL
for (i in 1:length(dat)) {
if (dataset_has_files[i]) {
indices <- indices_of_first_files_with_data[[i]]
if (!is.null(indices)) {
#//////////////////////////////////////////////////
# Find data_dims
## If largest_dims_length is a number & !merge_across_dims,
## directly assign this dim as the number;
## If largest_dims_length is a number & this dim is across files, find the dim length of each file
find_largest_dims_length_by_files <- FALSE
if (is.numeric(largest_dims_length)) {
if (names(largest_dims_length) %in% inner_dims_across_files) {
find_largest_dims_length_by_files <- TRUE
}
} else if (largest_dims_length) {
find_largest_dims_length_by_files <- TRUE
}
if (!find_largest_dims_length_by_files) { # old code
file_path <- do.call("[", c(list(array_of_files_to_load), as.list(indices_of_first_files_with_data[[i]])))
# The following 5 lines should go several lines below, but were moved
# here for better performance.
# If any of the dimensions comes without defining variable, then we read
# the data dimensions.
data_dims <- NULL
aho
committed
# if (length(unlist(var_params[expected_inner_dims[[i]]])) < length(expected_inner_dims[[i]])) {
file_to_open <- file_path
data_dims <- file_dim_reader(file_to_open, NULL, selectors_of_first_files_with_data[[i]],
lapply(dat[[i]][['selectors']][expected_inner_dims[[i]]], '[[', 1),
synonims)
# file_dim_reader returns dimension names as found in the file.
# Need to translate accoridng to synonims:
names(data_dims) <- replace_with_synonmins(data_dims, synonims)
aho
committed
# }
if (is.numeric(largest_dims_length)) { # largest_dims_length is a named vector
# Check if the names fit the inner dimension names
if (!all(names(largest_dims_length) %in% names(data_dims))) {
#NOTE: stop or warning?
stop("Parameter 'largest_dims_length' has inconsistent names with inner dimensions.")
} else {
match_ind <- match(names(largest_dims_length), names(data_dims))
data_dims[match_ind] <- largest_dims_length
}
}
## largest_dims_length = TRUE, or is a number & merge_across_dims is across this dim
tmp <- find_largest_dims_length(
selectors_total_list[[i]], array_of_files_to_load,
selector_indices_save[[i]], dat[[i]], expected_inner_dims[[i]],
synonims, file_dim_reader)
data_dims <- tmp$largest_data_dims
# 'data_dims_each_file' is used when merge_across_dims = TRUE &
# the files have different length of inner dim.
data_dims_each_file <- tmp$data_dims_all_files
# file_dim_reader returns dimension names as found in the file.
# Need to translate accoridng to synonims:
names(data_dims) <- replace_with_synonmins(data_dims, synonims)
} # end if (largest_dims_length == TRUE)
#//////////////////////////////////////////////////
aho
committed
# Some dimension is defined in Start() call but doesn't exist in data
if (!all(expected_inner_dims[[i]] %in% names(data_dims))) {
tmp <- expected_inner_dims[[i]][which(!expected_inner_dims[[i]] %in% names(data_dims))]
stop("Could not find the dimension '", tmp, "' in the file. Either ",
"change the dimension name in your request, adjust the ",
"parameter 'dim_names_in_files' or fix the dimension name in ",
"the file.\n", file_path)
}
# Not all the inner dims are defined in Start() call
if (!all(names(data_dims) %in% expected_inner_dims[[i]])) {
tmp <- names(data_dims)[which(!names(data_dims) %in% expected_inner_dims[[i]])]
if (data_dims[tmp] != 1) {
stop("The dimension '", tmp, "' is found in the file ", file_path,
" but not defined in the Start call.")
}
}
#///////////////////////////////////////////////////////////////////
# Transform the variables if needed and keep them apart.
if (!is.null(transform) && (length(transform_vars) > 0)) {
if (!all(transform_vars %in% c(names(picked_vars[[i]]), names(picked_common_vars)))) {
stop("Could not find all the required variables in 'transform_vars' ",
"for the dataset '", dat[[i]][['name']], "'.")
}
# picked_vars[[i]]
vars_to_transform <- generate_vars_to_transform(vars_to_transform, picked_vars[[i]], transform_vars, picked_vars_ordered[[i]])
# picked_common_vars
vars_to_transform <- generate_vars_to_transform(vars_to_transform, picked_common_vars, transform_vars, picked_common_vars_ordered)
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
# Save the crop domain from selectors of transformed vars
# PROB: It doesn't consider aiat. If aiat, the indices are for
# after transformed data; we don't know the corresponding
# values yet.
transform_crop_domain <- vector('list')
for (transform_var in transform_vars) {
transform_crop_domain[[transform_var]] <- dat[[i]][['selectors']][[transform_var]][[1]]
# Turn indices into values
if (attr(transform_crop_domain[[transform_var]], 'indices')) {
if (transform_var %in% names(common_return_vars)) {
if (transform_var %in% names(dim_reorder_params)) {
transform_crop_domain[[transform_var]] <-
generate_transform_crop_domain_values(
transform_crop_domain[[transform_var]],
picked_vars = picked_common_vars_ordered[[transform_var]])
} else {
transform_crop_domain[[transform_var]] <-
generate_transform_crop_domain_values(
transform_crop_domain[[transform_var]],
picked_vars = picked_common_vars[[transform_var]])
}
} else { # return_vars
if (transform_var %in% names(dim_reorder_params)) {
transform_crop_domain[[transform_var]] <-
generate_transform_crop_domain_values(
transform_crop_domain[[transform_var]],
picked_vars = picked_vars_ordered[[i]][[transform_var]])
} else {
transform_crop_domain[[transform_var]] <-
generate_transform_crop_domain_values(
transform_crop_domain[[transform_var]],
picked_vars = picked_vars[[i]][[transform_var]])
}
}
} else if (is.atomic(transform_crop_domain[[transform_var]])) {
# if it is values but vector
transform_crop_domain[[transform_var]] <-
c(transform_crop_domain[[transform_var]][1],
tail(transform_crop_domain[[transform_var]], 1))
}
# For CDORemapper (not sure if it's also suitable for other transform functions):
# If lon_reorder is not used + lon selector is from big to small,
# lonmax and lonmin need to be exchanged. The ideal way is to
# exchange in CDORemapper(), but lon_reorder is used or not is not
# known by CDORemapper().
# NOTE: lat's order doesn't matter, big to small and small to big
# both work. Since we shouldn't assume transform_var in Start(),
# e.g., transform_var can be anything transformable in the assigned transform function,
# we exchange whichever parameter here anyway.
if (!transform_var %in% names(dim_reorder_params) &
diff(unlist(transform_crop_domain[[transform_var]])) < 0) {
transform_crop_domain[[transform_var]] <- rev(transform_crop_domain[[transform_var]])
}
}
do.call(transform, c(list(data_array = NULL,
variables = vars_to_transform,
file_selectors = selectors_of_first_files_with_data[[i]],
crop_domain = transform_crop_domain),
transform_params))
)
transformed_data <- tmp$value
warnings1 <- c(warnings1, tmp$warnings)
# Discard the common transformed variables if already transformed before
if (!is.null(transformed_common_vars)) {
common_ones <- which(names(picked_common_vars) %in% names(transformed_data$variables))
if (length(common_ones) > 0) {
transformed_data$variables <- transformed_data$variables[-common_ones]
}
}
transformed_vars[[i]] <- list()
transformed_vars_ordered[[i]] <- list()
transformed_vars_unorder_indices[[i]] <- list()
# Order the transformed variables if needed
# 'var_to_read' should be 'transformed_var', but is kept to reuse the same code as above.
for (var_to_read in names(transformed_data$variables)) {
if (var_to_read %in% unlist(var_params)) {
associated_dim_name <- names(var_params)[which(unlist(var_params) == var_to_read)]
aho
committed
if ((associated_dim_name %in% names(dim_reorder_params))) {
## Is this check really needed?
if (length(dim(transformed_data$variables[[associated_dim_name]])) > 1) {
stop("Requested a '", associated_dim_name, "_reorder' for a dimension ",
"whose coordinate variable that has more than 1 dimension (after ",
"transform). This is not supported.")
}
ordered_var_values <- dim_reorder_params[[associated_dim_name]](transformed_data$variables[[associated_dim_name]])
attr(ordered_var_values, 'variables') <- attr(transformed_data$variables[[associated_dim_name]], 'variables')
if (!all(c('x', 'ix') %in% names(ordered_var_values))) {
stop("All the dimension reorder functions must return a list with the components 'x' and 'ix'.")
}
# Save the indices to reorder back the ordered variable values.
aho
committed
# This will be used to define the first round indices (if aiat) or second round
# indices (if !aiat).
unorder <- sort(ordered_var_values$ix, index.return = TRUE)$ix
if (var_to_read %in% names(picked_common_vars)) {
transformed_common_vars_ordered[[var_to_read]] <- ordered_var_values$x
transformed_common_vars_unorder_indices[[var_to_read]] <- unorder
} else {
transformed_vars_ordered[[i]][[var_to_read]] <- ordered_var_values$x
transformed_vars_unorder_indices[[i]][[var_to_read]] <- unorder
}
}
}
}
transformed_picked_vars_names <- which(names(picked_vars[[i]]) %in% names(transformed_data$variables))
if (length(transformed_picked_vars_names) > 0) {
transformed_picked_vars_names <- names(picked_vars[[i]])[transformed_picked_vars_names]
transformed_vars[[i]][transformed_picked_vars_names] <- transformed_data$variables[transformed_picked_vars_names]
transformed_picked_common_vars_names <- which(names(picked_common_vars) %in% names(transformed_data$variables))
if (length(transformed_picked_common_vars_names) > 0) {
transformed_picked_common_vars_names <- names(picked_common_vars)[transformed_picked_common_vars_names]
transformed_common_vars <- transformed_data$variables[transformed_picked_common_vars_names]
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
}
}
}
# Once the variables are transformed, we compute the indices to be
# taken for each inner dimension.
# In all cases, indices will have to be computed to know which data
# values to take from the original data for each dimension (if a
# variable is specified for that dimension, it will be used to
# convert the provided selectors into indices). These indices are
# referred to as 'first round of indices'.
# The taken data will then be transformed if needed, together with
# the dimension variable if specified, and, in that case, indices
# will have to be computed again to know which values to take from the
# transformed data. These are the 'second round of indices'. In the
# case there is no transformation, the second round of indices will
# be all the available indices, i.e. from 1 to the number of taken
# values with the first round of indices.
for (inner_dim in expected_inner_dims[[i]]) {
if (debug) {
print("-> DEFINING INDICES FOR INNER DIMENSION:")
print(inner_dim)
}
crossed_file_dim <- names(inner_dims_across_files)[which(sapply(inner_dims_across_files, function(x) inner_dim %in% x))[1]]
chunk_amount <- length(dat[[i]][['selectors']][[crossed_file_dim]][[1]])
names(chunk_amount) <- crossed_file_dim
} else if (!is.null(names(dim(dat[[i]][['selectors']][[inner_dim]][[1]]))) &
inner_dim %in% names(dim(dat[[i]][['selectors']][[inner_dim]][[1]])) &
any(found_file_dims[[i]] %in% names(dim(dat[[i]][['selectors']][[inner_dim]][[1]])))) {
# inner dim is dependent on file dim in the form of selector array (e.g., time = [sdate = 2, time = 4])
crossed_file_dim <- found_file_dims[[i]][which(found_file_dims[[i]] %in%
names(dim(dat[[i]][['selectors']][[inner_dim]][[1]])))]
if (length(crossed_file_dim) == 1) {
chunk_amount <- length(dat[[i]][['selectors']][[crossed_file_dim]][[1]])
names(chunk_amount) <- crossed_file_dim
} else {
# e.g., region = [memb = 2, sdate = 3, region = 1]
chunk_amount <- prod(
sapply(lapply(
dat[[i]][['selectors']][crossed_file_dim], "[[", 1), length))
names(chunk_amount) <- paste(crossed_file_dim, collapse = '.')
}
} else {
chunk_amount <- 1
}
# In the special case that the selectors for a dimension are 'all', 'first', ...
# and chunking (dividing in more than 1 chunk) is requested, the selectors are
# replaced for equivalent indices.
if ((any(dat[[i]][['selectors']][[inner_dim]][[1]] %in% c('all', 'first', 'last'))) &&
(chunks[[inner_dim]]['n_chunks'] != 1)) {
replace_character_with_indices(selectors = dat[[i]][['selectors']][[inner_dim]][[1]], data_dims = data_dims[[inner_dim]], chunk_amount)
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
# The selectors for the inner dimension are taken.
selector_array <- dat[[i]][['selectors']][[inner_dim]][[1]]
if (debug) {
if (inner_dim %in% dims_to_check) {
print(paste0("-> DEBUG MESSAGES FOR THE DATASET", i, " AND INNER DIMENSION '", inner_dim, "':"))
print("-> STRUCTURE OF SELECTOR ARRAY:")
print(str(selector_array))
print("-> PICKED VARS:")
print(picked_vars)
print("-> TRANSFORMED VARS:")
print(transformed_vars)
}
}
if (is.null(dim(selector_array))) {
dim(selector_array) <- length(selector_array)
}
if (is.null(names(dim(selector_array)))) {
if (length(dim(selector_array)) == 1) {
names(dim(selector_array)) <- inner_dim
} else {
stop("Provided selector arrays must be provided with dimension ",
"names. Found an array of selectors without dimension names ",
"for the dimension '", inner_dim, "'.")
}
}
selectors_are_indices <- FALSE
if (!is.null(attr(selector_array, 'indices'))) {
if (!is.logical(attr(selector_array, 'indices'))) {
stop("The atribute 'indices' for the selectors for the dimension '",
inner_dim, "' must be TRUE or FALSE.")
}
selectors_are_indices <- attr(selector_array, 'indices')
}
taken_chunks <- rep(FALSE, chunk_amount)
selector_file_dims <- 1
#NOTE: Change 'selector_file_dims' (from 1) if selector is an array with a file_dim dimname.
# I.e., If time = [sdate = 2, time = 4], selector_file_dims <- c(sdate = 2)
if (any(found_file_dims[[i]] %in% names(dim(selector_array)))) {
selector_file_dims <- dim(selector_array)[which(names(dim(selector_array)) %in% found_file_dims[[i]])]
}
selector_inner_dims <- dim(selector_array)[which(!(names(dim(selector_array)) %in% found_file_dims[[i]]))]
var_with_selectors <- NULL
var_with_selectors_name <- var_params[[inner_dim]]
var_ordered <- NULL
var_unorder_indices <- NULL
with_transform <- FALSE
# If the selectors come with an associated variable
if (!is.null(var_with_selectors_name)) {
if ((var_with_selectors_name %in% transform_vars) && (!is.null(transform))) {
with_transform <- TRUE
stop("Requested a transformation over the dimension '",
inner_dim, "', wich goes across files. This feature ",
"is not supported. Either do the request without the ",
"transformation or request it over dimensions that do ",
"not go across files.")
}
}
if (debug) {
if (inner_dim %in% dims_to_check) {
print("-> NAME OF THE VARIABLE WITH THE SELECTOR VALUES FOR THE CURRENT INNER DIMENSION:")
print(var_with_selectors_name)
print("-> NAMES OF THE VARIABLES TO BE TRANSFORMED:")
print(transform_vars)
print("-> STRUCTURE OF THE TRANSFORMATION FUNCTION:")
print(str(transform))
}
}
aho
committed
# For fri
if (var_with_selectors_name %in% names(picked_vars[[i]])) {
var_with_selectors <- picked_vars[[i]][[var_with_selectors_name]]
var_ordered <- picked_vars_ordered[[i]][[var_with_selectors_name]]
var_unorder_indices <- picked_vars_unorder_indices[[i]][[var_with_selectors_name]]
} else if (var_with_selectors_name %in% names(picked_common_vars)) {
var_with_selectors <- picked_common_vars[[var_with_selectors_name]]
var_ordered <- picked_common_vars_ordered[[var_with_selectors_name]]
var_unorder_indices <- picked_common_vars_unorder_indices[[var_with_selectors_name]]
}
n <- prod(dim(var_with_selectors))
# if no _reorder, var_unorder_indices is NULL
if (is.null(var_unorder_indices)) {
var_unorder_indices <- 1:n
}
aho
committed
# For sri
aho
committed
## var in 'dat'
if (var_with_selectors_name %in% names(transformed_vars[[i]])) {
m <- prod(dim(transformed_vars[[i]][[var_with_selectors_name]]))
if (aiat) {
var_with_selectors <- transformed_vars[[i]][[var_with_selectors_name]]
var_ordered <- transformed_vars_ordered[[i]][[var_with_selectors_name]]
var_unorder_indices <- transformed_vars_unorder_indices[[i]][[var_with_selectors_name]]
}
aho
committed
# For making sri ordered later
transformed_var_unordered_indices <- transformed_vars_unorder_indices[[i]][[var_with_selectors_name]]
if (is.null(transformed_var_unordered_indices)) {
transformed_var_unordered_indices <- 1:m
}
transformed_var_with_selectors <- transformed_vars[[i]][transformed_picked_vars_names][[var_with_selectors_name]][transformed_var_unordered_indices]
aho
committed
# Sorting the transformed variable and working out the indices again after transform.
if (!is.null(dim_reorder_params[[var_with_selectors_name]])) {
transformed_var_with_selectors_reorder <- dim_reorder_params[[var_with_selectors_name]](transformed_var_with_selectors)
transformed_var_with_selectors <- transformed_var_with_selectors_reorder$x
transformed_var_with_selectors_unorder <- transformed_var_with_selectors_reorder$ix
} else {
transformed_var_with_selectors_unorder <- 1:length(transformed_var_with_selectors)
}
## var in common
} else if (var_with_selectors_name %in% names(transformed_common_vars)) {
m <- prod(dim(transformed_common_vars[[var_with_selectors_name]]))
if (aiat) {
var_with_selectors <- transformed_common_vars[[var_with_selectors_name]]
var_ordered <- transformed_common_vars_ordered[[var_with_selectors_name]]
var_unorder_indices <- transformed_common_vars_unorder_indices[[var_with_selectors_name]]
}
aho
committed
# For making sri ordered later
transformed_var_unordered_indices <- transformed_common_vars_unorder_indices[[var_with_selectors_name]]
if (is.null(transformed_var_unordered_indices)) {
transformed_var_unordered_indices <- 1:m
}
transformed_var_with_selectors <- transformed_common_vars[[var_with_selectors_name]][transformed_var_unordered_indices]
# Sorting the transformed variable and working out the indices again after transform.
if (!is.null(dim_reorder_params[[var_with_selectors_name]])) {
transformed_var_with_selectors_reorder <- dim_reorder_params[[var_with_selectors_name]](transformed_var_with_selectors)
transformed_var_with_selectors <- transformed_var_with_selectors_reorder$x
transformed_var_with_selectors_unorder <- transformed_var_with_selectors_reorder$ix
} else {
transformed_var_with_selectors_unorder <- 1:length(transformed_var_with_selectors)
}
}
if (is.null(var_unorder_indices)) {
var_unorder_indices <- 1:m
}
}
if (debug) {
if (inner_dim %in% dims_to_check) {
print("-> SIZE OF ORIGINAL VARIABLE:")
print(n)
print("-> SIZE OF TRANSFORMED VARIABLE:")
if (with_transform) print(m)
print("-> STRUCTURE OF ORDERED VAR:")
print(str(var_ordered))
print("-> UNORDER INDICES:")
print(var_unorder_indices)
}
}
# If this inner dim's selector (var_with_selectors) is an array
# that has file dim as dimension (e.g., across or depend relationship)
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
if (any(names(var_dims) %in% found_file_dims[[i]])) {
if (with_transform) {
stop("Requested transformation for inner dimension '",
inner_dim, "' but provided selectors for such dimension ",
"over one or more file dimensions. This is not ",
"supported. Either request no transformation for the ",
"dimension '", inner_dim, "' or specify the ",
"selectors for this dimension without the file dimensions.")
}
var_file_dims <- var_dims[which(names(var_dims) %in% found_file_dims[[i]])]
var_dims <- var_dims[-which(names(var_dims) %in% found_file_dims[[i]])]
}
## # Keep the selectors if they correspond to a variable that will be transformed.
## if (with_transform) {
## if (var_with_selectors_name %in% names(picked_vars[[i]])) {
## transformed_var_with_selectors <- transformed_vars[[i]][[var_with_selectors_name]]
## } else if (var_with_selectors_name %in% names(picked_common_vars)) {
## transformed_var_with_selectors <- transformed_common_vars[[var_with_selectors_name]]
## }
## transformed_var_dims <- dim(transformed_var_with_selectors)
## transformed_var_file_dims <- 1
## if (any(names(transformed_var_dims) %in% found_file_dims[[i]])) {
## transformed_var_file_dims <- transformed_var_dims[which(names(transformed_var_dims) %in% found_file_dims[[i]])]
## transformed_var_dims <- tranasformed_var_dims[-which(names(transformed_var_dims) %in% found_file_dims[[i]])]
## }
##if (inner_dim %in% dims_to_check) {
##print("111m")
##print(str(transformed_var_dims))
##}
##
## m <- prod(transformed_var_dims)
## }
# Work out var file dims and inner dims.
if (inner_dim %in% unlist(inner_dims_across_files)) {
#TODO: if (chunk_amount != number of chunks in selector_file_dims), crash
if (length(var_dims) > 1) {
stop("Specified a '", inner_dim, "_var' for the dimension '",
inner_dim, "', which goes across files (across '", crossed_file_dim,
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
"'). The specified variable, '", var_with_selectors_name, "', has more ",
"than one dimension and can not be used as selector variable. ",
"Select another variable or fix it in the files.")
}
}
## TODO HERE::
#- indices_of_first_files_with_data may change, because array is now extended
var_full_dims <- dim(var_with_selectors)
} else if (((is.numeric(selector_array) || is.list(selector_array)) && selectors_are_indices) ||
(is.character(selector_array) && (length(selector_array) == 1) &&
(selector_array %in% c('all', 'first', 'last')) &&
!is.null(file_dim_reader))) {
#### TODO HERE::
###- indices_of_first_files_with_data may change, because array is now extended
# Lines moved above for better performance.
##data_dims <- file_dim_reader(file_path, NULL, selectors_of_first_files_with_data[[i]],
## lapply(dat[[i]][['selectors']][expected_inner_dims[[i]]], '[[', 1))
} else {
stop(paste0("Can not translate the provided selectors for '", inner_dim,
"' to numeric indices. Provide numeric indices and a ",
"'file_dim_reader' function, or a '", inner_dim,
"_var' in order to calculate the indices."))
}
# At this point, if no selector variable was provided, the variable
# data_dims has been populated. If a selector variable was provided,
# the variables var_dims, var_file_dims and var_full_dims have been
# populated instead.
#////////////////////////////////////////////////////////////////////
# If the inner dim lengths differ among files,
# need to know each length to create the indices for each file later.
inner_dim_lengths <- NULL
# inner_dim_lengths here includes all the files, but we only want
# the files of fyear for certain "sdate". We will categorize it later.
inner_dim_lengths <- tryCatch({
sapply(data_dims_each_file, '[[', inner_dim)
}, error = function(x) {
sapply(data_dims_each_file, '[[',
synonims[[inner_dim]][which(synonims[[inner_dim]] != inner_dim)])
})
# Use other file dims as the factors to categorize.
other_file_dims <- dim(array_of_files_to_load)[which(!found_file_dims[[i]] %in% crossed_file_dim)]
other_file_dims <- lapply(lapply(other_file_dims, seq, 1), rev)
other_file_dims_factor <- expand.grid(other_file_dims)
selector_indices_save_subset <-
lapply(selector_indices_save[[i]], '[', which(!found_file_dims[[i]] %in% crossed_file_dim))
# Put the fyear with the same other file dims (sdate, etc.) together, and find the largest length (in theory all of them should be the same)
inner_dim_lengths_cat <- vector('list', dim(other_file_dims_factor)[1])
for (i_factor in 1:length(inner_dim_lengths_cat)) {
inner_dim_lengths_cat[[i_factor]] <-
inner_dim_lengths[which(sapply(lapply(
selector_indices_save_subset, '==',
other_file_dims_factor[i_factor, ]), all))]
}
# Find the largest length of each time step
inner_dim_lengths <- do.call(pmax, inner_dim_lengths_cat)
}
fri <- first_round_indices <- NULL
sri <- second_round_indices <- NULL
# This variable will keep the indices needed to crop the transformed
# variable (the one that has been transformed without being subset
# with the first round indices).
tvi <- tranaformed_variable_indices <- NULL
ordered_fri <- NULL
ordered_sri <- NULL
if ((length(selector_array) == 1) && is.character(selector_array) &&
(selector_array %in% c('all', 'first', 'last')) &&
(chunks[[inner_dim]]['n_chunks'] == 1)) {
if (is.null(var_with_selectors_name)) {
fri <- vector('list', length = chunk_amount)
dim(fri) <- c(chunk_amount)
sri <- vector('list', length = chunk_amount)
dim(sri) <- c(chunk_amount)
if (selector_array == 'all') {
if (is.null(inner_dim_lengths) | length(unique(inner_dim_lengths)) <= 1) { #old code
fri[] <- replicate(chunk_amount, list(1:(data_dims[inner_dim])))
} else { # files have different inner dim length
for (i_chunk in 1:length(fri)) {
fri[[i_chunk]] <- 1:inner_dim_lengths[i_chunk]
}
}
taken_chunks <- rep(TRUE, chunk_amount)
#sri <- NULL
} else if (selector_array == 'first') {
fri[[1]] <- 1
taken_chunks[1] <- TRUE
#sri <- NULL
} else if (selector_array == 'last') {
fri[[chunk_amount]] <- data_dims[inner_dim]
taken_chunks[length(taken_chunks)] <- TRUE
#sri <- NULL
}
} else {
if (!is.null(crossed_file_dim) & any(!(crossed_file_dim %in% names(var_file_dims)))) {
stop("The variable '", var_with_selectors_name, "' must also be ",
"requested for the file dimension '", crossed_file_dim, "' in ",
"this configuration.")
}
fri <- vector('list', length = prod(var_file_dims))
dim(fri) <- var_file_dims
ordered_fri <- fri
sri <- vector('list', length = prod(var_file_dims))
dim(sri) <- var_file_dims
ordered_sri <- sri
if (selector_array == 'all') {
# TODO: Populate ordered_fri
ordered_fri[] <- replicate(prod(var_file_dims), list(1:n))
fri[] <- replicate(prod(var_file_dims), list(var_unorder_indices[1:n]))
taken_chunks <- rep(TRUE, chunk_amount)
if (!with_transform) {
#fri[] <- replicate(prod(var_file_dims), list(1:n))
#taken_chunks <- rep(TRUE, chunk_amount)
#sri <- NULL
} else {
ordered_sri[] <- replicate(prod(var_file_dims), list(1:m))
aho
committed
if (inner_dim %in% names(dim_reorder_params)) {
sri[] <- replicate(prod(var_file_dims), list(transformed_var_unordered_indices[1:m]))
} else {
sri[] <- replicate(prod(var_file_dims), list(1:m))
}
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
## var_file_dims instead??
#if (!aiat) {
#fri[] <- replicate(prod(var_file_dims), list(1:n))
#taken_chunks <- rep(TRUE, chunk_amount)
#sri[] <- replicate(prod(transformed_var_file_dims), list(1:m))
#} else {
#fri[] <- replicate(prod(var_file_dims), list(1:n))
#taken_chunks <- rep(TRUE, chunk_amount)
#sri[] <- replicate(prod(transformed_var_file_dims), list(1:m))
#}
tvi <- 1:m
}
} else if (selector_array == 'first') {
taken_chunks[1] <- TRUE
if (!with_transform) {
ordered_fri[[1]] <- 1
fri[[1]] <- var_unorder_indices[1]
#taken_chunks[1] <- TRUE
#sri <- NULL
} else {
if (!aiat) {
ordered_fri[[1]] <- 1
fri[[1]] <- var_unorder_indices[1]
# TODO: TO BE IMPROVED
#taken_chunks[1] <- TRUE
ordered_sri[[1]] <- 1:ceiling(m / n)
sri[[1]] <- 1:ceiling(m / n)
tvi <- 1:ceiling(m / n)
} else {
ordered_fri[[1]] <- 1:ceiling(m / n)
fri[[1]] <- var_unorder_indices[1:ceiling(m / n)]
#taken_chunks[1] <- TRUE
ordered_sri[[1]] <- 1
sri[[1]] <- 1
tvi <- 1
}
}
} else if (selector_array == 'last') {
taken_chunks[length(taken_chunks)] <- TRUE
if (!with_transform) {
ordered_fri[[prod(var_file_dims)]] <- n
fri[[prod(var_file_dims)]] <- var_unorder_indices[n]
#taken_chunks[length(taken_chunks)] <- TRUE
#sri <- NULL
} else {
if (!aiat) {
ordered_fri[[prod(var_file_dims)]] <- prod(var_dims)
fri[[prod(var_file_dims)]] <- var_unorder_indices[prod(var_dims)]
#taken_chunks[length(taken_chunks)] <- TRUE
ordered_sri[[prod(var_file_dims)]] <- 1:ceiling(m / n)
sri[[prod(var_file_dims)]] <- 1:ceiling(m / n)
# TODO: TO BE IMPROVED. THE TVI MAY BE WRONG IF THERE'S BEEN A REORDERING.
tvi <- 1:ceiling(m / n)
} else {
ordered_fri[[prod(var_file_dims)]] <- (n - ceiling(m / n) + 1):n
fri[[prod(var_file_dims)]] <- var_unorder_indices[(n - ceiling(m / n) + 1):n]
#taken_chunks[length(taken_chunks)] <- TRUE
ordered_sri[[prod(var_file_dims)]] <- 1
sri[[prod(var_file_dims)]] <- 1
tvi <- 1
}
}
}
}
# If the selectors are not 'all', 'first', 'last', ...
} else {
if (!is.null(var_with_selectors_name)) {
unmatching_file_dims <- which(!(names(var_file_dims) %in% names(selector_file_dims)))
if ((length(unmatching_file_dims) > 0)) {
raise_error <- FALSE
if (!(length(unmatching_file_dims) == 1 &
names(var_file_dims)[unmatching_file_dims] %in% crossed_file_dim &
inner_dim %in% names(selector_inner_dims))) {
raise_error <- TRUE
}
}
if (raise_error) {
stop("Provided selectors for the dimension '", inner_dim, "' must have as many ",
"file dimensions as the variable the dimension is defined along, '",
var_with_selectors_name, "', with the exceptions of the file pattern dimension ('",
found_pattern_dim, "') and any depended file dimension (if specified as ",
"depended dimension in parameter 'inner_dims_across_files' and the ",
"depending file dimension is present in the provided selector array).")
}
}
if (any(names(selector_file_dims) %in% names(dim(var_with_selectors)))) {
if (any(dim(var_with_selectors)[names(selector_file_dims)] != selector_file_dims)) {
stop("Size of selector file dimensions must match size of the corresponding ",
"variable dimensions.")
}
}
}
## TODO: If var dimensions are not in the same order as selector dimensions, reorder
if (is.null(names(selector_file_dims))) {
if (!is.null(crossed_file_dim)) {
fri_dim_names <- c(fri_dim_names, crossed_file_dim)
}
fri_dim_names <- found_file_dims[[i]][which(found_file_dims[[i]] %in% fri_dim_names)]
fri_dims <- rep(NA, length(fri_dim_names))
names(fri_dims) <- fri_dim_names
fri_dims[names(selector_file_dims)] <- selector_file_dims
#NOTE: Not sure how it works here, but "chunk_amount" is the same as
# "selector_file_dims" above in the cases we've seen so far,
# and it causes problem when crossed_file_dim is more than one.
# if (!is.null(crossed_file_dim)) {
# fri_dims[crossed_file_dim] <- chunk_amount
# }
}
fri <- vector('list', length = prod(fri_dims))
dim(fri) <- fri_dims
sri <- vector('list', length = prod(fri_dims))
dim(sri) <- fri_dims
selector_file_dim_array <- array(1:prod(selector_file_dims), dim = selector_file_dims)
selector_store_position <- fri_dims
for (j in 1:prod(dim(selector_file_dim_array))) {
selector_indices_to_take <- which(selector_file_dim_array == j, arr.ind = TRUE)[1, ]
names(selector_indices_to_take) <- names(selector_file_dims)
selector_store_position[names(selector_indices_to_take)] <- selector_indices_to_take
# "selector_indices_to_take" is an array if "selector_file_dims" is not 1 (if
# selector is an array with a file_dim dimname, i.e., time = [sdate = 2, time = 4].
if (!is.null(names(selector_indices_to_take))) {
sub_array_of_selectors <- Subset(selector_array, names(selector_indices_to_take),
as.list(selector_indices_to_take), drop = 'selected')
} else {
sub_array_of_selectors <- selector_array
}
if (debug) {
if (inner_dim %in% dims_to_check) {
print("-> ITERATING OVER FILE DIMENSIONS OF THE SELECTORS.")
print("-> STRUCTURE OF A SUB ARRAY:")
print(str(sub_array_of_selectors))
print("-> STRUCTURE OF THE VARIABLE WITH SELECTORS:")
print(str(var_with_selectors))
print(dim(var_with_selectors))
}
}
if (selectors_are_indices) {
sub_array_of_values <- NULL
#} else if (!is.null(var_ordered)) {
# sub_array_of_values <- var_ordered
} else {
var_indices_to_take <- selector_indices_to_take[which(names(selector_indices_to_take) %in% names(var_file_dims))]
if (!is.null(names(var_indices_to_take))) {
sub_array_of_values <- Subset(var_with_selectors, names(var_indices_to_take),
as.list(var_indices_to_take), drop = 'selected')
} else {
# time across some file dim (e.g., "file_date") but doesn't have
# this file dim as dimension (e.g., time: [sdate, time])
} else {
sub_array_of_values <- var_with_selectors
}
}
if (debug) {
if (inner_dim %in% dims_to_check) {
print("-> STRUCTURE OF THE SUB ARRAY FROM THE VARIABLE CORRESPONDING TO THE SUB ARRAY OF SELECTORS")
print(str(sub_array_of_values))
print(dim(sub_array_of_values))
print("-> NAME OF THE FILE DIMENSION THE CURRENT INNER DIMENSION EXTENDS ALONG:")
# The inner dim selector is an array in which have file dim (e.g., time = [sdate = 2, time = 4],
# or the inner dim doesn't go across any file dim (e.g., no time_across = 'sdate')
if ((!is.null(crossed_file_dim) & (any(crossed_file_dim %in% names(selector_file_dims)))) || is.null(crossed_file_dim)) {
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
if (length(sub_array_of_selectors) > 0) {
if (debug) {
if (inner_dim %in% dims_to_check) {
print("-> THE INNER DIMENSION DOES NOT GO ACROSS ANY FILE DIMENSION OR IT DOES BUT IS IN THE PROVIDED SELECTOR ARRAY.")
}
}
if (selectors_are_indices) {
if (!is.null(var_with_selectors_name)) {
max_allowed <- ifelse(aiat, m, n)
} else {
max_allowed <- data_dims[inner_dim]
}
if (any(na.omit(unlist(sub_array_of_selectors)) > max_allowed) ||
any(na.omit(unlist(sub_array_of_selectors)) < 1)) {
stop("Provided indices out of range for dimension '", inner_dim, "' ",
"for dataset '", dat[[i]][['name']], "' (accepted range: 1 to ",
max_allowed, ").")
}
}
# The selector_checker will return either a vector of indices or a list
# with the first and last desired indices.
#NOTE: goes_across_prime_meridian may be TRUE only if the selector is list of values
# If selectors are indices and _reorder = CircularSort() is used, change
# is_circular_dim to TRUE.
if (!is.null(var_ordered) & selectors_are_indices &
!is.null(dim_reorder_params[[inner_dim]])) {
if ('circular' %in% names(attributes(dim_reorder_params[[inner_dim]]))) {
is_circular_dim <- attr(dim_reorder_params[[inner_dim]], "circular")
if (is_circular_dim & is.list(sub_array_of_selectors)) {
tmp <- dim_reorder_params[[inner_dim]](unlist(sub_array_of_selectors))$ix
goes_across_prime_meridian <- tmp[1] > tmp[2]
}
# If selectors are values and _reorder is defined.
if (!is.null(var_ordered) && !selectors_are_indices) {
if (!is.null(dim_reorder_params[[inner_dim]])) {
if ('circular' %in% names(attributes(dim_reorder_params[[inner_dim]]))) {
is_circular_dim <- attr(dim_reorder_params[[inner_dim]], "circular")
}
if (is.list(sub_array_of_selectors)) {
## NOTE: The check of 'goes_across_prime_meridian' is moved forward to here.
if (is_circular_dim) {
# NOTE: Use CircularSort() to put the values in the assigned range, and get the order.
# For example, [-10, 20] in CircularSort(0, 360) is [350, 20]. The $ix list is [2, 1].
# 'goes_across_prime_meridian' means the selector range across the border. For example,
# CircularSort(-180, 180) with selector [170, 190] -> goes_across_prime_meridian = TRUE.
# dim_reorder_params is a list of Reorder function, i.e.,
# Sort() or CircularSort().
tmp <- dim_reorder_params[[inner_dim]](unlist(sub_array_of_selectors))$ix
goes_across_prime_meridian <- tmp[1] > tmp[2]
#NOTE: HERE change to the same code as below (under 'else'). Not sure why originally
# it uses additional lines, which make reorder not work.
# If "_reorder" is used, here 'sub_array_of_selectors' is adjusted to
# follow the reorder rule. E.g., if lat = values(list(-90, 90)) and
# lat_reorder = Sort(decreasing = T), 'sub_array_of_selectors' changes
# from list(-90, 90) to list(90, -90).
sub_array_of_selectors <- as.list(dim_reorder_params[[inner_dim]](unlist(sub_array_of_selectors))$x)
#sub_array_reordered <- dim_reorder_params[[inner_dim]](unlist(sub_array_of_selectors))
#sub_array_unorder <- sort(sub_array_reordered$ix, index.return = TRUE)$ix
#sub_array_of_selectors <- as.list(sub_array_reordered$x[sub_array_unorder])
# Add warning if the boundary is out of range
if (min(unlist(sub_array_of_selectors)) < range(var_ordered)[1]) {
show_out_of_range_warning(inner_dim, range = range(var_ordered),
bound = 'lower')
if (max(unlist(sub_array_of_selectors)) > range(var_ordered)[2]) {
show_out_of_range_warning(inner_dim, range = range(var_ordered), bound = 'upper')
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
}
} else {
sub_array_of_selectors <- dim_reorder_params[[inner_dim]](sub_array_of_selectors)$x
}
}
# NOTE: The ideal solution for selecting indices in goes_across_prime_meridian case
# is modified SelectorCheckor.R. But now SelectorCheckor doesn't know the info of
#goes_across_prime_meridian, so I do the adjustion after calling SelectorCheckor().
sub_array_of_indices <- selector_checker(sub_array_of_selectors, var_ordered,
tolerance = if (aiat) {
NULL
} else {
tolerance_params[[inner_dim]]
})
if (goes_across_prime_meridian & sub_array_of_indices[[1]] < sub_array_of_indices[[2]]) {
if (!(sub_array_of_selectors[[1]] %in% var_ordered)){
sub_array_of_indices[[1]] <- sub_array_of_indices[[1]] - 1
}
if (!(sub_array_of_selectors[[2]] %in% var_ordered)){
sub_array_of_indices[[2]] <- sub_array_of_indices[[2]] + 1
}
}
#NOTE: the possible case?
if (goes_across_prime_meridian & sub_array_of_indices[[1]] > sub_array_of_indices[[2]]) {
stop("The case is goes_across_prime_meridian but no adjustion for the indices!")
}
if (any(is.na(sub_array_of_indices))) {
stop(paste0("The selectors of ", inner_dim,
" are out of range [", min(var_ordered),
", ", max(var_ordered), "]."))
}
} else {
# Add warning if the boundary is out of range
if (is.list(sub_array_of_selectors) & !selectors_are_indices) {
if (min(unlist(sub_array_of_selectors)) < min(sub_array_of_values)) {
show_out_of_range_warning(inner_dim, range = range(sub_array_of_values),
bound = 'lower')
if (max(unlist(sub_array_of_selectors)) > max(sub_array_of_values)) {
show_out_of_range_warning(inner_dim, range = range(sub_array_of_values),
bound = 'upper')
# sub_array_of_values here is NULL if selectors are indices, and
# 'sub_array_of_indices' will be sub_array_of_selectors, i.e., the indices
# assigned (but rounded).
sub_array_of_indices <- selector_checker(sub_array_of_selectors, sub_array_of_values,
tolerance = if (aiat) {
NULL
} else {
tolerance_params[[inner_dim]]
})
if (any(is.na(sub_array_of_indices))) {
stop(paste0("The selectors of ", inner_dim,
" are out of range [", min(sub_array_of_values),
", ", max(sub_array_of_values), "]."))
}
}
#////////////////////////////////////////////////////////////
# If chunking along this inner dim, this part creates the indices for each chunk.
# For example, if 'sub_array_of_indices' is c(5:10) and chunked into 2,
# 'sub_array_of_indices' becomes c(5:7) for chunk = 1, c(8:10) for chunk = 2.
# If 'sub_array_of_indices' is list(55, 62) and chunked into 2, it becomes
# list(55, 58) for chunk = 1 and list(59, 62) for chunk = 2.
#TODO: The list can be turned into vector here? So afterward no need to judge if
# it is list or vector.
#NOTE: chunking cannot be done if goes_across_prime_meridian = TRUE.
#TODO: Change the algorithm to make chunking works for goes_across_prime_meridian = TRUE.
# If goes_across_prime_meridian = TRUE, "sub_array_of_indices" are not
# continuous numbers. For example, list(37, 1243) means sub_array_of_fri
# that will be generated based on sub_array_of_indices later is c(1:37, 1243:1296).
# the longitude are separated into 2 parts, therefore, cannot be chunked here.
if (chunks[[inner_dim]]["n_chunks"] > 1) {
if (goes_across_prime_meridian) {
stop(paste0("Chunking over ", inner_dim, " that goes across the circular border assigned by '", inner_dim, "_reorder' is not supported by startR now. Adjust the ", inner_dim, " selector to be within the border or change the borders." ))
}
if (!is.list(sub_array_of_indices)) {
sub_array_of_indices <-
sub_array_of_indices[get_chunk_indices(length(sub_array_of_indices),
chunks[[inner_dim]]["chunk"],
chunks[[inner_dim]]["n_chunks"],
inner_dim)]
get_chunk_indices(length(sub_array_of_indices[[1]]:sub_array_of_indices[[2]]),
chunks[[inner_dim]]["chunk"], chunks[[inner_dim]]["n_chunks"],
inner_dim)
vect <- sub_array_of_indices[[1]]:sub_array_of_indices[[2]]
sub_array_of_indices[[1]] <- vect[tmp[1]]
sub_array_of_indices[[2]] <- vect[tmp[length(tmp)]]
}
}
# The sub_array_of_indices now contains numeric indices of the values to be taken by each chunk.
#----------------------------------------------------------
# 'sub_sub_array_of_values' is for sri chunking. If this inner dim is chunked,
# the sri has to follow the chunking of fri. Therefore, we save the original
# value of this chunk here for later use. We'll find the corresponding
# transformed value within 'sub_sub_array_of_values' and chunk sri.
if (with_transform & chunks[[inner_dim]]["n_chunks"] > 1) {
input_array_of_values <- var_ordered
} else {
if (is.null(sub_array_of_values)) { # selectors are indices
#NOTE: Not sure if 'vars_to_transform' is the correct one to use.
input_array_of_values <- vars_to_transform[[var_with_selectors_name]]
} else {
input_array_of_values <- sub_array_of_values
tmp <- generate_sub_sub_array_of_values(
input_array_of_values, sub_array_of_indices,
number_of_chunk = chunks[[inner_dim]]["chunk"])
sub_sub_array_of_values <- tmp$sub_sub_array_of_values
previous_sub_sub_array_of_values <- tmp$previous_sub_sub_array_of_values
}
#----------------------------------------------------------
if (debug) {
if (inner_dim %in% dims_to_check) {
print("-> TRANSFORMATION REQUESTED?")
print(with_transform)
print("-> BETA:")
print(beta)
}
}
if (with_transform) {
# If there is a transformation and selector values are provided, these
# selectors will be processed in the same way either if aiat = TRUE or
# aiat = FALSE.
## TODO: If sub_array_of_selectors was integer and aiat then... do what's commented 50 lines below.
## otherwise, do what's coded.
if (debug) {
if (inner_dim %in% dims_to_check) {
print("-> SELECTORS REQUESTED BEFORE TRANSFORM.")
}
}
# Generate sub_array_of_fri
sub_array_of_fri <- generate_sub_array_of_fri(
with_transform, goes_across_prime_meridian, sub_array_of_indices, n, beta,
is_circular_dim)
# May be useful for crop = T. 'subset_vars_to_transform' may not need
# to include extra cells, but currently it shows mistake if not include.
sub_array_of_fri_no_beta <- generate_sub_array_of_fri(
with_transform, goes_across_prime_meridian, sub_array_of_indices, n, beta,
is_circular_dim, add_beta = FALSE)
subset_vars_to_transform <- vars_to_transform
if (!is.null(var_ordered)) {
#NOTE: If var_ordered is common_vars, it doesn't have attributes and it is a vector.
# Turn it into array and add dimension name.
if (!is.array(var_ordered)) {
var_ordered <- as.array(var_ordered)
names(dim(var_ordered)) <- inner_dim
}
subset_vars_to_transform[[var_with_selectors_name]] <- Subset(var_ordered, inner_dim, sub_array_of_fri)
} else {
if (!selectors_are_indices) { # selectors are values
#NOTE: It should be redundant because without reordering the var should remain array
## But just stay same with above...
if (!is.array(sub_array_of_values)) {
sub_array_of_values <- as.array(sub_array_of_values)
names(dim(sub_array_of_values)) <- inner_dim
}
subset_vars_to_transform[[var_with_selectors_name]] <- Subset(sub_array_of_values, inner_dim, sub_array_of_fri)
} else { # selectors are indices
subset_vars_to_transform[[var_with_selectors_name]] <-
Subset(subset_vars_to_transform[[var_with_selectors_name]],
inner_dim, sub_array_of_fri)
}
do.call(transform, c(list(data_array = NULL,
variables = subset_vars_to_transform,
file_selectors = selectors_of_first_files_with_data[[i]],
crop_domain = transform_crop_domain),
transform_params))$variables[[var_with_selectors_name]]
)
transformed_subset_var <- tmp$value
warnings2 <- c(warnings2, tmp$warnings)
# Sorting the transformed variable and working out the indices again after transform.
if (!is.null(dim_reorder_params[[inner_dim]])) {
transformed_subset_var_reorder <- dim_reorder_params[[inner_dim]](transformed_subset_var)
transformed_subset_var <- transformed_subset_var_reorder$x
#NOTE: The fix here solves the mis-ordered lon when across_meridian.
transformed_subset_var_unorder <- transformed_subset_var_reorder$ix
# transformed_subset_var_unorder <- sort(transformed_subset_var_reorder$ix, index.return = TRUE)$ix
} else {
transformed_subset_var_unorder <- 1:length(transformed_subset_var)
}
if (!selectors_are_indices) { # selectors are values
sub_array_of_sri <- selector_checker(
sub_array_of_selectors, transformed_subset_var,
tolerance = if (aiat) {
tolerance_params[[inner_dim]]
} else {
NULL
})
sub_array_of_sri <- unique(sub_array_of_sri)
}
} else { # selectors are indices
# Need to transfer to values first, then use the values to get the new
# indices in transformed_subset_var.
if (is.list(sub_array_of_selectors)) {
ori_values <- vars_to_transform[[var_with_selectors_name]][sub_array_of_selectors[[1]]:sub_array_of_selectors[[2]]]
} else {
ori_values <- vars_to_transform[[var_with_selectors_name]][sub_array_of_selectors]
}
sub_array_of_sri <- selector_checker(
ori_values, transformed_subset_var,
tolerance = if (aiat) {
tolerance_params[[inner_dim]]
} else {
NULL
})
# Here may need to further modify considering aiat. If aiat = FALSE,
# (i.e., indices are taken before transform), unique() is needed.
sub_array_of_sri <- unique(sub_array_of_sri)
}
# Check if selectors fall out of the range of the transform grid
# It may happen when original lon is [-180, 180] while want to regrid to
# [0, 360], and lon selector = [-20, -10].
if (any(is.na(sub_array_of_sri))) {
stop(paste0("The selectors of ",
inner_dim, " are out of range of transform grid '",
transform_params$grid, "'. Use parameter '",
inner_dim, "_reorder' or change ", inner_dim,
" selectors."))
}
if (goes_across_prime_meridian) {
if (sub_array_of_sri[[1]] == sub_array_of_sri[[2]]) {
# global longitude
sub_array_of_sri <- c(1:length(transformed_subset_var))
} else {
# the common case, i.e., non-global
# # NOTE: Because sub_array_of_sri order is exchanged due to
# # previous development, here [[1]] and [[2]] should exchange
# sub_array_of_sri <- c(1:sub_array_of_sri[[1]],
# sub_array_of_sri[[2]]:length(transformed_subset_var))
#NOTE: the old code above is not suitable for all the possible cases.
# If sub_array_of_selectors is not exactly the value in transformed_subset_var, sub_array_of_sri[[1]] will be larger than sub_array_of_sri[[2]].
# Though here is not global case, we already have transformed_subset_var cropped as the desired region, so it is okay to use the whole length. Not sure if it will cause other problems...
sub_array_of_sri <- 1:length(transformed_subset_var)
}
} else if (is.list(sub_array_of_sri)) {
sub_array_of_sri <- sub_array_of_sri[[1]]:sub_array_of_sri[[2]]
}
#========================================================
# Instead of using values to find sri, directly use the destination grid to count.
#NOTE: sub_array_of_sri seems to start at 1 always (because crop = c(lonmin, lonmax, latmin, latmax) already?)
if (chunks[[inner_dim]]["n_chunks"] > 1) {
sub_array_of_sri <- sub_array_of_sri[get_chunk_indices(
length(sub_array_of_sri),
chunks[[inner_dim]]["chunk"],
chunks[[inner_dim]]["n_chunks"],
inner_dim)]
#========================================================
ordered_sri <- sub_array_of_sri
sub_array_of_sri <- transformed_subset_var_unorder[sub_array_of_sri]
###########################old##################################
# if (chunks[[inner_dim]]["n_chunks"] > 1) {
# tmp <- which(transformed_subset_var >= min(sub_sub_array_of_values) &
# transformed_subset_var <= max(sub_sub_array_of_values))
# sub_array_of_sri <- sub_array_of_sri[tmp]
# }
################################################################
# In this case, the tvi are not defined and the 'transformed_subset_var'
# will be taken instead of the var transformed before in the code.
if (debug) {
if (inner_dim %in% dims_to_check) {
print("-> FIRST INDEX:")
# print(first_index)
print("NOTE: Check function generate_sub_array_of_fri() in zzz.R")
# print(last_index)
print("NOTE: Check function generate_sub_array_of_fri() in zzz.R")
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
print("-> STRUCTURE OF FIRST ROUND INDICES:")
print(str(sub_array_of_fri))
print("-> STRUCTURE OF SECOND ROUND INDICES:")
print(str(sub_array_of_sri))
print("-> STRUCTURE OF TRANSFORMED VARIABLE INDICES:")
print(str(tvi))
}
}
### # If the selectors are expressed after transformation
### } else {
###if (debug) {
###if (inner_dim %in% dims_to_check) {
###print("-> SELECTORS REQUESTED AFTER TRANSFORM.")
###}
###}
### if (goes_across_prime_meridian) {
### sub_array_of_indices <- c(sub_array_of_indices[[1]]:m,
### 1:sub_array_of_indices[[2]])
### }
### first_index <- min(unlist(sub_array_of_indices))
### last_index <- max(unlist(sub_array_of_indices))
### first_index_before_transform <- max(transform_indices(first_index, m, n) - beta, 1)
### last_index_before_transform <- min(transform_indices(last_index, m, n) + beta, n)
### sub_array_of_fri <- first_index_before_transform:last_index_before_transform
### n_of_extra_cells <- round(beta / n * m)
### if (is.list(sub_array_of_indices) && (length(sub_array_of_indices) > 1)) {
### sub_array_of_sri <- 1:(last_index - first_index + 1)
### if (is.null(tvi)) {
### tvi <- sub_array_of_sri + first_index - 1
### }
### } else {
### sub_array_of_sri <- sub_array_of_indices - first_index + 1
### if (is.null(tvi)) {
### tvi <- sub_array_of_indices
### }
### }
### sub_array_of_sri <- sub_array_of_sri + n_of_extra_cells
sri <- do.call('[[<-', c(list(x = sri), as.list(selector_store_position),
list(value = sub_array_of_sri)))
} else { # !with_transform
sub_array_of_fri <- generate_sub_array_of_fri(
with_transform, goes_across_prime_meridian, sub_array_of_indices, n, beta,
is_circular_dim)
# Reorder sub_array_of_fri if reordering function is used.
# It was index in the assigned order (e.g., in [-180, 180] if CircularSort(-180, 180)), and here is changed to the index in the original order.
if (!is.null(var_unorder_indices)) {
if (is.null(ordered_fri)) {
ordered_fri <- sub_array_of_fri
}
sub_array_of_fri <- var_unorder_indices[sub_array_of_fri]
}
fri <- do.call('[[<-', c(list(x = fri), as.list(selector_store_position),
list(value = sub_array_of_fri)))
#NOTE: This part existed always but never was used. taken_chunks
# is related to merge_across_dims, but I don't know how it is
# used (maybe for higher efficiency?)
# if (!is.null(crossed_file_dim)) {
# taken_chunks[selector_store_position[[crossed_file_dim]]] <- TRUE
# } else {
taken_chunks <- TRUE
# }
} else {
# The inner dim goes across a file dim (e.g., time_across = 'sdate')
if (debug) {
if (inner_dim %in% dims_to_check) {
print("-> THE INNER DIMENSION GOES ACROSS A FILE DIMENSION.")
}
}
# If "<inner_dim>_across = <crossed_file_dim> + merge_across_dims = FALSE + chunk over <inner_dim>", return error because this instance is not logically correct.
if (chunks[[inner_dim]]["n_chunks"] > 1 & inner_dim %in% inner_dims_across_files) {
stop("Chunk over dimension '", inner_dim, "' is not allowed because '",
inner_dim, "' is across '",
names(inner_dims_across_files)[which(inner_dim %in% inner_dims_across_files)], "'.")
}
if (inner_dim %in% names(dim(sub_array_of_selectors))) {
if (is.null(var_with_selectors_name)) {
if (!largest_dims_length | (largest_dims_length & length(unique(inner_dim_lengths)) <= 1)) { #old code
maximal_indice <- data_dims[inner_dim] * chunk_amount
} else { # files have different length of inner dim
maximal_indice <- sum(inner_dim_lengths)
}
any(na.omit(unlist(sub_array_of_selectors)) > maximal_indice)) {
stop("Provided indices out of range for dimension '", inner_dim, "' ",
"for dataset '", dat[[i]][['name']], "' (accepted range: 1 to ",
maximal_indice, ").")
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
}
} else {
if (inner_dim %in% names(dim(sub_array_of_values))) {
# NOTE: Put across-inner-dim at the 1st position.
# POSSIBLE PROB!! Only organize inner dim, the rest dims may not in the same order as sub_array_of_selectors below.
inner_dim_pos_in_sub_array <- which(names(dim(sub_array_of_values)) == inner_dim)
if (inner_dim_pos_in_sub_array != 1) {
new_sub_array_order <- (1:length(dim(sub_array_of_values)))[-inner_dim_pos_in_sub_array]
new_sub_array_order <- c(inner_dim_pos_in_sub_array, new_sub_array_order)
sub_array_of_values <- .aperm2(sub_array_of_values, new_sub_array_order)
}
}
}
# NOTE: Put across-inner-dim at the 1st position.
# POSSIBLE PROB!! Only organize inner dim, the rest dims may not in the same order as sub_array_of_values above.
inner_dim_pos_in_sub_array <- which(names(dim(sub_array_of_selectors)) == inner_dim)
if (inner_dim_pos_in_sub_array != 1) {
new_sub_array_order <- (1:length(dim(sub_array_of_selectors)))[-inner_dim_pos_in_sub_array]
new_sub_array_order <- c(inner_dim_pos_in_sub_array, new_sub_array_order)
sub_array_of_selectors <- .aperm2(sub_array_of_selectors, new_sub_array_order)
}
sub_array_of_indices <- selector_checker(sub_array_of_selectors, sub_array_of_values,
tolerance = tolerance_params[[inner_dim]])
# It is needed to expand the indices here, otherwise for
# values(list(date1, date2)) only 2 values are picked.
if (is.list(sub_array_of_indices)) {
sub_array_of_indices <- sub_array_of_indices[[1]]:sub_array_of_indices[[2]]
}
sub_array_of_indices <- sub_array_of_indices[get_chunk_indices(length(sub_array_of_indices),
chunks[[inner_dim]]['chunk'],
chunks[[inner_dim]]['n_chunks'],
inner_dim)]
sub_array_is_list <- FALSE
if (is.list(sub_array_of_indices)) {
sub_array_is_list <- TRUE
sub_array_of_indices <- unlist(sub_array_of_indices)
}
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
# "indices_chunk" refers to in which file the
# sub_array_of_indices is; "transformed_indices"
# refers to the indices of sub_array_of_indices in each file.
if (!largest_dims_length |
(largest_dims_length & length(unique(inner_dim_lengths)) <= 1)) {
# old code; all the files have the same length of inner_dim
if (is.null(var_with_selectors_name)) {
indices_chunk <- floor((sub_array_of_indices - 1) / data_dims[inner_dim]) + 1
transformed_indices <- ((sub_array_of_indices - 1) %% data_dims[inner_dim]) + 1
} else {
indices_chunk <- floor((sub_array_of_indices - 1) / var_full_dims[inner_dim]) + 1
transformed_indices <- ((sub_array_of_indices - 1) %% var_full_dims[inner_dim]) + 1
}
} else { # files have different inner dim length
indices_chunk <- c()
for (item in 1:length(inner_dim_lengths)) {
tmp <- which(sub_array_of_indices <= cumsum(inner_dim_lengths)[item])
indices_chunk <- c(indices_chunk, rep(item, length(tmp) - length(indices_chunk)))
}
sub_array_of_indices_by_file <- split(sub_array_of_indices, indices_chunk)
for (item in 2:length(sub_array_of_indices_by_file)) {
sub_array_of_indices_by_file[[item]] <- sub_array_of_indices_by_file[[item]] - cumsum(inner_dim_lengths)[item - 1]
}
transformed_indices <- unlist(sub_array_of_indices_by_file, use.names = FALSE)
if (sub_array_is_list) {
sub_array_of_indices <- as.list(sub_array_of_indices)
}
if (debug) {
if (inner_dim %in% dims_to_check) {
print("-> GOING TO ITERATE ALONG CHUNKS.")
}
}
for (chunk in 1:chunk_amount) {
if (!is.null(names(selector_store_position))) {
sub_array_of_indices <- transformed_indices[which(indices_chunk == chunk)]
#NOTE: This 'with_transform' part is probably not tested because
# here is for the inner dim that goes across a file dim, which
# is normally not lat and lon dimension. If in the future, we
# can interpolate time, this part needs to be examined.
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
if (with_transform) {
# If the provided selectors are expressed in the world
# before transformation
if (!aiat) {
first_index <- min(unlist(sub_array_of_indices))
last_index <- max(unlist(sub_array_of_indices))
sub_array_of_fri <- max(c(first_index - beta, 1)):min(c(last_index + beta, n))
sub_array_of_sri <- transform_indices(unlist(sub_array_of_indices) - first_index + 1, n, m)
if (is.list(sub_array_of_indices)) {
if (length(sub_array_of_sri) > 1) {
sub_array_of_sri <- sub_array_of_sri[[1]]:sub_array_of_sri[[2]]
}
}
##TODO: TRANSFORM SUBSET VARIABLE AS ABOVE, TO COMPUTE SRI
# If the selectors are expressed after transformation
} else {
first_index <- min(unlist(sub_array_of_indices))
last_index <- max(unlist(sub_array_of_indices))
first_index_before_transform <- max(transform_indices(first_index, m, n) - beta, 1)
last_index_before_transform <- min(transform_indices(last_index, m, n) + beta, n)
sub_array_of_fri <- first_index_before_transform:last_index_before_transform
if (is.list(sub_array_of_indices) && (length(sub_array_of_indices) > 1)) {
sub_array_of_sri <- 1:(last_index - first_index + 1) +
round(beta / n * m)
} else {
sub_array_of_sri <- sub_array_of_indices - first_index + 1 +
round(beta / n * m)
}
##TODO: FILL IN TVI
}
sri <- do.call('[[<-', c(list(x = sri), as.list(selector_store_position),
list(value = sub_array_of_sri)))
if (length(sub_array_of_sri) > 0) {
taken_chunks[chunk] <- TRUE
}
} else {
sub_array_of_fri <- sub_array_of_indices
if (length(sub_array_of_fri) > 0) {
taken_chunks[chunk] <- TRUE
}
}
if (!is.null(var_unorder_indices)) {
ordered_fri <- sub_array_of_fri
sub_array_of_fri <- var_unorder_indices[sub_array_of_fri]
}
fri <- do.call('[[<-', c(list(x = fri), as.list(selector_store_position),
list(value = sub_array_of_fri)))
}
if (debug) {
if (inner_dim %in% dims_to_check) {
print("-> FINISHED ITERATING ALONG CHUNKS")
}
}
} else {
stop("Provided array of indices for dimension '", inner_dim, "', ",
"which goes across the file dimension '", crossed_file_dim, "', but ",
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
"the provided array does not have the dimension '", inner_dim,
"', which is mandatory.")
}
}
}
}
if (debug) {
if (inner_dim %in% dims_to_check) {
print("-> PROCEEDING TO CROP VARIABLES")
}
}
#if ((length(selector_array) == 1) && (selector_array %in% c('all', 'first', 'last'))) {
#if (!is.null(var_with_selectors_name) || (is.null(var_with_selectors_name) && is.character(selector_array) &&
# (length(selector_array) == 1) && (selector_array %in% c('all', 'first', 'last')))) {
empty_chunks <- which(!taken_chunks)
if (length(empty_chunks) >= length(taken_chunks)) {
stop("Selectors do not match any of the possible values for the dimension '", inner_dim, "'.")
}
if (length(empty_chunks) > 0) {
# # Get the first group of chunks to remove, and remove them.
# # E.g., from c(1, 2, 4, 5, 6, 8, 9) remove only 1 and 2
# dist <- abs(rev(empty_chunks) - c(rev(empty_chunks)[1] - 1, head(rev(empty_chunks), length(rev(empty_chunks)) - 1)))
# if (all(dist == 1)) {
# start_chunks_to_remove <- NULL
# } else {
# first_chunk_to_remove <- tail(which(dist > 1), 1)
# start_chunks_to_remove <- rev(rev(empty_chunks)[first_chunk_to_remove:length(empty_chunks)])
# }
# # Get the last group of chunks to remove, and remove them.
# # E.g., from c(1, 2, 4, 5, 6, 8, 9) remove only 8 and 9
# dist <- abs(empty_chunks - c(empty_chunks[1] - 1, head(empty_chunks, length(empty_chunks) - 1)))
# if (all(dist == 1)) {
# first_chunk_to_remove <- 1
# } else {
# first_chunk_to_remove <- tail(which(dist > 1), 1)
# }
# end_chunks_to_remove <- empty_chunks[first_chunk_to_remove:length(empty_chunks)]
# chunks_to_keep <- which(!((1:length(taken_chunks)) %in% c(start_chunks_to_remove, end_chunks_to_remove)))
chunks_to_keep <- which(taken_chunks)
dims_to_crop[[crossed_file_dim]] <- c(dims_to_crop[[crossed_file_dim]], list(chunks_to_keep))
# found_indices <- Subset(found_indices, crossed_file_dim, chunks_to_keep)
# # Crop dataset variables file dims.
# for (picked_var in names(picked_vars[[i]])) {
# if (crossed_file_dim %in% names(dim(picked_vars[[i]][[picked_var]]))) {
# picked_vars[[i]][[picked_var]] <- Subset(picked_vars[[i]][[picked_var]], crossed_file_dim, chunks_to_keep)
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
# }
# }
}
#}
dat[[i]][['selectors']][[inner_dim]] <- list(fri = fri, sri = sri)
# Crop dataset variables inner dims.
# Crop common variables inner dims.
types_of_var_to_crop <- 'picked'
if (with_transform) {
types_of_var_to_crop <- c(types_of_var_to_crop, 'transformed')
}
if (!is.null(dim_reorder_params[[inner_dim]])) {
types_of_var_to_crop <- c(types_of_var_to_crop, 'reordered')
}
for (type_of_var_to_crop in types_of_var_to_crop) {
if (type_of_var_to_crop == 'transformed') {
if (is.null(tvi)) {
if (!is.null(dim_reorder_params[[inner_dim]])) {
crop_indices <- unique(unlist(ordered_sri))
} else {
crop_indices <- unique(unlist(sri))
}
} else {
crop_indices <- unique(unlist(tvi))
}
vars_to_crop <- transformed_vars[[i]]
common_vars_to_crop <- transformed_common_vars
} else if (type_of_var_to_crop == 'reordered') {
crop_indices <- unique(unlist(ordered_fri))
vars_to_crop <- picked_vars_ordered[[i]]
common_vars_to_crop <- picked_common_vars_ordered
} else {
#TODO: If fri has different indices in each list, the crop_indices should be
# separated for each list. Otherwise, picked_common_vars later will be wrong.
crop_indices <- unique(unlist(fri))
vars_to_crop <- picked_vars[[i]]
common_vars_to_crop <- picked_common_vars
}
for (var_to_crop in names(vars_to_crop)) {
if (inner_dim %in% names(dim(vars_to_crop[[var_to_crop]]))) {
if (!is.null(crop_indices)) {
if (type_of_var_to_crop == 'transformed') {
if (!aiat) {
all(selector_array %in% c('all', 'first', 'last')))) {
vars_to_crop[[var_to_crop]] <- Subset(transformed_subset_var, inner_dim, crop_indices)
aho
committed
} else {
vars_to_crop[[var_to_crop]] <-
Subset(transformed_var_with_selectors, inner_dim, crop_indices)
} else {
vars_to_crop[[var_to_crop]] <- Subset(vars_to_crop[[var_to_crop]], inner_dim, crop_indices)
}
} else {
vars_to_crop[[var_to_crop]] <- Subset(vars_to_crop[[var_to_crop]], inner_dim, crop_indices)
}
}
}
}
if (i == length(dat)) {
for (common_var_to_crop in names(common_vars_to_crop)) {
if (inner_dim %in% names(dim(common_vars_to_crop[[common_var_to_crop]]))) {
if (type_of_var_to_crop == 'transformed' & !aiat) {
all(selector_array %in% c('all', 'first', 'last')))) {
common_vars_to_crop[[common_var_to_crop]] <-
Subset(transformed_subset_var, inner_dim, crop_indices)
aho
committed
} else {
common_vars_to_crop[[common_var_to_crop]] <-
Subset(transformed_var_with_selectors, inner_dim, crop_indices)
if (!is.null(crossed_file_dim)) { #merge_across_dims, crossed_file_dim is the depended file dim
#NOTE: When is not this case??? Maybe this condition is not needed
if (any(crossed_file_dim %in% names(dim(common_vars_to_crop[[common_var_to_crop]])))) {
tmp <- common_vars_to_crop[[common_var_to_crop]]
tmp_attributes <- attributes(common_vars_to_crop[[common_var_to_crop]])
dim_extra_ind <- which(!names(dim(tmp)) %in% c(crossed_file_dim, inner_dim))
if (!identical(dim_extra_ind, integer(0))) {
tmp_list <- asplit(tmp, dim_extra_ind)
dim_file_ind <- which(names(dim(tmp_list[[1]])) %in% crossed_file_dim)
tmp_list <- lapply(tmp_list, asplit, dim_file_ind)
} else { # only crossed_file_dim and inner_dim
dim_file_ind <- which(names(dim(tmp)) %in% crossed_file_dim)
tmp_list <- asplit(tmp, dim_file_ind)
# Add another layer to be consistent with the first case above
tmp_list <- list(tmp_list)
}
max_fri_length <- max(sapply(fri, length))
for (i_extra_dim in 1:length(tmp_list)) {
for (i_fri in 1:length(fri)) {
tmp_list[[i_extra_dim]][[i_fri]] <-
tmp_list[[i_extra_dim]][[i_fri]][fri[[i_fri]]]
if (length(tmp_list[[i_extra_dim]][[i_fri]]) != max_fri_length) {
tmp_list[[i_extra_dim]][[i_fri]] <-
c(tmp_list[[i_extra_dim]][[i_fri]], rep(NA, max_fri_length - length(tmp_list[[i_extra_dim]][[i_fri]])))
}
# Change list back to array
tmp_new_dim <- c(max_fri_length, dim(tmp)[crossed_file_dim], dim(tmp)[dim_extra_ind])
names(tmp_new_dim) <- c(inner_dim, crossed_file_dim, names(dim(tmp))[dim_extra_ind])
array(unlist(tmp_list), dim = tmp_new_dim)
# Reorder back
common_vars_to_crop[[common_var_to_crop]] <-
aperm(common_vars_to_crop[[common_var_to_crop]], match(names(dim(tmp)), names(tmp_new_dim)))
# Put attributes back
tmp <- which(!names(tmp_attributes) %in% names(attributes(common_vars_to_crop[[common_var_to_crop]])))
attributes(common_vars_to_crop[[common_var_to_crop]]) <-
c(attributes(common_vars_to_crop[[common_var_to_crop]]),
tmp_attributes[tmp])
if ('time' %in% synonims[[common_var_to_crop]]) {
# Convert number back to time
common_vars_to_crop[[common_var_to_crop]] <-
as.POSIXct(common_vars_to_crop[[common_var_to_crop]],
origin = "1970-01-01", tz = 'UTC')
}
} else { # old code
common_vars_to_crop[[common_var_to_crop]] <- Subset(common_vars_to_crop[[common_var_to_crop]], inner_dim, crop_indices)
}
3516
3517
3518
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3539
}
}
}
}
if (type_of_var_to_crop == 'transformed') {
if (!is.null(vars_to_crop)) {
transformed_vars[[i]] <- vars_to_crop
}
if (i == length(dat)) {
transformed_common_vars <- common_vars_to_crop
}
} else if (type_of_var_to_crop == 'reordered') {
if (!is.null(vars_to_crop)) {
picked_vars_ordered[[i]] <- vars_to_crop
}
if (i == length(dat)) {
picked_common_vars_ordered <- common_vars_to_crop
}
} else {
if (!is.null(vars_to_crop)) {
picked_vars[[i]] <- vars_to_crop
}
if (i == length(dat)) {
#NOTE: To avoid redundant run
if (inner_dim %in% names(common_vars_to_crop)) {
picked_common_vars <- common_vars_to_crop
}
3544
3545
3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556
3557
3558
3559
3560
3561
3562
3563
3564
3565
3566
3567
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3579
3580
3581
3582
3583
3584
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
}
}
}
#}
}
# After the selectors have been picked (using the original variables),
# the variables are transformed. At that point, the original selectors
# for the transformed variables are also kept in the variable original_selectors.
#print("L")
}
}
}
# if (!is.null(transformed_common_vars)) {
# picked_common_vars[names(transformed_common_vars)] <- transformed_common_vars
# }
# Remove the trailing chunks, if any.
for (file_dim in names(dims_to_crop)) {
# indices_to_keep <- min(sapply(dims_to_crop[[file_dim]], min)):max(sapply(dims_to_crop[[file_dim]], max))
## TODO: Merge indices in dims_to_crop with some advanced mechanism?
indices_to_keep <- unique(unlist(dims_to_crop[[file_dim]]))
array_of_files_to_load <- Subset(array_of_files_to_load, file_dim, indices_to_keep)
array_of_not_found_files <- Subset(array_of_not_found_files, file_dim, indices_to_keep)
for (i in 1:length(dat)) {
# Crop selectors
for (selector_dim in names(dat[[i]][['selectors']])) {
if (selector_dim == file_dim) {
for (j in 1:length(dat[[i]][['selectors']][[selector_dim]][['fri']])) {
dat[[i]][['selectors']][[selector_dim]][['fri']][[j]] <- dat[[i]][['selectors']][[selector_dim]][['fri']][[j]][indices_to_keep]
}
for (j in 1:length(dat[[i]][['selectors']][[selector_dim]][['sri']])) {
dat[[i]][['selectors']][[selector_dim]][['sri']][[j]] <- dat[[i]][['selectors']][[selector_dim]][['sri']][[j]][indices_to_keep]
}
}
if (file_dim %in% names(dim(dat[[i]][['selectors']][[selector_dim]][['fri']]))) {
dat[[i]][['selectors']][[selector_dim]][['fri']] <- Subset(dat[[i]][['selectors']][[selector_dim]][['fri']], file_dim, indices_to_keep)
dat[[i]][['selectors']][[selector_dim]][['sri']] <- Subset(dat[[i]][['selectors']][[selector_dim]][['sri']], file_dim, indices_to_keep)
}
}
# Crop dataset variables file dims.
for (picked_var in names(picked_vars[[i]])) {
if (file_dim %in% names(dim(picked_vars[[i]][[picked_var]]))) {
picked_vars[[i]][[picked_var]] <- Subset(picked_vars[[i]][[picked_var]], file_dim, indices_to_keep)
}
}
for (transformed_var in names(transformed_vars[[i]])) {
if (file_dim %in% names(dim(transformed_vars[[i]][[transformed_var]]))) {
transformed_vars[[i]][[transformed_var]] <- Subset(transformed_vars[[i]][[transformed_var]], file_dim, indices_to_keep)
}
}
}
# Crop common variables file dims.
for (picked_common_var in names(picked_common_vars)) {
if (file_dim %in% names(dim(picked_common_vars[[picked_common_var]]))) {
picked_common_vars[[picked_common_var]] <- Subset(picked_common_vars[[picked_common_var]], file_dim, indices_to_keep)
}
}
for (transformed_common_var in names(transformed_common_vars)) {
if (file_dim %in% names(dim(transformed_common_vars[[transformed_common_var]]))) {
transformed_common_vars[[transformed_common_var]] <- Subset(transformed_common_vars[[transformed_common_var]], file_dim, indices_to_keep)
}
}
}
# Calculate the size of the final array.
total_inner_dims <- NULL
for (i in 1:length(dat)) {
if (dataset_has_files[i]) {
inner_dims <- expected_inner_dims[[i]]
inner_dims <- sapply(inner_dims,
function(x) {
if (!all(sapply(dat[[i]][['selectors']][[x]][['sri']], is.null))) {
max(sapply(dat[[i]][['selectors']][[x]][['sri']], length))
} else {
if (length(var_params[[x]]) > 0) {
if (var_params[[x]] %in% names(transformed_vars[[i]])) {
length(transformed_vars[[i]][[var_params[[x]]]])
} else if (var_params[[x]] %in% names(transformed_common_vars)) {
length(transformed_common_vars[[var_params[[x]]]])
} else {
max(sapply(dat[[i]][['selectors']][[x]][['fri']], length))
}
} else {
max(sapply(dat[[i]][['selectors']][[x]][['fri']], length))
}
}
})
names(inner_dims) <- expected_inner_dims[[i]]
if (is.null(total_inner_dims)) {
total_inner_dims <- inner_dims
} else {
new_dims <- .MergeArrayDims(total_inner_dims, inner_dims)
total_inner_dims <- new_dims[[3]]
}
}
}
new_dims <- .MergeArrayDims(dim(array_of_files_to_load), total_inner_dims)
final_dims <- new_dims[[3]][dim_names]
# final_dims_fake is the vector of final dimensions after having merged the
# 'across' file dimensions with the respective 'across' inner dimensions, and
# after having broken into multiple dimensions those dimensions for which
# multidimensional selectors have been provided.
# final_dims will be used for collocation of data, whereas final_dims_fake
# will be used for shaping the final array to be returned to the user.
final_dims_fake <- final_dims
if (merge_across_dims) {
final_dims_fake <- dims_merge(inner_dims_across_files, final_dims_fake)
#=========================================================================
# Find the dimension to split if split_multiselected_dims = TRUE.
# If there is no dimension able to be split, change split_multiselected_dims to FALSE.
tmp <- dims_split(dim_params, final_dims_fake)
final_dims_fake <- tmp[[1]]
# all_split_dims is a list containing all the split dims
all_split_dims <- tmp[[2]]
if (is.null(all_split_dims)) {
split_multiselected_dims <- FALSE
.warning(paste0("Not found any dimensions able to be split. The parameter ",
"'split_multiselected_dims' is changed to FALSE."))
} else {
tmp_fun <- function (x, y) {
any(names(dim(x)) %in% y)
}
if (!is.null(picked_common_vars)) {
inner_dim_has_split_dim <- names(which(unlist(lapply(
picked_common_vars, tmp_fun, names(all_split_dims)))))
if (!identical(inner_dim_has_split_dim, character(0))) {
# If merge_across_dims also, it will be replaced later
saved_reshaped_attr <- attr(picked_common_vars[[inner_dim_has_split_dim]], 'variables')
}
#======================================================================
# If only merge_across_dims and merge_across_dims_narm and no split_multiselected_dims,
# the length of inner across dim (e.g., time) needs to be adjusted. Sum up the actual length
# without potential NAs.
if (merge_across_dims) {
across_inner_dim <- inner_dims_across_files[[1]] #TODO: more than one?
# Get the length of each inner_dim ('time') along each file_dim ('file_date')
length_inner_across_dim <- lapply(dat[[i]][['selectors']][[across_inner_dim]][['fri']], length)
dims_of_merge_dim <- dim(picked_common_vars[[across_inner_dim]])
# Save attributes for later use. If split_multiselected_dims, this variable has been created above but is replaced here
saved_reshaped_attr <- attr(picked_common_vars[[across_inner_dim]], 'variables')
if (merge_across_dims_narm & !split_multiselected_dims) {
final_dims_fake <- merge_narm_dims(final_dims_fake, across_inner_dim, length_inner_across_dim)
3694
3695
3696
3697
3698
3699
3700
3701
3702
3703
3704
3705
3706
3707
3708
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3720
}
}
if (!silent) {
.message("Detected dimension sizes:")
longest_dim_len <- max(sapply(names(final_dims_fake), nchar))
longest_size_len <- max(sapply(paste0(final_dims_fake, ''), nchar))
sapply(names(final_dims_fake),
function(x) {
message(paste0("* ", paste(rep(' ', longest_dim_len - nchar(x)), collapse = ''),
x, ": ", paste(rep(' ', longest_size_len - nchar(paste0(final_dims_fake[x], ''))), collapse = ''),
final_dims_fake[x]))
})
bytes <- prod(c(final_dims_fake, 8))
dim_sizes <- paste(final_dims_fake, collapse = ' x ')
if (retrieve) {
.message(paste("Total size of requested data:"))
} else {
.message(paste("Total size of involved data:"))
}
.message(paste(dim_sizes, " x 8 bytes =",
format(structure(bytes, class = "object_size"), units = "auto")),
indent = 2)
}
# NOTE: If split_multiselected_dims + merge_across_dims, the dim order may need to be changed.
# The inner_dim needs to be the first dim among split dims.
# TODO: Cannot control the rest dims are in the same order or not...
# Suppose users put the same order of across inner and file dims.
if (split_multiselected_dims & merge_across_dims) {
# TODO: More than one split?
inner_dim_pos_in_split_dims <- which(names(all_split_dims[[1]]) == inner_dims_across_files)
# if inner_dim is not the first, change!
if (inner_dim_pos_in_split_dims != 1) {
# Save the current final_dims_fake for reordering it back later
tmp <- reorder_split_dims(all_split_dims[[1]], inner_dim_pos_in_split_dims, final_dims_fake)
final_dims_fake <- tmp[[1]]
all_split_dims[[1]] <- tmp[[2]]
if (!merge_across_dims & split_multiselected_dims & identical(inner_dim_has_split_dim, character(0))) {
final_dims_fake_metadata <- NULL
} else {
if (!merge_across_dims & split_multiselected_dims & !is.null(picked_common_vars)) {
if (any(names(all_split_dims[[1]]) %in% names(dim(picked_common_vars[[inner_dim_has_split_dim]]))) &
names(all_split_dims)[1] != inner_dim_has_split_dim) {
if (inner_dim_has_split_dim %in% names(final_dims)) {
stop("Detect inner dimension in the split array, but merge_across_dims is not used. The output dimensions will be repeated. Check if the dimensions and parameters are correctly defined.")
} else {
# Only split no merge, time dim is not explicitly defined because the
# length is 1, the sdate dim to be split having 'time' as one dimension.
# --> Take 'time' dim off from picked_common_vars.
dim(picked_common_vars[[inner_dim_has_split_dim]]) <- dim(picked_common_vars[[inner_dim_has_split_dim]])[-which(names(dim(picked_common_vars[[inner_dim_has_split_dim]])) == inner_dim_has_split_dim)]
}
}
}
final_dims_fake_metadata <- find_final_dims_fake_metadata(
merge_across_dims, split_multiselected_dims, picked_common_vars = picked_common_vars[[inner_dim_has_split_dim]], across_inner_dim,
final_dims_fake, dims_of_merge_dim, all_split_dims)
}
# store warning messages from transform
warnings3 <- NULL
# The following several lines will only run if retrieve = TRUE
if (retrieve) {
########## CREATING THE SHARED MATRIX AND DISPATCHING WORK PIECES ###########
# TODO: try performance of storing all in cols instead of rows
# Create the shared memory array, and a pointer to it, to be sent
# to the work pieces.
if (is.null(ObjectBigmemory)) {
data_array <- bigmemory::big.matrix(nrow = prod(final_dims), ncol = 1)
} else {
data_array <- bigmemory::big.matrix(nrow = prod(final_dims), ncol = 1,
if (is.null(ObjectBigmemory)) {
name_bigmemory_obj <- attr(shared_matrix_pointer, 'description')$sharedName
} else {
name_bigmemory_obj <- attr(shared_matrix_pointer, 'description')$filename
}
#warning(paste("SharedName:", attr(shared_matrix_pointer, 'description')$sharedName))
#warning(paste("Filename:", attr(shared_matrix_pointer, 'description')$filename))
nperez
committed
#if (!is.null(ObjectBigmemory)) {
# attr(shared_matrix_pointer, 'description')$sharedName <- ObjectBigmemory
#}
3788
3789
3790
3791
3792
3793
3794
3795
3796
3797
3798
3799
3800
3801
3802
3803
3804
3805
3806
3807
3808
3809
3810
3811
if (is.null(num_procs)) {
num_procs <- future::availableCores()
}
# Creating a shared tmp folder to store metadata from each chunk
array_of_metadata_flags <- array(FALSE, dim = dim(array_of_files_to_load))
if (!is.null(metadata_dims)) {
metadata_indices_to_load <- as.list(rep(1, length(dim(array_of_files_to_load))))
names(metadata_indices_to_load) <- names(dim(array_of_files_to_load))
metadata_indices_to_load[metadata_dims] <- as.list(rep(TRUE, length(metadata_dims)))
array_of_metadata_flags <- do.call('[<-', c(list(array_of_metadata_flags), metadata_indices_to_load,
list(value = rep(TRUE, prod(dim(array_of_files_to_load)[metadata_dims])))))
}
metadata_file_counter <- 0
metadata_folder <- tempfile('metadata')
dir.create(metadata_folder)
# Build the work pieces, each with:
# - file path
# - total size (dims) of store array
# - start position in store array
# - file selectors (to provide extra info. useful e.g. to select variable)
# - indices to take from file
work_pieces <- list()
for (i in 1:length(dat)) {
if (dataset_has_files[i]) {
# metadata_file_counter may be changed by the following function
work_pieces <- build_work_pieces(
work_pieces = work_pieces, i = i, selectors = dat[[i]][['selectors']],
file_dims = found_file_dims[[i]],
inner_dims = expected_inner_dims[[i]], final_dims = final_dims,
found_pattern_dim = found_pattern_dim,
inner_dims_across_files = inner_dims_across_files,
array_of_files_to_load = array_of_files_to_load,
array_of_not_found_files = array_of_not_found_files,
array_of_metadata_flags = array_of_metadata_flags,
metadata_file_counter = metadata_file_counter,
depending_file_dims = depending_file_dims, transform = transform,
transform_vars = transform_vars, picked_vars = picked_vars[[i]],
picked_vars_ordered = picked_vars_ordered[[i]],
picked_common_vars = picked_common_vars,
picked_common_vars_ordered = picked_common_vars_ordered,
metadata_folder = metadata_folder, debug = debug)
}
}
#print("N")
if (debug) {
print("-> WORK PIECES BUILT")
}
# Calculate the progress %s that will be displayed and assign them to
# the appropriate work pieces.
work_pieces <- retrieve_progress_message(work_pieces, num_procs, silent)
# NOTE: In .LoadDataFile(), metadata is saved in metadata_folder/1 or /2 etc. Before here,
# the path name is created in work_pieces but the path hasn't been built yet.
shared_matrix_pointer = shared_matrix_pointer,
file_data_reader = file_data_reader,
synonims = synonims,
transform = transform,
transform_params = transform_params,
transform_crop_domain = transform_crop_domain,
found_files <- tmp$value
warnings3 <- c(warnings3, tmp$warnings)
} else {
cluster <- parallel::makeCluster(num_procs, outfile = "")
# Send the heavy work to the workers
##NOTE: .withWarnings() can't catch warnings like it does above (num_procs == 1). The warnings
## show below when "bigmemory::as.matrix(data_array)" is called. Don't know how to fix it for now.
work_errors <- try({
found_files <- parallel::clusterApplyLB(cluster, work_pieces, .LoadDataFile,
shared_matrix_pointer = shared_matrix_pointer,
file_data_reader = file_data_reader,
synonims = synonims,
transform = transform,
transform_params = transform_params,
transform_crop_domain = transform_crop_domain,
silent = silent, debug = debug)
})
parallel::stopCluster(cluster)
}
if (!silent) {
# if (progress_message != '')
if (length(work_pieces) / num_procs >= 2 && !silent) {
# If merge_across_dims = TRUE, there might be additional NAs due to unequal
# inner_dim ('time') length across file_dim ('file_date').
# If merge_across_dims_narm = TRUE, add additional lines to remove these NAs.
# TODO: Now it assumes that only one '_across'. Add a for loop for more-than-one case.
if (merge_across_dims & (split_multiselected_dims | merge_across_dims_narm)) {
if (!merge_across_dims_narm) {
data_array_tmp <- array(bigmemory::as.matrix(data_array), dim = final_dims)
tmp <- match(names(final_dims), names(dims_of_merge_dim))
if (any(diff(tmp[!is.na(tmp)]) < 0)) { #need to reorder
picked_common_vars[[across_inner_dim]] <- .aperm2(picked_common_vars[[across_inner_dim]], tmp[!is.na(tmp)])
}
metadata_tmp <- picked_common_vars[[across_inner_dim]]
tmp <- remove_additional_na_from_merge(
data_array = bigmemory::as.matrix(data_array),
merge_dim_metadata = picked_common_vars[[across_inner_dim]],
metadata_tmp <- tmp$merge_dim_metadata
if (length(data_array_tmp) != prod(final_dims_fake)) {
stop(paste0("After reshaping, the data do not fit into the expected output dimension. ",
"Check if the reshaping parameters are used correctly."))
stop(paste0("After reshaping, the metadata do not fit into the expected output dimension. ",
"Check if the reshaping parameters are used correctly or contact support."))
#NOTE: When one file contains values for dicrete dimensions, rearrange the
# chunks (i.e., work_piece) is necessary.
data_array = data_array_tmp, metadata = metadata_tmp, indices_chunk,
all_split_dims, final_dims_fake, across_inner_dim, length_inner_across_dim)
data_array_tmp <- tmp$data_array
metadata_tmp <- tmp$metadata
data_array <- array(data_array_tmp, dim = final_dims_fake)
# If split_multiselected_dims + merge_across_dims, the dimension order may change above.
# To get the user-required dim order, we need to reorder the array again.
if (split_multiselected_dims) {
if (inner_dim_pos_in_split_dims != 1) {
correct_order <- match(names(final_dims_fake_output), names(final_dims_fake))
data_array <- .aperm2(data_array, correct_order)
correct_order_metadata <- match(names(final_dims_fake_output), names(all_split_dims[[across_inner_dim]]))
metadata_tmp <- .aperm2(metadata_tmp, correct_order_metadata[!is.na(correct_order_metadata)])
# Convert numeric back to dates
if ('time' %in% synonims[[across_inner_dim]]) {
metadata_tmp <- as.POSIXct(metadata_tmp, origin = "1970-01-01", tz = 'UTC')
picked_common_vars[[across_inner_dim]] <- metadata_tmp
attr(picked_common_vars[[across_inner_dim]], 'variables') <- saved_reshaped_attr
} else { # ! (merge_across_dims + split_multiselected_dims) (old version)
data_array <- array(bigmemory::as.matrix(data_array), dim = final_dims_fake)
if (merge_across_dims) {
# merge_across_dims = TRUE but (merge_across_dims_narm = F & split_multiselected_dims = F)
inner_dim_pos <- which(names(dims_of_merge_dim) == inner_dims_across_files)
file_dim_pos <- which(names(dims_of_merge_dim) == names(inner_dims_across_files))
if (file_dim_pos < inner_dim_pos) { #need to reorder
tmp <- seq(1, length(dims_of_merge_dim))
tmp[inner_dim_pos] <- file_dim_pos
tmp[file_dim_pos] <- inner_dim_pos
picked_common_vars[[across_inner_dim]] <- .aperm2(picked_common_vars[[across_inner_dim]], tmp)
}
metadata_tmp <- array(picked_common_vars[[across_inner_dim]], dim = final_dims_fake_metadata)
# Convert numeric back to dates
if ('time' %in% synonims[[across_inner_dim]]) {
metadata_tmp <- as.POSIXct(metadata_tmp, origin = "1970-01-01", tz = 'UTC')
}
picked_common_vars[[across_inner_dim]] <- metadata_tmp
attr(picked_common_vars[[across_inner_dim]], 'variables') <- saved_reshaped_attr
if (split_multiselected_dims & !is.null(picked_common_vars)) {
metadata_tmp <- array(picked_common_vars[[inner_dim_has_split_dim]], dim = final_dims_fake_metadata)
if (inherits(picked_common_vars[[inner_dim_has_split_dim]], 'POSIXct')) {
metadata_tmp <- as.POSIXct(metadata_tmp, origin = "1970-01-01", tz = 'UTC')
}
attr(picked_common_vars[[inner_dim_has_split_dim]], 'variables') <- saved_reshaped_attr
gc()
# Load metadata and remove the metadata folder
if (!is.null(metadata_dims)) {
loaded_metadata_files <- list.files(metadata_folder)
if (!identical(loaded_metadata_files, character(0))) { # old code
loaded_metadata <- lapply(paste0(metadata_folder, '/', loaded_metadata_files), readRDS)
} else {
loaded_metadata <- NULL
}
unlink(metadata_folder, recursive = TRUE)
# Create a list of metadata of the variable (e.g., tas)
return_metadata <- create_metadata_list(array_of_metadata_flags, metadata_dims, pattern_dims,
loaded_metadata_files, loaded_metadata, dat_names,
dataset_has_files)
3997
3998
3999
4000
4001
4002
4003
4004
4005
4006
4007
4008
4009
4010
4011
4012
4013
4014
4015
4016
4017
4018
4019
4020
4021
4022
4023
4024
4025
# TODO: Try to infer data type from loaded_metadata
# as.integer(data_array)
}
failed_pieces <- work_pieces[which(unlist(found_files))]
for (failed_piece in failed_pieces) {
array_of_not_found_files <- do.call('[<-',
c(list(array_of_not_found_files),
as.list(failed_piece[['file_indices_in_array_of_files']]),
list(value = TRUE)))
}
if (any(array_of_not_found_files)) {
for (i in 1:prod(dim(array_of_files_to_load))) {
if (is.na(array_of_not_found_files[i])) {
array_of_files_to_load[i] <- NA
} else {
if (array_of_not_found_files[i]) {
array_of_not_found_files[i] <- array_of_files_to_load[i]
array_of_files_to_load[i] <- NA
} else {
array_of_not_found_files[i] <- NA
}
}
}
} else {
array_of_not_found_files <- NULL
}
} # End if (retrieve)
else { # if retrieve = FALSE, metadata still needs to reshape
if (merge_across_dims & (split_multiselected_dims | merge_across_dims_narm)) {
if (!merge_across_dims_narm) {
tmp <- match(names(final_dims), names(dims_of_merge_dim))
if (any(diff(tmp[!is.na(tmp)]) < 0)) { #need to reorder
picked_common_vars[[across_inner_dim]] <- .aperm2(picked_common_vars[[across_inner_dim]], tmp[!is.na(tmp)])
}
metadata_tmp <- picked_common_vars[[across_inner_dim]]
} else {
tmp <- remove_additional_na_from_merge(
data_array = NULL,
merge_dim_metadata = picked_common_vars[[across_inner_dim]],
metadata_tmp <- tmp$merge_dim_metadata
}
stop(paste0("After reshaping, the metadata do not fit into the expected output dimension. ",
"Check if the reshaping parameters are used correctly or contact support."))
}
#NOTE: When one file contains values for dicrete dimensions, rearrange the
# chunks (i.e., work_piece) is necessary.
if (split_multiselected_dims) {
tmp <- rebuild_array_merge_split(
data_array = NULL, metadata = metadata_tmp, indices_chunk,
all_split_dims, final_dims_fake, across_inner_dim, length_inner_across_dim)
# If split_multiselected_dims + merge_across_dims, the dimension order may change above.
# To get the user-required dim order, we need to reorder the array again.
if (split_multiselected_dims) {
if (inner_dim_pos_in_split_dims != 1) {
correct_order <- match(names(final_dims_fake_output), names(final_dims_fake))
# data_array <- .aperm2(data_array, correct_order)
correct_order_metadata <- match(names(final_dims_fake_output), names(all_split_dims[[across_inner_dim]]))
metadata_tmp <- .aperm2(metadata_tmp, correct_order_metadata[!is.na(correct_order_metadata)])
}
}
# Convert numeric back to dates
if ('time' %in% synonims[[across_inner_dim]]) {
metadata_tmp <- as.POSIXct(metadata_tmp, origin = "1970-01-01", tz = 'UTC')
}
picked_common_vars[[across_inner_dim]] <- metadata_tmp
attr(picked_common_vars[[across_inner_dim]], 'variables') <- saved_reshaped_attr
} else { # ! (merge_across_dims + split_multiselected_dims) (old version)
if (merge_across_dims) {
# merge_across_dims = TRUE but (merge_across_dims_narm = F & split_multiselected_dims = F)
inner_dim_pos <- which(names(dims_of_merge_dim) == inner_dims_across_files)
file_dim_pos <- which(names(dims_of_merge_dim) == names(inner_dims_across_files))
if (file_dim_pos < inner_dim_pos) { #need to reorder
tmp <- seq(1, length(dims_of_merge_dim))
tmp[inner_dim_pos] <- file_dim_pos
tmp[file_dim_pos] <- inner_dim_pos
picked_common_vars[[across_inner_dim]] <- .aperm2(picked_common_vars[[across_inner_dim]], tmp)
}
metadata_tmp <- array(picked_common_vars[[across_inner_dim]], dim = final_dims_fake_metadata)
# Convert numeric back to dates
if ('time' %in% synonims[[across_inner_dim]]) {
metadata_tmp <- as.POSIXct(metadata_tmp, origin = "1970-01-01", tz = 'UTC')
}
picked_common_vars[[across_inner_dim]] <- metadata_tmp
attr(picked_common_vars[[across_inner_dim]], 'variables') <- saved_reshaped_attr
if (split_multiselected_dims & !is.null(picked_common_vars)) {
metadata_tmp <- array(picked_common_vars[[inner_dim_has_split_dim]], dim = final_dims_fake_metadata)
if (inherits(picked_common_vars[[inner_dim_has_split_dim]], 'POSIXct')) {
metadata_tmp <- as.POSIXct(metadata_tmp, origin = "1970-01-01", tz = 'UTC')
}
attr(picked_common_vars[[inner_dim_has_split_dim]], 'variables') <- saved_reshaped_attr
if (!is.null(c(warnings1, warnings2, warnings3))) {
transform_warnings_list <- lapply(c(warnings1, warnings2, warnings3), function(x) {
transform_warnings_list <- unique(transform_warnings_list)
for (i in 1:length(transform_warnings_list)) {
.warning(transform_warnings_list[[i]])
4120
4121
4122
4123
4124
4125
4126
4127
4128
4129
4130
4131
4132
4133
4134
4135
4136
4137
4138
4139
4140
4141
4142
4143
4144
4145
4146
4147
# Change final_dims_fake back because retrieve = FALSE will use it for attributes later
if (exists("final_dims_fake_output")) {
final_dims_fake <- final_dims_fake_output
}
# Replace the vars and common vars by the transformed vars and common vars
for (i in 1:length(dat)) {
if (length(names(transformed_vars[[i]])) > 0) {
picked_vars[[i]][names(transformed_vars[[i]])] <- transformed_vars[[i]]
} else if (length(names(picked_vars_ordered[[i]])) > 0) {
picked_vars[[i]][names(picked_vars_ordered[[i]])] <- picked_vars_ordered[[i]]
}
}
if (length(names(transformed_common_vars)) > 0) {
picked_common_vars[names(transformed_common_vars)] <- transformed_common_vars
} else if (length(names(picked_common_vars_ordered)) > 0) {
picked_common_vars[names(picked_common_vars_ordered)] <- picked_common_vars_ordered
}
if (debug) {
print("-> THE TRANSFORMED VARS:")
print(str(transformed_vars))
print("-> THE PICKED VARS:")
print(str(picked_vars))
}
file_selectors <- NULL
for (i in 1:length(dat)) {
file_selectors[[dat[[i]][['name']]]] <- dat[[i]][['selectors']][which(names(dat[[i]][['selectors']]) %in% found_file_dims[[i]])]
}
if (retrieve) {
if (!silent) {
.message("Successfully retrieved data.")
}
if (all(sapply(return_metadata, is.null))) {
# We don't have metadata of the variable (e.g., tas). The returned metadata list only
# contains those are specified in argument "return_vars".
Variables_list <- c(list(common = picked_common_vars), picked_vars)
.warning(paste0("Metadata cannot be retrieved. The reason may be the ",
"non-existence of the first file. Use parameter 'metadata_dims'",
" to assign to file dimensions along which to return metadata, ",
"or check the existence of the first file."))
} else {
# Add the metadata of the variable (e.g., tas) into the list of picked_vars or
# picked_common_vars.
Variables_list <- combine_metadata_picked_vars(
return_metadata, picked_vars, picked_common_vars,
metadata_dims, pattern_dims, length(dat))
Files = array_of_files_to_load,
NotFoundFiles = array_of_not_found_files,
FileSelectors = file_selectors,
ObjectBigmemory = name_bigmemory_obj) #attr(shared_matrix_pointer, 'description')$sharedName)
)
attr(data_array, 'class') <- c('startR_array', attr(data_array, 'class'))
data_array
4182
4183
4184
4185
4186
4187
4188
4189
4190
4191
4192
4193
4194
4195
4196
4197
4198
4199
4200
4201
4202
4203
4204
4205
4206
4207
4208
4209
4210
4211
4212
4213
4214
4215
4216
4217
if (!silent) {
.message("Successfully discovered data dimensions.")
}
start_call <- match.call()
for (i in 2:length(start_call)) {
if (class(start_call[[i]]) %in% c('name', 'call')) {
start_call[[i]] <- eval.parent(start_call[[i]])
}
}
start_call[['retrieve']] <- TRUE
attributes(start_call) <- c(attributes(start_call),
list(Dimensions = final_dims_fake,
Variables = c(list(common = picked_common_vars), picked_vars),
ExpectedFiles = array_of_files_to_load,
FileSelectors = file_selectors,
PatternDim = found_pattern_dim,
MergedDims = if (merge_across_dims) {
inner_dims_across_files
} else {
NULL
},
SplitDims = if (split_multiselected_dims) {
all_split_dims
} else {
NULL
})
)
attr(start_call, 'class') <- c('startR_cube', attr(start_call, 'class'))
start_call
}
}
# This function is the responsible for loading the data of each work
# piece.
.LoadDataFile <- function(work_piece, shared_matrix_pointer,
file_data_reader, synonims,
transform, transform_params, transform_crop_domain = NULL,
nperez
committed
#warning(attr(shared_matrix_pointer, 'description')$sharedName)
4221
4222
4223
4224
4225
4226
4227
4228
4229
4230
4231
4232
4233
4234
4235
4236
4237
4238
4239
4240
4241
4242
4243
4244
4245
4246
4247
4248
4249
4250
4251
4252
4253
4254
4255
4256
# suppressPackageStartupMessages({library(bigmemory)})
### TODO: Specify dependencies as parameter
# suppressPackageStartupMessages({library(ncdf4)})
#print("1")
store_indices <- as.list(work_piece[['store_position']])
first_round_indices <- work_piece[['first_round_indices']]
second_round_indices <- work_piece[['second_round_indices']]
#print("2")
file_to_open <- work_piece[['file_path']]
sub_array <- file_data_reader(file_to_open, NULL,
work_piece[['file_selectors']],
first_round_indices, synonims)
if (debug) {
if (all(unlist(store_indices[1:6]) == 1)) {
print("-> LOADING A WORK PIECE")
print("-> STRUCTURE OF READ UNTRANSFORMED DATA:")
print(str(sub_array))
print("-> STRUCTURE OF VARIABLES TO TRANSFORM:")
print(str(work_piece[['vars_to_transform']]))
print("-> COMMON ARRAY DIMENSIONS:")
print(str(work_piece[['store_dims']]))
}
}
if (!is.null(sub_array)) {
# Apply data transformation once we have the data arrays.
if (!is.null(transform)) {
if (debug) {
if (all(unlist(store_indices[1:6]) == 1)) {
print("-> PROCEEDING TO TRANSFORM ARRAY")
print("-> DIMENSIONS OF ARRAY RIGHT BEFORE TRANSFORMING:")
print(dim(sub_array))
}
}
sub_array <- do.call(transform, c(list(data_array = sub_array,
variables = work_piece[['vars_to_transform']],
file_selectors = work_piece[['file_selectors']],
crop_domain = transform_crop_domain),
transform_params))
if (debug) {
if (all(unlist(store_indices[1:6]) == 1)) {
print("-> STRUCTURE OF ARRAY AND VARIABLES RIGHT AFTER TRANSFORMING:")
print(str(sub_array))
print("-> DIMENSIONS OF ARRAY RIGHT AFTER TRANSFORMING:")
print(dim(sub_array$data_array))
}
}
sub_array <- sub_array$data_array
# Subset with second round of indices
dims_to_crop <- which(!sapply(second_round_indices, is.null))
if (length(dims_to_crop) > 0) {
dimnames_to_crop <- names(second_round_indices)[dims_to_crop]
sub_array <- ClimProjDiags::Subset(sub_array, dimnames_to_crop,
second_round_indices[dimnames_to_crop])
4275
4276
4277
4278
4279
4280
4281
4282
4283
4284
4285
4286
4287
4288
4289
4290
4291
4292
4293
4294
4295
4296
4297
4298
4299
4300
4301
4302
4303
4304
4305
4306
4307
4308
4309
4310
4311
4312
4313
4314
4315
4316
4317
4318
4319
4320
4321
4322
4323
4324
4325
4326
4327
4328
4329
4330
4331
4332
4333
4334
4335
4336
4337
4338
4339
4340
4341
4342
4343
4344
4345
4346
4347
4348
4349
4350
4351
4352
4353
4354
4355
4356
4357
4358
4359
4360
4361
4362
4363
4364
4365
4366
4367
4368
4369
4370
4371
4372
4373
4374
4375
4376
4377
4378
4379
4380
4381
4382
4383
4384
4385
4386
4387
4388
4389
4390
4391
4392
4393
4394
4395
4396
4397
4398
4399
4400
4401
4402
}
if (debug) {
if (all(unlist(store_indices[1:6]) == 1)) {
print("-> STRUCTURE OF ARRAY AND VARIABLES RIGHT AFTER SUBSETTING WITH 2nd ROUND INDICES:")
print(str(sub_array))
}
}
}
metadata <- attr(sub_array, 'variables')
names_bk <- names(store_indices)
store_indices <- lapply(names(store_indices),
function (x) {
if (!(x %in% names(first_round_indices))) {
store_indices[[x]]
} else if (is.null(second_round_indices[[x]])) {
1:dim(sub_array)[x]
} else {
if (is.numeric(second_round_indices[[x]])) {
## TODO: Review carefully this line. Inner indices are all
## aligned to the left-most positions. If dataset A has longitudes
## 1, 2, 3, 4 but dataset B has only longitudes 3 and 4, then
## they will be stored as follows:
## 1, 2, 3, 4
## 3, 4, NA, NA
##x - min(x) + 1
1:length(second_round_indices[[x]])
} else {
1:length(second_round_indices[[x]])
}
}
})
names(store_indices) <- names_bk
if (debug) {
if (all(unlist(store_indices) == 1)) {
print("-> STRUCTURE OF FIRST ROUND INDICES FOR THIS WORK PIECE:")
print(str(first_round_indices))
print("-> STRUCTURE OF SECOND ROUND INDICES FOR THIS WORK PIECE:")
print(str(second_round_indices))
print("-> STRUCTURE OF STORE INDICES FOR THIS WORK PIECE:")
print(str(store_indices))
}
}
store_indices <- lapply(store_indices, as.integer)
store_dims <- work_piece[['store_dims']]
# split the storage work of the loaded subset in parts
largest_dim_name <- names(dim(sub_array))[which.max(dim(sub_array))]
max_parts <- length(store_indices[[largest_dim_name]])
# Indexing a data file of N MB with expand.grid takes 30*N MB
# The peak ram of Start is, minimum, 2 * total data to load from all files
# due to inefficiencies in other regions of the code
# The more parts we split the indexing done below in, the lower
# the memory footprint of the indexing and the fast.
# But more than 10 indexing iterations (parts) for each MB processed
# makes the iteration slower (tested empirically on BSC workstations).
subset_size_in_mb <- prod(dim(sub_array)) * 8 / 1024 / 1024
best_n_parts <- ceiling(subset_size_in_mb * 10)
# We want to set n_parts to a greater value than the one that would
# result in a memory footprint (of the subset indexing code below) equal
# to 2 * total data to load from all files.
# s = subset size in MB
# p = number of parts to break it in
# T = total size of data to load
# then, s / p * 30 = 2 * T
# then, p = s * 15 / T
min_n_parts <- ceiling(prod(dim(sub_array)) * 15 / prod(store_dims))
# Make sure we pick n_parts much greater than the minimum calculated
n_parts <- min_n_parts * 10
if (n_parts > best_n_parts) {
n_parts <- best_n_parts
}
# Boundary checks
if (n_parts < 1) {
n_parts <- 1
}
if (n_parts > max_parts) {
n_parts <- max_parts
}
if (n_parts > 1) {
make_parts <- function(length, n) {
clusters <- cut(1:length, n, labels = FALSE)
lapply(1:n, function(y) which(clusters == y))
}
part_indices <- make_parts(max_parts, n_parts)
parts <- lapply(part_indices,
function(x) {
store_indices[[largest_dim_name]][x]
})
} else {
part_indices <- list(1:max_parts)
parts <- store_indices[largest_dim_name]
}
# do the storage work
weights <- sapply(1:length(store_dims),
function(i) prod(c(1, store_dims)[1:i]))
part_indices_in_sub_array <- as.list(rep(TRUE, length(dim(sub_array))))
names(part_indices_in_sub_array) <- names(dim(sub_array))
data_array <- bigmemory::attach.big.matrix(shared_matrix_pointer)
for (i in 1:n_parts) {
store_indices[[largest_dim_name]] <- parts[[i]]
# Converting array indices to vector indices
matrix_indices <- do.call("expand.grid", store_indices)
# Given a matrix where each row is a set of array indices of an element
# the vector indices are computed
matrix_indices <- 1 + colSums(t(matrix_indices - 1) * weights)
part_indices_in_sub_array[[largest_dim_name]] <- part_indices[[i]]
data_array[matrix_indices] <- as.vector(do.call('[',
c(list(x = sub_array),
part_indices_in_sub_array)))
}
rm(data_array)
gc()
if (!is.null(work_piece[['save_metadata_in']])) {
saveRDS(metadata, file = work_piece[['save_metadata_in']])
}
}
if (!is.null(work_piece[['progress_amount']]) && !silent) {
message(work_piece[['progress_amount']], appendLF = FALSE)
}
is.null(sub_array)
}