Newer
Older
This document intends to be the first reference for any doubts that you may have regarding startR. If you do not find the information you need, please open an issue for your problem.
1. [Choose the number of chunks/jobs/cores in Compute()](#1-choose-the-number-of-chunksjobscores-in-compute)
2. [Indicate dependent dimension and use merge parameters in Start()](#2-indicate-dependent-dimension-and-use-merge-parameters-in-start)
3. [Use self-defined function in Compute()](#3-use-self-defined-function-in-compute)
4. [Use package function in Compute()](#4-use-package-function-in-compute)
5. [Do interpolation in Start() (using parameter 'transform')](#5-do-interpolation-in-start-using-parameter-transform)
6. [Get data attributes without retrieving data to workstation](#6-get-data-attributes-without-retrieving-data-to-workstation)
7. [Avoid or specify a node from cluster in Compute()](#7-avoid-or-specify-a-node-from-cluster-in-compute)
8. [Define a path with multiple dependencies](#8-define-a-path-with-multiple-dependencies)
9. [Use CDORemap() in function](#9-use-cdoremap-in-function)
10. [The number of members depends on the start date](#10-the-number-of-members-depends-on-the-start-date)
aho
committed
11. [Select the longitude/latitude region](#11-select-the-longitudelatitude-region)
12. [What will happen if reorder function is not used](#12-what-will-happen-if-reorder-function-is-not-used)
13. [Load specific grid points data](#13-load-specific-grid-points-data)
14. [Find the error log when jobs are launched on Power9](#14-find-the-error-log-when-jobs-are-launched-on-power9)
15. [Specify extra function arguments in the workflow](#15-specify-extra-function-arguments-in-the-workflow)
16. [Use parameter 'return_vars' in Start()](#16-use-parameter-return_vars-in-start)
17. [Use parameter 'split_multiselected_dims' in Start()](#17-use-parameter-split_multiselected_dims-in-start)
18. [Use glob expression '*' to define the file path](#18-use-glob-expression-to-define-the-file-path)
19. [Get metadata when the first file does not exist](#19-get-metadata-when-the-first-file-does-not-exist)
20. [Use 'metadata_dims' to retrieve variable metadata](#20-use-metadata_dims-to-retrieve-variable-metadata)
21. [Retrieve the complete data when the dimension length varies among files](#21-retrieve-the-complete-data-when-the-dimension-length-varies-among-files)
22. [Define the selector when the indices in the files are not aligned](#22-define-the-selector-when-the-indices-in-the-files-are-not-aligned)
aho
committed
23. [The best practice of using vector and list for selectors](#23-the-best-practice-of-using-vector-and-list-for-selectors)
24. [Do both interpolation and chunking on spatial dimensions](#24-do-both-interpolation-and-chunking-on-spatial-dimensions)
25. [What to do if your function has too many target dimensions](#25-what-to-do-if-your-function-has-too-many-target-dimensions)
26. [Use merge_across_dims_narm to remove NAs](#26-use-merge_across_dims_narm-to-remove-nas)
27. [Utilize chunk number in the function](#27-utilize-chunk-number-in-the-function)
28. [Run startR in the background](#28-run-startr-in-the-background)
29. [Collect result on HPCs](#29-collect-result-on-hpcs)
1. [No space left on device](#1-no-space-left-on-device)
2. [ecFlow UI remains blue and does not update status](#2-ecflow-ui-remains-blue-and-does-not-update-status)
3. [Compute() successfully but then killed on R session](#3-compute-successfully-but-then-killed-on-r-session)
4. [My jobs work well in workstation and fatnodes but not on Power9 (or vice versa)](#4-my-jobs-work-well-in-workstation-and-fatnodes-but-not-on-power9-or-vice-versa)
5. [Errors related to wrong file formatting](#5-errors-related-to-wrong-file-formatting)
6. [Errors using a new cluster (setting Nord3)](#6-errors-using-a-new-cluster-setting-nord3)
7. [Start() fails retrieving data](#7-start-fails-retrieving-data)
8. [Error 'caught segfault' when job submitted to HPCs](#8-error-caught-segfault-when-job-submitted-to-hpcs)
## 1. How to
### 1. Choose the number of chunks/jobs/cores in Compute()
Run Start() call to see the total size of the data you read in (remember to set ´retrieve = FALSE´).
Divide data into chunks according to the size of machine memory module (Power9 is 32GB; MN4 is 8GB). The data size per chunk should be 1/3 to 1/2 of the total memory module.
Find more details in practical_guide.md [How to choose the number of chunks, jobs and cores](inst/doc/practical_guide.md#how-to-choose-the-number-of-chunks-jobs-and-cores)
### 2. Indicate dependent dimension and use merge parameters in Start()
The parameter `'xxx_across = yyy'` indicates that the inner dimension 'xxx' is continuous along the file dimension 'yyy'.
A common example is 'time_across = chunk', when the experiment runs through many years
and the result is saved in several chunk files.
If you indicate this dependent relation, you can specify 'xxx' with the indices
throughout the whole 'yyy' files, instead of only within one file. See Example 1 below,
'time = indices(1:24)' is available when 'time_across = chunk' is specified. If not, 'time' can only be 12 for most.
One example taking advantage of 'xxx_across' is extracting an climate event across years, like El Niño.
If the event starts from Nov 2014 to May 2016 (19 months in total), simply specify 'time = indices(11:29)' (Example 2).
The thing you should bear in mind when using this parameter is the returned data structure.
First, **the length of the return xxx dimension is the length of the longest xxx in all files**.
Take the El Niño above as an example. The first chunk has 2 months, the second chunk has 12 months,
and the third chunk has 5 months. Therefore, the length of time dimension will be 12, and the length of chunk dimension will be 3.
Second, the way Start() store data is **put data at the left-most position**.
Take the El Niño (Example 2) above as an example again. The first chunk has only 2 months,
so position 1 and 2 have values (which are Nov and Dec 2014). The second chunk has 12 months,
so all positions have values (Jan to Dec 2015), while position 3 to 12 will be NA.
The third chunk has 5 months, so position 1 to 5 have values (which are Jan to May 2016), while position 6 to 12 will be NA.
It seems more reasonable to put NA at position 1 to 10 in first chunk (Jan to Oct 2014)
and and position 6 to 12 in the third chunk (June to Dec 2016). But if the data is not continuous or picked irregularly ,
it is hard to judge the correct NA position (see Example 3).
Since Start() is very flexible with any possible way to read-in data, it is difficult to include
all the possibilities and make the output data structure reasonable all the time.
Therefore, it is recommended to understand the way Start() rolls first,
then you know what you should expect from the output and will not get confused with what it returns to you.
Now we understand the cross relationship between dimensions, we can talk about how to merge them: use the parameters `merge_across_dims` and `merge_across_dims_narm`.
See Example 1. If `merge_across_dims = TRUE`, the chunk dimension will disappear.
`merge_across_dims` simply attaches data one after another, so the NA values (if exist) will be the same places as the unmerged one (see Example 2).
If you want to remove those additional NAs, you can use `merge_across_dims_narm = TRUE`,
then the NAs will be removed when merging into one dimension. (see Example 2). To know more about `merge_across_dims_narm`, check [How-to-26](#26-use-merge-across-dims-narm-to-remove-nas).
You can find more use cases at [ex1_2_exp_obs_attr.R](inst/doc/usecase/ex1_2_exp_obs_attr.R) and [ex1_3_attr_loadin.R](inst/doc/usecase/ex1_3_attr_loadin.R).
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
Example 1
```r
data <- Start(dat = repos,
var = 'tas',
time = indices(1:24), # each file has 12 months; read 24 months in total
chunk = indices(1:2), #two years, each with 12 months
lat = 'all',
lon = 'all',
time_across = 'chunk',
merge_across_dims = FALSE, #TRUE,
return_vars = list(lat = NULL, lon = NULL),
retrieve = TRUE)
#return dimension (merge_across_dims = FALSE)
dat var time chunk lat lon
1 1 12 2 256 512
#return dimension (merge_across_dims = TRUE)
dat var time lat lon
1 1 24 256 512
```
Example 2: El Niño event
```r
repos <- '/esarchive/exp/ecearth/a1tr/cmorfiles/CMIP/EC-Earth-Consortium/EC-Earth3/historical/$memb$/Omon/$var$/gr/v20190312/$var$_Omon_EC-Earth3_historical_$memb$_gr_$chunk$.nc'
data <- Start(dat = repos,
var = 'tos',
memb = 'r24i1p1f1',
time = indices(4:27), # Apr 1957 to Mar 1959
chunk = c('195701-195712', '195801-195812', '195901-195912'),
lat = 'all',
lon = 'all',
time_across = 'chunk',
merge_across_dims = FALSE,
return_vars = list(lat = NULL, lon = NULL),
retrieve = TRUE)
> dim(data)
dat var memb time chunk lat lon
1 1 1 12 3 256 512
> data[1,1,1,,,100,100]
[,1] [,2] [,3]
[1,] 300.7398 300.7659 301.7128
[2,] 299.6569 301.8241 301.4781
[3,] 298.3954 301.6472 301.3807
[4,] 297.1931 301.0621 NA
[5,] 295.9608 299.1324 NA
[6,] 295.4735 297.4028 NA
[7,] 295.8538 296.1619 NA
[8,] 297.9998 295.2794 NA
[9,] 299.4571 295.0474 NA
[10,] NA 295.4571 NA
[11,] NA 296.8002 NA
[12,] NA 299.0254 NA
#To move the NAs in the first year to Jan to Mar
> asd <- Subset(data, c(5), list(1))
> qwe <- asd[, , , c(10:12, 1:9), , ,]
> data[, , , , 1, ,] <- qwe
> data[1, 1, 1, , , 100, 100]
[,1] [,2] [,3]
[1,] NA 300.7659 301.7128
[2,] NA 301.8241 301.4781
[3,] NA 301.6472 301.3807
[4,] 300.7398 301.0621 NA
[5,] 299.6569 299.1324 NA
[6,] 298.3954 297.4028 NA
[7,] 297.1931 296.1619 NA
[8,] 295.9608 295.2794 NA
[9,] 295.4735 295.0474 NA
[10,] 295.8538 295.4571 NA
[11,] 297.9998 296.8002 NA
[12,] 299.4571 299.0254 NA
# use merge parameters
data <- Start(dat = repos,
var = 'tos',
memb = 'r24i1p1f1',
time = indices(4:27), # Apr 1957 to Mar 1959
chunk = c('195701-195712', '195801-195812', '195901-195912'),
lat = 'all',
lon = 'all',
time_across = 'chunk',
merge_across_dims = TRUE,
merge_across_dims_narm = TRUE,
return_vars = list(lat = NULL, lon = NULL),
retrieve = TRUE)
data[1,1,1,,100,100]
[1] 300.7398 299.6569 298.3954 297.1931 295.9608 295.4735 295.8538 297.9998
[9] 299.4571 300.7659 301.8241 301.6472 301.0621 299.1324 297.4028 296.1619
[17] 295.2794 295.0474 295.4571 296.8002 299.0254 301.7128 301.4781 301.3807
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
```
Example 3: Read in three winters (DJF)
```r
repos <- '/esarchive/exp/ecearth/a1tr/cmorfiles/CMIP/EC-Earth-Consortium/EC-Earth3/historical/$memb$/Omon/$var$/gr/v20190312/$var$_Omon_EC-Earth3_historical_$memb$_gr_$chunk$.nc'
data <- Start(dat = repos,
var = 'tos',
memb = 'r24i1p1f1',
time = c(12:14, 24:26, 36:38), # DJF, Dec 1999 to Jan 2002
chunk = c('199901-199912', '200001-200012', '200101-200112', '200201-200212'),
lat = 'all',
lon = 'all',
time_across = 'chunk',
merge_across_dims = TRUE,
return_vars = list(lat = NULL, lon = NULL),
retrieve = TRUE)
> dim(data)
dat var memb time lat lon
1 1 1 12 256 512
> data[1, 1, 1, , 100, 100]
[1] 300.0381 NA NA 301.3340 302.0320 300.3575 301.0930 301.4149
[9] 299.3486 300.7203 301.6695 NA
#Remove NAs and rearrange DJF
> qwe <- Subset(asd, c(4), list(c(1, 4:11)))
> zxc <- InsertDim(InsertDim(qwe, 5, 3), 6, 3)
> zxc <- Subset(zxc, 'time', list(1), drop = 'selected')
> zxc[, , , 1:3, 1, ,] <- qwe[, , , 1:3, ,]
> zxc[, , , 1:3, 2, ,] <- qwe[, , , 4:6, ,]
> zxc[, , , 1:3, 3, ,] <- qwe[, , , 7:9, ,]
> names(dim(zxc))[4] <- c('month')
> names(dim(zxc))[5] <- c('year')
> dim(zxc)
dat var memb month year lat lon
1 1 1 3 3 256 512
> zxc[1, 1, 1, , , 100, 100]
[,1] [,2] [,3]
[1,] 300.0381 300.3575 299.3486
[2,] 301.3340 301.0930 300.7203
[3,] 302.0320 301.4149 301.6695
```
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
The workflow to use Compute() is: 'define the function' -> 'use Step() to assign the target/output dimension' -> 'use AddStep() to build up workflow' -> 'use Compute() to launch jobs on either local workstation or fatnodes/Power9'.
It is no problem when you only have a simple function directly defined in your script (like the example in [practical guide](https://earth.bsc.es/gitlab/es/startR/blob/master/inst/doc/practical_guide.md#step-and-addstep)). However, if the function is more complicated, you may want to save it as an independent file. In this case, the machines (Power 9 or fatnodes) cannot recognize your function therefore the jobs will fail (if you use Compute() at your own local workstation, the problem does not exist.)
The solution is simple. First, put your function file at somewhere in the machine. For example, in Power 9, put own_func.R at `/esarchive/scratch/<your_user_name>`. Second, in the script, source the function in the function definition (see the example below). Hence, the machine can find your function.
```r
data <- Start(...,
retrieve = FALSE)
func <- function(x) {
source("/esarchive/scratch/aho/own_func.R") #the path in Power 9
y <- own_func(x, posdim = 'time')
return(y)
}
step <- Step(fun = func,
target_dims = c('time'),
output_dims = c('time'))#,
wf <- AddStep(data, step)
res <- Compute(wf, ...)
```
In the workflow for Compute(), first step is to define the function. If you want to use the function in certain R package, you need to check if the package is involved in the R module (`r_module`) or library (`lib_dir`). Then, specify the package name before the function name (see example below) so the machine can recognize which function you refer to.
```r
data <- Start(...,
retrieve = FALSE)
func <- function(x) {
y <- s2dv::Season(x, posdim = 'time') #specify package name
return(y)
}
step <- Step(fun = func,
target_dims = c('time'),
output_dims = c('time'))
wf <- AddStep(data, step)
res <- Compute(wf,
chunks = list(latitude = 2,
longitude = 2),
threads_load = 2,
threads_compute = 4,
cluster = list(queue_host = 'p1', #your alias for power9
queue_type = 'slurm',
temp_dir = '/gpfs/scratch/bsc32/bsc32734/startR_hpc/',
lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.5/', #s2dv is involved here, so the machine can find Season()
r_module = 'startR/0.1.2-foss-2018b-R-3.5.0',
job_wallclock = '00:10:00',
cores_per_job = 4,
max_jobs = 4,
bidirectional = FALSE,
polling_period = 50
),
ecflow_suite_dir = '/home/Earth/aho/startR_local/',
wait = TRUE
)
```
### 5. Do interpolation in Start() (using parameter 'transform')
If you want to do the interpolation within Start(), you can use the following four parameters:
1. **`transform`**: Assign the interpolation function. It is recommended to use `startR::CDORemapper`, the wrapper function of s2dv::CDORemap().
2. **`transform_params`**: A list of the required inputs for `transform`. Take `transform = CDORemapper` as an example, the common items are:
- `grid`: A character string specifying either a name of a target grid (recognized by CDO, e.g., 'r256x128', 't106grid') or a path to another NetCDF file with the target grid (a single grid must be defined in such file).
- `method`: A character string specifying an interpolation method (recognized by CDO, e.g., 'con', 'bil', 'bic', 'dis'). The following long names are also supported: 'conservative', 'bilinear', 'bicubic', and 'distance-weighted'.
- `crop`: Whether to crop the data after interpolation with 'cdo sellonlatbox' (TRUE) or to extend interpolated data to the whole region as CDO does by default (FALSE).
If crop = TRUE, the longitude and latitude borders to be cropped at are taken as the limits of the cells at the borders ('lons' and 'lats' are perceived as cell centers), i.e., the resulting array will contain data that covers the same area as the input array. This is equivalent to specifying crop = 'preserve', i.e. preserving area.
If crop = 'tight', the borders to be cropped at are taken as the minimum and maximum cell centers in ’lons’ and ’lats’, i.e., the area covered by the resulting array may be smaller if interpolating from a coarse grid to a fine grid.
The parameter ’crop’ also accepts a numeric vector of custom borders: c(western border, eastern border, southern border, northern border).
3. **`transform_vars`**: A character vector of the inner dimensions to be transformed. E.g., c('latitude', 'longitude').
4. **`transform_extra_cells`**: A numeric indicating the number of grid cell to extend from the borders if the interpolating region is a subset of the whole region. 2 as default, which is consistent with the method in s2dv::Load().
You can find an example script here [ex1_1_tranform.R](/inst/doc/usecase/ex1_1_tranform.R)
You can see more information in s2dv::CDORemap documentation [here](https://earth.bsc.es/gitlab/es/s2dv/blob/master/man/CDORemap.Rd).
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
### 6. Get data attributes without retrieving data to workstation
One of the most useful functionalities of Start() is the parameter `retrieve = FALSE`. It creates a pointer to data repository and tells you the data information without occupying your workstation memory. The better thing is, even the data is not actually retrieved, you can still use its attributes:
```r
header <- Start(dat = repos,
...,
retrieve = FALSE)
class(header)
#[1] "startR_cube"
# check attributes
str(attr(header, 'Variables'))
# Get longitude and latitude
lons <- attr(header, 'Variables')$common$lon
lats <- attr(header, 'Variables')$common$lat
# Get dimension
dim <- attr(header, 'Dimensions')
```
And if you want to retrieve the data to the workstation afterward, you can use `eval()`:
```r
data <- eval(header)
class(data)
#[1] "startR_array"
# Get dimension
dim(data)
```
Find examples at [usecase.md](/inst/doc/usecase.md), ex1_1 and ex1_3.
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
### 7. Avoid or specify a node from cluster in Compute()
When submitting a job to Fatnodes using Compute(), the parameter 'extra_queue_params' could be used to restricthe job to be run in a expecific node as follows:
```
extra_queue_params = list('#SBATCH -w moore'),
```
or exclude a specific node from job by:
```
extra_queue_params = list('#SBATCH -x moore'),
```
Look at the position of `extra_queue_params` parameter in a full call of Compute:
```
res <- Compute(wf1,
chunks = list(ensemble = 20,
sdate = 2),
threads_load = 2,
threads_compute = 4,
cluster = list(queue_host = queue_host,
queue_type = 'slurm',
extra_queue_params = list('#SBATCH -x moore'),
cores_per_job = 2,
temp_dir = temp_dir,
r_module = 'R/3.5.0-foss-2018b',
polling_period = 10,
job_wallclock = '01:00:00',
max_jobs = 40,
bidirectional = FALSE),
ecflow_suite_dir = ecflow_suite_dir,
wait = TRUE)
```
The structure of the BSC Earth data repository 'esarchive' allows us to create a path pattern to the data by using different variables
(between dollar symbol), such as `$var$`, for the variable name, or `$sdates$`, for the start date of the simulation. We call these variables 'file dimension'.
Here is an example for loading monthly simulations of system4_m1 data:
`path <- '/esarchive/exp/ecmwf/system4_m1/monthly_mean/$var$_f6h/$var$_$sdate$.nc'`
The function Start() will require two parameters 'var' and 'sdate' to load the desired data.
In some cases, the file dimensions have dependence relationship. Some researchers create their own EC-Earth experiments which are identified by an experiment ID (`$expid$`) and with different members (`$member$`):
| expid | member |
|-------|----------|
| a1st | r7i1p1f1 |
| a1sx |r10i1p1f1 |
In this case, 'member' under each 'expid' has different value. Therefore, the parameter `member_depends = 'expid'` needs to be used in Start().
However, in some other cases, the creation of the path could be more complicated. For example, the experiment ID (`$expid$`) can have different members (`$member$`) and even with different model version (`$version`):
| expid | member | version |
|-------|----------|---------|
| a1st | r7i1p1f1 |v20190302|
| a1sx |r10i1p1f1 |v20190308|
In this case, the variable member and version have different value depending on the expid (the member r10i1p1f1 and version v20190302 do not exist for expid a1st). The path will include this varibles:
`path <- '/esarchive/exp/ecearth/$expid$/diags/CMIP/EC-Earth-Consortium/EC-Earth3/historical/$member$/Omon/$var$/gn/$version$/$var$_Omon_EC-Earth3_historical_$member$_gn_$year$.nc'`
The current Start() can not deal with multiple dependencies. However, for this case, here is a workaround. The following parameters can be added to Start():
member_depends = 'expid',
version_depends = 'expid',
member_depends = 'version',
version_depends = 'member',
```
The final Start() call will look like:
yrh1 = 1960
yrh2 = 2014
years <- paste0(c(yrh1 : yrh2), '01-', c(yrh1 : yrh2), '12')
data <- Start(dat = repos,
var = 'tosmean',
expid = c('a1st','a1sx'),
member = 'all',
version = 'all',
member_depends = 'expid',
version_depends = 'expid',
member_depends = 'version',
version_depends = 'member',
return_vars = list(time = NULL, region = NULL),
retrieve = TRUE)
```
### 9. Use CDORemap() in function
If you want to interpolate data by s2dv::CDORemap in function, you need to tell the
machine which CDO module to use. Therefore, `CDO_module = 'CDO/1.9.5-foss-2018b'` should be
added in Compute() cluster list. See the example in usecase [ex2_3](inst/doc/usecase/ex2_3_cdo.R).
### 10. The number of members depends on the start date
In seasonal forecast, some start dates, such as November 1st, are more widely used than others. For those start dates extensively used, the number of members available is greater than for other start dates. This is the case of the seasonal forecast system ECMWF SEAS5 (system5_m1):
- for the start date November 1st, 1999, there are 51 members available, while
- for the start date September 1st, 1999, there are 25 members available.
When trying to load both start dates at once using Start(), the order in which the start dates is specified will impact on the dimensions of the dataset if all members are loaded with `member = 'all'`:
- `sdates = c('19991101', '19990901')`, the member dimension will be of length 51, showing missing values for the members 26 to 51 in the second start date;
- `sdates = c('19990901', '19991101')`, the member dimension will be of length 25, any member will be missing.
To ensure that all the members are retrieved, we can use parameter `largest_dims_length`. See [FAQ 21](https://earth.bsc.es/gitlab/es/startR/-/blob/master/inst/doc/faq.md#21-retrieve-the-complete-data-when-the-dimension-length-varies-among-files) for details.
The code to reproduce this behaviour could be found in the usecase [ex1_4](/inst/doc/usecase/ex1_4_variable_nmember.R).
aho
committed
### 11. Select the longitude/latitude region
There are three ways to specify the dimension selectors: special keywords('all', 'first', 'last'), indices, or values (find more details in [pratical guide](inst/doc/practical_guide.md)).
For now, the parameter 'xxx_reorder' is only effective when using **values**.
aho
committed
There are two reorder functions in startR package, **Sort()** for latitude and **CircularSort()** for longitude.
Sort() is a wrapper function of base function sort(), rearranging the values from low to high (decreasing = FALSE, default) or
from high to low (decreasing = TRUE). For example, if you want to sort latitude from 90 to -90, use `latitude_reorder = Sort(decreasing = TRUE)`.
By this means, the result will always from big to small value no matter how the original order is.
On the other hand, the concept of CircularSort() is different. It is used for a circular region, putting the out-of-region values back to the region.
It requires two input numbers defining the borders of the whole region, which are usually [0, 360] or [-180, 180]. For example,
`longitude_reorder = CircularSort(0, 360)` means that the left border is 0 and the right border is 360, so 360 will be put back to 0, 361 will be put back to 1,
and -1 will become 359. After circulating values, CircularSort() also sorts the values from small to big. It may cause the discontinous sub-region,
but the problem can be solved by assigning the borders correctly.
Note that the two points in CircularSort() are regarded as the same point. Hence, if you want to load the global longitude, lonmin/lonmax should be slightly different, e.g., 0/359.9, 0.1/360, -179.9/180, -180/179.9, etc. Otherwise, only one point will be returned.
The following chart helps you to decide how to use CircularSort() to get the desired region.
The first row represents the longitude border of the requested region, e.g., `values(list(lon.min, lon.max))`,
and the white part is the returned longitude range corresponding to each CircularSort() setting.
Here are some summaries:
- The original longitude range does not matter. No matter the original longitude is [0, 360] or [-180, 180], Start() will return the values shown in the chart according to the lonmin/lonmax you set.
- The lonmin/lonmax value should be consistent with CircularSort(), so the returned values are continuous. For example, if `lonmin/lonmax = -60/60`, `CircularSort(-180, 180)` should be used.
- Define the longitude range as the one you want to get, regardless the original file. For example, if you want the data to be [-180, 180], define `lonmin/lonmax = -179.9/180` and `CircularSort(-180, 180)`, even if the original longitude in the netCDF file is [0, 360].
Note that this chart only provides the idea. The real numbers may slightly differ depending on the original/transform values.
<img src="inst/doc/figures/lon-2.PNG" width="1000" />
aho
committed
Find the usecases here [ex1_5_latlon_reorder.R](inst/doc/usecase/ex1_5_latlon_reorder.R)
aho
committed
### 12. What will happen if reorder function is not used
The reorder functions (i.e., Sort() and CircularSort()) are always recommended to adopt in Start() so you can ensure the result is in line
with your expectation (find more details at [how-to-11](#11-select-the-longitudelatitude-region) above). If the functions are not used, the situation will be more complicated and easier to
aho
committed
get unexpected results.
Without reorder functions, the longitude and latitude selectors must be within the respective range in the original file, and the result order
will be the same order as how you request. If transformation is performed simultaneously, you need to consider the latitude/longitude range of
the transform grid too. The requested region values cannot fall out of both the original and the transformed region.
The following chart shows some examples.
<img src="inst/doc/figures/lon-3.PNG" width="800" />
### 13. Load specific grid points data
A single or list of grid points, defined by pairs of latitude and logitud values, can be loaded using **Start**.
If the values does not match the defined spatial point in the files, **Start** will load the nearest gridpoint (The user can also consider the regriding option to fulfill his/her expectations).
An example of how to load several gridpoints and how to transform the data could be found in the Use Cases section [example 1.6](/inst/doc/usecase/ex1_6_gridpoint_data.R).
### 14. Find the error log when jobs are launched on Power9
Due to connection problem, when Compute() dispatches jobs to Power9, each job in ecFlow ui has a 'Z', zombie, beside, no matter the job is complete or failed.
The zombie blocks the error log to be shown in ecFlow ui output frame. Therefore, you need to log in Power9, go to 'temp_dir' listed in the cluster list in Compute() and enter the job folder. You will find another folder with the same name as the previous layer, then go down to the most inner folder. You will see 'Chunk.1.err'.
For example, the path can be: "/gpfs/scratch/bsc32/bsc32734/startR_hpc/STARTR_CHUNKING_1665710775/STARTR_CHUNKING_1665710775/computation/lead_year_CHUNK_1/lon_CHUNK_1/lat_CHUNK_1/sdate_CHUNK_1/var_CHUNK_1/dataset_CHUNK_1/Chunk.1.err".
### 15. Specify extra function arguments in the workflow
The input arguments of the function may not only be the data, sometimes the extra information is required.
The additional arguments should be specified in 'AddStep()'. The following example shows how to assign 'na.rm' in mean().
```
func <- function(x, narm = narm) { # add additional argument 'narm'
a <- apply(x, 2, mean, na.rm = narm)
dim(a) <- c(sdate = length(a))
return(a)
}
step <- Step(func, target_dims = c('ensemble', 'sdate'),
output_dims = c('sdate'))
wf <- AddStep(data, step, narm = TRUE) # specify the additional argument 'narm'
```
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
### 16. Use parameter 'return_vars' in Start()
Apart from the data array, retrieving auxiliary variables inside the netCDF files may also be needed.
The parameter 'return_vars' is used to request such variables.
This parameter expects to receive a named variable list. The names are the variable names to be fetched in the netCDF files, and the corresponding value can be:
(1) NULL, if the variable is common along all the file dimensions (i.e., it will be retrieved only once from the first involved files)
(2) a vector of the file dimension name which to retrieve the variable for
(3) a vector which includes the file dimension for path pattern specification (i.e., 'dat' in the example below)
For the first and second options, the fetched variable values will be saved in *$Variables$common$<variable_name>*.
For the third option, the fetched variable values will be saved in *$Variables$<dataset_name>$<variable_name>*.
Notice that if the variable is specified by values(), it will be automatically added to return_vars and its value will be NULL.
Here is an example showing the above three ways.
```
repos <- "/esarchive/exp/ecmwf/system5_m1/monthly_mean/tas_f6h/$var$_$sdate$.nc"
var <- 'tas'
lon.min <- 10
lon.max <- 20
lat.min <- 20
lat.max <- 30
data <- Start(dat = repos, # file dimension for path pattern specification
var = var,
sdate = c('20170101', '20170401'), # file dimension; 'time' is dependent on 'sdate'
ensemble = indices(1:5),
time = indices(1:3), # inner dimension, also an auxiliary variable containing forecast time information
latitude = values(list(lat.min, lat.max)), # inner dimension, common along all files
longitude = values(list(lon.min, lon.max)), # inner dimension, common along all files
return_vars = list(time = 'sdate', # option (2)
longitude = NULL, # option (1)
latitude = NULL), # option (1)
retrieve = FALSE
)
```
In the return_vars list, we require information of three variables. 'time' values differ from each sdate, while longitude and latitude are common variable among all the files.
You can use `str(data)` to see the information structure.
```
str(attr(data, 'Variables')$common)
List of 3
$ time : POSIXct[1:6], format: "2017-02-01 00:00:00" "2017-05-01 00:00:00" ...
$ longitude: num [1:37(1d)] 10 10.3 10.6 10.8 11.1 ...
$ latitude : num [1:36(1d)] 20.1 20.4 20.7 20.9 21.2 ...
dim((attr(data, 'Variables')$common$time))
sdate time
2 3
```
It is not necessary in this example, but you can try to replace return_vars longitude to `longitude = dat` (option (3)).
You will find that longitude is moved from $common to $dat1 list.
```
str(attr(data, 'Variables')$common)
List of 2
$ time : POSIXct[1:6], format: "2017-02-01 00:00:00" "2017-05-01 00:00:00" ...
$ latitude: num [1:36(1d)] 20.1 20.4 20.7 20.9 21.2 ...
str(attr(data, 'Variables')$dat1)
List of 1
$ longitude: num [1:37(1d)] 10 10.3 10.6 10.8 11.1 ...
```
### 17. Use parameter 'split_multiselected_dims' in Start()
The selectors can be not only vectors, but also multidimensional array. For instance, the 'time' dimension
can be assigned by a two-dimensional array `[sdate = 12, time = 31]`, which is 31 timesteps for 12 start dates.
You may want to have both 'sdate' and 'time' in the output dimension, even though 'sdate' is not explicitly specified in Start().
The parameter 'split_multiselected_dims' is for this goal. It can be used to reshape the
file dimensions, and it is also common in the case that experimental data attributes are
used to define observational data inner dimensions, so we can get the corresponding observational data in the same dimension structure.
Here is a simple example. By defining the selector of the file dimension 'file_date' as a
two-dimensional array, we can reshape this dimension into 'month' and 'year'.
```r
obs.path <- "/esarchive/recon/ecmwf/era5/monthly_mean/$var$_f1h-r1440x721cds/$var$_$file_date$.nc"
file_date <- c("201311","201312","201411","201412")
dim(file_date) <- c(month = 2, year = 2)
obs <- Start(dat = obs.path,
var = 'prlr',
file_date = file_date,
time = 'all',
lat = indices(1:10),
lon = indices(1:10),
return_vars = list(lat = NULL,
lon = NULL,
time = 'file_date'),
split_multiselected_dims = TRUE,
retrieve = TRUE)
```
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
The following script is part of the use case [ex1_2_exp_obs_attr.R](inst/doc/usecase/ex1_2_exp_obs_attr.R).
The time selector for observational data comes from experimental data above (neglected here). The dimension number of the selector is two.
Notice that dimension name, which is 'time' here, must also be one of the dimension names in the selector.
The result dimensions include 'sdate' because it is splited from 'time'. In the meanwhile,
'date' disappears because 'merge_across_dims = TRUE' (see more explanation at [How-to-#2](#2-indicate-dependent-dimension-and-use-merge-parameters-in-start)).
```r
# use time attributes from experimental data
dates <- attr(exp, 'Variables')$common$time
dim(dates)
#sdate time
# 4 3
obs <- Start(dat = repos_obs,
var = 'tas',
date = unique(format(dates, '%Y%m')),
time = values(dates), #dim: [sdate = 4, time = 3]
time_across = 'date',
lat = 'all',
lon = 'all',
merge_across_dims = TRUE,
split_multiselected_dims = TRUE,
synonims = list(lat = c('lat', 'latitude'),
lon = c('lon', 'longitude')),
return_vars = list(lon = NULL,
lat = NULL,
time = 'date'),
retrieve = FALSE)
print(attr(obs, 'Dimensions'))
# dat var sdate time lat lon
# 1 1 4 3 256 512
```
The splited dimension can have more than two dimensions.
The following example comes from the usecase [ex1_7_split_merge.R](inst/doc/usecase/ex1_7_split_merge.R).
The 'time' selector has three dimensions 'sdate', 'syear', and 'time'.
```r
dates <- attr(hcst, 'Variables')$common$time
dim(dates)
#sdate syear time
# 2 3 12
file_date <- sort(unique(gsub('-', '',
sapply(as.character(dates), substr, 1, 7))))
print(file_date)
#[1] "199607" "199612" "199701" "199707" "199712" "199801" "199807" "199812"
#[9] "199901"
obs <- Start(dat = path.obs,
var = var_name,
file_date = file_date, # a vector with the information of sdate and syear
latitude = indices(1:10),
longitude = indices(1:10),
time = values(dates), # a 3-dim array (sdate, syear, time)
time_across = 'file_date',
merge_across_dims = TRUE,
merge_across_dims_narm = TRUE,
split_multiselected_dims = TRUE,
synonims = list(latitude = c('lat','latitude'),
longitude = c('lon','longitude')),
return_vars = list(latitude = 'dat',
longitude = 'dat',
time = 'file_date'),
retrieve = T)
### 18. Use glob expression '*' to define the file path
The standard way to define the file path for Start() is using tags (i.e., $TAG_NAME$).
The glob expression, or wildcard, '*', can also be used in the path definition, while the rule is different from the common usage.
Please note that **'*' can only be used to replace the common part of all the files**. For example, if all the required files have the folder 'EC-Earth-Consortium/' in their path, then this part can be substituted with '*/'.
It can save some effort to define the long and uncritical path, and also make the script cleaner.
However, if the part replaced by '\*' is not same among all the files, Start() will use **the first pattern it finds in the first file to substitute '*'**.
As a result, the rest files may not be found due to the wrong path pattern.
For example, if the first file is under a folder named 'v20190302/' and the second file is under another one named 'v20190308/', and you define the path pattern as 'v*/', then Start() will use 'v20190302/' for both file paths.
This is different from the common definition of glob expression that tries to expand to match all the existing patterns, so please be careful when using it.
There is a parameter 'path_glob_permissive' in Start() can be used to perserve the
functionality of '*'. It can be FALSE/TRUE or an integer indicating how many folder layers
in the path pattern, beginning from the end, the shell glob expressions to be preserved.
The default value is FALSE (equal to 0), which means no '\*' is preserved.
If set it to TRUE (equal to 1), the '\*' in the filename will remain and represent different possiblities of the file path pattern. See more details in Start() parameter
'path\_glob\_permissive' and the use case [ex1_9](inst/doc/usecase/ex1_9_path_glob_permissive.R).
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
### 19. Get metadata when the first file does not exist
Start() can retrieve the data even if some of them do not exist. The returned array will be filled with NA at the positions of missing files.
However, Start() retrieves the metadata from the first file of each data set by default. When the first file does not exist, the metadata cannot be found then the returned array will be lack of metadata of the variable.
In this case, Start() shows a warning: `Metadata cannot be retrieved. The reason may be the non-existence of the first file. Use parameter 'metadata_dims' to assign to file dimensions along which to return metadata, or check the existence of the first file.`
To get the metadata, we can ensure the first file exists, or use the parameter 'metadata_dims' in Start().
(1) Ensure the first file exists
If the first file exists, you have no problem with metadata. You can check manually, or use the script like below:
```r
first.exists <- FALSE
n <- 1
while(!first.exists) {
if(!file.exists(dir[n])) {
n <- n + 1
} else {
first.exists <- TRUE
if (n > 1) {
init.year <- init.year + (n - 1)
all.years <- paste(strtoi(init.year):strtoi(end.year), sep = "")
warning(paste0("NEW INIT YEAR: ", init.year))
}
}
}
```
(2) Use parameter 'metadata_dims'
This parameter expects to receive a vector of character strings with the names of the file dimensions which to return metadata for.
Start() by default returns the auxiliary data read for only the first file of each data set in the pattern dimension.
However, it can be configured to return the metadata for all the files along any set of file dimensions. The following example uses `metadata_dims = 'file_date'`, so even the first file is missing, Start() can find the metatdata from the second files.
```r
file <- "/esarchive/exp/ncep/cfs-v2/weekly_mean/s2s/$var$_f24h/$var$_$file_date$.nc"
var <- 'tas'
sdates <- c("20130618", "20130611") #1st missing, 2nd exists
dat1 <- Start(dat = file,
var = var,
file_date = sdates4,
time = indices(1:4),
latitude = values(list(20, 30)),
latitude_reorder = Sort(decreasing = TRUE),
longitude = values(list(-20, -10)),
longitude_reorder = CircularSort(-180, 180),
ensemble = indices(1),
synonims = list(latitude = c('lat', 'latitude'),
longitude = c('lon', 'longitude')),
return_vars = list(latitude = 'dat',
longitude = 'dat',
time = 'file_date'),
retrieve = T)
# Check the attributes. There is no 'tas' metadata
names(attr(data, 'Variables')$common)
[1] "time"
dat1 <- Start(dat = file,
var = var,
file_date = sdates4,
time = indices(1:4),
latitude = values(list(20, 30)),
latitude_reorder = Sort(decreasing = TRUE),
longitude = values(list(-20, -10)),
longitude_reorder = CircularSort(-180, 180),
ensemble = indices(1),
metadata_dims = 'file_date',
synonims = list(latitude = c('lat', 'latitude'),
longitude = c('lon', 'longitude')),
return_vars = list(latitude = 'dat',
longitude = 'dat',
time = 'file_date'),
retrieve = T)
# Check the attributes. 'tas' metadata exists
names(attr(data, 'Variables')$common)
[1] "time" "tas"
```
In addition to retrieve the data value, Start() can retrieve the auxiliary data as well.
The parameter 'metadata_dims' is for the variable which you want to get the value (e.g., 'tas'),
and the parameter 'return_vars' is for other variables in the netCDF file (e.g., 'lat', 'lon', 'time').
The definition of 'metadata_dims' is:
> A vector of character strings with the names of the file dimensions which to return metadata for.
Start() by default returns the auxiliary data read for only the first file of each source (or data set) in the pattern dimension.
However, it can be configured to return the metadata for all the files along any set of file dimensions.
By default, 'metadata_dims' is equal to 'pattern_dims', which we usually assign as 'dat'.
By this means, the variable auxiliary data will be collected from the first file of each data set.
If you only have one variable to be retrieved, you have no problem with the default.
However, what if the data set number and/or the variable number is more than 1? You need to adjust this parameter to get the complete metadata.
Here are some common cases and the corresponding 'metadata_dims' to be used:
- One dat, one var: 'dat' (or default)
- One dat, two vars: 'var'
- Two dats, one var: 'dat' (or default)
- Two dats, two vars: c('dat', 'var')
If there are two variables to be retrieved but metadata_dims does'nt have "var", only the first
variable's metadata will be retrieved. If there are two data sets but metadata_dims doesn't have "dat",
only the first data set will have the variable's metadata.
Please find the relevant use cases in [ex1_10](inst/doc/usecase/ex1_10_metadata_dims.R).
### 21. Retrieve the complete data when the dimension length varies among files
By default, Start() uses the first valid file of each data set to determine the dimensions
of the return data array. However, the inner dimension length among the files may not be the
same. For example, the member number in one experiment is 25 in the early years while
increase to 51 later. If you assign `member = 'all'` in Start() call, the returned member
dimension length will be 25 only.
The parameter `largest_dims_length` is for this case. Its default value is `FALSE`, meaning
that Start() can only use the first valid file to decide the dimensions. If it is changed to
`TRUE`, Start() will examine all the required files to find the largest length for all the inner
dimensions. It is time- and resource-consuming, but useful when you are not sure how the dimensions
in all the files look like.
If you know the expected dimension length, it is recommended to assign `largest_dims_length`
by a named integer vector, for example, `largest_dims_length = c(member = 51)`. Start() will
adopt the provided ones and use the first valid file to decide the rest of dimensions.
By this means, the efficiency can be similar to `largest_dims_length = FALSE`.
Find example in use case [ex1_4](/inst/doc/usecase/ex1_4_variable_nmember.R).
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
### 22. Define the selector when the indices in the files are not aligned
When the data structure between the requested files is not identical, we need to give different
selectors to each file. We can do this by using arrays as the selector and with the parameter
'return_vars' being well-defined. There are two scenarios: (1) different between datasets (2) different along certain file dim.
(1) Different between datasets
We don't need (and can't) to define the selectors with pattern dim as the dimension. We can use
the value as the selector and specify `return_vars = list(<inner_dim> = 'dat')`. By 'return_vars',
Start() knows that this inner_dim differs among the datasets so it examines all the files to get
the correct values. See more details of 'return_vars' at [How-to-16](#16-use-parameter-return_vars-in-start).
For example, the two datasets, Hadgem3 and NorCPM1, have different initial dates. Hadgem3 initiates
in November while NorCPM1 in October. To retrieve them aligned, we can define the time selector
with the value "2000-11-16 UTC" and define 'return_vars' properly.
```r
# HadGEM3 (initialised in November)
# NorCPM1 (initialised in October)
data <- Start(dat = list(list(name = 'hadgem3', path = path_hadgem3),
list(name = 'norcpm1', path = path_norcpm1)),
var = 'tas',
sdate = '2000',
time = as.POSIXct("2000-11-16", tz = 'UTC'),
lat = 'all',
lon = 'all',
synonims = list(lon = c('lon', 'longitude'), lat = c('lat', 'latitude')),
return_vars = list(lat = 'dat', lon = 'dat',
time = 'dat'),
retrieve = TRUE)
```
(2) Different along certain file dim
If the difference of indices is among the files in the same dataset, we can use the array with
named dimensions
to define the selector, and define 'return_vars' with the file dim along which the indices differ.
For example, the 'region' number in the earlier experiments (sdate < 2013) is less than the later experiments (sdate = 2013),
making some regions have different indices between the experiments. The region selector array
should be two-dimensional, with one dimension 'sdate' and the other 'region'. The value of the
array can be either the character string of the region name or the indices in each sdate.
Besides, the dependency should be specified by `return_vars = list(region = 'sdate')`.
```r
# 'Nino3' in 1st sdate file is index 9 while in 2nd sdate file is index 11
# Either define with 'Nino3' or the corresponding index works
region <- array('Nino3', dim = c(sdate = 2, region = 1))
region <- array(c(indices(9), indices(11)), dim = c(sdate = 2, region = 1))
data <- Start(dat = path,
var = 'tosmean',
sdate = c('1993', '2013'),
chunk = 'all',
chunk_depends = 'sdate',
region = region,
time = 'all',
time_across = 'chunk',
merge_across_dims = TRUE,
return_vars = list(time = c('sdate', 'chunk'),
region = 'sdate'),
retrieve = T)
```
The dependency can be on more than one file dimension. What you need to do is just creat
an array with the depended file dimension as the array dimension name. See more examples
in [use case ex1_13](inst/doc/usecase/ex1_13_implicit_dependency.R).
aho
committed
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
### 23. The best practice of using vector and list for selectors
There are three ways to define the selectors in Start(): `indices()`, `values()`, and character string
like 'all', 'first', and 'last'. For `indices()` and `values()`, we can put either a vector or a list
in them (here we talk about the common cases, not including the dependency case mentioned in how-to-22 above.)
For file dimensions, it is common to simply define the selectors by a vector of character string
(which belongs to `values()` but `values()` can be ommitted), e.g., `sdate = c('200001', '200002')`; `var = 'tas'`.
You can also use a vector of indices, but you cannot gurantee the files you get is the desired one
since the file order in the repository may change.
For inner dimensions, it is recommended using "list of 2 values" or "vector of indices".
The main difference between vector and list is that the vector looks for the exact or closest
(could be larger or smaller) value in the data while the list looks for the data falling between the two numbers in the list.
You can assign all the indices needed by a vector, e.g., `time = indices(1:12)`, or give a range
that covers all the data needed by a list of 2, e.g., `lon = values(list(0, 30))`.
Note that `lon = values(list(0, 30))` means the data between 0 degE and 30 degE is taken; on the
other hand, `lon = indices(list(0, 30))` means that index 0 to index 30 of lon is taken (and it
will return an error in this case because there is no index 0.)
In conclusion, if you know the exact values or indices of the selector, you can use vector of values or indices; if not, usually for longitude and latitude, it is better to use list of 2 values instead.
### 24. Do both interpolation and chunking on spatial dimensions
If all other dimensions are used as target dimensions in the operation, it would be necessary to
to chunk the spatial dimensions. The chunking can be done even if regridding is also required in
Start() (See those transform arguments at [how-to-5](#5-do-interpolation-in-start-using-parameter-transform), and the script has no difference with chunking other dimensions.
However, there are some things you need to bear in mind when using startR in this way.
The regridding function provided by startR is CDORemapper(), which is a wrapper function of s2dv::CDORemap;
and CDORemap() uses cdo inside. Therefore, the regridding of startR has the same performance as cdo.
The errors due to transformation at borders may increase by chunking because it produces more
borders. For example, if `longitude = indices(1:20)` is chunked by 2, the first chunk will be indices(1:10) and the second chunk will be indices(11:20). Therefore, we have borders at 0, 10, 11, and 20.
In most cases, the border errors can be eliminated by increasing the number of extra cells (argument `transform_extra_cells` in Start()). With enough extra cells, the result will be identical as
global regridding.
However, there are many factors that may impact the results of regridding, like the `crop` option,
the way to define the longitude/latitude selectors, etc. It is important to know how CDO works and
the usage of those parameters to avoid unecessary errors.
We provide some [use cases](inst/doc/usecase/ex2_12_transform_and_chunk.R) showing the secure ways of transformation + chunking.
### 25. What to do if your function has too many target dimensions
Ideally, the desired startR workflow uses those required (for the key computations) dimensions to compute an analysis and the rest to chunk the data in pieces.
If we have a complex analysis that require all the dimensions in the computation in one single step, we don't have any free (i.e., margin) dimension to chunk the data.
Unfortunately, we don't have a perfect solution now before we have multiple steps feature.
You may check [How-to-27](#27-utilize-chunk-number-in-the-function) to see if the solution applies to your case. If not, talk to the maintainers to see how to generate a workaround for your case.
### 26. Use merge_across_dims_narm to remove NAs
The Start() parameter `merge_across_dims_narm` can be useful when you want to merge two dimensions together (e.g., time across chunk.) If you're not familiar with the usage of `xxx_across = yyy` and `merge_across_dims` yet, check [How-to-2](#2-indicate-dependent-dimension-and-use-merge-parameters-in-start) first.
First thing to notice is that `merge_across_dims_narm` can only remove **the NAs that are created by Start()** during the reshaping process.
It doesn't remove the NAs in the original data. For example, in Example 2 in How-to-2, the NAs are removed because those NAs are added by Start().
Second, if the files don't share the same length of the merged dimension, you need to use `largest_dims_length = T` along.
This parameter tells Start() that it needs to look into each file to know the dimension length. By doing this, Start() knows that the NAs in the files with shorter dimension are added by it, so `merge_across_dims_narm = T` can remove those NAs correctly.
A typical example is reading daily data and merging time dimension together. The 30-day months will have one NA at the end of time dimension, if `merge_across_dims_narm = T` and `largest_dims_length = T` are not used.
Check usecase [ex1_16](/inst/doc/usecase/ex1_16_files_different_time_dim_length.R) for the example script.
See [How-to-21](#21-retrieve-the-complete-data-when-the-dimension-length-varies-among-files) for more details of `largest_dims_length`.