Start(): The usage and problems of parameter *_var
Hi @nperez
Sorry for my insistence, but since I kept coming across the relevant code of the parameter *_var in Start(), I want to clarify the usage (also the problematic part) of it. I summarize what I've found so far here.
1. The usage of *_var
The documentation says:
The name of the associated coordinate variable must be a character string with the name of an associated coordinate variable to be found in the data files (in all* of them). For this to work, a ’file_var_reader’ function must be specified when calling Start() (see parameter ’file_var_reader’). The coordinate variable must also be requested in the parameter ’return_vars’ (see its section for details). This feature only works for inner dimensions.
Take one file for example: /esarchive/exp/ecmwf/system5_m1/monthly_mean/tas_f6h/tas_20080301.nc
. Using ncdump, we can see the file looks like this:
netcdf tas_20080301 {
dimensions:
ensemble = 25 ;
latitude = 640 ;
longitude = 1296 ;
time = UNLIMITED ; // (7 currently)
variables:
int realization(ensemble) ;
double latitude(latitude) ;
latitude:standard_name = "latitude" ;
latitude:long_name = "latitude" ;
latitude:units = "degrees_north" ;
latitude:axis = "Y" ;
double longitude(longitude) ;
longitude:standard_name = "longitude" ;
longitude:long_name = "longitude" ;
longitude:units = "degrees_east" ;
longitude:axis = "X" ;
float tas(time, ensemble, latitude, longitude) ;
tas:long_name = "2 metre temperature" ;
tas:code = 167 ;
tas:table = 128 ;
tas:grid_type = "gaussian" ;
tas:units = "K" ;
double time(time) ;
time:standard_name = "time" ;
time:units = "hours since 2008-03-01 00:00:00" ;
time:calendar = "proleptic_gregorian" ;
...
The coordinate variables include realization, latitude, longitude, and time. As we know, we don't need to worry about this parameter most of the time because the dimension name is the same as the name of the coordinate variable. Therefore, if needed, Start() will automatically add time_var = 'time'
or latitude_var = 'latitude'
when running and return a warning like: Warning: Found specified values for dimension 'time' but no 'time_var' requested. "time_var = 'time'" has been automatically added to the Start call.
(ex3)
However, the dimension name 'ensemble' is different from the corresponding coordinate variable 'realization'. If the selector is assigned with values, then we need to use ensemble_var = 'realization'
(ex2).
So, what if the selector type is not 'values' but 'indices' or character like 'all'? Start() may or may not return an error (ex4,5). It depends on how you define the parameter 'return_vars' (see point 2 below). But the message here we should take is that *_var is not necessary to use if the selector type is not 'values'.
2. The cooperation with parameter 'return_vars'
From the documentation above, we know that *_var has a certain relation with 'return_vars'. I list some points here:
- *_var will be automatically added to the return_vars list even you don't do it. Start() will put the value as NULL and return a warning like
Warning: All '*_var' params must associate a dimension to one of the requested variables in 'return_vars'. The following variables have been added to 'return_vars': 'time'
(ex2,3). - The following situation leads to an error: time selector type is not 'values' + time_var = 'time' + time's value in return_vars is not NULL (e.g., time = 'sdate'). The error is:
Provided selectors for the dimension 'time' must have as many file dimensions as the variable the dimension is defined along, 'time', with the exceptions of the file pattern dimension ('dat') and any depended file dimension (if specified as depended dimension in parameter 'inner_dims_across_files' and the depending file dimension is present in the provided selector array).
(ex5). I don't quite understand this message. - We can fix the above situation by adding dimension name to the time selector (ex6). By this means, the selector type is still indices but with dimension name 'sdate', which is also the value in return_vars.
However, the above situation is not a legit usage from my understanding. Point 3 is just a workaround without meaning.
3. Problem and proposed solution
From the code, I regard that *_var should only be used when the selector type is values. However, the current code doesn't return a warning or error when the selector type is indices + *_var is assigned. It either leads to an irrelevant error (as the one in part 2 point 2) or runs well due to the coincidentally cooperated return_vars.
Since I haven't fully understood the error message, I don't wanna change it. To prevent confusion, we can 1. mention that *_var is only used when selector type is values in document 2. remove the assigned *_var if selector type is indices and show a warning.
Examples
library(startR)
# Get time values for later use
repos <- '/esarchive/exp/ecmwf/system5_m1/monthly_mean/$var$_f6h/$var$_$sdate$.nc'
data <- Start(dat = repos,
var = 'tas',
sdate = c('20170101', '20180101'),
ensemble = indices(1),
time = indices(1:4),
latitude = indices(1), longitude = indices(1),
return_vars = list(time = 'sdate'),
retrieve = F)
time_val <- attr(data, 'Variables')$common$time
# The arguments which won't change in the tests
basic_list <- list(
dat = '/esarchive/exp/ecmwf/system5_m1/monthly_mean/$var$_f6h/$var$_$sdate$.nc',
var = 'tas',
sdate = c('20170101', '20180101'),
latitude = indices(1:3),
longitude = indices(1:2),
retrieve = F
)
# The tests with different arguments
test_batteries <- list(
# 1: ensemble and time are indices. no *_var assigned
c(basic_list, list(ensemble = c(1, 3)), list(time = indices(1:4))),
# 2: ensemble is values. ensemble_var assigned.
c(basic_list, list(ensemble = values(c(1, 3))), list(time = indices(1:4)),
ensemble_var = 'realization'),
# 3: ensemble and time are values. ensemble_var assigned. time_var will be added automatically.
c(basic_list, list(ensemble = values(c(1, 3))), list(time = time_val),
ensemble_var = 'realization'),
# 4: same as 2 but time_var is assigned, and return_vars = list(time = NULL).
c(basic_list, list(ensemble = values(c(1, 3))), list(time = indices(1:4)),
ensemble_var = 'realization', time_var = 'time',
list(return_vars = list(time = NULL))),
# 5: same as 4 but return_var = list(time = 'sdate'). ERROR!!
c(basic_list, list(ensemble = values(c(1, 3))), list(time = indices(1:4)),
ensemble_var = 'realization', time_var = 'time',
list(return_vars = list(time = 'sdate'))),
# 6: same as 5 but time with dim.
c(basic_list, list(ensemble = values(c(1, 3))), list(time = array(1:4, dim = c(time = 4, sdate = 2))),
ensemble_var = 'realization', time_var = 'time',
list(return_vars = list(time = 'sdate')))
)
# Run the tests
for (battery_ind in 1:length(test_batteries)) {
battery <- test_batteries[[battery_ind]]
call <- list()
cat(paste0("Test ", battery_ind, "...\n"))
call[names(battery[[call_index]])] <- battery
data <- do.call(Start, battery)
}
warnings()
Cheers,
An-Chi