Commit 89c4b48b authored by aho's avatar aho
Browse files

Update to master

parents 685647ce d47326ce
Pipeline #5835 passed with stage
in 13 minutes and 41 seconds
Package: startR
Title: Automatically Retrieve Multidimensional Distributed Data Sets
Version: 2.1.0-3
Version: 2.1.0-4
Authors@R: c(
person("BSC-CNS", role = c("aut", "cph")),
person("Nicolau", "Manubens", , "nicolau.manubens@bsc.es", role = c("aut")),
......
......@@ -221,6 +221,13 @@ NcDataReader <- function(file_path = NULL, file_object = NULL,
# Origin year and month
ori_year <- as.numeric(substr(parts[2], 1, 4))
ori_month <- as.numeric(substr(parts[2], 6, 7))
if (is.na(ori_month)) {
ori_month <- as.numeric(substr(parts[2], 6, 6))
}
if (!is.numeric(ori_year) | !is.numeric(ori_month)) {
stop(paste0("The time unit attribute format is not 'YYYY-MM-DD' or 'YYYY-M-D'. ",
"Check the file or contact the maintainer."))
}
if (calendar == 'gregorian') {
# Find how many years + months
......
This diff is collapsed.
......@@ -475,11 +475,11 @@ The code to reproduce this behaviour could be found in the Use Cases section, [e
### 11. Select the longitude/latitude region
There are three ways to specify the dimension selectors: special keywords('all', 'first', 'last'), indices, or values (find more details in [pratical guide](inst/doc/practical_guide.md)).
The parameter 'xxx_reorder' is only effective when using **values**.
For now, the parameter 'xxx_reorder' is only effective when using **values**.
There are two reorder functions in startR package, **Sort()** for latitude and **CircularSort()** for longitude.
Sort() is a wrapper function of base function sort(), rearranging the values from low to high (decreasing = TRUE, default) or
from high to low (decreasing = FALSE). For example, if you want to sort latitude from 90 to -90, use `latitude_reorder = Sort(decreasing = TRUE)`.
Sort() is a wrapper function of base function sort(), rearranging the values from low to high (decreasing = FALSE, default) or
from high to low (decreasing = TRUE). For example, if you want to sort latitude from 90 to -90, use `latitude_reorder = Sort(decreasing = TRUE)`.
By this means, the result will always from big to small value no matter how the original order is.
On the other hand, the concept of CircularSort() is different. It is used for a circular region, putting the out-of-region values back to the region.
......@@ -487,9 +487,16 @@ It requires two input numbers defining the borders of the whole region, which ar
`longitude_reorder = CircularSort(0, 360)` means that the left border is 0 and the right border is 360, so 360 will be put back to 0, 361 will be put back to 1,
and -1 will become 359. After circulating values, CircularSort() also sorts the values from small to big. It may cause the discontinous sub-region,
but the problem can be solved by assigning the borders correctly.
Note that the two points in CircularSort() are regarded as the same point. Hence, if you want to load the global longitude, lonmin/lonmax should be slightly different, e.g., 0/359.9, 0.1/360, -179.9/180, -180/179.9, etc. Otherwise, only one point will be returned.
The following chart helps you to decide how to use CircularSort() to get the desired region.
The first row represents the longitude border of the requested region, e.g., `values(list(lon.min, lon.max))`.
The first row represents the longitude border of the requested region, e.g., `values(list(lon.min, lon.max))`,
and the white part is the returned longitude range corresponding to each CircularSort() setting.
Here are some summaries:
- The original longitude range does not matter. No matter the original longitude is [0, 360] or [-180, 180], Start() will return the values shown in the chart according to the lonmin/lonmax you set.
- The lonmin/lonmax value should be consistent with CircularSort(), so the returned values are continuous. For example, if `lonmin/lonmax = -60/60`, `CircularSort(-180, 180)` should be used.
- Define the longitude range as the one you want to get, regardless the original file. For example, if you want the data to be [-180, 180], define `lonmin/lonmax = -179.9/180` and `CircularSort(-180, 180)`, even if the original longitude in the netCDF file is [0, 360].
Note that this chart only provides the idea. The real numbers may slightly differ depending on the original/transform values.
<img src="inst/doc/figures/lon-2.PNG" width="1000" />
......@@ -611,7 +618,30 @@ List of 1
The selectors can be not only vectors, but also multidimensional array. For instance, the 'time' dimension
can be assigned by a two-dimensional array `[sdate = 12, time = 31]`, which is 31 timesteps for 12 start dates.
You may want to have both 'sdate' and 'time' in the output dimension, even though 'sdate' is not explicitly specified in Start().
The parameter 'split_multiselected_dims' is for this goal. It is common in the case that uses experimental data attributes to get the corresponding observational data.
The parameter 'split_multiselected_dims' is for this goal. It can be used to reshape the
file dimensions, and it is also common in the case that experimental data attributes are
used to define observational data inner dimensions, so we can get the corresponding observational data in the same dimension structure.
Here is a simple example. By defining the selector of the file dimension 'file_date' as a
two-dimensional array, we can reshape this dimension into 'month' and 'year'.
```r
obs.path <- "/esarchive/recon/ecmwf/era5/monthly_mean/$var$_f1h-r1440x721cds/$var$_$file_date$.nc"
file_date <- c("201311","201312","201411","201412")
dim(file_date) <- c(month = 2, year = 2)
obs <- Start(dat = obs.path,
var = 'prlr',
file_date = file_date,
time = 'all',
lat = indices(1:10),
lon = indices(1:10),
return_vars = list(lat = NULL,
lon = NULL,
time = 'file_date'),
split_multiselected_dims = TRUE,
retrieve = TRUE)
```
The following script is part of the use case [ex1_2_exp_obs_attr.R](inst/doc/usecase/ex1_2_exp_obs_attr.R).
The time selector for observational data comes from experimental data above (neglected here). The dimension number of the selector is two.
......
......@@ -25,7 +25,7 @@ in a comparable structure. It also shows how to use parameters `xxx_tolerance`,
5. [Use reorder functions to get desired lat/lon region](inst/doc/usecase/ex1_5_latlon_reorder.R)
This script shows you how to use reorder function (`Sort()`, `CircularSort()`) to
get the desired longitude and latitude region. See [FAQ How-to-#11] (/inst/doc/faq.md#11-read-latitude-and-longitude-with-the-usage-of-parameter-xxx_reorder)
get the desired longitude and latitude region. See [FAQ How-to-#11](inst/doc/faq.md#11-select-the-longitudelatitude-region)
for more explanation.
6. [Loading gridpoint data](inst/doc/usecase/ex1_6_gridpoint_data.R)
......
......@@ -77,7 +77,7 @@
# NOTE: 'merge_across_dims_narm = TRUE' is necessary because the observational
# data have unequal time length of 30-day and 31-day months.
# If the NAs are not removed, unwanted NAs will exist and make the
# values misplaced in the array. See 'bonus' below for more explanation.
# values misplaced in the array.
#------- Check erai -----------
dim(erai)
......@@ -129,45 +129,3 @@ attr(erai, 'Variables')$common$time[2, ]
# //////////////////"BONUS"//////////////////////
# Here is something more to show the usage of parameter 'merge_across_dims_narm'.
# If the last day of 30-day months is NA instead of the first day of the following month,
# NAs are needed to exist in the array. In this case, 'merge_across_dims_narm'
# should be FALSE.
dates <- attr(system4, 'Variables')$common$time
dates[2, 31]
#[1] "1994-07-01 UTC"
dates[2, 31] <- NA # Jun
dates[5, 31] <- NA # Sep
dates[7, 31] <- NA # Nov
erai <- Start(dat = repos_obs,
var = 'tas',
file_date = dates_file,
time = values(dates),
latitude = indices(1:10),
longitude = indices(1:10),
time_var = 'time',
time_across = 'file_date',
merge_across_dims = TRUE,
#keep NAs of the last day in 30-day months
merge_across_dims_narm = FALSE,
split_multiselected_dims = TRUE,
return_vars = list(latitude = NULL,
longitude = NULL,
time = 'file_date'),
retrieve = TRUE)
#------- Check erai -----------
erai[1, 1, 2, , 1, 1] # June
# [1] 269.9410 269.6855 268.7380 268.5008 270.3236 271.5151 270.5046 270.1686
# [9] 270.5395 272.0379 272.5489 271.1494 270.7764 270.5678 272.0331 273.7856
#[17] 273.9849 274.5904 273.4369 273.8404 274.4068 274.2292 274.7375 275.5104
#[25] 275.4324 274.9408 274.8679 276.5602 275.0995 274.6409 NA
erai[1, 1, 5, , 1, 1] # Sep
# [1] 270.0656 270.7113 268.4678 271.6489 271.2354 269.7831 269.8045 268.7994
# [9] 266.3092 262.2734 265.0124 261.8378 265.3950 257.1690 255.8402 264.8826
#[17] 267.8663 266.6875 262.5502 258.5476 258.9617 263.6396 257.1111 264.8644
#[25] 261.0085 256.7690 256.5811 256.4331 256.1260 256.4716 NA
#------------------------------
......@@ -22,7 +22,7 @@ Start(
metadata_dims = NULL,
selector_checker = SelectorChecker,
merge_across_dims = FALSE,
merge_across_dims_narm = FALSE,
merge_across_dims_narm = TRUE,
split_multiselected_dims = FALSE,
path_glob_permissive = FALSE,
largest_dims_length = FALSE,
......@@ -607,7 +607,8 @@ It is helpful when the length of the to-be-merged dimension is different
across another dimension. For example, if the dimension 'time' extends
across dimension 'chunk', and the time length along the first chunk is 2
while along the second chunk is 10. Setting this parameter as TRUE can
remove the additional 8 NAs at position 3 to 10. The default value is FALSE.}
remove the additional 8 NAs at position 3 to 10. The default value is TRUE,
but will be automatically turned to FALSE if 'merge_across_dims = FALSE'.}
\item{split_multiselected_dims}{A logical value indicating whether to split a
dimension that has been selected with a multidimensional array of selectors
......
context("DCPP successfull retrieved for depends and across parameters.")
test_that("Chunks of DCPP files- Local execution", {
path <- '/esarchive/exp/CMIP6/dcppA-hindcast/hadgem3-gc31-mm/cmip6-dcppA-hindcast_i1p1/DCPP/MOHC/HadGEM3-GC31-MM/dcppA-hindcast/r1i1p1f2/Omon/tos/gn/v20200417/$var$_Omon_HadGEM3-GC31-MM_dcppA-hindcast_s$sdate$-r1i1p1f2_gn_$chunk$.nc'
sdates <- c('2017', '2018')
dat <- Start(dat = path,
var = 'tos',
sdate = sdates,
chunk = indices(3:5),
chunk_depends = 'sdate',
time = 'all',
i = indices(1:10),
j = indices(1:10),
time_across = 'chunk',
merge_across_dims = TRUE,
retrieve = TRUE,
return_vars = list(time = 'sdate'))
# [sdate = 2, chunk = 3]
dat_2018_chunk3 <- Start(dat = '/esarchive/exp/CMIP6/dcppA-hindcast/hadgem3-gc31-mm/cmip6-dcppA-hindcast_i1p1/DCPP/MOHC/HadGEM3-GC31-MM/dcppA-hindcast/r1i1p1f2/Omon/tos/gn/v20200417/$var$_Omon_HadGEM3-GC31-MM_dcppA-hindcast_s2018-r1i1p1f2_gn_202201-202212.nc',
var = 'tos', time = 'all', i = indices(1:10), j = indices(1:10),
retrieve = TRUE)
expect_equal(dat[1,1,2,25:36,,], dat_2018_chunk3[1,1,,,])
# [sdate = 1, chunk = 2]
dat_2017_chunk2 <- Start(dat = '/esarchive/exp/CMIP6/dcppA-hindcast/hadgem3-gc31-mm/cmip6-dcppA-hindcast_i1p1/DCPP/MOHC/HadGEM3-GC31-MM/dcppA-hindcast/r1i1p1f2/Omon/tos/gn/v20200417/$var$_Omon_HadGEM3-GC31-MM_dcppA-hindcast_s2017-r1i1p1f2_gn_202001-202012.nc',
var = 'tos', time = 'all', i = indices(1:10), j = indices(1:10),
retrieve = TRUE)
expect_equal(dat[1,1,1,13:24,,], dat_2017_chunk2[1,1,,,])
# [sdate = 2, chunk = 1]
dat_2018_chunk1 <- Start(dat = '/esarchive/exp/CMIP6/dcppA-hindcast/hadgem3-gc31-mm/cmip6-dcppA-hindcast_i1p1/DCPP/MOHC/HadGEM3-GC31-MM/dcppA-hindcast/r1i1p1f2/Omon/tos/gn/v20200417/$var$_Omon_HadGEM3-GC31-MM_dcppA-hindcast_s2018-r1i1p1f2_gn_202001-202012.nc',
var = 'tos', time = 'all', i = indices(1:10), j = indices(1:10),
retrieve = TRUE)
expect_equal(dat[1,1,2,1:12,,], dat_2018_chunk1[1,1,,,])
})
context("Start() implicit inner dimension")
# The unit test is for the implicitly defined inner dimension. If a file dimension selector
# is an array with named dimensions, and 'split_multiselected_dims' is used, then the file
# dim can be split into multiple dimensions that may contain inner dimensions.
# merge_across_dims + split_multiselected_dims + implicit inner dim????
# The unit test is for the implicit inner dimension. If the inner dimension length is 1,
# startR allows it not to be specified in the call. Users can still define it in
# 'return_vars'.
#---------------------------------------------------------------
test_that("1. Split into inner dimension", {
test_that("1. time = 1", {
obs.path <- "/esarchive/recon/ecmwf/era5/monthly_mean/$var$_f1h-r1440x721cds/$var$_$file_date$.nc"
variable <- "prlr"
dates_file <- c("201311","201312","201411","201412")
dim(dates_file) <- c(time = 2, syear = 2)
dates_file <- c("201311","201312")
# (1)
suppressWarnings(
obs <- Start(dat = obs.path,
var = variable,
......@@ -25,8 +20,8 @@ obs <- Start(dat = obs.path,
longitude_reorder = CircularSort(-180, 180),
synonims = list(latitude = c('lat', 'latitude'),
longitude = c('lon', 'longitude')),
return_vars = list(latitude = 'dat',
longitude = 'dat',
return_vars = list(latitude = NULL,
longitude = NULL,
time = 'file_date'),
split_multiselected_dims = TRUE,
retrieve = FALSE)
......@@ -34,11 +29,11 @@ obs <- Start(dat = obs.path,
expect_equal(
attr(obs, 'Dimensions'),
c(dat = 1, var = 1, time = 2, syear = 2, latitude = 18, longitude = 81)
c(dat = 1, var = 1, file_date = 2, latitude = 18, longitude = 81)
)
expect_equal(
dim(attr(obs, 'Variables')$common$time),
c(file_date = 4, time = 1)
c(file_date = 2, time = 1)
)
expect_equal(
attr(obs, 'Variables')$common$time[1, 1],
......@@ -48,42 +43,3 @@ as.POSIXct('2013-11-15 23:30:00', tz = 'UTC')
})
test_that("2. Split into file dimension", {
obs.path <- "/esarchive/recon/ecmwf/era5/monthly_mean/$var$_f1h-r1440x721cds/$var$_$file_date$.nc"
variable <- "prlr"
dates_file <- c("201311","201312","201411","201412")
dim(dates_file) <- c(smonth = 2, syear = 2)
suppressWarnings(
obs <- Start(dat = obs.path,
var = variable,
file_date = dates_file,
time = 'all',
latitude = values(list(35.6, 40)),
latitude_reorder = Sort(decreasing = TRUE),
longitude = values(list(-10, 10)),
longitude_reorder = CircularSort(-180, 180),
synonims = list(latitude = c('lat', 'latitude'),
longitude = c('lon', 'longitude')),
return_vars = list(latitude = 'dat',
longitude = 'dat',
time = 'file_date'),
split_multiselected_dims = TRUE,
retrieve = FALSE)
)
expect_equal(
attr(obs, 'Dimensions'),
c(dat = 1, var = 1, smonth = 2, syear = 2, time = 1, latitude = 18, longitude = 81)
)
expect_equal(
dim(attr(obs, 'Variables')$common$time),
c(file_date = 4, time = 1)
)
expect_equal(
attr(obs, 'Variables')$common$time[1, 1],
as.POSIXct('2013-11-15 23:30:00', tz = 'UTC')
)
})
context("Start() reshape parameters check")
# This one is more comprehensive than test-Start-split-merge.R
path_exp <- '/esarchive/exp/ecmwf/system5c3s/daily_mean/$var$_f6h/$var$_$sdate$.nc'
path_obs <- '/esarchive/recon/ecmwf/era5/daily_mean/$var$_f1h-r360x181/$var$_$date$.nc'
var <- 'tas'
sdate <- paste0(1993:1995, '1201')
suppressWarnings(
exp <- Start(dat = path_exp,
var = var,
sdate = sdate,
time = indices(1:90), #indices(1:91),
ensemble = indices(1),
lat = indices(1),
lon = indices(1),
synonims = list(lat = c('lat', 'latitude'),
lon = c('lon', 'longitude')),
return_vars = list(lon = NULL,
lat = NULL,
time = 'sdate'),
retrieve = FALSE, silent = T)
)
dates <- attr(exp, 'Variables')$common$time
# easyNCDF
library(easyNCDF)
# obs
easy_sdate <- c('199312', paste0(rep(1994:1995, each = 3), c('01', '02', '12')),
'199601', '199602')
easy_array <- c()
for (i in 1:length(easy_sdate)) {
easy_file <- NcOpen(paste0('/esarchive/recon/ecmwf/era5/daily_mean/tas_f1h-r360x181/tas_',
easy_sdate[i], '.nc'))
if (substr(easy_sdate[i], 5, 6) == '02') {
sub_time <- 1:28
} else {
sub_time <- 1:31
}
easy_obs <- NcToArray(easy_file, vars_to_read = 'tas',
dim_indices = list(lon = c(1), lat = c(1), time = sub_time))
NcClose(easy_file)
easy_array <- c(easy_array, as.vector(easy_obs))
}
dim(easy_array) <- c(time = 90, sdate = 3)
test_that("1. split + merge + narm", {
sorted_dates <- sort(unique(format(dates, '%Y%m')))
unsorted_dates <- unique(format(dates, '%Y%m'))
# unsorted dates
obs1 <- Start(dat = path_obs,
var = var,
date = unsorted_dates,
time = values(dates), #dim: [sdate = 3, time = 90]
lat = indices(1),
lon = indices(1),
time_across = 'date',
merge_across_dims = TRUE,
merge_across_dims_narm = TRUE,
split_multiselected_dims = TRUE,
synonims = list(lat = c('lat', 'latitude'),
lon = c('lon', 'longitude')),
return_vars = list(lon = NULL,
lat = NULL,
time = 'date'),
retrieve = TRUE)
# sorted_dates
obs2 <- Start(dat = path_obs,
var = var,
date = sorted_dates,
time = values(dates), #dim: [sdate = 3, time = 90]
lat = indices(1),
lon = indices(1),
time_across = 'date',
merge_across_dims = TRUE,
merge_across_dims_narm = TRUE,
split_multiselected_dims = TRUE,
synonims = list(lat = c('lat', 'latitude'),
lon = c('lon', 'longitude')),
return_vars = list(lon = NULL,
lat = NULL,
time = 'date'),
retrieve = TRUE)
expect_equal(
dim(obs1),
c(dat = 1, var = 1, sdate = 3, time = 90, lat = 1, lon = 1)
)
expect_equal(
dim(obs1),
dim(obs2)
)
expect_equal(
as.vector(obs1),
as.vector(obs2)
)
expect_equal(
as.vector(obs1[1, 1, 1, , 1, 1]),
as.vector(easy_array[, 1])
)
expect_equal(
as.vector(obs1[1, 1, 2, , 1, 1]),
as.vector(easy_array[, 2])
)
expect_equal(
as.vector(obs1[1, 1, 3, , 1, 1]),
as.vector(easy_array[, 3])
)
})
test_that("2. split + merge", {
exp <- Start(dat = path_exp,
var = var,
sdate = sdate,
time = indices(1:62),
ensemble = indices(1),
lat = indices(1),
lon = indices(1),
synonims = list(lat = c('lat', 'latitude'),
lon = c('lon', 'longitude')),
return_vars = list(lon = NULL,
lat = NULL,
time = 'sdate'),
retrieve = FALSE)
dates <- attr(exp, 'Variables')$common$time
sorted_dates <- sort(unique(format(dates, '%Y%m')))
unsorted_dates <- unique(format(dates, '%Y%m'))
# unsorted dates
obs1 <- Start(dat = path_obs,
var = var,
date = unsorted_dates,
time = values(dates), #dim: [sdate = 3, time = 62]
lat = indices(1),
lon = indices(1),
time_across = 'date',
merge_across_dims = TRUE,
# merge_across_dims_narm = TRUE,
split_multiselected_dims = TRUE,
synonims = list(lat = c('lat', 'latitude'),
lon = c('lon', 'longitude')),
return_vars = list(lon = NULL,
lat = NULL,
time = 'date'),
retrieve = TRUE)
# sorted_dates
obs2 <- Start(dat = path_obs,
var = var,
date = sorted_dates,
time = values(dates), #dim: [sdate = 3, time = 62]
lat = indices(1),
lon = indices(1),
time_across = 'date',
merge_across_dims = TRUE,
# merge_across_dims_narm = TRUE,
split_multiselected_dims = TRUE,
synonims = list(lat = c('lat', 'latitude'),
lon = c('lon', 'longitude')),
return_vars = list(lon = NULL,
lat = NULL,
time = 'date'),
retrieve = TRUE)
expect_equal(
dim(obs1),
c(dat = 1, var = 1, sdate = 3, time = 62, lat = 1, lon = 1)
)
expect_equal(
as.vector(obs1[1, 1, 1, , 1, 1]),
as.vector(easy_array[1:62, 1])
)
expect_equal(
as.vector(obs1[1, 1, 2, , 1, 1]),
as.vector(easy_array[1:62, 2])
)
expect_equal(
as.vector(obs1[1, 1, 3, , 1, 1]),
as.vector(easy_array[1:62, 3])
)
expect_equal(
as.vector(obs1),
as.vector(obs2)
)
})
test_that("3. merge", {
# NOTE: The three files are all regarded to have time = 31, despite 199402 only has 28.
# It happens when time = 'all' or time = indices(). It seems reasonable when
# 'merge_across_dims' is not used, but if it is used, it's common to expect 31+31+28.
# See the next test "4. merge + narm". 199402 is still regarded as 31, so NAs are not
# removed.
suppressWarnings(
obs3 <- Start(dat = path_obs,
var = var,
date = c('199312', '199401', '199402'),
time = 'all',
lat = indices(1),
lon = indices(1),
time_across = 'date',
merge_across_dims = TRUE,
# merge_across_dims_narm = TRUE,
# split_multiselected_dims = TRUE,
synonims = list(lat = c('lat', 'latitude'),
lon = c('lon', 'longitude')),
return_vars = list(lon = NULL,
lat = NULL,
time = 'date'),
retrieve = TRUE)
)
expect_equal(
dim(obs3),
c(dat = 1, var = 1, time = 93, lat = 1, lon = 1)
)
expect_equal(
as.vector(obs3),
c(as.vector(easy_array[, 1]), NA, NA, NA)
)
})
test_that("4. merge + narm", {
# (1) Notice that the NAs at the tail of 199402 won't be removed because Start()
# considers all the files have the same length, i.e., 31.
# The NAs in 199402 are regarded as part of the original file.
obs3 <- Start(dat = path_obs,
var = var,
date = c('199312', '199401', '199402'),
time = 'all',
lat = indices(1),
lon = indices(1),
time_across = 'date',
merge_across_dims = TRUE,
merge_across_dims_narm = TRUE,
# split_multiselected_dims = TRUE,
synonims = list(lat = c('lat', 'latitude'),
lon = c('lon', 'longitude')),
return_vars = list(lon = NULL,
lat = NULL,
time = 'date'),
retrieve = TRUE)
expect_equal(
dim(obs3),
c(dat = 1, var = 1, time = 93, lat = 1, lon = 1)
)
expect_equal(
as.vector(obs3),
c(as.vector(easy_array[, 1]), NA, NA, NA)
)
# (2) It's tricky that 199402 is considered time = 31 because Start() considers
# all the files have the same length. So it won't return an error when
# time = indices(93).