Commit 35491733 authored by aho's avatar aho
Browse files

Merge branch 'master' into 'production'

Merge master to production

See merge request !201
parents 8862cc98 0bb0f2b6
Pipeline #8648 passed with stage
in 2 minutes and 17 seconds
......@@ -6,16 +6,16 @@
^README\.md$
#\..*\.RData$
#^vignettes$
#^tests$
^inst/doc$
#^inst/doc/*$
#^inst/doc/figures/$
#^inst/doc/usecase/$
#^inst/PlotProfiling\.R$
.gitlab-ci.yml
^\.gitlab-ci\.yml$
## unit tests should be ignored when building the package for CRAN
^tests$
^inst/PlotProfiling\.R$
# Suggested by http://r-pkgs.had.co.nz/package.html
^.*\.Rproj$ # Automatically added by RStudio,
^\.Rproj\.user$ # used for temporary files.
^README\.Rmd$ # An Rmarkdown file used to generate README.md
^cran-comments\.md$ # Comments for CRAN submission
^NEWS\.md$ # A news file written in Markdown
#^NEWS\.md$ # A news file written in Markdown
^\.gitlab-ci\.yml$
......@@ -3,7 +3,7 @@ stages:
build:
stage: build
script:
- module load R/3.6.1-foss-2015a-bare
- module load R/4.1.2-foss-2015a-bare
- module load CDO/1.9.8-foss-2015a
- R CMD build --resave-data .
- R CMD check --as-cran --no-manual --run-donttest startR_*.tar.gz
......
Package: startR
Title: Automatically Retrieve Multidimensional Distributed Data Sets
Version: 2.2.0-1
Version: 2.2.1
Authors@R: c(
person("Nicolau", "Manubens", , "nicolau.manubens@bsc.es", role = c("aut")),
person("An-Chi", "Ho", , "an.ho@bsc.es", role = c("aut", "cre")),
......@@ -8,6 +8,7 @@ Authors@R: c(
person("Javier", "Vegas", , "javier.vegas@bsc.es", role = c("ctb")),
person("Pierre-Antoine", "Bretonniere", , "pierre-antoine.bretonniere@bsc.es", role = c("ctb")),
person("Roberto", "Serrano", , "rsnotivoli@gmal.com", role = c("ctb")),
person("Eva", "Rifa", , "eva.rifarovira@bsc.es", role = "ctb"),
person("BSC-CNS", role = c("aut", "cph")))
Description: Tool to automatically fetch, transform and arrange subsets of
multi- dimensional data sets (collections of files) stored in local and/or
......@@ -39,4 +40,5 @@ License: Apache License 2.0
URL: https://earth.bsc.es/gitlab/es/startR/
BugReports: https://earth.bsc.es/gitlab/es/startR/-/issues
SystemRequirements: cdo ecFlow
RoxygenNote: 7.0.1
Encoding: UTF-8
RoxygenNote: 7.2.0
# startR v2.2.0-1 (Release date: 2022-04-19)
# startR v2.2.1 (Release date: 2022-11-17)
- Reduce warning messages from CDO.
- Reduce repetitive warning messages from CDORemapper() when single core is used. When multiple cores
are used, there are still repetitive messages.
- Bugfix in Start() about ClimProjDiags::Subset inputs.
- Bugfix when longitude selector range is very close but not global. The transform indices are correctly selected now.
# startR v2.2.0-2 (Release date: 2022-08-25; internally)
- Use the destination grid to decide which indices to take after interpolation.
- Bugfix when Start() parameter "return_vars" is not used.
- Allow netCDF files to not have calendar attributes (force it to be standard calendar)
# startR v2.2.0-1 (Release date: 2022-04-19; internally)
- Bugfix for the case that the variable has units like time, e.g., "days".
- Development of metadata reshaping. The metadata should correspond to data if data are reshaped by parameter "merge_across_dims" and "split_multiselected_dims", as well as if data selectors are not continuous indices.
- Development of multiple dependency by array selector. An inner dimension indices can vary with multiple file dimensions.
......
......@@ -25,7 +25,7 @@
#' to use for the computation. The default value is 1.
#'@param cluster A list of components that define the configuration of the
#' machine to be run on. The comoponents vary from the different machines.
#' Check \href{https://earth.bsc.es/gitlab/es/startR/}{startR GitLab} for more
#' Check \href{https://earth.bsc.es/gitlab/es/startR/-/blob/master/inst/doc/practical_guide.md}{Practical guide on GitLab} for more
#' details and examples. Only needed when the computation is not run locally.
#' The default value is NULL.
#'@param ecflow_suite_dir A character string indicating the path to a folder in
......
......@@ -210,6 +210,17 @@ NcDataReader <- function(file_path = NULL, file_object = NULL,
} else if (grepl(' since ', units)) {
# Find the calendar
calendar <- attr(result, 'variables')[[var_name]]$calendar
# Calendar types recognized by as.PCICt()
cal.list <- c("365_day", "365", "noleap", "360_day", "360", "gregorian", "standard", "proleptic_gregorian")
if (is.null(calendar)) {
warning("Calendar is missing. Use the standard calendar to calculate time values.")
calendar <- 'gregorian'
} else if (!calendar %in% cal.list) {
# if calendar is not recognized by as.PCICt(), forced it to be standard
warning("The calendar type '", calendar, "' is not recognized by NcDataReader(). It is forced to be standard type.")
calendar <- 'gregorian'
}
if (calendar == 'standard') calendar <- 'gregorian'
parts <- strsplit(units, ' since ')[[1]]
......@@ -291,6 +302,7 @@ NcDataReader <- function(file_path = NULL, file_object = NULL,
result <- result * 30 * 24 * 60 * 60 # day to sec
} else { #old code. The calendar is not in any of the above.
#NOTE: Should not have a chance to be used because the calendar types are forced to be standard above already.
result <- result * 30.5
result <- result * 24 * 60 * 60 # day to sec
}
......
......@@ -1221,7 +1221,7 @@ Start <- function(..., # dim = indices/selectors,
dims_to_check <- debug
debug <- TRUE
}
############################## READING FILE DIMS ############################
# Check that no unrecognized variables are present in the path patterns
# and also that no file dimensions are requested to THREDDs catalogs.
......@@ -1409,7 +1409,7 @@ Start <- function(..., # dim = indices/selectors,
# Chunk it only if it is defined dim (i.e., list of character with names of depended dim)
if (!(length(dat_selectors[[depending_dim_name]]) == 1 &&
dat_selectors[[depending_dim_name]] %in% c('all', 'first', 'last'))) {
if (sapply(dat_selectors[[depending_dim_name]], is.character)) {
if (any(sapply(dat_selectors[[depending_dim_name]], is.character))) {
dat_selectors[[depending_dim_name]] <-
dat_selectors[[depending_dim_name]][desired_chunk_indices]
}
......@@ -1947,6 +1947,10 @@ Start <- function(..., # dim = indices/selectors,
transformed_common_vars_unorder_indices <- NULL
transform_crop_domain <- NULL
# store warning messages from transform
warnings1 <- NULL
warnings2 <- NULL
for (i in 1:length(dat)) {
if (dataset_has_files[i]) {
indices <- indices_of_first_files_with_data[[i]]
......@@ -2101,11 +2105,16 @@ Start <- function(..., # dim = indices/selectors,
}
# Transform the variables
transformed_data <- do.call(transform, c(list(data_array = NULL,
variables = vars_to_transform,
file_selectors = selectors_of_first_files_with_data[[i]],
crop_domain = transform_crop_domain),
transform_params))
tmp <- .withWarnings(
do.call(transform, c(list(data_array = NULL,
variables = vars_to_transform,
file_selectors = selectors_of_first_files_with_data[[i]],
crop_domain = transform_crop_domain),
transform_params))
)
transformed_data <- tmp$value
warnings1 <- c(warnings1, tmp$warnings)
# Discard the common transformed variables if already transformed before
if (!is.null(transformed_common_vars)) {
common_ones <- which(names(picked_common_vars) %in% names(transformed_data$variables))
......@@ -2661,8 +2670,14 @@ Start <- function(..., # dim = indices/selectors,
selector_indices_to_take <- which(selector_file_dim_array == j, arr.ind = TRUE)[1, ]
names(selector_indices_to_take) <- names(selector_file_dims)
selector_store_position[names(selector_indices_to_take)] <- selector_indices_to_take
sub_array_of_selectors <- Subset(selector_array, names(selector_indices_to_take),
as.list(selector_indices_to_take), drop = 'selected')
# "selector_indices_to_take" is an array if "selector_file_dims" is not 1 (if
# selector is an array with a file_dim dimname, i.e., time = [sdate = 2, time = 4].
if (!is.null(names(selector_indices_to_take))) {
sub_array_of_selectors <- Subset(selector_array, names(selector_indices_to_take),
as.list(selector_indices_to_take), drop = 'selected')
} else {
sub_array_of_selectors <- selector_array
}
if (debug) {
if (inner_dim %in% dims_to_check) {
......@@ -2681,8 +2696,14 @@ Start <- function(..., # dim = indices/selectors,
} else {
if (length(names(var_file_dims)) > 0) {
var_indices_to_take <- selector_indices_to_take[which(names(selector_indices_to_take) %in% names(var_file_dims))]
sub_array_of_values <- Subset(var_with_selectors, names(var_indices_to_take),
as.list(var_indices_to_take), drop = 'selected')
if (!is.null(names(var_indices_to_take))) {
sub_array_of_values <- Subset(var_with_selectors, names(var_indices_to_take),
as.list(var_indices_to_take), drop = 'selected')
} else {
# time across some file dim (e.g., "file_date") but doesn't have
# this file dim as dimension (e.g., time: [sdate, time])
sub_array_of_values <- var_with_selectors
}
} else {
sub_array_of_values <- var_with_selectors
}
......@@ -2967,12 +2988,16 @@ Start <- function(..., # dim = indices/selectors,
inner_dim, sub_array_of_fri)
}
}
transformed_subset_var <- do.call(transform, c(list(data_array = NULL,
variables = subset_vars_to_transform,
file_selectors = selectors_of_first_files_with_data[[i]],
crop_domain = transform_crop_domain),
transform_params))$variables[[var_with_selectors_name]]
tmp <- .withWarnings(
do.call(transform, c(list(data_array = NULL,
variables = subset_vars_to_transform,
file_selectors = selectors_of_first_files_with_data[[i]],
crop_domain = transform_crop_domain),
transform_params))$variables[[var_with_selectors_name]]
)
transformed_subset_var <- tmp$value
warnings2 <- c(warnings2, tmp$warnings)
# Sorting the transformed variable and working out the indices again after transform.
if (!is.null(dim_reorder_params[[inner_dim]])) {
transformed_subset_var_reorder <- dim_reorder_params[[inner_dim]](transformed_subset_var)
......@@ -3046,97 +3071,18 @@ Start <- function(..., # dim = indices/selectors,
sub_array_of_sri <- sub_array_of_sri[[1]]:sub_array_of_sri[[2]]
}
# Chunk sub_array_of_sri if this inner_dim needs to be chunked
#TODO: Potential problem: the transformed_subset_var value falls between
# the end of sub_sub_array_of_values of the 1st chunk and the beginning
# of sub_sub_array_of_values of the 2nd chunk. Then, one sub_array_of_sri
# will miss. 'previous_sri' is checked and will be included if this
# situation happens, but don't know if the transformed result is
# correct or not.
# NOTE: The chunking criteria may not be 100% correct. The current way
# is to pick the sri that larger than the minimal sub_sub_array_of_values
# and smaller than the maximal sub_sub_array_of_values; if it's
# the first chunk, make sure the 1st sri is included; if it's the
# last chunk, make sure the last sri is included.
if (chunks[[inner_dim]]["n_chunks"] > 1) {
sub_array_of_sri_complete <- sub_array_of_sri
if (is.list(sub_sub_array_of_values)) { # list
sub_array_of_sri <-
which(transformed_subset_var >= min(unlist(sub_sub_array_of_values)) &
transformed_subset_var <= max(unlist(sub_sub_array_of_values)))
# if it's 1st chunk & the first sri is not included, include it.
if (chunks[[inner_dim]]["chunk"] == 1 &
!(sub_array_of_sri_complete[1] %in% sub_array_of_sri)) {
sub_array_of_sri <- c(sub_array_of_sri_complete[1], sub_array_of_sri)
}
# if it's last chunk & the last sri is not included, include it.
if (chunks[[inner_dim]]["chunk"] == chunks[[inner_dim]]["n_chunks"] &
!(tail(sub_array_of_sri_complete, 1) %in% sub_array_of_sri)) {
sub_array_of_sri <- c(sub_array_of_sri, tail(sub_array_of_sri_complete, 1))
}
#========================================================
# Check if sub_array_of_sri perfectly connects to the previous sri.
# If not, inlclude the previous sri.
#NOTE 1: don't know if the transform for the previous sri is
# correct or not.
#NOTE 2: If crop = T, sub_array_of_sri always starts from 1.
# Don't know if the cropping will miss some sri or not.
if (sub_array_of_sri[1] != 1) {
if (!is.null(previous_sub_sub_array_of_values)) {
# if decreasing = F
if (transformed_subset_var[1] < transformed_subset_var[2]) {
previous_sri <- max(which(transformed_subset_var <= previous_sub_sub_array_of_values))
} else {
# if decreasing = T
previous_sri <- max(which(transformed_subset_var >= previous_sub_sub_array_of_values))
}
if (previous_sri + 1 != sub_array_of_sri[1]) {
sub_array_of_sri <- (previous_sri + 1):sub_array_of_sri[length(sub_array_of_sri)]
}
}
}
} else { # is vector
tmp <- which(transformed_subset_var >= min(sub_sub_array_of_values) &
transformed_subset_var <= max(sub_sub_array_of_values))
# Ensure tmp and sub_array_of_sri are both ascending or descending
if (is.unsorted(tmp) != is.unsorted(sub_array_of_sri)) {
tmp <- rev(tmp)
}
# Include first or last sri if tmp doesn't have. It's only for
# ""vectors"" because vectors look for the closest value.
#NOTE: The condition here is not correct. The criteria should be
# 'vector' instead of indices.
if (chunks[[inner_dim]]["chunk"] == 1) {
sub_array_of_sri <- unique(c(sub_array_of_sri[1], tmp))
} else if (chunks[[inner_dim]]["chunk"] ==
chunks[[inner_dim]]["n_chunks"]) { # last chunk
sub_array_of_sri <- unique(c(tmp, sub_array_of_sri[length(sub_array_of_sri)]))
} else {
sub_array_of_sri <- tmp
}
# Check if sub_array_of_sri perfectly connects to the previous sri.
# If not, inlclude the previous sri.
#NOTE 1: don't know if the transform for the previous sri is
# correct or not.
#NOTE 2: If crop = T, sub_array_of_sri always starts from 1.
# Don't know if the cropping will miss some sri or not.
if (sub_array_of_sri[1] != 1) {
if (!is.null(previous_sub_sub_array_of_values)) {
# if decreasing = F
if (transformed_subset_var[1] < transformed_subset_var[2]) {
previous_sri <- max(which(transformed_subset_var <= previous_sub_sub_array_of_values))
} else {
# if decreasing = T
previous_sri <- max(which(transformed_subset_var >= previous_sub_sub_array_of_values))
}
if (previous_sri + 1 != which(sub_array_of_sri[1] == sub_array_of_sri_complete)) {
sub_array_of_sri <- (previous_sri + 1):sub_array_of_sri[length(sub_array_of_sri)]
}
}
}
}
# Instead of using values to find sri, directly use the destination grid to count.
#NOTE: sub_array_of_sri seems to start at 1 always (because crop = c(lonmin, lonmax, latmin, latmax) already?)
if (chunks[[inner_dim]]["n_chunks"] > 1) {
sub_array_of_sri <- sub_array_of_sri[get_chunk_indices(
length(sub_array_of_sri),
chunks[[inner_dim]]["chunk"],
chunks[[inner_dim]]["n_chunks"],
inner_dim)]
}
#========================================================
ordered_sri <- sub_array_of_sri
sub_array_of_sri <- transformed_subset_var_unorder[sub_array_of_sri]
......@@ -3333,7 +3279,11 @@ Start <- function(..., # dim = indices/selectors,
selector_store_position <- chunk
}
sub_array_of_indices <- transformed_indices[which(indices_chunk == chunk)]
sub_array_of_indices <- transformed_indices[which(indices_chunk == chunk)]
#NOTE: This 'with_transform' part is probably not tested because
# here is for the inner dim that goes across a file dim, which
# is normally not lat and lon dimension. If in the future, we
# can interpolate time, this part needs to be examined.
if (with_transform) {
# If the provided selectors are expressed in the world
# before transformation
......@@ -3716,11 +3666,13 @@ Start <- function(..., # dim = indices/selectors,
tmp_fun <- function (x, y) {
any(names(dim(x)) %in% y)
}
inner_dim_has_split_dim <- names(which(unlist(lapply(
picked_common_vars, tmp_fun, names(all_split_dims)))))
if (!identical(inner_dim_has_split_dim, character(0))) {
# If merge_across_dims also, it will be replaced later
saved_reshaped_attr <- attr(picked_common_vars[[inner_dim_has_split_dim]], 'variables')
if (!is.null(picked_common_vars)) {
inner_dim_has_split_dim <- names(which(unlist(lapply(
picked_common_vars, tmp_fun, names(all_split_dims)))))
if (!identical(inner_dim_has_split_dim, character(0))) {
# If merge_across_dims also, it will be replaced later
saved_reshaped_attr <- attr(picked_common_vars[[inner_dim_has_split_dim]], 'variables')
}
}
}
}
......@@ -3785,7 +3737,7 @@ Start <- function(..., # dim = indices/selectors,
if (!merge_across_dims & split_multiselected_dims & identical(inner_dim_has_split_dim, character(0))) {
final_dims_fake_metadata <- NULL
} else {
if (!merge_across_dims & split_multiselected_dims) {
if (!merge_across_dims & split_multiselected_dims & !is.null(picked_common_vars)) {
if (any(names(all_split_dims[[1]]) %in% names(dim(picked_common_vars[[inner_dim_has_split_dim]]))) &
names(all_split_dims)[1] != inner_dim_has_split_dim) {
if (inner_dim_has_split_dim %in% names(final_dims)) {
......@@ -3803,7 +3755,10 @@ Start <- function(..., # dim = indices/selectors,
final_dims_fake, dims_of_merge_dim, all_split_dims)
}
}
# store warning messages from transform
warnings3 <- NULL
# The following several lines will only run if retrieve = TRUE
if (retrieve) {
......@@ -3882,10 +3837,12 @@ Start <- function(..., # dim = indices/selectors,
# the appropriate work pieces.
work_pieces <- retrieve_progress_message(work_pieces, num_procs, silent)
# NOTE: In .LoadDataFile(), metadata is saved in metadata_folder/1 or /2 etc. Before here,
# the path name is created in work_pieces but the path hasn't been built yet.
if (num_procs == 1) {
found_files <- lapply(work_pieces, .LoadDataFile,
tmp <- .withWarnings(
lapply(work_pieces, .LoadDataFile,
shared_matrix_pointer = shared_matrix_pointer,
file_data_reader = file_data_reader,
synonims = synonims,
......@@ -3893,9 +3850,15 @@ Start <- function(..., # dim = indices/selectors,
transform_params = transform_params,
transform_crop_domain = transform_crop_domain,
silent = silent, debug = debug)
)
found_files <- tmp$value
warnings3 <- c(warnings3, tmp$warnings)
} else {
cluster <- parallel::makeCluster(num_procs, outfile = "")
# Send the heavy work to the workers
##NOTE: .withWarnings() can't catch warnings like it does above (num_procs == 1). The warnings
## show below when "bigmemory::as.matrix(data_array)" is called. Don't know how to fix it for now.
work_errors <- try({
found_files <- parallel::clusterApplyLB(cluster, work_pieces, .LoadDataFile,
shared_matrix_pointer = shared_matrix_pointer,
......@@ -4000,7 +3963,7 @@ Start <- function(..., # dim = indices/selectors,
picked_common_vars[[across_inner_dim]] <- metadata_tmp
attr(picked_common_vars[[across_inner_dim]], 'variables') <- saved_reshaped_attr
}
if (split_multiselected_dims) {
if (split_multiselected_dims & !is.null(picked_common_vars)) {
if (!identical(inner_dim_has_split_dim, character(0))) {
metadata_tmp <- array(picked_common_vars[[inner_dim_has_split_dim]], dim = final_dims_fake_metadata)
# Convert numeric back to dates
......@@ -4129,7 +4092,7 @@ Start <- function(..., # dim = indices/selectors,
picked_common_vars[[across_inner_dim]] <- metadata_tmp
attr(picked_common_vars[[across_inner_dim]], 'variables') <- saved_reshaped_attr
}
if (split_multiselected_dims) {
if (split_multiselected_dims & !is.null(picked_common_vars)) {
if (!identical(inner_dim_has_split_dim, character(0))) {
metadata_tmp <- array(picked_common_vars[[inner_dim_has_split_dim]], dim = final_dims_fake_metadata)
# Convert numeric back to dates
......@@ -4143,6 +4106,16 @@ Start <- function(..., # dim = indices/selectors,
}
}
# Print the warnings from transform
if (!is.null(c(warnings1, warnings2, warnings3))) {
transform_warnings_list <- lapply(c(warnings1, warnings2, warnings3), function(x) {
return(x$message)
})
transform_warnings_list <- unique(transform_warnings_list)
for (i in 1:length(transform_warnings_list)) {
.warning(transform_warnings_list[[i]])
}
}
# Change final_dims_fake back because retrieve = FALSE will use it for attributes later
if (exists("final_dims_fake_output")) {
......
......@@ -859,3 +859,13 @@
}
return(unlist(new_list))
}
.withWarnings <- function(expr) {
myWarnings <- NULL
wHandler <- function(w) {
myWarnings <<- c(myWarnings, list(w))
invokeRestart("muffleWarning")
}
val <- withCallingHandlers(expr, warning = wHandler)
list(value = val, warnings = myWarnings)
}
......@@ -566,7 +566,7 @@ generate_vars_to_transform <- function(vars_to_transform, picked_vars, transform
# Turn indices to values for transform_crop_domain
generate_transform_crop_domain_values <- function(transform_crop_domain, picked_vars) {
if (transform_crop_domain == 'all') {
if (any(transform_crop_domain == 'all')) {
transform_crop_domain <- c(picked_vars[1], tail(picked_vars, 1))
} else { # indices()
if (is.list(transform_crop_domain)) {
......@@ -692,9 +692,11 @@ generate_sub_array_of_fri <- function(with_transform, goes_across_prime_meridian
} else if (start_padding < beta) {
# left side too close to border, need to go to right side
sub_array_of_fri <- c((first_index - start_padding):(last_index + end_padding), (n - (beta - start_padding - 1)):n)
sub_array_of_fri <- unique(sub_array_of_fri)
} else if (end_padding < beta) {
# right side too close to border, need to go to left side
sub_array_of_fri <- c(1: (beta - end_padding), (first_index - start_padding):(last_index + end_padding))
sub_array_of_fri <- unique(sub_array_of_fri)
}
}
......@@ -706,6 +708,7 @@ generate_sub_array_of_fri <- function(with_transform, goes_across_prime_meridian
}
}
}
if (print_warning) {
.warning(paste0("Adding parameter transform_extra_cells = ", beta,
" to the transformed index excesses ",
......
......@@ -28,6 +28,8 @@ This document intends to be the first reference for any doubts that you may have
22. [Define the selector when the indices in the files are not aligned](#22-define-the-selector-when-the-indices-in-the-files-are-not-aligned)
23. [The best practice of using vector and list for selectors](#23-the-best-practice-of-using-vector-and-list-for-selectors)
24. [Do both interpolation and chunking on spatial dimensions](#24-do-both-interpolation-and-chunking-on-spatial-dimensions)
25. [What to do if your function has too many target dimensions](#25-what-to-do-if-your-function-has-too-many-target-dimensions)
26. [Use merge_across_dims_narm to remove NAs](#26-use-merge_across_dims_narm-to-remove-nas)
</b>
2. **Something goes wrong...**
......@@ -82,12 +84,12 @@ all the possibilities and make the output data structure reasonable all the time
Therefore, it is recommended to understand the way Start() rolls first,
then you know what you should expect from the output and will not get confused with what it returns to you.
If you want to connet xxx along yyy, the parameter 'merge_across_dims' and 'merge_across_dims_narm' can help you achieve it.
See Example 1. If 'merge_across_dims = TRUE', the chunk dimension will disappear.
'merge_across_dims' simply attaches data one after another, so the NA values (if exist) will be the same places as the unmerged one (see Example 2).
Now we understand the cross relationship between dimensions, we can talk about how to merge them: use the parameters `merge_across_dims` and `merge_across_dims_narm`.
See Example 1. If `merge_across_dims = TRUE`, the chunk dimension will disappear.
`merge_across_dims` simply attaches data one after another, so the NA values (if exist) will be the same places as the unmerged one (see Example 2).
If you want to remove those additional NAs, you can use 'merge_across_dims_narm = TRUE',
then the NAs will be removed when merging into one dimension. (see Example 2).
If you want to remove those additional NAs, you can use `merge_across_dims_narm = TRUE`,
then the NAs will be removed when merging into one dimension. (see Example 2). To know more about `merge_across_dims_narm`, check [How-to-26](#26-use-merge-across-dims-narm-to-remove-nas).
You can find more use cases at [ex1_2_exp_obs_attr.R](inst/doc/usecase/ex1_2_exp_obs_attr.R) and [ex1_3_attr_loadin.R](inst/doc/usecase/ex1_3_attr_loadin.R).
......@@ -463,7 +465,7 @@ data <- Start(dat = repos,
If you want to interpolate data by s2dv::CDORemap in function, you need to tell the
machine which CDO module to use. Therefore, `CDO_module = 'CDO/1.9.5-foss-2018b'` should be
added in Compute() cluster list. See the example in usecase [ex2_3_cdo.R](inst/doc/usecase/ex2_3_cdo.R).
added in Compute() cluster list. See the example in usecase [ex2_3](inst/doc/usecase/ex2_3_cdo.R).
### 10. The number of members depends on the start date
......@@ -475,9 +477,9 @@ When trying to load both start dates at once using Start(), the order in which t
- `sdates = c('19991101', '19990901')`, the member dimension will be of length 51, showing missing values for the members 26 to 51 in the second start date;
- `sdates = c('19990901', '19991101')`, the member dimension will be of length 25, any member will be missing.
To ensure that all the members are retrieved, we can use parameter 'largest_dims_length'. See [FAQ 21](https://earth.bsc.es/gitlab/es/startR/-/blob/master/inst/doc/faq.md#21-retrieve-the-complete-data-when-the-dimension-length-varies-among-files) for details.
To ensure that all the members are retrieved, we can use parameter `largest_dims_length`. See [FAQ 21](https://earth.bsc.es/gitlab/es/startR/-/blob/master/inst/doc/faq.md#21-retrieve-the-complete-data-when-the-dimension-length-varies-among-files) for details.
The code to reproduce this behaviour could be found in the Use Cases section, [example 1.4](/inst/doc/usecase/ex1_4_variable_nmember.R).
The code to reproduce this behaviour could be found in the usecase [ex1_4](/inst/doc/usecase/ex1_4_variable_nmember.R).
### 11. Select the longitude/latitude region
......@@ -850,18 +852,18 @@ same. For example, the member number in one experiment is 25 in the early years
increase to 51 later. If you assign `member = 'all'` in Start() call, the returned member
dimension length will be 25 only.
The parameter 'largest_dims_length' is for this case. Its default value is `FALSE`, meaning
The parameter `largest_dims_length` is for this case. Its default value is `FALSE`, meaning
that Start() can only use the first valid file to decide the dimensions. If it is changed to
`TRUE`, Start() will examine all the required files to find the largest length for all the inner
dimensions. It is time- and resource-consuming, but useful when you are not sure how the dimensions
in all the files look like.
If you know the expected dimension length, it is recommended to assign 'largest_dims_length'
If you know the expected dimension length, it is recommended to assign `largest_dims_length`
by a named integer vector, for example, `largest_dims_length = c(member = 51)`. Start() will
adopt the provided ones and use the first valid file to decide the rest of dimensions.
By this means, the efficiency can be similar to `largest_dims_length = FALSE`.
Find example in [use case ex1_4](/inst/doc/usecase/ex1_4_variable_nmember.R).
Find example in use case [ex1_4](/inst/doc/usecase/ex1_4_variable_nmember.R).
### 22. Define the selector when the indices in the files are not aligned
......@@ -973,6 +975,24 @@ the usage of those parameters to avoid unecessary errors.
We provide some [use cases](inst/doc/usecase/ex2_12_transform_and_chunk.R) showing the secure ways of transformation + chunking.
### 25. What to do if your function has too many target dimensions
Unfortunately, we don't have a perfect solution now before we have multiple steps feature. Talk to maintainers to see how to generate a workaround for your case.
### 26. Use merge_across_dims_narm to remove NAs
The Start() parameter `merge_across_dims_narm` can be useful when you want to merge two dimensions together (e.g., time across chunk.) If you're not familiar with the usage of `xxx_across = yyy` and `merge_across_dims` yet, check [How-to-2](#2-indicate-dependent-dimension-and-use-merge-parameters-in-start) first.
First thing to notice is that `merge_across_dims_narm` can only remove **the NAs that are created by Start()** during the reshaping process.
It doesn't remove the NAs in the original data. For example, in Example 2 in How-to-2, the NAs are removed because those NAs are added by Start().
Second, if the files don't share the same length of the merged dimension, you need to use `largest_dims_length = T` along.
This parameter tells Start() that it needs to look into each file to know the dimension length. By doing this, Start() knows that the NAs in the files with shorter dimension are added by it, so `merge_across_dims_narm = T` can remove those NAs correctly.
A typical example is reading daily data and merging time dimension together. The 30-day months will have one NA at the end of time dimension, if `merge_across_dims_narm = T` and `largest_dims_length = T` are not used.
Check usecase [ex1_16](/inst/doc/usecase/ex1_16_files_different_time_dim_length.R) for the example script.
See [How-to-21](#21-retrieve-the-complete-data-when-the-dimension-length-varies-among-files) for more details of `largest_dims_length`.
# Something goes wrong...
### 1. No space left on device
......
This diff is collapsed.
# Hands-on 3: Load data by startR
## Goal
Use startR to load the data used in CSTools [RainFARM vignette](https://earth.bsc.es/gitlab/external/cstools/-/blob/master/vignettes/RainFARM_vignette.Rmd). Learn how to adjust data while loading data.