diff --git a/inst/doc/data_check.md b/inst/doc/data_check.md new file mode 100644 index 0000000000000000000000000000000000000000..9d42c565355482e45dc12cb29f0b37ebbb8d9ec7 --- /dev/null +++ b/inst/doc/data_check.md @@ -0,0 +1,387 @@ +# Retrieved Data Check +Since startR has not been fully stable with many developments going on, +it is important to check manually if the retrieving data array is as you expect before you go for further analysis, especially when several +parameters are used. + +Here we list some tips recommended to pay attention, and some tools for data comparison. + +## Index +- Tips + (a) [Small dataset](inst/doc/data_check.md#a-small-dataset) + (b) [Dimensions](inst/doc/data_check.md#b-dimensions) + (c) [Attributes](inst/doc/data_check.md#c-attributes) + (d) [Summary](inst/doc/data_check.md#d-summary) + (e) [Individual values](inst/doc/data_check.md#e-individual-values) + +- Data retrieval tools + (1) [s2dv::Load](inst/doc/data_check.md#1-s2dvload) + (2) [ncdf4](inst/doc/data_check.md#2-ncdf4) + (3) [easyNCDF](inst/doc/data_check.md#3-easyncdf) + (4) [ncview](inst/doc/data_check.md#4-ncview) + +- Extra examination + (5) [Compute()](inst/doc/data_check.md#5-compute) + + +## Tips +Here is the general idea. See the [Tools](#data-retrieval-tools) section for example code. + +### (a) Small dataset +It is recommended to use small datasets as the testing sample. For example, reducing 'ensemble' dimension since it is a less-trouble one. +Shortening 'time' dimension to the length that is light-weighted but can still be representative. +Take the monthly file of daily data as an example, it is better to include both 30-day and 31-day months. +As for latitude and longitude, keep both the negative and positive values, +and use the same way (i.e., values(), indices(), or 'all') to define the selectors. + +### (b) Dimensions +Dimension is the most basic thing to check if the data structure is align with +your expectation. You can check it with or without retrieving the data. + +### (c) Attributes +Check longitude, latitude, and time, etc. Even the dimension length is correct, +the order may be opposite or out of range. You can check it with or without retrieving the data. +To learn how to get the proper attributes +by Start(), see [FAQ How-to-#16](#16-use-parameter-return_vars-in-start). + +### (d) Summary +After checking the data structure, you can look into the values to see if the data +are retrieved correctly and filled in the right place. First of all, summary +(mean, min, max, etc.) gives you a general view of the values. If the summary is +consistent, you know Start() finds the correct files and data position. + +### (e) Individual values +There is a chance that the dimensions are mixed, or a few data points are incorrect due to interpolation or other reasons. +By the overall summary, these mistakes may not be detected. Therefore, checking +a few data points is recommended. +It is better to check not only the first element in each dimension but also some points in the middle of the array. + + +## Data retrieval tools +### (1) s2dv::Load +If you used Load() before, it is convenient to compare with its output directly. +It is useful when transformation is applied. Other methods provided here can only show original data. +Also, it helps check the retrieved data structure, not only the values. + +```r +# startR +library(startR) + +obs_path <- '/esarchive/recon/ecmwf/era5/monthly_mean/$var$_f1h/$var$_$sdate$.nc' +var_name <- 'sfcWind' +sdate <- '201811' +lons.min <- 10 +lons.max <- 20 +lats.min <- 0 +lats.max <- 10 + +obs1 <- Start(dat = obs_path, + var = var_name, + sdate = sdate, + time = 'all', + latitude = values(list(lats.min, lats.max)), + latitude_reorder = Sort(decreasing = T), + longitude = values(list(lons.min, lons.max)), + longitude_reorder = CircularSort(0, 360), + synonims = list(longitude = c('lon', 'longitude'), + latitude = c('lat', 'latitude')), + transform = CDORemapper, + transform_extra_cells = 2, + transform_params = list(grid = 'r360x181', + method = 'conservative', + crop = c(lons.min, lons.max, lats.min, lats.max)), + transform_vars = c('latitude', 'longitude'), + return_vars = list(time = NULL, + latitude = 'dat', + longitude = 'dat'), + retrieve = T) + +# Load() +library(s2dv) +pobs <- paste0('/esarchive/recon/ecmwf/era5/monthly_mean/', + '$VAR_NAME$_f1h/$VAR_NAME$_$YEAR$$MONTH$.nc') + +obs2 <- Load(var = var_name, + obs = list(list(path = pobs)), + sdates = '20181101', + leadtimemin = 1, + leadtimemax = 1, + output = 'lonlat', + grid = 'r360x181', + method = 'conservative', + storefreq = 'monthly', + latmin = lats.min, + latmax = lats.max, + lonmin = lons.min, + lonmax = lons.max) + +# Check +## (a) dimensions +dim(obs1) +# dat var sdate time latitude longitude +# 1 1 1 1 11 11 +dim(obs2$obs) +# dataset member sdate ftime lat lon +# 1 1 1 1 11 11 + +## (b) attributes +attr(obs1, 'Variables')$dat1$longitude +# [1] 10 11 12 13 14 15 16 17 18 19 20 +as.vector(obs2$lon) +# [1] 10 11 12 13 14 15 16 17 18 19 20 +attr(obs1, 'Variables')$dat1$latitude +# [1] 10 9 8 7 6 5 4 3 2 1 0 +as.vector(obs2$lat) +# [1] 10 9 8 7 6 5 4 3 2 1 0 + +## (c) summary +diff <- drop(obs1) - drop(obs2$obs) +range(diff) +#[1] -4.965553e-05 4.929314e-05 +summary(obs1) +# Min. 1st Qu. Median Mean 3rd Qu. Max. +# 1.090 1.313 1.479 1.636 1.936 2.803 +summary(obs2$obs) +# Min. 1st Qu. Median Mean 3rd Qu. Max. +# 1.090 1.313 1.479 1.636 1.936 2.803 + +## (d) individual values +obs1[1, 1, 1, 1, 1:3, 1:2] +# [,1] [,2] +#[1,] 2.345519 2.365326 +#[2,] 2.219121 1.958319 +#[3,] 1.840537 1.386617 +obs2$obs[1, 1, 1, 1, 1:3, 1:2] +# [,1] [,2] +#[1,] 2.3455 2.3653 +#[2,] 2.2191 1.9583 +#[3,] 1.8405 1.3866 + +``` + +### (2) ncdf4 +It shows original data values in netCDF files. You can get the same domain if the +interpolation is not applied in Start(), while the data structure will be different. +It is resource-consuming if the whole domain is retrieved through ncdf4. +If doing so, it is better to use fatnodes/Power9/Nord3 interactive session for more memory space. +Check [ncdf4 documentation](https://cran.r-project.org/web/packages/ncdf4/index.html) to learn how to use it. + +```r +# startR +library(startR) +obs_path <- '/esarchive/recon/ecmwf/era5/monthly_mean/$var$_f1h/$var$_$sdate$.nc' +var_name <- 'sfcWind' +sdate <- '201811' + +obs1 <- Start(dat = obs_path, + var = var_name, + sdate = sdate, + time = 'all', + latitude = indices(1:10), + longitude = indices(1:10), + synonims = list(longitude = c('lon', 'longitude'), + latitude = c('lat', 'latitude')), + return_vars = list(time = 'sdate', + latitude = 'dat', + longitude = 'dat'), + retrieve = T) + +# ncdf4 +library(ncdf4) +file <- nc_open('/esarchive/recon/ecmwf/era5/monthly_mean/sfcWind_f1h/sfcWind_201811.nc') +obs2 <- ncvar_get(file, 'sfcWind', start = c(1, 1, 1), count = c(10, 10, 1)) +nc_close(file) + +# Check +## (a) dimensions +dim(obs1) +# dat var sdate time latitude longitude +# 1 1 1 1 10 10 +dim(obs2) +#[1] 10 10 + +## (c) summary +summary(obs1) +# Min. 1st Qu. Median Mean 3rd Qu. Max. +# 4.934 4.954 5.025 5.035 5.105 5.165 +summary(as.vector(obs2)) +# Min. 1st Qu. Median Mean 3rd Qu. Max. +# 4.934 4.954 5.025 5.035 5.105 5.165 + +## (d) individual values +obs1[1, 1, 1, 1, , 2] +# [1] 4.960041 4.959622 4.956322 4.939547 4.991117 5.075977 5.143179 5.157578 +# [9] 5.095344 5.051348 +obs2[2, ] +# [1] 4.960041 4.959622 4.956322 4.939547 4.991117 5.075977 5.143179 5.157578 +# [9] 5.095344 5.051348 +``` + +### (3) easyNCDF +It is a wrapper package of ncdf4. The idea is similar to ncdf4 but with some more-friendly feature. +Check [easyNCDF documentation](https://cran.r-project.org/web/packages/easyNCDF/index.html) to learn how to use it. + + +```r +# startR +# Use the same one as (2). + +# easyNCDF +library(easyNCDF) +file <- NcOpen('/esarchive/recon/ecmwf/era5/monthly_mean/sfcWind_f1h/sfcWind_201811.nc') + +obs2 <- NcToArray(file, vars_to_read = 'sfcWind', + dim_indices = list(longitude = c(1:10), + latitude = c(1:10), + time = c(1))) +NcClose(file) + +# Check +## (a) dimensions +dim(obs1) +# dat var sdate time latitude longitude +# 1 1 1 1 10 10 +dim(obs2) +# var longitude latitude time +# 1 10 10 1 + +## (c) summary +summary(obs1) +# Min. 1st Qu. Median Mean 3rd Qu. Max. +# 4.934 4.954 5.025 5.035 5.105 5.165 +summary(as.vector(obs2)) +# Min. 1st Qu. Median Mean 3rd Qu. Max. +# 4.934 4.954 5.025 5.035 5.105 5.165 + +## (d) individual values +obs1[1, 1, 1, 1, , 2] +# [1] 4.960041 4.959622 4.956322 4.939547 4.991117 5.075977 5.143179 5.157578 +# [9] 5.095344 5.051348 +obs2[1, 2, , 1] +# [1] 4.960041 4.959622 4.956322 4.939547 4.991117 5.075977 5.143179 5.157578 +# [9] 5.095344 5.051348 +``` + +### (4) ncview +Using ncview to check netCDF file is the easiest way. While it may be slow. +It is convenient if you want to check with the map. + +In the terminal, type 'ncview '. Then check the values dynamically in the pop-up window. +You can also plot the map to compare with ncview. It is more clear if you use a bigger domain (rather than the example here, which only has 10*10 points). + +```r +s2dv::PlotEquiMap(obs1[1,1,1,1,,], + lon = attributes(obs1)$Variable$dat1$longitude, + lat = attributes(obs1)$Variable$dat1$latitude) +``` + + +## Extra examination +### (5) Compute() +Some problems hide in Compute() but not Start(). If your final goal is to use Compute() +for analysis, you can do some simple/dumb tests to ensure the data remain the same +after going through the workflow. +You can compare the result of Compute() with what you get from Start(retrieve = TRUE), or with any tool mentioned above. + +```r +library(startR) +obs_path <- '/esarchive/recon/ecmwf/era5/monthly_mean/$var$_f1h/$var$_$sdate$.nc' +var_name <- 'sfcWind' +sdate <- '201811' +lons.min <- 10 +lons.max <- 20 +lats.min <- 0 +lats.max <- 10 + +# startR (retrieve = TRUE) +obs1 <- Start(dat = obs_path, + var = var_name, + sdate = sdate, + time = 'all', + latitude = values(list(lats.min, lats.max)), + latitude_reorder = Sort(decreasing = T), + longitude = values(list(lons.min, lons.max)), + longitude_reorder = CircularSort(0, 360), + synonims = list(longitude = c('lon', 'longitude'), + latitude = c('lat', 'latitude')), + transform = CDORemapper, + transform_extra_cells = 2, + transform_params = list(grid = 'r360x181', + method = 'conservative', + crop = c(lons.min, lons.max, lats.min, lats.max)), + transform_vars = c('latitude', 'longitude'), + return_vars = list(time = NULL, + latitude = 'dat', + longitude = 'dat'), + retrieve = T) + + +# startR (retrieve = FALSE, use the whole workflow) +obs2 <- Start(dat = obs_path, + var = var_name, + sdate = sdate, + time = 'all', + latitude = values(list(lats.min, lats.max)), + latitude_reorder = Sort(decreasing = T), + longitude = values(list(lons.min, lons.max)), + longitude_reorder = CircularSort(0, 360), + synonims = list(longitude = c('lon', 'longitude'), + latitude = c('lat', 'latitude')), + transform = CDORemapper, + transform_extra_cells = 2, + transform_params = list(grid = 'r360x181', + method = 'conservative', + crop = c(lons.min, lons.max, lats.min, lats.max)), + transform_vars = c('latitude', 'longitude'), + return_vars = list(time = NULL, + latitude = 'dat', + longitude = 'dat'), + retrieve = F) + +func <- function(x) { + return(x) +} + +step <- Step(func, + target_dims = 'latitude', + output_dims = 'latitude') + +wf <- AddStep(obs2, step) + +res <- Compute(wf, chunks = list(time = 1))$output1 + +# Check +## (a) dimensions +dim(obs1) +# dat var sdate time latitude longitude +# 1 1 1 1 11 11 +dim(res) +# latitude dat var sdate time longitude +# 11 1 1 1 1 11 + +## (c) summary +diff <- drop(obs1) - drop(res) +range(diff) +#[1] 0 0 +summary(obs1) +# Min. 1st Qu. Median Mean 3rd Qu. Max. +# 1.090 1.313 1.479 1.636 1.936 2.803 +summary(res) +# Min. 1st Qu. Median Mean 3rd Qu. Max. +# 1.090 1.313 1.479 1.636 1.936 2.803 + +## (d) individual values +obs1[1, 1, 1, 1, 1:3, 1:2] +# [,1] [,2] +#[1,] 2.345519 2.365326 +#[2,] 2.219121 1.958319 +#[3,] 1.840537 1.386617 +res[1:3, 1, 1, 1, 1, 1:2] +# [,1] [,2] +#[1,] 2.345519 2.365326 +#[2,] 2.219121 1.958319 +#[3,] 1.840537 1.386617 + +``` + + +