From 5fd107bd8fc483a556811e434b34b6a4fb971be2 Mon Sep 17 00:00:00 2001 From: aho Date: Tue, 5 Nov 2019 11:20:57 +0100 Subject: [PATCH 1/4] Move FAQs from wiki to inst/doc --- inst/doc/faq.md | 277 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 277 insertions(+) create mode 100644 inst/doc/faq.md diff --git a/inst/doc/faq.md b/inst/doc/faq.md new file mode 100644 index 0000000..7027ed9 --- /dev/null +++ b/inst/doc/faq.md @@ -0,0 +1,277 @@ +# Usecase scripts + +This document intends to be the first reference for any doubts that you may have regarding startR. If you do not find the information you need, please open an issue for your problem. + + +1. **How to** + 1. [Choose the number of chunks/jobs/cores in Compute()]() + 2. [Merge/Reorder dimension in Start() (using parameter 'xxx_across' and 'merge_across_dims')]() + 3. [Use self-defined function in Compute()]() + 4. [Use package function in Compute()]() + +2. **Something goes wrong...** + 1. [No space left on device](#no-space-left-on-device) + 2. [ecFlow UI remains blue and does not update status]() + 3. [Compute() successfully but then killed on R session]() + + +## 1. How to + +### 1. Choose the number of chunks/jobs/cores in Compute() +Run Start() call to see the total size of the data you read in (remember to set ´retrieve = FALSE´). + +Divide data into chunks according to the size of machine memory module (Power9 is 32GB; MN4 is 8GB). The data size per chunk should be 1/3 to 1/2 of the total memory module. + +Find more details in practical_guide.md [How to choose the number of chunks, jobs and cores](inst/doc/practical_guide.md#how-to-choose-the-number-of-chunks-jobs-and-cores) + +### 2. Merge/Reorder dimension in Start() (using parameter 'xxx_across' and 'merge_across_dims') +The parameter `'xxx_across = yyy'` indicates that the inner dimension 'xxx' is continuous along the file dimension 'yyy'. A common example is 'time_across = chunk', when the experiment runs through many years and the result is saved in several chunk files. Find more details in startR documentation. + +If you define this parameter, you can specify 'xxx' with the indices throughout the whole 'yyy' files, not only within one file. See Example 1 below, 'time = indices(1:24)' is available when 'time_across = chunk' is specified. If not, 'time' can only be 12 for most. + +One example making advantage of 'xxx_across' is extracting an climate event across years, like El Niño. If the event starts from Nov 2014 to May 2016 (19 months in total), simply specify 'time = indices(11:29)' (Example 2) + +The thing you should bear in mind when using this parameter is the returned data structure. First, **the length of the return xxx dimension is the length of the longest xxx in all files**. Take the El Niño above as an example. The first chunk has 2 months, the second chunk has 12 months, and the third chunk has 5 months. Therefore, the length of time dimension will be 12, and the length of chunk dimension will be 3. + +Second, the way Start() store data is **put data at the left-most position**. Take the El Niño (Example 2) above as an example again. The first chunk has only 2 months, so position 1 and 2 have values (which are Nov and Dec 2014). The second chunk has 12 months, so all positions have values (Jan to Dec 2015), while position 3 to 12 will be NA. The third chunk has 5 months, so position 1 to 5 have values (which are Jan to May 2016), while position 6 to 12 will be NA. + +It seems more reasonable to put NA at position 1 to 10 in first chunk (Jan to Oct 2014) and and position 6 to 12 in the third chunk (June to Dec 2016). But if the data is not continuous or picked irregularly , it is hard to judge the correct NA position (see Example 3). + +Since Start() is very flexible with any possible way to read-in data, it is difficult to include all the possibilities and make the output data structure reasonable all the time. Therefore, it is recommended to understand the way Start() rolls first, then you know what you should expect from the output and will not get confused with what it returns to you. + +As for parameter 'merge_across_dims', it decides whether to connect all 'xxx' together along 'yyy' or not. See Example 1. If 'merge_across_dims = TRUE', the chunk dimension will disappear. 'merge_across_dims' simply attaches data one after another, so the NA values (if exist) will be the same places as the unmerged one (see Example 2). + +Example 1 + +```r +data <- Start(dat = repos, + var = 'tas', + time = indices(1:24), # each file has 12 months; read 24 months in total + chunk = indices(1:2), #two years, each with 12 months + lat = 'all', + lon = 'all', + time_across = 'chunk', + merge_across_dims = FALSE, #TRUE, + return_vars = list(lat = NULL, lon = NULL), + retrieve = TRUE) + +#return dimension (merge_across_dims = FALSE) +dat var time chunk lat lon + 1 1 12 2 256 512 + +#return dimension (merge_across_dims = TRUE) +dat var time lat lon + 1 1 24 256 512 +``` + +Example 2: El Niño event + +```r +repos <- '/esarchive/exp/ecearth/a1tr/cmorfiles/CMIP/EC-Earth-Consortium/EC-Earth3/historical/$memb$/Omon/$var$/gr/v20190312/$var$_Omon_EC-Earth3_historical_$memb$_gr_$chunk$.nc' + +data <- Start(dat = repos, + var = 'tos', + memb = 'r24i1p1f1', + time = indices(4:27), # Apr 1957 to Mar 1959 + chunk = c('195701-195712', '195801-195812', '195901-195912'), + lat = 'all', + lon = 'all', + time_across = 'chunk', + merge_across_dims = FALSE, + return_vars = list(lat = NULL, lon = NULL), + retrieve = TRUE) + +> dim(data) + dat var memb time chunk lat lon + 1 1 1 12 3 256 512 + +> data[1,1,1,,,100,100] + [,1] [,2] [,3] + [1,] 300.7398 300.7659 301.7128 + [2,] 299.6569 301.8241 301.4781 + [3,] 298.3954 301.6472 301.3807 + [4,] 297.1931 301.0621 NA + [5,] 295.9608 299.1324 NA + [6,] 295.4735 297.4028 NA + [7,] 295.8538 296.1619 NA + [8,] 297.9998 295.2794 NA + [9,] 299.4571 295.0474 NA +[10,] NA 295.4571 NA +[11,] NA 296.8002 NA +[12,] NA 299.0254 NA + +#To move the NAs in the first year to Jan to Mar +> asd <- Subset(data, c(5), list(1)) +> qwe <- asd[, , , c(10:12, 1:9), , ,] +> data[, , , , 1, ,] <- qwe + +> data[1, 1, 1, , , 100, 100] + [,1] [,2] [,3] + [1,] NA 300.7659 301.7128 + [2,] NA 301.8241 301.4781 + [3,] NA 301.6472 301.3807 + [4,] 300.7398 301.0621 NA + [5,] 299.6569 299.1324 NA + [6,] 298.3954 297.4028 NA + [7,] 297.1931 296.1619 NA + [8,] 295.9608 295.2794 NA + [9,] 295.4735 295.0474 NA +[10,] 295.8538 295.4571 NA +[11,] 297.9998 296.8002 NA +[12,] 299.4571 299.0254 NA + +``` + +Example 3: Read in three winters (DJF) + +```r +repos <- '/esarchive/exp/ecearth/a1tr/cmorfiles/CMIP/EC-Earth-Consortium/EC-Earth3/historical/$memb$/Omon/$var$/gr/v20190312/$var$_Omon_EC-Earth3_historical_$memb$_gr_$chunk$.nc' + +data <- Start(dat = repos, + var = 'tos', + memb = 'r24i1p1f1', + time = c(12:14, 24:26, 36:38), # DJF, Dec 1999 to Jan 2002 + chunk = c('199901-199912', '200001-200012', '200101-200112', '200201-200212'), + lat = 'all', + lon = 'all', + time_across = 'chunk', + merge_across_dims = TRUE, + return_vars = list(lat = NULL, lon = NULL), + retrieve = TRUE) + +> dim(data) + dat var memb time lat lon + 1 1 1 12 256 512 + +> data[1, 1, 1, , 100, 100] + [1] 300.0381 NA NA 301.3340 302.0320 300.3575 301.0930 301.4149 + [9] 299.3486 300.7203 301.6695 NA + + +#Remove NAs and rearrange DJF +> qwe <- Subset(asd, c(4), list(c(1, 4:11))) +> zxc <- InsertDim(InsertDim(qwe, 5, 3), 6, 3) +> zxc <- Subset(zxc, 'time', list(1), drop = 'selected') +> zxc[, , , 1:3, 1, ,] <- qwe[, , , 1:3, ,] +> zxc[, , , 1:3, 2, ,] <- qwe[, , , 4:6, ,] +> zxc[, , , 1:3, 3, ,] <- qwe[, , , 7:9, ,] +> names(dim(zxc))[4] <- c('month') +> names(dim(zxc))[5] <- c('year') + +> dim(zxc) + dat var memb month year lat lon + 1 1 1 3 3 256 512 + +> zxc[1, 1, 1, , , 100, 100] + [,1] [,2] [,3] +[1,] 300.0381 300.3575 299.3486 +[2,] 301.3340 301.0930 300.7203 +[3,] 302.0320 301.4149 301.6695 + +``` + +## 3. Use self-defined function in Compute() + +The workflow to use Compute() is: 'define the function' -> 'use Step() to assign the target/output dimension' -> 'use AddStep() to build up workflow' -> 'use Compute() to launch jobs on either local workstation or fatnodes/Power9'. + +It is no problem when you only have a simple function directly defined in your script (like the example in [practical guide](https://earth.bsc.es/gitlab/es/startR/blob/master/inst/doc/practical_guide.md#step-and-addstep)). However, if the function is more complicated, you may want to save it as an independent file. In this case, the machines (Power 9 or fatnodes) cannot recognize your function therefore the jobs will fail (if you use Compute() at your own local workstation, the problem does not exist.) + +The solution is simple. First, put your function file at somewhere in the machine. For example, in Power 9, put own_func.R at `/esarchive/scratch/`. Second, in the script, source the function in the function definition (see the example below). Hence, the machine can find your function. + +```r +data <- Start(..., + retrieve = FALSE) + +func <- function(x) { + source("/esarchive/scratch/aho/own_func.R") #the path in Power 9 + y <- own_func(x, posdim = 'time') + return(y) +} + +step <- Step(fun = func, + target_dims = c('time'), + output_dims = c('time'))#, + +wf <- AddStep(data, step) + +res <- Compute(wf, ...) + +``` + +## 4. Use package function in Compute() + +In the workflow for Compute(), first step is to define the function. If you want to use the function in certain R package, you need to check if the package is involved in the R module (`r_module`) or library (`lib_dir`). Then, specify the package name before the function name (see example below) so the machine can recognize which function you refer to. + +```r +data <- Start(..., + retrieve = FALSE) + +func <- function(x) { + y <- s2dverification::Season(x, posdim = 'time') #specify package name + return(y) +} + +step <- Step(fun = func, + target_dims = c('time'), + output_dims = c('time')) + +wf <- AddStep(data, step) + + res <- Compute(wf, + chunks = list(latitude = 2, + longitude = 2), + threads_load = 2, + threads_compute = 4, + cluster = list(queue_host = 'p1', #your alias for power9 + queue_type = 'slurm', + temp_dir = '/gpfs/scratch/bsc32/bsc32734/startR_hpc/', + lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.5/', #s2dverification is involved here, so the machine can find Season() + r_module = 'startR/0.1.2-foss-2018b-R-3.5.0', + job_wallclock = '00:10:00', + cores_per_job = 4, + max_jobs = 4, + bidirectional = FALSE, + polling_period = 50 + ), + ecflow_suite_dir = '/home/Earth/aho/startR_local/', + wait = TRUE + ) + +``` + +## Something goes wrong... + +### 1. No space left on device + +An issue of R is the accumulated trash files, which occupy the machine memory therefore crash R. If the size of data your R script deal with is reasonable but R crashes immediately after running and returns the ERROR: +> +> No space left on device +> +Go to **/dev/shm/** and `rm ` + +Find more discussion in this [issue](https://earth.bsc.es/gitlab/es/s2dverification/issues/221) + +### 2. ecFlow UI remains blue and does not update status + +This situation will occur if: +1. The Compute() parameter `wait` is set to be `FALSE`, and +2. Launch jobs on an HPC where the connection with its login node is unidirectional (e.g., Power 9) + +Under this condition, the ecFlow UI will remain blue and will not update the status. +To solve this problem, run the following three lines in an R terminal after running Compute(): + +```r + res <- Compute(wf, + ..., + wait = FALSE) + + saveRDS(res, file = 'test_data.Rds') + res <- readRDS('test_data.Rds') + result <- Collect(res, wait = TRUE) #it will update ecflow_ui status +``` + +The last line will block the terminal but meanwhile update the status just like what you see with `wait = TRUE`. + +### 3. Compute() successfully but then killed on R session + +When Compute() on HPCs, the machines are able to process data which are much larger than the local workstation, so the computation works fine (i.e., on ec-Flow UI, the chunks show yellow in the end.) However, after the computation, the output will be sent back to local workstation. **If the returned data is larger than the available local memory space, your R session will be killed.** Therefore, always pre-check if the returned data will fit in your workstation free memory or not. If not, subset the input data or reduce the output size through more computation. -- GitLab From f2d304b9275298b151f26839f4a2cad5c6f16743 Mon Sep 17 00:00:00 2001 From: aho Date: Tue, 5 Nov 2019 11:28:16 +0100 Subject: [PATCH 2/4] Update faq.md --- inst/doc/faq.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/inst/doc/faq.md b/inst/doc/faq.md index 7027ed9..84dc7e1 100644 --- a/inst/doc/faq.md +++ b/inst/doc/faq.md @@ -2,17 +2,18 @@ This document intends to be the first reference for any doubts that you may have regarding startR. If you do not find the information you need, please open an issue for your problem. - +## Index 1. **How to** - 1. [Choose the number of chunks/jobs/cores in Compute()]() - 2. [Merge/Reorder dimension in Start() (using parameter 'xxx_across' and 'merge_across_dims')]() - 3. [Use self-defined function in Compute()]() - 4. [Use package function in Compute()]() + 1. [Choose the number of chunks/jobs/cores in Compute()](#1-choose-the-number-of-chunksjobscores-in-compute) + 2. [Merge/Reorder dimension in Start() (using parameter 'xxx_across' and 'merge_across_dims')](#2-mergereorder-dimension-in-start-using-parameter-xxx_across-and-merge_across_dims) + 3. [Use self-defined function in Compute()](#3-use-self-defined-function-in-compute) + 4. [Use package function in Compute()](#4-use-package-function-in-compute) + 2. **Something goes wrong...** - 1. [No space left on device](#no-space-left-on-device) - 2. [ecFlow UI remains blue and does not update status]() - 3. [Compute() successfully but then killed on R session]() + 1. [No space left on device](#1-no-space-left-on-device) + 2. [ecFlow UI remains blue and does not update status](#2-ecflow-ui-remains-blue-and-does-not-update-status) + 3. [Compute() successfully but then killed on R session](#3-compute-successfully-but-then-killed-on-r-session) ## 1. How to -- GitLab From e57edfe91a9abdfe8e9cf1d79d8f7a5f10f5c533 Mon Sep 17 00:00:00 2001 From: aho Date: Tue, 5 Nov 2019 11:28:35 +0100 Subject: [PATCH 3/4] Update faq.md --- inst/doc/faq.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/inst/doc/faq.md b/inst/doc/faq.md index 84dc7e1..231dca5 100644 --- a/inst/doc/faq.md +++ b/inst/doc/faq.md @@ -1,4 +1,4 @@ -# Usecase scripts +# FAQs This document intends to be the first reference for any doubts that you may have regarding startR. If you do not find the information you need, please open an issue for your problem. -- GitLab From a6bb1304a2888493665284bf7f841f6a61b57676 Mon Sep 17 00:00:00 2001 From: aho Date: Tue, 5 Nov 2019 11:29:34 +0100 Subject: [PATCH 4/4] Update faq.md --- inst/doc/faq.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/inst/doc/faq.md b/inst/doc/faq.md index 231dca5..11c8c82 100644 --- a/inst/doc/faq.md +++ b/inst/doc/faq.md @@ -171,7 +171,7 @@ data <- Start(dat = repos, ``` -## 3. Use self-defined function in Compute() +### 3. Use self-defined function in Compute() The workflow to use Compute() is: 'define the function' -> 'use Step() to assign the target/output dimension' -> 'use AddStep() to build up workflow' -> 'use Compute() to launch jobs on either local workstation or fatnodes/Power9'. @@ -199,7 +199,7 @@ res <- Compute(wf, ...) ``` -## 4. Use package function in Compute() +### 4. Use package function in Compute() In the workflow for Compute(), first step is to define the function. If you want to use the function in certain R package, you need to check if the package is involved in the R module (`r_module`) or library (`lib_dir`). Then, specify the package name before the function name (see example below) so the machine can recognize which function you refer to. -- GitLab