faq.md

# FAQs

This document intends to be the first reference for any doubts that you may have regarding startR. If you do not find the information you need, please open an issue for your problem.

## Index
1. **How to**
   1. [Choose the number of chunks/jobs/cores in Compute()](#1-choose-the-number-of-chunksjobscores-in-compute)  
   2. [Indicate dependent dimension and use merge parameters in Start()](#2-indicate-dependent-dimension-and-use-merge-parameters-in-start)
   3. [Use self-defined function in Compute()](#3-use-self-defined-function-in-compute)
   4. [Use package function in Compute()](#4-use-package-function-in-compute)
   5. [Do interpolation in Start() (using parameter 'transform')](#5-do-interpolation-in-start-using-parameter-transform)
   6. [Get data attributes without retrieving data to workstation](#6-get-data-attributes-without-retrieving-data-to-workstation)
   7. [Avoid or specify a node from cluster in Compute()](#7-avoid-or-specify-a-node-from-cluster-in-compute)
   8. [Define a path with multiple dependencies](#8-define-a-path-with-multiple-dependencies)
   9. [Use CDORemap() in function](#9-use-cdoremap-in-function)
   10. [The number of members depends on the start date](#10-the-number-of-members-depends-on-the-start-date)
   11. [Select the longitude/latitude region](#11-select-the-longitudelatitude-region)
   12. [What will happen if reorder function is not used](#12-what-will-happen-if-reorder-function-is-not-used)
   13. [Load specific grid points data](#13-load-specific-grid-points-data)
   14. [Find the error log when jobs are launched on Power9](#14-find-the-error-log-when-jobs-are-launched-on-power9)
   15. [Specify extra function arguments in the workflow](#15-specify-extra-function-arguments-in-the-workflow)
   16. [Use parameter 'return_vars' in Start()](#16-use-parameter-return_vars-in-start)
   17. [Use parameter 'split_multiselected_dims' in Start()](#17-use-parameter-split_multiselected_dims-in-start)
</b>

2. **Something goes wrong...**
   1. [No space left on device](#1-no-space-left-on-device)
   2. [ecFlow UI remains blue and does not update status](#2-ecflow-ui-remains-blue-and-does-not-update-status)  
   3. [Compute() successfully but then killed on R session](#3-compute-successfully-but-then-killed-on-r-session)
   4. [My jobs work well in workstation and fatnodes but not on Power9 (or vice versa)](#4-my-jobs-work-well-in-workstation-and-fatnodes-but-not-on-power9-or-vice-versa)
   5. [Errors related to wrong file formatting](#5-errors-related-to-wrong-file-formatting) 
   6. [Errors using a new cluster (setting Nord3)](#6-Errors-using-a-new-cluster-setting-nord3)

## 1. How to

### 1. Choose the number of chunks/jobs/cores in Compute()
Run Start() call to see the total size of the data you read in (remember to set ´retrieve = FALSE´).  

Divide data into chunks according to the size of machine memory module (Power9 is 32GB; MN4 is 8GB). The data size per chunk should be 1/3 to 1/2 of the total memory module.  

Find more details in practical_guide.md [How to choose the number of chunks, jobs and cores](inst/doc/practical_guide.md#how-to-choose-the-number-of-chunks-jobs-and-cores)

### 2. Indicate dependent dimension and use merge parameters in Start()
The parameter `'xxx_across = yyy'` indicates that the inner dimension 'xxx' is continuous along the file dimension 'yyy'. 
A common example is 'time_across = chunk', when the experiment runs through many years 
and the result is saved in several chunk files. 

If you indicate this dependent relation, you can specify 'xxx' with the indices 
throughout the whole 'yyy' files, instead of only within one file. See Example 1 below, 
'time = indices(1:24)' is available when 'time_across = chunk' is specified. If not, 'time' can only be 12 for most.

One example taking advantage of 'xxx_across' is extracting an climate event across years, like El Niño. 
If the event starts from Nov 2014 to May 2016 (19 months in total), simply specify 'time = indices(11:29)' (Example 2).

The thing you should bear in mind when using this parameter is the returned data structure. 
First, **the length of the return xxx dimension is the length of the longest xxx in all files**. 
Take the El Niño above as an example. The first chunk has 2 months, the second chunk has 12 months, 
and the third chunk has 5 months. Therefore, the length of time dimension will be 12, and the length of chunk dimension will be 3. 

Second, the way Start() store data is **put data at the left-most position**.
 Take the El Niño (Example 2) above as an example again. The first chunk has only 2 months, 
so position 1 and 2 have values (which are Nov and Dec 2014). The second chunk has 12 months, 
so all positions have values (Jan to Dec 2015), while position 3 to 12 will be NA. 
The third chunk has 5 months, so position 1 to 5 have values (which are Jan to May 2016), while position 6 to 12 will be NA.

It seems more reasonable to put NA at position 1 to 10 in first chunk (Jan to Oct 2014) 
and and position 6 to 12 in the third chunk (June to Dec 2016). But if the data is not continuous or picked irregularly , 
it is hard to judge the correct NA position (see Example 3).

Since Start() is very flexible with any possible way to read-in data, it is difficult to include 
all the possibilities and make the output data structure reasonable all the time. 
Therefore, it is recommended to understand the way Start() rolls first, 
then you know what you should expect from the output and will not get confused with what it returns to you.

If you want to connet xxx along yyy, the parameter 'merge_across_dims' and 'merge_across_dims_narm' can help you achieve it.
See Example 1. If 'merge_across_dims = TRUE', the chunk dimension will disappear. 
'merge_across_dims' simply attaches data one after another, so the NA values (if exist) will be the same places as the unmerged one (see Example 2).

If you want to remove those additional NAs, you can use 'merge_across_dims_narm = TRUE', 
then the NAs will be removed when merging into one dimension. (see Example 2).

You can find more use cases at [ex1_2_exp_obs_attr.R](inst/doc/usecase/ex1_2_exp_obs_attr.R) and [ex1_3_attr_loadin.R](inst/doc/usecase/ex1_3_attr_loadin.R).

Example 1

```r
data <- Start(dat = repos,
              var = 'tas',
              time = indices(1:24), # each file has 12 months; read 24 months in total
              chunk = indices(1:2),  #two years, each with 12 months
              lat = 'all',
              lon = 'all',
              time_across = 'chunk',
              merge_across_dims = FALSE,  #TRUE,
              return_vars = list(lat = NULL, lon = NULL),
              retrieve = TRUE)
 
#return dimension (merge_across_dims = FALSE)
dat  var time chunk  lat  lon
   1    1   12    2  256  512 
 
#return dimension (merge_across_dims = TRUE)
dat  var time  lat  lon
   1    1   24  256  512 
```

Example 2: El Niño event

```r
repos <- '/esarchive/exp/ecearth/a1tr/cmorfiles/CMIP/EC-Earth-Consortium/EC-Earth3/historical/$memb$/Omon/$var$/gr/v20190312/$var$_Omon_EC-Earth3_historical_$memb$_gr_$chunk$.nc'

data <- Start(dat = repos,
              var = 'tos',
              memb = 'r24i1p1f1',
              time = indices(4:27), # Apr 1957 to Mar 1959
              chunk = c('195701-195712', '195801-195812', '195901-195912'),  
              lat = 'all',
              lon = 'all',
              time_across = 'chunk',
              merge_across_dims = FALSE,
              return_vars = list(lat = NULL, lon = NULL),
              retrieve = TRUE)

> dim(data)
  dat   var  memb  time chunk   lat   lon 
    1     1     1    12     3   256   512 

> data[1,1,1,,,100,100]
          [,1]     [,2]     [,3]
 [1,] 300.7398 300.7659 301.7128
 [2,] 299.6569 301.8241 301.4781
 [3,] 298.3954 301.6472 301.3807
 [4,] 297.1931 301.0621       NA
 [5,] 295.9608 299.1324       NA
 [6,] 295.4735 297.4028       NA
 [7,] 295.8538 296.1619       NA
 [8,] 297.9998 295.2794       NA
 [9,] 299.4571 295.0474       NA
[10,]       NA 295.4571       NA
[11,]       NA 296.8002       NA
[12,]       NA 299.0254       NA

#To move the NAs in the first year to Jan to Mar
> asd <- Subset(data, c(5), list(1))
> qwe <- asd[, , , c(10:12, 1:9), , ,]
> data[, , , , 1, ,] <- qwe

> data[1, 1, 1, , , 100, 100]
          [,1]     [,2]     [,3]
 [1,]       NA 300.7659 301.7128
 [2,]       NA 301.8241 301.4781
 [3,]       NA 301.6472 301.3807
 [4,] 300.7398 301.0621       NA
 [5,] 299.6569 299.1324       NA
 [6,] 298.3954 297.4028       NA
 [7,] 297.1931 296.1619       NA
 [8,] 295.9608 295.2794       NA
 [9,] 295.4735 295.0474       NA
[10,] 295.8538 295.4571       NA
[11,] 297.9998 296.8002       NA
[12,] 299.4571 299.0254       NA

# use merge parameters 

data <- Start(dat = repos,
              var = 'tos',
              memb = 'r24i1p1f1',
              time = indices(4:27), # Apr 1957 to Mar 1959
              chunk = c('195701-195712', '195801-195812', '195901-195912'),  
              lat = 'all',
              lon = 'all',
              time_across = 'chunk',
              merge_across_dims = TRUE,
              merge_across_dims_narm = TRUE,
              return_vars = list(lat = NULL, lon = NULL),
              retrieve = TRUE)

data[1,1,1,,100,100]
 [1] 300.7398 299.6569 298.3954 297.1931 295.9608 295.4735 295.8538 297.9998
 [9] 299.4571 300.7659 301.8241 301.6472 301.0621 299.1324 297.4028 296.1619
[17] 295.2794 295.0474 295.4571 296.8002 299.0254 301.7128 301.4781 301.3807

```

Example 3: Read in three winters (DJF)

```r
repos <- '/esarchive/exp/ecearth/a1tr/cmorfiles/CMIP/EC-Earth-Consortium/EC-Earth3/historical/$memb$/Omon/$var$/gr/v20190312/$var$_Omon_EC-Earth3_historical_$memb$_gr_$chunk$.nc'

data <- Start(dat = repos,
              var = 'tos',
              memb = 'r24i1p1f1',
              time = c(12:14, 24:26, 36:38), # DJF, Dec 1999 to Jan 2002
              chunk = c('199901-199912', '200001-200012', '200101-200112', '200201-200212'),  
              lat = 'all',
              lon = 'all',
              time_across = 'chunk',
              merge_across_dims = TRUE,
              return_vars = list(lat = NULL, lon = NULL),
              retrieve = TRUE)

> dim(data)
 dat  var memb time  lat  lon 
   1    1    1   12  256  512 

> data[1, 1, 1, , 100, 100]
 [1] 300.0381       NA       NA 301.3340 302.0320 300.3575 301.0930 301.4149
 [9] 299.3486 300.7203 301.6695       NA


#Remove NAs and rearrange DJF
> qwe <- Subset(asd, c(4), list(c(1, 4:11)))
> zxc <- InsertDim(InsertDim(qwe, 5, 3), 6, 3)
> zxc <- Subset(zxc, 'time', list(1), drop = 'selected')
> zxc[, , , 1:3, 1, ,] <- qwe[, , , 1:3, ,]
> zxc[, , , 1:3, 2, ,] <- qwe[, , , 4:6, ,]
> zxc[, , , 1:3, 3, ,] <- qwe[, , , 7:9, ,]
> names(dim(zxc))[4] <- c('month')
> names(dim(zxc))[5] <- c('year')

> dim(zxc)
  dat   var  memb month  year   lat   lon 
    1     1     1     3     3   256   512 

> zxc[1, 1, 1, , , 100, 100]
         [,1]     [,2]     [,3]
[1,] 300.0381 300.3575 299.3486
[2,] 301.3340 301.0930 300.7203
[3,] 302.0320 301.4149 301.6695

```

### 3. Use self-defined function in Compute()

The workflow to use Compute() is: 'define the function' -> 'use Step() to assign the target/output dimension' -> 'use AddStep() to build up workflow' ->  'use Compute() to launch jobs on either local workstation or fatnodes/Power9'.

It is no problem when you only have a simple function directly defined in your script (like the example in [practical guide](https://earth.bsc.es/gitlab/es/startR/blob/master/inst/doc/practical_guide.md#step-and-addstep)). However, if the function is more complicated, you may want to save it as an independent file. In this case, the machines (Power 9 or fatnodes) cannot recognize your function therefore the jobs will fail (if you use Compute() at your own local workstation, the problem does not exist.)  

The solution is simple. First, put your function file at somewhere in the machine. For example, in Power 9, put own_func.R at `/esarchive/scratch/<your_user_name>`. Second, in the script, source the function in the function definition (see the example below). Hence, the machine can find your function.  

```r
data <- Start(...,
              retrieve = FALSE)

func <- function(x) {
    source("/esarchive/scratch/aho/own_func.R")  #the path in Power 9
    y <- own_func(x, posdim = 'time')
    return(y)
}

step <- Step(fun = func,
             target_dims = c('time'),
             output_dims = c('time'))#,

wf <- AddStep(data, step)

res <- Compute(wf, ...)

```

### 4. Use package function in Compute()

In the workflow for Compute(), first step is to define the function. If you want to use the function in certain R package, you need to check if the package is involved in the R module (`r_module`) or library (`lib_dir`). Then, specify the package name before the function name (see example below) so the machine can recognize which function you refer to.

```r
data <- Start(...,
              retrieve = FALSE)

func <- function(x) {
    y <- s2dverification::Season(x, posdim = 'time')  #specify package name
    return(y)
}

step <- Step(fun = func,
             target_dims = c('time'),
             output_dims = c('time'))

wf <- AddStep(data, step)

  res <- Compute(wf,
                 chunks = list(latitude = 2,
                               longitude = 2),
                 threads_load = 2,
                 threads_compute = 4,
                 cluster = list(queue_host = 'p1',  #your alias for power9
                                queue_type = 'slurm',
                                temp_dir = '/gpfs/scratch/bsc32/bsc32734/startR_hpc/',
                                lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.5/',  #s2dverification is involved here, so the machine can find Season()
                                r_module = 'startR/0.1.2-foss-2018b-R-3.5.0',
                                job_wallclock = '00:10:00',
                                cores_per_job = 4,
                                max_jobs = 4,
                                bidirectional = FALSE,
                                polling_period = 50
                               ),
                 ecflow_suite_dir = '/home/Earth/aho/startR_local/',
                 wait = TRUE
                 )

```

### 5. Do interpolation in Start() (using parameter 'transform')

If you want to do the interpolation within Start(), you can use the following four parameters:  

1.  **`transform`**: Assign the interpolation function. It is recommended to use `startR::CDORemapper`, the wrapper function of s2dverification::CDORemap().  
2.  **`transform_params`**: A list of the required inputs for `transform`. Take `transform = CDORemapper` as an example, the common items are:  
   - `grid`: A character string specifying either a name of a target grid (recognized by CDO, e.g., 'r256x128', 't106grid') or a path to another NetCDF file with the target grid (a single grid must be defined in such file).  
   - `method`: A character string specifying an interpolation method (recognized by CDO, e.g., 'con', 'bil', 'bic', 'dis'). The following long names are also supported: 'conservative', 'bilinear', 'bicubic', and 'distance-weighted'.  
   - `crop`: Whether to crop the data after interpolation with 'cdo sellonlatbox' (TRUE) or to extend interpolated data to the whole region as CDO does by default (FALSE).  
             If crop = TRUE, the longitude and latitude borders to be cropped at are taken as the limits of the cells at the borders ('lons' and 'lats' are perceived as cell centers), i.e., the resulting array will contain data that covers the same area as the input array. This is equivalent to specifying crop = 'preserve', i.e. preserving area.   
             If crop = 'tight', the borders to be cropped at are taken as the minimum and maximum cell centers in ’lons’ and ’lats’, i.e., the area covered by the resulting array may be smaller if interpolating from a coarse grid to a fine grid.  
             The parameter ’crop’ also accepts a numeric vector of custom borders: c(western border, eastern border, southern border, northern border).  

3.  **`transform_vars`**: A character vector of the inner dimensions to be transformed. E.g., c('latitude', 'longitude').  
4.  **`transform_extra_cells`**: A numeric indicating the number of grid cell to extend from the borders if the interpolating region is a subset of the whole region. 2 as default, which is consistent with the method in s2dverification::Load(). 

You can find an example script here [ex1_1_tranform.R](/inst/doc/usecase/ex1_1_tranform.R)  
You can see more information in s2dverification::CDORemap documentation [here](https://earth.bsc.es/gitlab/es/s2dverification/blob/master/man/CDORemap.Rd).


### 6. Get data attributes without retrieving data to workstation

One of the most useful functionalities of Start() is the parameter `retrieve = FALSE`. It creates a pointer to data repository and tells you the data information without occupying your workstation memory. The better thing is, even the data is not actually retrieved, you can still use its attributes:  
```r
  header <- Start(dat = repos,
                  ...,
                  retrieve = FALSE)
  class(header)
#[1] "startR_cube"

# check attributes
  str(attr(header, 'Variables'))
# Get longitude and latitude
  lons <- attr(header, 'Variables')$common$lon
  lats <- attr(header, 'Variables')$common$lat
# Get dimension
  dim <- attr(header, 'Dimensions')

```

And if you want to retrieve the data to the workstation afterward, you can use `eval()`:  
```r
  data <- eval(header)

  class(data)
#[1] "startR_array"

# Get dimension
  dim(data)
```
Find examples at [usecase.md](/inst/doc/usecase.md), ex1_1 and ex1_3.


### 7. Avoid or specify a node from cluster in Compute()

When submitting a job to Fatnodes using Compute(), the parameter 'extra_queue_params' could be used to restricthe job to be run in a expecific node as follows:

```
  extra_queue_params = list('#SBATCH -w moore'), 
```

or exclude a specific node from job by:

```
  extra_queue_params = list('#SBATCH -x moore'), 
```

Look at the position of `extra_queue_params` parameter in a full call of Compute:

```
  res <- Compute(wf1,
                 chunks = list(ensemble = 20,
                               sdate = 2),
                 threads_load = 2,
                 threads_compute = 4,
                 cluster = list(queue_host = queue_host,  
                                queue_type = 'slurm',
                                extra_queue_params = list('#SBATCH -x moore'),
                                cores_per_job = 2,
                                temp_dir = temp_dir,
                                r_module = 'R/3.5.0-foss-2018b',
                                polling_period = 10,
                                job_wallclock = '01:00:00',
                                max_jobs = 40,
                                bidirectional = FALSE),
                 ecflow_suite_dir = ecflow_suite_dir,  
                 wait = TRUE)
```

### 8. Define a path with multiple dependencies

The structure of the BSC Earth data repository 'esarchive' allows us to create a path pattern to the data by using different variables (between dollar symbol), such as `$var$`, for the variable name, or `$sdates$`, for the start date of the simulation. Here is an example for loading monthly simulations of system4_m1 data:

`path <- '/esarchive/exp/ecmwf/system4_m1/monthly_mean/$var$_f6h/$var$_$sdate$.nc'`

The function Start() will require two parameters 'var' and 'sdate' to load the desired data.

In some cases, the creation of the path could be a little bit more complicated. Some researchers create their own EC-Earth experiments which are identified by an experiment ID (`$expid$`) and with different model version (`$version`), even for different members (`$member$`):

| expid | member   | version |
|-------|----------|---------|
| a1st  | r7i1p1f1 |v20190302|
| a1sx  |r10i1p1f1 |v20190308|

In this case, the variable member and version have different value depending on the expid (the member r10i1p1f1 does not exist for expid a1st). The path will include this varibles:

`path <- '/esarchive/exp/ecearth/$expid$/diags/CMIP/EC-Earth-Consortium/EC-Earth3/historical/$member$/Omon/$var$/gn/$version$/$var$_Omon_EC-Earth3_historical_$member$_gn_$year$.nc'`

However, the following parameters are mandatory to make Start() aware of that they are not independent variables:

```
     member_depends = 'expid',
     version_depends = 'expid',
```

The final Start() call will look like:

```
yrh1 = 1960
yrh2 = 2014
years  <- paste0(c(yrh1 : yrh2), '01-', c(yrh1 : yrh2), '12')
data <- Start(dat = repos,
              var = 'tosmean',
              expid =  c('a1st','a1sx'),
              member = 'all',
              version =  'all',
              member_depends = 'expid',
              version_depends = 'expid',
              year = years,
              time = 'all',
              region = indices(1 : 4),
              return_vars = list(time = NULL, region = NULL),
              retrieve = TRUE)
```

### 9. Use CDORemap() in function

If you want to interpolate data by s2dverification::CDORemap in function, you need to tell the
machine which CDO module to use. Therefore, `CDO_module = 'CDO/1.9.5-foss-2018b'` should be 
added in Compute() cluster list. See the example in usecase [ex2_3_cdo.R](inst/doc/usecase/ex2_3_cdo.R).

### 10. The number of members depends on the start date

In seasonal forecast, some start dates, such as November 1st, are more widely used than others. For those start dates extensively used, the number of members available is greater than for other start dates. This is the case of the seasonal forecast system ECMWF SEAS5 (system5_m1):
  - for the start date November 1st, 1999, there are 51 members available, while
  - for the start date September 1st, 1999, there are 25 members available.

When trying to load both start dates at once using Start(), the order in which the start dates is specified will impact on the dimensions of the dataset if all members are loaded with `member = 'all'`:
  - `sdates = c('19991101', '19990901')`, the member dimension will be of length 51, showing missing values for the members 26 to 51 in the second start date;
  - `sdates = c('19990901', '19991101')`, the member dimension will be of length 25, any member will be missing.

The code to reproduce this behaviour could be found in the Use Cases section, [example 1.4](/inst/doc/usecase/ex1_4_variable_nmember.R).

### 11. Select the longitude/latitude region

There are three ways to specify the dimension selectors: special keywords('all', 'first', 'last'), indices, or values (find more details in [pratical guide](inst/doc/practical_guide.md)). 
The parameter 'xxx_reorder' is only effective when using **values**.

There are two reorder functions in startR package, **Sort()** for latitude and **CircularSort()** for longitude. 
Sort() is a wrapper function of base function sort(), rearranging the values from low to high (decreasing = TRUE, default) or 
from high to low (decreasing = FALSE). For example, if you want to sort latitude from 90 to -90, use `latitude_reorder = Sort(decreasing = TRUE)`. 
By this means, the result will always from big to small value no matter how the original order is.

On the other hand, the concept of CircularSort() is different. It is used for a circular region, putting the out-of-region values back to the region. 
It requires two input numbers defining the borders of the whole region, which are usually [0, 360] or [-180, 180]. For example, 
`longitude_reorder = CircularSort(0, 360)` means that the left border is 0 and the right border is 360, so 360 will be put back to 0, 361 will be put back to 1, 
and -1 will become 359. After circulating values, CircularSort() also sorts the values from small to big. It may cause the discontinous sub-region, 
but the problem can be solved by assigning the borders correctly.

The following chart helps you to decide how to use CircularSort() to get the desired region. 
The first row represents the longitude border of the requested region, e.g., `values(list(lon.min, lon.max))`.
Note that this chart only provides the idea. The real numbers may slightly differ depending on the original/transform values.

<img src="inst/doc/figures/lon-2.PNG" width="1000" />

Find the usecases here [ex1_5_latlon_reorder.R](inst/doc/usecase/ex1_5_latlon_reorder.R)

### 12. What will happen if reorder function is not used

The reorder functions (i.e., Sort() and CircularSort()) are always recommended to adopt in Start() so you can ensure the result is in line
with your expectation (find more details at [how-to-11](#11-select-the-longitudelatitude-region) above). If the functions are not used, the situation will be more complicated and easier to 
get unexpected results. 

Without reorder functions, the longitude and latitude selectors must be within the respective range in the original file, and the result order
will be the same order as how you request. If transformation is performed simultaneously, you need to consider the latitude/longitude range of
the transform grid too. The requested region values cannot fall out of both the original and the transformed region.

The following chart shows some examples. 

<img src="inst/doc/figures/lon-3.PNG" width="800" />

### 13. Load specific grid points data

A single or list of grid points, defined by pairs of latitude and logitud values, can be loaded using **Start**.
If the values does not match the defined spatial point in the files, **Start** will load the nearest gridpoint (The user can also consider the regriding option to fulfill his/her expectations).

An example of how to load several gridpoints and how to transform the data could be found in the Use Cases section [example 1.6](/inst/doc/usecase/ex1_6_gridpoint_data.R).

### 14. Find the error log when jobs are launched on Power9

Due to connection problem, when Compute() dispatches jobs to Power9, each job in ecFlow ui has a 'Z', zombie, beside, no matter the job is complete or failed.
The zombie blocks the error log to be shown in ecFlow ui output frame. Therefore, you need to log in Power9, go to 'temp_dir' listed in the cluster list in Compute() and enter the job folder. You will find another folder with the same name as the previous layer, then go down to the most inner folder. You will see 'Chunk.1.err'.
For example, the path can be: "/gpfs/scratch/bsc32/bsc32734/startR_hpc/STARTR_CHUNKING_1665710775/STARTR_CHUNKING_1665710775/computation/lead_year_CHUNK_1/lon_CHUNK_1/lat_CHUNK_1/sdate_CHUNK_1/var_CHUNK_1/dataset_CHUNK_1/Chunk.1.err".

### 15. Specify extra function arguments in the workflow

The input arguments of the function may not only be the data, sometimes the extra information is required. 
The additional arguments should be specified in 'AddStep()'. The following example shows how to assign 'na.rm' in mean().

```
  func <- function(x, narm = narm) {  # add additional argument 'narm'
    a <- apply(x, 2, mean, na.rm = narm)
    dim(a) <- c(sdate = length(a))
    return(a)
  }
  step <- Step(func, target_dims = c('ensemble', 'sdate'),
                     output_dims = c('sdate'))
  wf <- AddStep(data, step, narm = TRUE)  # specify the additional argument 'narm'
```

### 16. Use parameter 'return_vars' in Start()

Apart from the data array, retrieving auxiliary variables inside the netCDF files may also be needed. 
The parameter 'return_vars' is used to request such variables.
This parameter expects to receive a named variable list. The names are the variable names to be fetched in the netCDF files, and the corresponding value can be:  

(1) NULL, if the variable is common along all the file dimensions (i.e., it will be retrieved only once from the first involved files)  
(2) a vector of the file dimension name which to retrieve the variable for  
(3) a vector which includes the file dimension for path pattern specification (i.e., 'dat' in the example below)  

For the first and second options, the fetched variable values will be saved in *$Variables$common$<variable_name>*. 
For the third option, the fetched variable values will be saved in *$Variables$<dataset_name>$<variable_name>*. 

Notice that if the variable is specified by values(), it will be automatically added to return_vars and its value will be NULL.

Here is an example showing the above three ways.

```
  repos <- "/esarchive/exp/ecmwf/system5_m1/monthly_mean/tas_f6h/$var$_$sdate$.nc"
  var <- 'tas'
  lon.min <- 10
  lon.max <- 20
  lat.min <- 20
  lat.max <- 30
  data <- Start(dat = repos,  # file dimension for path pattern specification
                var = var,
                sdate = c('20170101', '20170401'),  # file dimension; 'time' is dependent on 'sdate'
                ensemble = indices(1:5),
                time = indices(1:3),  # inner dimension, also an auxiliary variable containing forecast time information
                latitude = values(list(lat.min, lat.max)),  # inner dimension, common along all files 
                longitude = values(list(lon.min, lon.max)),  # inner dimension, common along all files
                return_vars = list(time = 'sdate',  # option (2)
                                   longitude = NULL,  # option (1)
                                   latitude = NULL),  # option (1)
                retrieve = FALSE
               )

```

In the return_vars list, we require information of three variables. 'time' values differ from each sdate, while longitude and latitude are common variable among all the files.
You can use `str(data)` to see the information structure.

```
str(attr(data, 'Variables')$common)
List of 3
 $ time     : POSIXct[1:6], format: "2017-02-01 00:00:00" "2017-05-01 00:00:00" ...
 $ longitude: num [1:37(1d)] 10 10.3 10.6 10.8 11.1 ...
 $ latitude : num [1:36(1d)] 20.1 20.4 20.7 20.9 21.2 ...
 
dim((attr(data, 'Variables')$common$time))
sdate  time
    2     3
```

It is not necessary in this example, but you can try to replace return_vars longitude to `longitude = dat` (option (3)). 
You will find that longitude is moved from $common to $dat1 list.

```
str(attr(data, 'Variables')$common)
List of 2
 $ time    : POSIXct[1:6], format: "2017-02-01 00:00:00" "2017-05-01 00:00:00" ...
 $ latitude: num [1:36(1d)] 20.1 20.4 20.7 20.9 21.2 ...

str(attr(data, 'Variables')$dat1)
List of 1
 $ longitude: num [1:37(1d)] 10 10.3 10.6 10.8 11.1 ...
```

### 17. Use parameter 'split_multiselected_dims' in Start()
The selectors can be not only vectors, but also multidimensional array. For instance, the 'time' dimension 
can be assigned by a two-dimensional array `[sdate = 12, time = 31]`, which is 31 timesteps for 12 start dates. 
You may want to have both 'sdate' and 'time' in the output dimension, even though 'sdate' is not explicitly specified in Start(). 
The parameter 'split_multiselected_dims' is for this goal. It is common in the case that uses experimental data attributes to get the corresponding observational data. 

The following script is part of the use case [ex1_2_exp_obs_attr.R](inst/doc/usecase/ex1_2_exp_obs_attr.R). 
The time selector for observational data comes from experimental data above (neglected here). The dimension number of the selector is two. 
Notice that dimension name, which is 'time' here, must also be one of the dimension names in the selector. 

The result dimensions include 'sdate' because it is splited from 'time'. In the meanwhile, 
'date' disappears because 'merge_across_dims = TRUE' (see more explanation at [How-to-#2](#2-indicate-dependent-dimension-and-use-merge-parameters-in-start)). 

```r
# use time attributes from experimental data
dates <- attr(exp, 'Variables')$common$time
dim(dates)
#sdate time 
#    4    3 

obs <- Start(dat = repos_obs,
             var = 'tas',
             date = unique(format(dates, '%Y%m')),
             time = values(dates),  #dim: [sdate = 4, time = 3]
             time_across = 'date',
             lat = 'all',
             lon = 'all',
             merge_across_dims = TRUE,
             split_multiselected_dims = TRUE,
             synonims = list(lat = c('lat', 'latitude'),
                             lon = c('lon', 'longitude')),
             return_vars = list(lon = NULL,
                                lat = NULL,
                                time = 'date'),
             retrieve = FALSE)

print(attr(obs, 'Dimensions'))
#  dat   var sdate  time   lat   lon 
#    1     1     4     3   256   512 
```

The splited dimension can have more than two dimensions. 
The following example comes from the usecase [ex1_7_split_merge.R](inst/doc/usecase/ex1_7_split_merge.R). 
The 'time' selector has three dimensions 'sdate', 'syear', and 'time'.

```r
dates <- attr(hcst, 'Variables')$common$time
dim(dates)
#sdate syear  time
#    2     3    12

file_date <- sort(unique(gsub('-', '',
                              sapply(as.character(dates), substr, 1, 7))))
print(file_date)
#[1] "199607" "199612" "199701" "199707" "199712" "199801" "199807" "199812"
#[9] "199901"

obs <- Start(dat = path.obs,
             var = var_name,
             file_date = file_date,   # a vector with the information of sdate and syear
             latitude = indices(1:10),
             longitude = indices(1:10),
             time = values(dates),    # a 3-dim array (sdate, syear, time)
             time_across = 'file_date',
             merge_across_dims = TRUE,
             merge_across_dims_narm = TRUE,
             split_multiselected_dims = TRUE,
             synonims = list(latitude = c('lat','latitude'),
                             longitude = c('lon','longitude')),
             return_vars = list(latitude = 'dat',
                                longitude = 'dat',
                                time = 'file_date'),
             retrieve = T)
```


# Something goes wrong...

### 1. No space left on device

An issue of R is the accumulated trash files, which occupy the machine memory therefore crash R. If the size of data your R script deal with is reasonable but R crashes immediately after running and returns the ERROR:    
>  
> No space left on device  
>
Go to **/dev/shm/** and `rm <large_trash_file_name>` 

Find more discussion in this [issue](https://earth.bsc.es/gitlab/es/s2dverification/issues/221)

### 2. ecFlow UI remains blue and does not update status

This situation will occur if:  
1. The Compute() parameter `wait` is set to be `FALSE`, and  
2. Launch jobs on an HPC where the connection with its login node is unidirectional (e.g., Power 9)  

Under this condition, the ecFlow UI will remain blue and will not update the status. 
To solve this problem, use `Collect()` in the R terminal after running Compute():

```r
  res <- Compute(wf,
                 ...,
                 wait = FALSE)

  result <- Collect(res, wait = TRUE)  #it will update ecflow_ui status continuously, but will block the R session
  result <- Collect(res, wait = FALSE)  #it will return the ecflow_ui status once only, but will not block the R session
```


### 3. Compute() successfully but then killed on R session

When Compute() on HPCs, the machines are able to process data which are much larger than the local workstation, so the computation works fine (i.e., on ec-Flow UI, the chunks show yellow in the end.) However, after the computation, the output will be sent back to local workstation. **If the returned data is larger than the available local memory space, your R session will be killed.** Therefore, always pre-check if the returned data will fit in your workstation free memory or not. If not, subset the input data or reduce the output size through more computation.  

Further explanation: though the complete output (i.e., merging all the chunks into one returned array) cannot be sent back to workstation, but the chunking results (.Rds file) are completed and saved in the directory '<ecflow_suite_dir>/STARTR_CHUNKING_<job_id>'. If you still want to use the chunking results, you can find them there.


### 4. My jobs work well in workstation and fatnodes but not on Power9 (or vice versa)

There are several possible reasons for this situation. Here we list some of them, and please let us know if you find any other reason not listed here yet.
- **R module or package version difference.** Sometimes, the versions among these 
machines are not consistency, and it might cause the problem. Try to load 
different module to see if it fixes the problem.  
- **The package is not known by the machine you use.** If the package you use 
in the function does not include in the R module, you have to assign the 
parameter `lib_dir` in the cluster list in Compute() (see more details in 
[practical_guide.md](https://earth.bsc.es/gitlab/es/startR/blob/master/inst/doc/practical_guide.md#compute-on-cte-power-9).) 
- **The function is specified the package name ahead.** The package name needs 
to be added in front of function connected with '::' (e.g., `s2dv::Clim`) or with
 ':::' if the function is internal (e.g., `CSTools:::.cal`).
- **Source or load the file not in the machine you use.** If you use self-defined 
function or load data in the function, you need to put those files in the machine 
you run the computation on, so the machine can find it (e.g., when submitting jobs 
to power9, you should put the files in Power9 instead of local workstation.)
- **Connection problem.** Test the successful script you used to use (if you do not 
have one, go to [usecase.md](https://earth.bsc.es/gitlab/es/startR/tree/develop-FAQcluster/inst/doc/usecase) to find one!). 
If it fails, it means that your connection to machine or the ecFlow setting has 
some problem.

###  5. Errors related to wrong file formatting

Several errors could be return when the files are not correctly formatted. If you see one of this errors, review the coordinates in your files:

```
Error in Rsx_nc4_put_vara_double: NetCDF: Numeric conversion not representable
Error in ncvar_put(ncdf_object, defined_vars[[var_counter]]$name, arrays[[i]], : 
 C function Rsx_nc4_put_vara_double returned error
```

```
Error in dim(x$x) <- dim_bk :
  dims [product 1280] do not match the length of object [1233]  <- this '1233' changes every time
```

```
Error in s2dverification::CDORemap(data_array, lons, lats, ...) : 
  Found invalid values in 'lons'.
```

```
ERROR: invalid cell
 
Aborting in file clipping.c, line 1295 ...
Error in s2dverification::CDORemap(data_array, lons, lats, ...) : 
  CDO remap failed.
```


###  6. Errors using a new cluster (setting Nord3)

When using a new cluster, some errors could happen. Here, there are some behaviours detected from issue #64.

- whether running Compute(), request password:

```
Password:
```

Check that the host name for the cluster has been include in the ´.ssh/config´. 
Check also that the passwordless access has been properly set up. You can check that you can access the cluster without providing the password by using the host name ´ssh nord3´ (see more infor in the [**Practical guide**](inst/doc/practical_guide.md)).

- alias may not be available, such as 'enas' for 'esarchive'

In this case, the error ´No data files found for any of the specified datasets.´ will be returned.

- repetitive prints of modules loading:

```
load UDUNITS/2.1.24 (PATH)
load NETCDF/4.1.3 (PATH, LD_LIBRARY_PATH, NETCDF)
load R/2.15.2 (PATH, LD_LIBRARY_PATH)
```

The .bashrc in your Nord 3 home must be edit with the information from [BSC ES wiki](https://earth.bsc.es/wiki/doku.php?id=computing:nord3) to load correct modules. However, if you add a line before those, the result will be the one above.

Check your .bashrc to avoid loading modules before define the department ones.


- R versions: Workstation version versus remote cluster version

Some functions depends on the R version used and they should be compatible in workstation and in the remote cluster. If the error:

```
cannot read workspace version 3 written by R 3.6.2; need R 3.5.0 or newer
```

change the R version used in your workstation to one newer.