faq.md 63.4 KB
Newer Older
The final aim of the startR workflow is to avoid crashing the R session memory due to the size of the input data. Therefore, the desired workflow of startR will use those required (for the key computations) data dimensions to compute an analysis and the rest to chunk the data in pieces.

When very complex analyses are being carried in a single step, the analyses may use all dimensions for computation and any free dimension could be used to chunk the data in pieces.

Being in this latest case, and depending on the analyses to be performed, we may be able to chunk in a dimension that is being used in the analysis. You just need to define a parameter in your function 'nchunks = chunk_indices' and use it in the same function.

You can find a working example in use case [RainFARM precipitation downscaling](https://earth.bsc.es/gitlab/es/startR/-/blob/develop-RainFARMCase/inst/doc/usecase/ex2_5_rainFARM.R). In that example the start date dimension is being used to chunk since the downscaling method only needs the lon and lat but the following function requires sdate (the chunked dimension) to save data in esarchive format. In this case, the result is independent of the start date, but the function saveExp needs the dimension with other purposes which are to create the file names, then, we can use the 'chunk_indices' just to detect the name of the output files.
nperez's avatar
nperez committed

There are many other possible applications of this parameter, please, report other uses cases you may create.

nperez's avatar
nperez committed

# Something goes wrong...
aho's avatar
aho committed

### 1. No space left on device

An issue of R is the accumulated trash files, which occupy the machine memory therefore crash R. If the size of data your R script deal with is reasonable but R crashes immediately after running and returns the ERROR:    
>  
> No space left on device  
>
Go to **/dev/shm/** and `rm <large_trash_file_name>` 

Find more discussion in this [issue](https://earth.bsc.es/gitlab/es/s2dverification/issues/221)

### 2. ecFlow UI remains blue and does not update status

This situation will occur if:  
1. The Compute() parameter `wait` is set to be `FALSE`, and  
2. Launch jobs on an HPC where the connection with its login node is unidirectional (e.g., Power 9)  

Under this condition, the ecFlow UI will remain blue and will not update the status. 
aho's avatar
aho committed
To solve this problem, use `Collect()` in the R terminal after running Compute():
aho's avatar
aho committed

```r
  res <- Compute(wf,
                 ...,
                 wait = FALSE)

aho's avatar
aho committed
  result <- Collect(res, wait = TRUE)  #it will update ecflow_ui status continuously, but will block the R session
  result <- Collect(res, wait = FALSE)  #it will return the ecflow_ui status once only, but will not block the R session
aho's avatar
aho committed
```


### 3. Compute() successfully but then killed on R session

aho's avatar
aho committed
When Compute() on HPCs, the machines are able to process data which are much larger than the local workstation, so the computation works fine (i.e., on ec-Flow UI, the chunks show yellow in the end.) However, after the computation, the output will be sent back to local workstation. **If the returned data is larger than the available local memory space, your R session will be killed.** Therefore, always pre-check if the returned data will fit in your workstation free memory or not. If not, subset the input data or reduce the output size through more computation.  

Further explanation: though the complete output (i.e., merging all the chunks into one returned array) cannot be sent back to workstation, but the chunking results (.Rds file) are completed and saved in the directory '<ecflow_suite_dir>/STARTR_CHUNKING_<job_id>'. If you still want to use the chunking results, you can find them there.

aho's avatar
aho committed

### 4. My jobs work well in workstation and fatnodes but not on Power9 (or vice versa)

There are several possible reasons for this situation. Here we list some of them, and please let us know if you find any other reason not listed here yet.
- **R module or package version difference.** Sometimes, the versions among these 
machines are not consistency, and it might cause the problem. Try to load 
different module to see if it fixes the problem.  
- **The package is not known by the machine you use.** If the package you use 
in the function does not include in the R module, you have to assign the 
parameter `lib_dir` in the cluster list in Compute() (see more details in 
[practical_guide.md](https://earth.bsc.es/gitlab/es/startR/blob/master/inst/doc/practical_guide.md#compute-on-cte-power-9).) 
- **The function is specified the package name ahead.** The package name needs 
to be added in front of function connected with '::' (e.g., `s2dv::Clim`) or with
 ':::' if the function is internal (e.g., `CSTools:::.cal`).
- **Source or load the file not in the machine you use.** If you use self-defined 
function or load data in the function, you need to put those files in the machine 
you run the computation on, so the machine can find it (e.g., when submitting jobs 
to power9, you should put the files in Power9 instead of local workstation.)
- **Connection problem.** Test the successful script you used to use (if you do not 
have one, go to [usecase.md](https://earth.bsc.es/gitlab/es/startR/tree/develop-FAQcluster/inst/doc/usecase) to find one!). 
If it fails, it means that your connection to machine or the ecFlow setting has 
some problem.

###  5. Errors related to wrong file formatting

aho's avatar
aho committed
Several errors could be returned when the files are not correctly formatted. If you see one of this errors, review the coordinates in your files:

```
Error in Rsx_nc4_put_vara_double: NetCDF: Numeric conversion not representable
Error in ncvar_put(ncdf_object, defined_vars[[var_counter]]$name, arrays[[i]], : 
 C function Rsx_nc4_put_vara_double returned error
```

```
Error in dim(x$x) <- dim_bk :
  dims [product 1280] do not match the length of object [1233]  <- this '1233' changes every time
```

```
Error in s2dv::CDORemap(data_array, lons, lats, ...) : 
  Found invalid values in 'lons'.
```

```
ERROR: invalid cell
 
Aborting in file clipping.c, line 1295 ...
Error in s2dv::CDORemap(data_array, lons, lats, ...) : 
  CDO remap failed.
```
nperez's avatar
nperez committed
###  6. Errors using a new cluster (setting Nord3)
nperez's avatar
nperez committed

When using a new cluster, some errors could happen. Here, there are some behaviours detected from issue #64.

- whether running Compute(), request password:

```
Password:
```

Check that the host name for the cluster has been include in the ´.ssh/config´. 
Check also that the passwordless access has been properly set up. You can check that you can access the cluster without providing the password by using the host name ´ssh nord3´ (see more infor in the [**Practical guide**](inst/doc/practical_guide.md)).

Andrea's avatar
Andrea committed
- alias may not be available, such as 'esnas' for 'esarchive'
nperez's avatar
nperez committed

In this case, the error ´No data files found for any of the specified datasets.´ will be returned.

- repetitive prints of modules loading:

```
load UDUNITS/2.1.24 (PATH)
load NETCDF/4.1.3 (PATH, LD_LIBRARY_PATH, NETCDF)
load R/2.15.2 (PATH, LD_LIBRARY_PATH)
```

The .bashrc in your Nord 3 home must be edit with the information from [BSC ES wiki](https://earth.bsc.es/wiki/doku.php?id=computing:nord3) to load correct modules. However, if you add a line before those, the result will be the one above.

Check your .bashrc to avoid loading modules before define the department ones.


- R versions: Workstation version versus remote cluster version

Some functions depends on the R version used and they should be compatible in workstation and in the remote cluster. If the error:

```
cannot read workspace version 3 written by R 3.6.2; need R 3.5.0 or newer
```

change the R version used in your workstation to one newer.


nperez's avatar
nperez committed
### 7. Start() fails retrieving data

If you get the following error message:
nperez's avatar
nperez committed
```
Exploring files... This will take a variable amount of time depending
*   on the issued request and the performance of the file server...
Error in R_nc4_open: No such file or directory
Error in file_var_reader(NULL, file_object, NULL, var_to_read, synonims) :
  Either 'file_path' or 'file_object' must be provided.
```

check if your path contains the label $var$ in the path. If not, try to added it as part of the path or the file name. Where $var$ is the variable to retrieve from files.


### 8. Error 'caught segfault' when job submitted to HPCs  

This error message implies that the memory space is not enough for 
computation. 
Check if the partition `/dev/shm/` is empty. If the trash files occupy this partition, 
the process you are running may fail. Remove the files and re-run your code.

If the error persists, check your code with a smaller data sample to discard a problem with your code since this error message indicates that you are requesting more memory than the available.


### 9. White lines on the figure of interpolated irregular data

To process irregular grid data, we can load the data by Start() and interpolate it to a regular grid by other tools (e.g., using s2dv::CDORemap).
In some instances, when we plot the interpolated data, we see some white lines on the map (see figure below).
To solve this problem, we can try to exclude the first and last indices of the latitude and longitude in Start() call.
Check [use case ex1_15](inst/doc/usecase/ex1_15_irregular_grid_CDORemap.R) for the example script (Case 2). The solution of each case may differ, so if you find this solution does not work for your case, please open an issue.

<img src="inst/doc/figures/faq_2_9_white_stripes.png" width="400" />