Newer
Older
The final aim of the startR workflow is to avoid crashing the R session memory due to the size of the input data. Therefore, the desired workflow of startR will use those required (for the key computations) data dimensions to compute an analysis and the rest to chunk the data in pieces.
When very complex analyses are being carried in a single step, the analyses may use all dimensions for computation and any free dimension could be used to chunk the data in pieces.
Being in this latest case, and depending on the analyses to be performed, we may be able to chunk in a dimension that is being used in the analysis. You just need to define a parameter in your function 'nchunks = chunk_indices' and use it in the same function.
You can find a working example in use case [RainFARM precipitation downscaling](https://earth.bsc.es/gitlab/es/startR/-/blob/develop-RainFARMCase/inst/doc/usecase/ex2_5_rainFARM.R). In that example the start date dimension is being used to chunk since the downscaling method only needs the lon and lat but the following function requires sdate (the chunked dimension) to save data in esarchive format. In this case, the result is independent of the start date, but the function saveExp needs the dimension with other purposes which are to create the file names, then, we can use the 'chunk_indices' just to detect the name of the output files.
There are many other possible applications of this parameter, please, report other uses cases you may create.
# Something goes wrong...
### 1. No space left on device
An issue of R is the accumulated trash files, which occupy the machine memory therefore crash R. If the size of data your R script deal with is reasonable but R crashes immediately after running and returns the ERROR:
>
> No space left on device
>
Go to **/dev/shm/** and `rm <large_trash_file_name>`
Find more discussion in this [issue](https://earth.bsc.es/gitlab/es/s2dverification/issues/221)
### 2. ecFlow UI remains blue and does not update status
This situation will occur if:
1. The Compute() parameter `wait` is set to be `FALSE`, and
2. Launch jobs on an HPC where the connection with its login node is unidirectional (e.g., Power 9)
Under this condition, the ecFlow UI will remain blue and will not update the status.
To solve this problem, use `Collect()` in the R terminal after running Compute():
```r
res <- Compute(wf,
...,
wait = FALSE)
result <- Collect(res, wait = TRUE) #it will update ecflow_ui status continuously, but will block the R session
result <- Collect(res, wait = FALSE) #it will return the ecflow_ui status once only, but will not block the R session
```
### 3. Compute() successfully but then killed on R session
When Compute() on HPCs, the machines are able to process data which are much larger than the local workstation, so the computation works fine (i.e., on ec-Flow UI, the chunks show yellow in the end.) However, after the computation, the output will be sent back to local workstation. **If the returned data is larger than the available local memory space, your R session will be killed.** Therefore, always pre-check if the returned data will fit in your workstation free memory or not. If not, subset the input data or reduce the output size through more computation.
Further explanation: though the complete output (i.e., merging all the chunks into one returned array) cannot be sent back to workstation, but the chunking results (.Rds file) are completed and saved in the directory '<ecflow_suite_dir>/STARTR_CHUNKING_<job_id>'. If you still want to use the chunking results, you can find them there.
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
### 4. My jobs work well in workstation and fatnodes but not on Power9 (or vice versa)
There are several possible reasons for this situation. Here we list some of them, and please let us know if you find any other reason not listed here yet.
- **R module or package version difference.** Sometimes, the versions among these
machines are not consistency, and it might cause the problem. Try to load
different module to see if it fixes the problem.
- **The package is not known by the machine you use.** If the package you use
in the function does not include in the R module, you have to assign the
parameter `lib_dir` in the cluster list in Compute() (see more details in
[practical_guide.md](https://earth.bsc.es/gitlab/es/startR/blob/master/inst/doc/practical_guide.md#compute-on-cte-power-9).)
- **The function is specified the package name ahead.** The package name needs
to be added in front of function connected with '::' (e.g., `s2dv::Clim`) or with
':::' if the function is internal (e.g., `CSTools:::.cal`).
- **Source or load the file not in the machine you use.** If you use self-defined
function or load data in the function, you need to put those files in the machine
you run the computation on, so the machine can find it (e.g., when submitting jobs
to power9, you should put the files in Power9 instead of local workstation.)
- **Connection problem.** Test the successful script you used to use (if you do not
have one, go to [usecase.md](https://earth.bsc.es/gitlab/es/startR/tree/develop-FAQcluster/inst/doc/usecase) to find one!).
If it fails, it means that your connection to machine or the ecFlow setting has
some problem.
### 5. Errors related to wrong file formatting
Several errors could be returned when the files are not correctly formatted. If you see one of this errors, review the coordinates in your files:
```
Error in Rsx_nc4_put_vara_double: NetCDF: Numeric conversion not representable
Error in ncvar_put(ncdf_object, defined_vars[[var_counter]]$name, arrays[[i]], :
C function Rsx_nc4_put_vara_double returned error
```
```
Error in dim(x$x) <- dim_bk :
dims [product 1280] do not match the length of object [1233] <- this '1233' changes every time
```
```
Error in s2dv::CDORemap(data_array, lons, lats, ...) :
Found invalid values in 'lons'.
```
```
ERROR: invalid cell
Aborting in file clipping.c, line 1295 ...
Error in s2dv::CDORemap(data_array, lons, lats, ...) :
When using a new cluster, some errors could happen. Here, there are some behaviours detected from issue #64.
- whether running Compute(), request password:
```
Password:
```
Check that the host name for the cluster has been include in the ´.ssh/config´.
Check also that the passwordless access has been properly set up. You can check that you can access the cluster without providing the password by using the host name ´ssh nord3´ (see more infor in the [**Practical guide**](inst/doc/practical_guide.md)).
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
In this case, the error ´No data files found for any of the specified datasets.´ will be returned.
- repetitive prints of modules loading:
```
load UDUNITS/2.1.24 (PATH)
load NETCDF/4.1.3 (PATH, LD_LIBRARY_PATH, NETCDF)
load R/2.15.2 (PATH, LD_LIBRARY_PATH)
```
The .bashrc in your Nord 3 home must be edit with the information from [BSC ES wiki](https://earth.bsc.es/wiki/doku.php?id=computing:nord3) to load correct modules. However, if you add a line before those, the result will be the one above.
Check your .bashrc to avoid loading modules before define the department ones.
- R versions: Workstation version versus remote cluster version
Some functions depends on the R version used and they should be compatible in workstation and in the remote cluster. If the error:
```
cannot read workspace version 3 written by R 3.6.2; need R 3.5.0 or newer
```
change the R version used in your workstation to one newer.
### 7. Start() fails retrieving data
If you get the following error message:
```
Exploring files... This will take a variable amount of time depending
* on the issued request and the performance of the file server...
Error in R_nc4_open: No such file or directory
Error in file_var_reader(NULL, file_object, NULL, var_to_read, synonims) :
Either 'file_path' or 'file_object' must be provided.
```
check if your path contains the label $var$ in the path. If not, try to added it as part of the path or the file name. Where $var$ is the variable to retrieve from files.
### 8. Error 'caught segfault' when job submitted to HPCs
This error message implies that the memory space is not enough for
computation.
Check if the partition `/dev/shm/` is empty. If the trash files occupy this partition,
the process you are running may fail. Remove the files and re-run your code.
If the error persists, check your code with a smaller data sample to discard a problem with your code since this error message indicates that you are requesting more memory than the available.
### 9. White lines on the figure of interpolated irregular data
To process irregular grid data, we can load the data by Start() and interpolate it to a regular grid by other tools (e.g., using s2dv::CDORemap).
In some instances, when we plot the interpolated data, we see some white lines on the map (see figure below).
To solve this problem, we can try to exclude the first and last indices of the latitude and longitude in Start() call.
Check [use case ex1_15](inst/doc/usecase/ex1_15_irregular_grid_CDORemap.R) for the example script (Case 2). The solution of each case may differ, so if you find this solution does not work for your case, please open an issue.
<img src="inst/doc/figures/faq_2_9_white_stripes.png" width="400" />