diff --git a/inst/doc/faq.md b/inst/doc/faq.md index b6df41cedefa5cd487bb6dd5d5767bf0a74c3c28..0b3bc53e56e09a5e36e768f53738ef93e21c5100 100644 --- a/inst/doc/faq.md +++ b/inst/doc/faq.md @@ -10,12 +10,14 @@ This document intends to be the first reference for any doubts that you may have 4. [Use package function in Compute()](#4-use-package-function-in-compute) 5. [Do interpolation in Start() (using parameter 'transform')](#5-do-interpolation-in-start-using-parameter-transform) 6. [Get data attributes without retrieving data to workstation](#6-get-data-attributes-without-retrieving-data-to-workstation) + 7. [Avoid or specify a node from cluster in Compute()](#7-avoid-or-specify-a-node-from-cluster-in-compute) 2. **Something goes wrong...** 1. [No space left on device](#1-no-space-left-on-device) 2. [ecFlow UI remains blue and does not update status](#2-ecflow-ui-remains-blue-and-does-not-update-status) - 3. [Compute() successfully but then killed on R session](#3-compute-successfully-but-then-killed-on-r-session) + 3. [Compute() successfully but then killed on R session](#3-compute-successfully-but-then-killed-on-r-session) + 4. [My jobs work well in workstation and fatnodes but not on Power9 (or vice versa)](#4-my-jobs-work-well-in-workstation-and-fatnodes-but-not-on-power9-or-vice-versa) ## 1. How to @@ -295,6 +297,42 @@ And if you want to retrieve the data to the workstation afterward, you can use ` Find examples at [usecase.md](/inst/doc/usecase.md), ex1_1 and ex1_3. +### 7. Avoid or specify a node from cluster in Compute() + +When submitting a job to Fatnodes using Compute(), the parameter 'extra_queue_params' could be used to restricthe job to be run in a expecific node as follows: + +``` + extra_queue_params = list('#SBATCH -w moore'), +``` + +or exclude a specific node from job by: + +``` + extra_queue_params = list('#SBATCH -x moore'), +``` + +Look at the position of `extra_queue_params` parameter in a full call of Compute: + +``` + res <- Compute(wf1, + chunks = list(ensemble = 20, + sdate = 2), + threads_load = 2, + threads_compute = 4, + cluster = list(queue_host = queue_host, + queue_type = 'slurm', + extra_queue_params = list('#SBATCH -x moore'), + cores_per_job = 2, + temp_dir = temp_dir, + r_module = 'R/3.5.0-foss-2018b', + polling_period = 10, + job_wallclock = '01:00:00', + max_jobs = 40, + bidirectional = FALSE), + ecflow_suite_dir = ecflow_suite_dir, + wait = TRUE) +``` + ## Something goes wrong... ### 1. No space left on device @@ -332,3 +370,26 @@ When Compute() on HPCs, the machines are able to process data which are much lar Further explanation: though the complete output (i.e., merging all the chunks into one returned array) cannot be sent back to workstation, but the chunking results (.Rds file) are completed and saved in the directory '/STARTR_CHUNKING_'. If you still want to use the chunking results, you can find them there. + +### 4. My jobs work well in workstation and fatnodes but not on Power9 (or vice versa) + +There are several possible reasons for this situation. Here we list some of them, and please let us know if you find any other reason not listed here yet. +- **R module or package version difference.** Sometimes, the versions among these +machines are not consistency, and it might cause the problem. Try to load +different module to see if it fixes the problem. +- **The package is not known by the machine you use.** If the package you use +in the function does not include in the R module, you have to assign the +parameter `lib_dir` in the cluster list in Compute() (see more details in +[practical_guide.md](https://earth.bsc.es/gitlab/es/startR/blob/master/inst/doc/practical_guide.md#compute-on-cte-power-9).) +- **The function is specified the package name ahead.** The package name needs +to be added in front of function connected with '::' (e.g., `s2dv::Clim`) or with + ':::' if the function is internal (e.g., `CSTools:::.cal`). +- **Source or load the file not in the machine you use.** If you use self-defined +function or load data in the function, you need to put those files in the machine +you run the computation on, so the machine can find it (e.g., when submitting jobs +to power9, you should put the files in Power9 instead of local workstation.) +- **Connection problem.** Test the successful script you used to use (if you do not +have one, go to [usecase.md](https://earth.bsc.es/gitlab/es/startR/tree/develop-FAQcluster/inst/doc/usecase) to find one!). +If it fails, it means that your connection to machine or the ecFlow setting has +some problem. +