From fbd1c11b714d89b1920c936e1c48266a32d5b659 Mon Sep 17 00:00:00 2001 From: nperez Date: Wed, 12 Feb 2020 11:20:57 +0100 Subject: [PATCH 1/3] FAQ for specifying avoiding a node #43 --- inst/doc/faq.md | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/inst/doc/faq.md b/inst/doc/faq.md index b6df41c..b0b0493 100644 --- a/inst/doc/faq.md +++ b/inst/doc/faq.md @@ -10,6 +10,7 @@ This document intends to be the first reference for any doubts that you may have 4. [Use package function in Compute()](#4-use-package-function-in-compute) 5. [Do interpolation in Start() (using parameter 'transform')](#5-do-interpolation-in-start-using-parameter-transform) 6. [Get data attributes without retrieving data to workstation](#6-get-data-attributes-without-retrieving-data-to-workstation) + 7. [Avoid or specify a node from cluster in Compute()](#7-avoid-or-specify-a-node-from-cluster-in-Compute) 2. **Something goes wrong...** @@ -295,6 +296,42 @@ And if you want to retrieve the data to the workstation afterward, you can use ` Find examples at [usecase.md](/inst/doc/usecase.md), ex1_1 and ex1_3. +### 7. Avoid or specify a node from cluster in Compute() + +When submitting a job to Fatnodes using Compute(), the parameter 'extra_queue_params' could be used to restricthe job to be run in a expecific node as follows: + +``` + extra_queue_params = list('#SBATCH -w moore'), +``` + +or exclude a specific node from job by: + +``` + extra_queue_params = list('#SBATCH -x moore'), +``` + +Look at the position of `extra_queue_params` parameter in a full call of Compute: + +``` + res <- Compute(wf1, + chunks = list(ensemble = 20, + sdate = 2), + threads_load = 2, + threads_compute = 4, + cluster = list(queue_host = queue_host, + queue_type = 'slurm', + extra_queue_params = list('#SBATCH -x moore'), + cores_per_job = 2, + temp_dir = temp_dir, + r_module = 'R/3.5.0-foss-2018b', + polling_period = 10, + job_wallclock = '01:00:00', + max_jobs = 40, + bidirectional = FALSE), + ecflow_suite_dir = ecflow_suite_dir, + wait = TRUE) +``` + ## Something goes wrong... ### 1. No space left on device -- GitLab From 0cfb4b725b177aba27beb59b4f47d11e809e1659 Mon Sep 17 00:00:00 2001 From: aho Date: Thu, 13 Feb 2020 14:05:35 +0100 Subject: [PATCH 2/3] Correct the internal link. --- inst/doc/faq.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/inst/doc/faq.md b/inst/doc/faq.md index b0b0493..58cfcb0 100644 --- a/inst/doc/faq.md +++ b/inst/doc/faq.md @@ -10,7 +10,7 @@ This document intends to be the first reference for any doubts that you may have 4. [Use package function in Compute()](#4-use-package-function-in-compute) 5. [Do interpolation in Start() (using parameter 'transform')](#5-do-interpolation-in-start-using-parameter-transform) 6. [Get data attributes without retrieving data to workstation](#6-get-data-attributes-without-retrieving-data-to-workstation) - 7. [Avoid or specify a node from cluster in Compute()](#7-avoid-or-specify-a-node-from-cluster-in-Compute) + 7. [Avoid or specify a node from cluster in Compute()](#7-avoid-or-specify-a-node-from-cluster-in-compute) 2. **Something goes wrong...** -- GitLab From 96edf7b33a9eff43c743908c89fada586db19116 Mon Sep 17 00:00:00 2001 From: aho Date: Thu, 13 Feb 2020 15:06:02 +0100 Subject: [PATCH 3/3] Add FAQ 2-4. --- inst/doc/faq.md | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/inst/doc/faq.md b/inst/doc/faq.md index 58cfcb0..0b3bc53 100644 --- a/inst/doc/faq.md +++ b/inst/doc/faq.md @@ -16,7 +16,8 @@ This document intends to be the first reference for any doubts that you may have 2. **Something goes wrong...** 1. [No space left on device](#1-no-space-left-on-device) 2. [ecFlow UI remains blue and does not update status](#2-ecflow-ui-remains-blue-and-does-not-update-status) - 3. [Compute() successfully but then killed on R session](#3-compute-successfully-but-then-killed-on-r-session) + 3. [Compute() successfully but then killed on R session](#3-compute-successfully-but-then-killed-on-r-session) + 4. [My jobs work well in workstation and fatnodes but not on Power9 (or vice versa)](#4-my-jobs-work-well-in-workstation-and-fatnodes-but-not-on-power9-or-vice-versa) ## 1. How to @@ -369,3 +370,26 @@ When Compute() on HPCs, the machines are able to process data which are much lar Further explanation: though the complete output (i.e., merging all the chunks into one returned array) cannot be sent back to workstation, but the chunking results (.Rds file) are completed and saved in the directory '/STARTR_CHUNKING_'. If you still want to use the chunking results, you can find them there. + +### 4. My jobs work well in workstation and fatnodes but not on Power9 (or vice versa) + +There are several possible reasons for this situation. Here we list some of them, and please let us know if you find any other reason not listed here yet. +- **R module or package version difference.** Sometimes, the versions among these +machines are not consistency, and it might cause the problem. Try to load +different module to see if it fixes the problem. +- **The package is not known by the machine you use.** If the package you use +in the function does not include in the R module, you have to assign the +parameter `lib_dir` in the cluster list in Compute() (see more details in +[practical_guide.md](https://earth.bsc.es/gitlab/es/startR/blob/master/inst/doc/practical_guide.md#compute-on-cte-power-9).) +- **The function is specified the package name ahead.** The package name needs +to be added in front of function connected with '::' (e.g., `s2dv::Clim`) or with + ':::' if the function is internal (e.g., `CSTools:::.cal`). +- **Source or load the file not in the machine you use.** If you use self-defined +function or load data in the function, you need to put those files in the machine +you run the computation on, so the machine can find it (e.g., when submitting jobs +to power9, you should put the files in Power9 instead of local workstation.) +- **Connection problem.** Test the successful script you used to use (if you do not +have one, go to [usecase.md](https://earth.bsc.es/gitlab/es/startR/tree/develop-FAQcluster/inst/doc/usecase) to find one!). +If it fails, it means that your connection to machine or the ecFlow setting has +some problem. + -- GitLab