diff --git a/inst/doc/faq.md b/inst/doc/faq.md index 83065d47e3ecef79af49e958188888d7eb45f4c7..3a065e0612b82b3c92b8de2d79e357067b22cc57 100644 --- a/inst/doc/faq.md +++ b/inst/doc/faq.md @@ -28,6 +28,7 @@ This document intends to be the first reference for any doubts that you may have 22. [Define the selector when the indices in the files are not aligned](#22-define-the-selector-when-the-indices-in-the-files-are-not-aligned) 23. [The best practice of using vector and list for selectors](#23-the-best-practice-of-using-vector-and-list-for-selectors) 24. [Do both interpolation and chunking on spatial dimensions](#24-do-both-interpolation-and-chunking-on-spatial-dimensions) + 25. [What to do if your function has too many target dimensions](#25-what-to-do-if-your-function-has-too-many-target-dimensions) 2. **Something goes wrong...** @@ -973,6 +974,10 @@ the usage of those parameters to avoid unecessary errors. We provide some [use cases](inst/doc/usecase/ex2_12_transform_and_chunk.R) showing the secure ways of transformation + chunking. +### 25. What to do if your function has too many target dimensions +Unfortunately, we don't have a perfect solution now before we have multiple steps feature. Talk to maintainers to see how to generate a workaround for your case. + + # Something goes wrong... ### 1. No space left on device diff --git a/inst/doc/practical_guide.md b/inst/doc/practical_guide.md index c56fc0b401797fd6b00a790606b4ca33c4c01b01..70b29a61b9d73cb7db017c3da9ed8ebf10c15be3 100644 --- a/inst/doc/practical_guide.md +++ b/inst/doc/practical_guide.md @@ -6,22 +6,25 @@ If you would like to start using startR rightaway on the BSC infrastructure, you ## Index -1. [**Motivation**](inst/doc/practical_guide.md#motivation) -2. [**Introduction**](inst/doc/practical_guide.md#introduction) -3. [**Configuring startR**](inst/doc/practical_guide.md#configuring-startr) -4. [**Using startR**](inst/doc/practical_guide.md#using-startr) - 1. [**Start()**](inst/doc/practical_guide.md#start) - 2. [**Step() and AddStep()**](inst/doc/practical_guide.md#step-and-addstep) - 3. [**Compute()**](inst/doc/practical_guide.md#compute) - 1. [**Compute() locally**](inst/doc/practical_guide.md#compute-locally) - 2. [**Compute() on CTE-Power 9**](inst/doc/practical_guide.md#compute-on-cte-power-9) - 3. [**Compute() on the fat nodes and other HPCs**](inst/doc/practical_guide.md#compute-on-the-fat-nodes-and-other-hpcs) - 4. [**Collect() and the EC-Flow GUI**](inst/doc/practical_guide.md#collect-and-the-ec-flow-gui) -5. [**Additional information**](inst/doc/practical_guide.md#additional-information) -6. [**Other examples**](inst/doc/practical_guide.md#other-examples) -7. [**Compute() cluster templates**](inst/doc/practical_guide.md#compute-cluster-templates) - -## Motivation +1. [**Motivation**](#1-motivation) +2. [**Introduction**](#2-introduction) +3. [**Configuring startR**](#3-configuring-startr) +4. [**Using startR**](#4-using-startr) + 1. [**Start()**](#4-1-start) + 2. [**Step() and AddStep()**](#4-2-step-and-addstep) + 3. [**Compute()**](#4-3-compute) + 1. [**Compute() locally**](#4-3-1-compute-locally) + 2. [**Compute() on HPCs**](#4-3-2-compute-on-hpcs) + 4. [**Collect() and the EC-Flow GUI**](#4-4-collect-and-the-ec-flow-gui) +5. [**Additional information**](#5-additional-information) + 1. [**How to choose the number of chunks, jobs and cores**](#5-1-how-to-choose-the-number-of-chunks-jobs-and-cores) + 2. [**How to clean a failed execution**](#5-2-how-to-clean-a-failed-execution) + 3. [**Visualizing the profiling of the execution**](#5-3-visualizing-the-profiling-of-the-execution) + 4. [**Pending features**](#5-4-pending-features) +6. [**Other examples**](#6-other-examples) +7. [**Compute() cluster templates**](#7-compute-cluster-templates) + +## 1. Motivation What would you do if you had to apply a custom statistical analysis procedure to a 10TB climate data set? Probably, you would need to use a scripting language to write a procedure which is able to retrieve a subset of data from the file system (it would rarely be possible to handle all of it at once on a single node), code the custom analysis procedure in that language, and apply it carefully and efficiently to the data. Afterwards, you would need to think of and develop a mechanism to dispatch the job mutiple times in parallel to an HPC of your choice, each of the jobs processing a different subset of the data set. You could do this by hand but, ideally, you would rather use EC-Flow or a similar general purpose workflow manager which would orchestrate the work for you. Also, it would allow you to visually monitor and control the progress, as well as keep an easy-to-understand record of what you did, in case you need to re-use it in the future. The mentioned solution, although it is the recommended way to go, is a demanding one and you could easily spend a few days until you get it running smoothly. Additionally, when developing the job script, you would be exposed to the difficulties of efficiently managing the data and applying the coded procedure to it. @@ -43,7 +46,7 @@ _**Note 1**_: The data files do not need to be migrated to a database system, no _**Note 2**_: The HPCs startR is designed to run on are understood as multi-core multi-node clusters. startR relies on a shared file system across all HPC nodes, and does not implement any kind of distributed storage system for now. -## Introduction +## 2. Introduction In order to use startR you will need to follow the configuration steps listed in the "Configuring startR" section of this guide to make sure startR works on your workstation with the HPC of your choice. @@ -90,7 +93,7 @@ wf <- AddStep(data, step) cores_per_job = 4, job_wallclock = '00:10:00', max_jobs = 4, - extra_queue_params = list('#SBATCH --mem-per-cpu=3000'), + extra_queue_params = list('#SBATCH --constraint=medmem'), bidirectional = FALSE, polling_period = 10 ), @@ -99,7 +102,7 @@ wf <- AddStep(data, step) ) ``` -## Configuring startR +## 3. Configuring startR At BSC, the only configuration step you need to follow is to set up passwordless connection with the HPC. You do not need to follow the complete list of deployment steps under [**Deployment**](inst/doc/deployment.md) since all dependencies are already installed for you to use. @@ -147,7 +150,7 @@ alias ctp='ssh -XY username@hostname_or_ip' alias start='module load R CDO ecFlow' ``` -## Using startR +## 4. Using startR If you have successfully gone through the configuration steps, you will just need to run the following commands in a terminal session and a fresh R session will pop up with the startR environment ready to use. @@ -162,7 +165,7 @@ The library can be loaded as follows: library(startR) ``` -### Start() +### 4-1. Start() In order for startR to recognize the data sets you want to process, you first need to declare them. The first step in the declaration of a data set is to build a special path string that encodes where all the involved NetCDF files to be processed are stored. It contains some wildcards in those parts of the path that vary across files. This special path string is also called "path pattern". @@ -323,7 +326,7 @@ _**Note 3**_ _on the order of dimensions_: neither the file dimensions nor the i _**Note 4**_ _on the size of the data in R_: if you check the size of the involved file in the example `Start()` call used above ('/esarchive/exp/ecmwf/system5_m1/6hourly/tas/tas_19930101.nc'), you will realize it only weighs 34GB. Why is the data reported to occupy 134GB then? This is due to two facts: by one side, NetCDF files are usually compressed, and their uncompressed size can be substantially greater. In this case, the uncompressed data would occupy about 72GB. Besides, the variable we are targetting in the example ('tas') is defined as a float variable inside the NetCDF file. This means each value is a 4-byte real number. However, R cannot represent 4-byte real numbers; it always takes 8 bytes to represent a real number. This is why, when float numbers are represented in R, the effective size of the data is doubled. In this case, the 72GB of uncompressed float numbers need to be represented using 132GB of RAM in R. -### Step() and AddStep() +### 4-2. Step() and AddStep() Once the data sources are declared, you can define the operation to be applied to them. The operation needs to be encapsulated in the form of an R function receiving one or more multidimensional arrays with named dimensions (plus additional helper parameters) and returning one or more multidimensional arrays, which should also have named dimensions. For example: @@ -364,7 +367,7 @@ If the step involved more than one data source, a list of data sources could be It is not possible for now to define workflows with more than one step, but this is not a crucial gap since a step function can contain more than one statistical analysis procedure. Furthermore, it is usually enough to perform only the first or two first steps of the analysis workflow on the HPCs because, after these steps, the volume of data involved is reduced substantially and the analysis can go on with conventional methods. -### Compute() +### 4-3. Compute() Once the data sources are declared and the workflow is defined, you can proceed to specify the execution parameters (including which platform to run on) and trigger the execution with the `Compute()` function. @@ -418,7 +421,7 @@ Warning messages: ! dimension with pattern specifications. ``` -#### Compute() locally +#### 4-3-1. Compute() locally When only your own workstation is available, startR can still be useful to process a very large dataset by chunks, thus avoiding a RAM memory overload and consequent crash of the workstation. startR will simply load and process the dataset by chunks, one after the other. The operations defined in the workflow will be applied to each chunk, and the results will be stored on a temporary file. `Compute()` will finally gather and merge the results of each chunk and return a single data object, including one or multiple multidimensional data arrays, and additional metadata. @@ -563,7 +566,7 @@ res <- Compute(wf, * max: 8.03660178184509 ``` -#### Compute() on HPCs +#### 4-3-2. Compute() on HPCs In order to run the computation on a HPC, you will need to make sure the passwordless connection with the login node of that HPC is configured, as shown at the beginning of this guide. If possible, in both directions. Also, you will need to know whether there is a shared file system between your workstation and that HPC, and will need information on the number of nodes, cores per node, threads per core, RAM memory per node, and type of workload used by that HPC (Slurm, PBS and LSF supported). @@ -589,7 +592,7 @@ The parameter `cluster` expects a list with a number of components that will hav cores_per_job = 4, job_wallclock = '00:10:00', max_jobs = 4, - extra_queue_params = list('#SBATCH --mem-per-cpu=3000'), + extra_queue_params = list('#SBATCH --constraint=medmem'), bidirectional = FALSE, polling_period = 10 ), @@ -607,7 +610,7 @@ The cluster components and options are explained next: - `cores_per_job`: number of computing cores to be requested when submitting the job for each chunk to the HPC queue. Each node may be capable of supporting more than one computing thread. - `job_wallclock`: amount of time to reserve the resources when submitting the job for each chunk. Must follow the specific format required by the specified `queue_type`. - `max_jobs`: maximum number of jobs (chunks) to be queued simultaneously onto the HPC queue. Submitting too many jobs could overload the bandwidth between the HPC nodes and the storage system, or could overload the queue system. -- `extra_queue_params`: list of character strings with additional queue headers for the jobs to be submitted to the HPC. Mainly used to specify the amount of memory to book for each job (e.g. '#SBATCH --mem-per-cpu=30000'), to request special queuing (e.g. '#SBATCH --qos=bsc_es'), or to request use of specific software (e.g. '#SBATCH --reservation=test-rhel-7.5'). +- `extra_queue_params`: list of character strings with additional queue headers for the jobs to be submitted to the HPC. Mainly used to specify the amount of memory to book for each job (e.g. '#SBATCH --mem-per-cpu=30000'; __NOTE: this line does not work on Nord3v2__), to request special queuing (e.g. '#SBATCH --qos=bsc_es'), or to request use of specific software (e.g. '#SBATCH --reservation=test-rhel-7.5'). - `bidirectional`: whether the connection between the R workstation and the HPC login node is bidirectional (TRUE) or unidirectional from the workstation to the login node (FALSE). - `polling_period`: when the connection is unidirectional, the workstation will ask the HPC login node for results each `polling_period` seconds. An excessively small value can overload the login node or result in temporary banning. - `special_setup`: name of the machine if the computation requires an special setup. Only Marenostrum 4 needs this parameter (e.g. special_setup = 'marenostrum4'). @@ -686,7 +689,7 @@ As mentioned above in the definition of the `cluster` parameters, it is strongly You can find the `cluster` configuration for other HPCs at the end of this guide [Compute() cluster templates](#compute-cluster-templates) -### Collect() and the EC-Flow GUI +### 4-4. Collect() and the EC-Flow GUI Usually, in use cases where large data inputs are involved, it is convenient to add the parameter `wait = FALSE` to your `Compute()` call. With this parameter, `Compute()` will immediately return an object with information about your startR execution. You will be able to store this object onto disk. After doing that, you will not need to worry in case your workstation turns off in the middle of the computation. You will be able to close your R session, and collect the results later on with the `Collect()` function. @@ -702,7 +705,7 @@ Usually, in use cases where large data inputs are involved, it is convenient to cores_per_job = 4, job_wallclock = '00:10:00', max_jobs = 4, - extra_queue_params = list('#SBATCH --mem-per-cpu=3000'), + extra_queue_params = list('#SBATCH --constraint=medmem'), bidirectional = FALSE, polling_period = 10 ), @@ -774,9 +777,10 @@ You can also run `Collect()` with `wait = FALSE`. This will crash with an error `Collect()` also has a parameter called `remove`, which by default is set to `TRUE` and triggers removal of all data results received from the HPC (and stored under `ecflow_suite_dir`). If you would like to preserve the data, you can set `remove = FALSE` and `Collect()` it as many times as desired. Alternatively, you can `Collect()` with `remove = TRUE` and store the merged array right after with `saveRDS()`. -## Additional information +## 5. Additional information +You can find more FAQ in [faq.md](inst/doc/faq.md). -### How to choose the number of chunks, jobs and cores +### 5-1. How to choose the number of chunks, jobs and cores #### Number of chunks and memory per job @@ -804,7 +808,7 @@ The number of cores per job should be as large as possible, with two limitations - If the HPC nodes have each "M" total amount of memory with "m" amount of memory per memory module, and have each "N" amount of cores, the requested amount of cores "n" should be such that "n" / "N" does not exceed "m" / "M". - Depending on the shape of the chunks, startR has a limited capability to exploit multiple cores. It is recommended to make small tests increasing the number of cores to work out a reasonable number of cores to be requested. -### How to clean a failed execution +### 5-2. How to clean a failed execution - Work out the startR execution ID, either by inspecting the execution description by `Compute()` when called with the parameter `wait = FALSE`, or by checking the `ecflow_suite_dir` with `ls -ltr`. - ssh to the HPC login node and cancel all jobs of your startR execution. @@ -814,7 +818,7 @@ The number of cores per job should be as large as possible, with two limitations - Optionally remove the data under `data_dir` on the HPC login node if the file system is not shared between the workstation and the HPC and you do not want to keep the data in the `data_dir`, used as caché for future startR executions. - Open the EC-Flow GUI and remove the workflow entry (a.k.a. suite) named with your startR execution ID with right click > "Remove". -### Visualizing the profiling of the execution +### 5-3. Visualizing the profiling of the execution As seen in previous sections, profiling measurements of the execution are provided together with the data output. These measurements can be visualized with the `PlotProfiling()` function made available in the source code of the startR package. @@ -832,16 +836,17 @@ A chart displays the timings for the different stages of the computation, as sho You can click on the image to expand it. -### What to do if your function has too many target dimensions -### Pending features -- Adding feature for `Copute()` to run on multiple HPCs or workstations. +### 5-4. Pending features +- Adding feature for `Compute()` to run on multiple HPCs or workstations. - Adding plug-in to read CSV files. - Supporting multiple steps in a workflow defined by `AddStep()`. - Adding feature in `Start()` to read sparse grid points. - Allow for chunking along "essential" (a.k.a. "target") dimensions. -## Other examples + +## 6. Other examples +You can find more use cases in [usecase.md](inst/doc/usecase.md). ### Using experimental and (date-corresponding) observational data @@ -1007,7 +1012,7 @@ r <- Compute(wf, ) ``` -## Compute() cluster templates +## 7. Compute() cluster templates ### Nord3-v2