diff --git a/README.md b/README.md index ab9968e7938b1c75f6d79beda02b84bf0452bb46..adc9200041a9600c64960d49d20bfdb02649b2f8 100644 --- a/README.md +++ b/README.md @@ -1,53 +1,23 @@ ## startR - Retrieval and processing of multidimensional datasets -startR is an R package developed at the Barcelona Supercomputing Center which implements the MapReduce paradigm (a.k.a. domain decomposition) on HPCs in a way transparent to the user and specially oriented to complex multidimensional datasets. - -Following the startR framework, the user can represent in a one-page startR script all the information that defines a use case, including: -- the involved (multidimensional) data sources and the distribution of the data files -- the workflow of operations to be applied, over which data sources and dimensions -- the HPC platform properties and the configuration of the execution - -When run, the script triggers the execution of the defined workflow. Furthermore, the EC-Flow workflow manager is transparently used to dispatch tasks onto the HPC, and the user can employ its graphical interface for monitoring and control purposes. - -An extensive part of this package is devoted to the automatic retrieval (from disk or store to RAM) and arrangement of multi-dimensional distributed data sets. This functionality is encapsulated in a single funcion called `Start()`, which is explained in detail in the [**Start()**](inst/doc/start.md) documentation page and in `?Start`. - -startR is an open source project that is open to external collaboration and funding, and will continuously evolve to support as many data set formats as possible while maximizing its efficiency. - -### Installation - -See the [**Deployment**](inst/doc/deployment.md) documentation page for details on the set up steps. The most relevant system dependencies are listed next: -- netCDF-4 -- R with the startR, bigmemory and easyNCDF R packages -- For distributed computation: - - UNIX-based HPC (cluster of multi-processor nodes) - - a job scheduler (Slurm, PBS or LSF) - - EC-Flow >= 4.9.0 - -In order to install and load the latest published version of the package on CRAN, you can run the following lines in your R session: - -```r -install.packages('startR') -library(startR) -``` - -Also, you can install the latest stable version from the GitLab repository as follows: - -```r -devtools::install_git('https://earth.bsc.es/gitlab/es/startR.git') -``` +Welcome to startR! startR is an R package developed at the Barcelona Supercomputing Center which implements the MapReduce paradigm (a.k.a. domain decomposition) on HPCs in a way transparent to the users and specially tailored for complex multidimensional datasets. startR is an open source project which is open to external collaboration and funding, and will continuously evolve to support as many data set formats as possible while maximizing its efficiency. ### How it works +The startR framework can be summarized in the following steps: +1. [Declaration of data sources](inst/doc/practical_guide.md#start) +2. [Declaration of the workflow](inst/doc/practical_guide.md#step-and-addstep) +3. [Declaration of the HPC platform and execution](inst/doc/practical_guide.md#compute) +4. [Monitoring the execution](inst/doc/practical_guide.md#ec-flow-gui) -An overview example of how to process a large data set is shown in the following. You can see real use cases in the [**Practical guide for processing large data sets with startR**](inst/doc/practical_guide.md), and you can find more information on the use of the `Start()` function in the [**Start()**](inst/doc/start.md) documentation page, as well as in the documentation of the functions in the package. - -The purpose of the example in this section is simply to illustrate how the user is expected to use startR once the framework is deployed on the workstation and HPC. It shows how a simple addition and averaging operation is performed on BSC's CTE-Power HPC, over a multi-dimensional climate data set, which lives in the BSC-ES storage infrastructure. As mentioned in the introduction, the user will need to declare the involved data sources, the workflow of operations to carry out, and the computing environment and parameters. +To let the new users have a preliminary understanding of the script, an example for each step is shown below. It shows a simple addition and average operation performed on BSC's CTE-Power HPC (i.e., Power 9) over a multi-dimensional climate data set, which is stored in the BSC-ES storage infrastructure. The purpose of this example is simply to illustrate how the users can expect from using startR. You can find real use cases in [practical_guide.md](inst/doc/practical_guide.md). #### 1. Declaration of data sources +As mentioned above, the first step is to declare the data sources. With `retrieve = FALSE`, it creates a pointer indicating the desired dataset. ```r library(startR) -# A path pattern is built +# A path pattern is built with self-defined wildcards repos <- '/esarchive/exp/ecmwf/system5_m1/6hourly/$var$/$var$_$sdate$.nc' # A Start() call is built with the indices to operate @@ -62,10 +32,11 @@ data <- Start(dat = repos, ``` #### 2. Declaration of the workflow +The second step is to define the function and build up the workflow using `Step()` and `AddStep()`. ```r -# The function to be applied is defined. -# It only operates on the essential 'target' dimensions. +# The function to be applied is defined +# It only operates on the essential 'target' dimensions fun <- function(x) { # Expected inputs: # x: array with dimensions ('ensemble', 'time') @@ -74,17 +45,18 @@ fun <- function(x) { # single array with dimensions ('time') } -# A startR Step is defined, specifying its expected input and -# output dimensions. -step <- Step(fun, - target_dims = c('ensemble', 'time'), +# A startR Step is defined, specifying its expected input and +# output dimensions +step <- Step(fun, + target_dims = c('ensemble', 'time'), output_dims = c('time')) -# The workflow of operations is cast before execution. +# The workflow of operations is cast before execution wf <- AddStep(data, step) ``` #### 3. Declaration of the HPC platform and execution +After data sources and workflow are prepared, we are ready to dispatch jobs to the wanted machines, which is Power 9 in this case. The computation configuration of Power 9 is defined in the `cluster` list. ```r res <- Compute(wf, @@ -92,9 +64,9 @@ res <- Compute(wf, longitude = 2), threads_load = 1, threads_compute = 2, - cluster = list(queue_host = 'p9login1.bsc.es', + cluster = list(queue_host = 'p9login1.bsc.es', #user-defined queue_type = 'slurm', - temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_tests/', + temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_tests/', #user-defined lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.5/', r_module = 'R/3.5.0-foss-2018b', cores_per_job = 2, @@ -104,21 +76,19 @@ res <- Compute(wf, bidirectional = FALSE, polling_period = 10 ), - ecflow_suite_dir = '/home/Earth/nmanuben/test_startR/', + ecflow_suite_dir = '/home/Earth/nmanuben/test_startR/', #user-defined wait = TRUE) ``` #### 4. Monitoring the execution -During the execution of the workflow, which is orchestrated by EC-Flow and a job scheduler (either Slurm, LSF or PBS), the status can be monitored using the EC-Flow graphical user interface. Pending tasks are coloured in blue, ongoing in green, and finished in yellow. +During the execution of the workflow, which is orchestrated by EC-Flow and a job scheduler (either Slurm, LSF or PBS), the status can be monitored using the EC-Flow graphical user interface (see figure below). #### 5. Profiling of the execution -Additionally, profiling measurements of the execution are provided together with the output data. Such measurements can be visualized with the `PlotProfiling()` function made available in the source code of the startR package. - -This function has not been included as part of the official set of functions of the package because it requires a number of extense plotting libraries which take time to load and, since the startR package is loaded in each of the worker jobs on the HPC or cluster, this could imply a substantial amount of time spent in repeatedly loading unused visualization libraries during the computing stage. +Additionally, profiling measurements of the execution are provided together with the output data. The profiling will be printed on the R terminal when the execution ends, or you can visualize it with the `PlotProfiling()` function available in the source code of the startR package. ```r source('https://earth.bsc.es/gitlab/es/startR/raw/master/inst/PlotProfiling.R') @@ -128,3 +98,23 @@ PlotProfiling(attr(res, 'startR_compute_profiling')) You can click on the image to expand it. + + + +### Installation +Now you have a general picture of how startR works. But before starting using startR, you need to check if the required environments are installed. The most relevant system dependencies are listed below: +- netCDF-4 +- R, with the startR packages and its dependent packages (e.g., bigmemory, easyNCDF) +- For distributed computation: + - UNIX-based HPC (cluster of multi-processor nodes) + - a job scheduler (Slurm, PBS or LSF) + - EC-Flow >= 4.9.0 + +### Other useful documentation +If you are a new startR users, you are recommended to go through the following link one by one to get familiar with startR. + +1. See [**Deployment**](inst/doc/deployment.md) for details on the setup if outside of BSC. +2. Follow [**Practical Guide**](inst/doc/practical_guide.md) with use cases step by step. +3. See [**Start**](inst/doc/start.md) to learn more applications of Start(). +4. Find [**FAQs**](https://earth.bsc.es/gitlab/es/startR/wikis/FAQ) and [**Examples**](https://earth.bsc.es/gitlab/es/startR/wikis/Example) in GitLab wiki page. + diff --git a/inst/PlotProfiling.R b/inst/PlotProfiling.R index 2cd7ed7c8837faf4872fb2eceb07ca3183b755cf..dd304dec108aa1f31d38bea023b0d34b20580e59 100644 --- a/inst/PlotProfiling.R +++ b/inst/PlotProfiling.R @@ -1,3 +1,26 @@ +#'Visualize profiling measurements of the Compute execution +#' +#'When excuting function 'Compute', the profiling measurements of the execution +#'are provided together with the data output. This function turns the measurements +#'into histograms and boxplots. +#' +#'@param configs +#'@param n_test +#'@param file_name +#'@param config_names +#'@param items +#'@param total_timings +#'@param crop +#'@param subtitle A charater string to add +#' +#'@return +#' +#'@keywords datagen +#'@author History:\cr +#' +#'@examples + +#'@export PlotProfiling <- function(configs, n_test = 1, file_name = NULL, config_names = NULL, items = NULL, total_timings = TRUE, @@ -79,7 +102,7 @@ PlotProfiling <- function(configs, n_test = 1, file_name = NULL, config_long_names <- c(config_long_names, config_long_name) config_total_times <- c(config_total_times, timings[['total']]) timings <- as.data.frame(timings) - t_all_chunks <- timings[['total']] - timings[['bychunks_setup']] - timings[['merge']] - + t_all_chunks <- timings[['total']] - timings[['bychunks_setup']] - timings[['transfer']] - timings[['merge']] if (!is.na(timings[['transfer_back']])) { t_all_chunks <- t_all_chunks - timings[['transfer_back']] diff --git a/inst/doc/deployment.md b/inst/doc/deployment.md index 47325db52180433bd49da4920be06a53780cb43f..fdfadc3ce6535e9106c7d1cd7c65fcfe4e42ba4d 100644 --- a/inst/doc/deployment.md +++ b/inst/doc/deployment.md @@ -1,35 +1,49 @@ ## Deployment of startR -This section contains the information on system requirements and the steps to set up such requirements. Note that `startR` can be used for two different purposes, either only retrieving data locally, or retrieving plus processing data on a distributed HPC. The requirements for each purpose are detailed in separate sections. +This section contains the information on system requirements and the steps to set up such environment. Note that if you are at BSC, you do not need to follow the complete list of deployment steps here since all dependencies are already installed for you to use. -### Deployment steps for retrieving data locally +This documentation seperates into two sections. The first one is for [retrieve data](inst/doc/deployment.md#deployment-steps-for-retrieving-data-locally). If you also want to [process data on a distributed HPC](inst/doc/deployment.md#deployment-steps-for-processing-data-on-distributed-hpcs), more installations are needed. -1. Install netCDF-4 if retrieving data from NetCDF files (only file format supported by now): - - zlib (>= 1.2.3) and HDF5 (>= 1.8.0-beta1) are required by netCDF-4 - - Steps for installation of netCDF-4 are detailed in https://www.unidata.ucar.edu/software/netcdf/docs/getting_and_building_netcdf.html +### Deployment steps for retrieving data + +Install the followings: -2. Install R (>= 2.14.1) +1. R (>= 2.14.1) +2. startR package +You can choose either of the following ways: +- Install the latest published version on CRAN. Run the following lines in your R session: -3. Install the required R packages - - Installing `startR` will trigger the installation of the other required packages ```r -devtools::install_git('https://earth.bsc.es/gitlab/es/startR') +install.packages('startR') +library(startR) ``` - - Among others, the bigmemory package will be installed. - - If loading and processing NetCDF files (only file format supported by now), install the easyNCDF package. - - If planning to interpolate the data with CDO (either by using the `transform` parameter in `startR::Start`, or by using `s2dverification::CDORemap` in the workflow specified to `startR::Compute`), install s2dverification (>= 2.8.4) and CDO (version 1.6.3 tested). CDO is not available for Windows. -A local or remote file system or THREDDS/OPeNDAP server providing the data to be retrieved must be accessible. +- Install the latest stable version from the GitLab repository: +```r +devtools::install_git('https://earth.bsc.es/gitlab/es/startR.git') +``` -### Deployment steps for processing data on distributed HPCs +Installing `startR` will trigger the installation of the other required packages. +- `bigmemory` package will be installed with `startR` installation. +- `easyNCDF` package will be installed if `Start()` function is used to load NetCDF files (which is only file format supported by now). -For processing the data on a distributed HPC (cluster of multi-processor, multi-core nodes), your workstation, an optional EC-Flow host node and a HPC login node with acces to the HPC nodes, should all be accessible in your network. +3. netCDF-4 +This is required if you plan to retrieve NetCDF files (which is only file format supported by now): + - zlib (>= 1.2.3) and HDF5 (>= 1.8.0-beta1) are required by netCDF-4 + - Steps for installation of netCDF-4 are detailed in https://www.unidata.ucar.edu/software/netcdf/docs/getting_and_building_netcdf.html -All machines must be UNIX-based, with the "hostname", "date", "touch" and "sed" commands available. -1. Set up passwordless, userless ssh access - - at least from your workstation to the HPC login node - - if possible, also from the HPC login node to your workstation +4. s2dverification package (>= 2.8.4) and CDO (version 1.6.3 tested) +These are required if you plan to interpolate the data with CDO (either by using the `transform` parameter in `startR::Start`, or by using `s2dverification::CDORemap` in the workflow specified to `startR::Compute`). + +5. A local or remote file system or THREDDS/OPeNDAP server providing the data to be retrieved from the workstation you run the code must be accessible. In BSC-ES, the data repository could be under `/esarchive/exp/` or `/esarchive/obs/` or`/esarchive/recon/`. + + +### Deployment steps for processing data on distributed HPCs + +If you want not only to retrieve data locally but also to process the data on a distributed HPC (i.e., cluster of multi-processor/ multi-core nodes), the following settings are required. + +1. All machines (your workstation, the HPC) must be UNIX-based, with the "hostname", "date", "touch" and "sed" commands available. 2. Install the following libraries on your workstation: - ssh @@ -37,20 +51,34 @@ All machines must be UNIX-based, with the "hostname", "date", "touch" and "sed" - rsync (>= 3.0.6) - EC-Flow (>= 4.9.0) -3. If you are using a separate EC-Flow host node to control your EC-Flow workflows (optional), install EC-Flow (>= 4.9.0) on the EC-Flow host node +3. Set up passwordless, userless ssh access + - at least from your workstation to the HPC login node + - if possible, also from the HPC login node to your workstation + +4. If you are using a separate EC-Flow host node to control your EC-Flow workflows (optional), install EC-Flow (>= 4.9.0) on the EC-Flow host node -4. Install the following libraries on the HPC login node: +5. Install the following libraries on the HPC login node: - ssh - scp - rsync (>= 3.0.6) - EC-Flow (>= 4.9.0), as a Linux Environment Module (optional) - Job scheduler (Slurm, PBS or LSF) to distribute the workload across HPC nodes -5. Make sure the following requirements are fulfilled by all HPC nodes: - - netCDF-4 is installed, if loading and processing NetCDF files (only supported format by now) +6. Make sure the following requirements are fulfilled by all HPC nodes: + - netCDF-4 is installed, if loading and processing NetCDF files (the only supported format by now) - R (>= 2.14.1) is installed as a Linux Environment Module - - the startR package is installed - - if using CDO interpolation, the s2dverification package and CDO 1.6.3 are installed - - any other R packages required by the `startR::Compute` workflow are installed - - any other Environment Modules used by the `startR::Compute` workflow are installed - - a shared file system (with a unified access point) or THREDDS/OPeNDAP server is accessible across HPC nodes and HPC login node, where the necessary data can be uploaded from your workstation. A file system shared between your workstation and the HPC is also supported and advantageous. Use of a data transfer service between the workstation and the HPC is also supported under specific configurations. + - The startR package is installed + - If using CDO interpolation, the s2dverification package and CDO v1.6.3 are installed + - any other R packages required by the `startR::Compute` workflow (e.g., the package including the used function) are installed + - Any other Environment Modules used by the `startR::Compute` workflow are installed + - A shared file system (with a unified access point) or THREDDS/OPeNDAP server is accessible across HPC nodes and HPC login node, where the necessary data can be uploaded from your workstation. + - A file system shared between your workstation and the HPC is also supported and advantageous. + - Use of a data transfer service between the workstation and the HPC is also supported under specific configurations. + +### Other useful documentation +If you are a new startR users and have not read [**README**](https://earth.bsc.es/gitlab/es/startR/blob/master/README.md), you are recommended to read it first to get a general picture of startR. Then, you can follow the order to know more: +1. See [**Deployment**](inst/doc/deployment.md) for details on the setup if outside of BSC. +2. Follow [**Practical Guide**](inst/doc/practical_guide.md) with use cases step by step. +3. See [**Start**](inst/doc/start.md) to learn more applications of Start(). +4. Find [**FAQs**](https://earth.bsc.es/gitlab/es/startR/wikis/FAQ) and [**Examples**](https://earth.bsc.es/gitlab/es/startR/wikis/Example) in GitLab wiki page. + diff --git a/inst/doc/practical_guide.md b/inst/doc/practical_guide.md index 0ae2806ddee86e09ba2525fa18dea0e0cb30fdd7..57239c101172aca96ebf2f77c927a16934229435 100644 --- a/inst/doc/practical_guide.md +++ b/inst/doc/practical_guide.md @@ -1,9 +1,5 @@ # Practical guide for processing large data sets with startR -This guide includes explanations and practical examples for you to learn how to use startR to efficiently process large data sets in parallel on the BSC's HPCs (CTE-Power 9, Marenostrum 4, ...). See the main page of the [**startR**](README.md) project for a general overview of the features of startR, without actual guidance on how to use it. - -If you would like to start using startR rightaway on the BSC infrastructure, you can directly go through the "Configuring startR" section, copy/paste the basic startR script example shown at the end of the "Introduction" section onto the text editor of your preference, adjust the paths and user names specified in the `Compute()` call, and run the code in an R session after loading the R and ecFlow modules. - ## Index 1. [**Motivation**](inst/doc/practical_guide.md#motivation) @@ -15,11 +11,14 @@ If you would like to start using startR rightaway on the BSC infrastructure, you 3. [**Compute()**](inst/doc/practical_guide.md#compute) 1. [**Compute() locally**](inst/doc/practical_guide.md#compute-locally) 2. [**Compute() on CTE-Power 9**](inst/doc/practical_guide.md#compute-on-cte-power-9) - 3. [**Compute() on the fat nodes and other HPCs**](inst/doc/practical_guide.md#compute-on-the-fat-nodes-and-other-hpcs) - 4. [**Collect() and the EC-Flow GUI**](inst/doc/practical_guide.md#collect-and-the-ec-flow-gui) -5. [**Additional information**](inst/doc/practical_guide.md#additional-information) -6. [**Other examples**](inst/doc/practical_guide.md#other-examples) -7. [**Compute() cluster templates**](inst/doc/practical_guide.md#compute-cluster-templates) + 3. [**Compute() on the fat nodes**](inst/doc/practical_guide.md#compute-on-the-fat-nodes) + 4. [**EC-Flow GUI**](inst/doc/practical_guide.md#ec-flow-gui) + 5. [**Collect()**](inst/doc/practical_guide.md#collect) +5. [**Other useful documentation**](inst/doc/practical_guide.md#other-useful-documentation) +6. [**Additional information**](inst/doc/practical_guide.md#additional-information) +7. [**Other examples**](inst/doc/practical_guide.md#other-examples) +8. [**Compute() cluster templates**](inst/doc/practical_guide.md#compute-cluster-templates) + ## Motivation @@ -43,18 +42,20 @@ _**Note 1**_: The data files do not need to be migrated to a database system, no _**Note 2**_: The HPCs startR is designed to run on are understood as multi-core multi-node clusters. startR relies on a shared file system across all HPC nodes, and does not implement any kind of distributed storage system for now. + ## Introduction -In order to use startR you will need to follow the configuration steps listed in the "Configuring startR" section of this guide to make sure startR works on your workstation with the HPC of your choice. +This guide includes comprehensive explanations and practical examples for you to learn step by step how to use startR to efficiently process large data sets in parallel on the BSC HPCs (CTE-Power 9, Marenostrum 4, ...). If you have not read the [**README.md**](README.md) and [**deployment.md**](inst/doc/deployment.md), you are recommended to go through them first. -Afterwards, you will need to understand and use five functions, all of them included in the startR package: +You are going to learn the configuration setup first, following the instruction at "Configuring startR" section. Afterwards, you will need to understand and use five functions, which are all included in the startR package: - **Start()**, for declaing the data sets to be processed - **Step()** and **AddStep()**, for specifying the operation to be applied to the data - **Compute()**, for specifying the HPC to be employed, the execution parameters (e.g. number of chunks and cores), and to trigger the computation - - **Collect()** and the **EC-Flow graphical user interface**, for monitoring of the progress and collection of results + - **Collect()**, for collecting the results -Next, you can see an example startR script performing the ensemble mean of a small data set on CTE-Power9, for you to get a broad picture of how the startR functions interact and the information that is represented in a startR script. Note that the `queue_host`, `temp_dir` and `ecflow_suite_dir` parameters in the `Compute()` call are user-specific. +You will also learn how to monitor and control the execution progress using **EC-Flow graphical user interface**. +Here is a full startR example script. In the next sections, you will learn step by step to get a clearer picture of how the startR functions work and the information represented in a startR script. ```r library(startR) @@ -63,9 +64,9 @@ data <- Start(dat = repos, var = 'tas', sdate = '20180101', ensemble = 'all', - time = 'all', - latitude = indices(1:40), - longitude = indices(1:40)) + time = 'all', #indices(1:2), + latitude = 'all', + longitude = 'all') fun <- function(x) { # Expected inputs: @@ -73,8 +74,8 @@ fun <- function(x) { apply(x + 1, 2, mean) } -step <- Step(fun, - target_dims = c('ensemble', 'time'), +step <- Step(fun, + target_dims = c('ensemble', 'time'), output_dims = c('time')) wf <- AddStep(data, step) @@ -87,7 +88,6 @@ res <- Compute(wf, cluster = list(queue_host = 'p9login1.bsc.es', queue_type = 'slurm', temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', - lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.5/', r_module = 'R/3.5.0-foss-2018b', job_wallclock = '00:10:00', cores_per_job = 4, @@ -98,9 +98,10 @@ res <- Compute(wf, ecflow_suite_dir = '/home/Earth/nmanuben/startR_local/') ``` + ## Configuring startR -At BSC, the only configuration step you need to follow is to set up passwordless connection with the HPC. You do not need to follow the complete list of deployment steps under [**Deployment**](inst/doc/deployment.md) since all dependencies are already installed for you to use. +At BSC, the only configuration step you need to follow is to set up passwordless connection with the HPCs because all other dependencies are already installed for you to use. If you are outside of BSC, make sure you have completed the deployment steps in [**Deployment**](inst/doc/deployment.md). Specifically, you need to set up passwordless, userless access from your machine to the HPC login node, and from the HPC login node to your machine if at all possible. In order to establish the connection in one of the directions, you need to do the following: @@ -119,6 +120,14 @@ Specifically, you need to set up passwordless, userless access from your machine User username IdentityFile ~/.ssh/id_rsa ``` +For example: +``` +Host power9 + HostName p9login1.bsc.es + User bsc32xxx + IdentityFile ~/.ssh/id_rsa + ForwardX11 yes +``` After following these steps for the connections in both directions (although from the HPC to the workstation might not be possible), you are good to go. @@ -130,22 +139,24 @@ if [[ $BSC_MACHINE == "power" ]] ; then fi ``` -You can add the following lines in your .bashrc file on your workstation for convenience: +## Using startR + +After setting up the configuration steps, you are ready to use startR. Before opening an R session, you need to laod the necessary module: +``` +module load R CDO ecFlow +R +``` +For your convenience, you can create an alias in your .bashrc file: ``` -alias ctp='ssh -XY username@hostname_or_ip' alias start='module load R CDO ecFlow' ``` - -## Using startR - -If you have successfully gone through the configuration steps, you will just need to run the following commands in a terminal session and a fresh R session will pop up with the startR environment ready to use. - +By doing this, the previous code will be simplified as: ``` start R ``` -The library can be loaded as follows: +Through either of the two ways, an R session is opened. The library can be loaded as follows: ```r library(startR) @@ -153,10 +164,10 @@ library(startR) ### Start() -In order for startR to recognize the data sets you want to process, you first need to declare them. The first step in the declaration of a data set is to build a special path string that encodes where all the involved NetCDF files to be processed are stored. It contains some wildcards in those parts of the path that vary across files. This special path string is also called "path pattern". - -Before defining an example path pattern, let's introduce some target NetCDF files. In the "esarchive" at BSC, we can find the following files: +#### Define path pattern +In order to let startR recognize the data sets you want to process, you first need to declare them. The first step is to build a special path string that encodes where all the involved files to be processed are stored. This special path string is called "path pattern", and it contains wildcards (the pieces wrapped between '$' symbols) indicating the changing parts across file names. +The data sets used in the example are stored in the "esarchive" at BSC. The file storage structure is like this: ``` /esarchive/exp/ecmwf/system5_m1/6hourly/ |--tas/ @@ -171,18 +182,24 @@ Before defining an example path pattern, let's introduce some target NetCDF file |--tos_20171201.nc ``` -A path pattern that can be used to encode the location of these files in a compact way is the following: - +You can find that the path before '6hourly' is universe to all the files, and after that, diversing to different variables (tas, tos, and so on). Under each variable, the dates are identical (all from 19930101 to 20171201). Therefore, a path pattern which can encode the location of these files in a compact way is: ```r repos <- '/esarchive/exp/ecmwf/system5_m1/6hourly/$var$/$var$_$sdate$.nc' ``` +The wildcards are used for the variable names and start dates, so we can assign which we want to load in later. The name of the wildcards can be any names you like. They do not necessarily need to be 'var' or 'sdate' or match any specific keyword (although in this case, as explained later, the 'var' name will trigger a special feature of `Start()`). -The wildcards used (the pieces wrapped between '$' symbols) can be given any names you like. They do not necessarily need to be 'var' or 'sdate' or match any specific key word (although in this case, as explained later, the 'var' name will trigger a special feature of `Start()`). - -Once the path pattern is specified, a `Start()` call can be built, in which you need to provide, as parameters, the specific values of interest for each of the dimensions defined by the wildcards (also called outer dimensions, or file dimensions), as well as for each of the dimensions inside the NetCDF files (inner dimensions). - -You can check in advance which dimensions are inside the NetCDF files by using e.g. easyNCDF on one of the files: +#### Outer and inner dimensions +After defining the path pattern, you need to provide, as parameters in `Start()`, the specific values of interest for all the dimensions of the data. There are two types of dimensions: +1) Outer/file dimension: those are defined by the wildcards. In our example, the outer dimensions are 'var' and 'sdate'. +2) Inner dimension: those inside the NetCDF files. In our example, the inner dimensions are 'ensemble', 'time', 'latitude', and 'longitude'. +There are some methods to find out the inner dimensions: +- 'ncdump': +``` +ncdump -h /esarchive/exp/ecmwf/system5_m1/6hourly/tas/tas_19930101.nc |less +``` +You can find the dimension of the desired variable (here is 'tas') are `tas(time, ensemble, latitude, longitude)`. +- easyNCDF package: ```r easyNCDF::NcReadDims('/esarchive/exp/ecmwf/system5_m1/6hourly/tas/tas_19930101.nc', var_names = 'tas') @@ -197,31 +214,33 @@ This will show the names and lengths of the dimensions of the selected variable: _**Note**_: If you check the dimensions of that file with `ncdump -h`, you will realize the 'var' dimension is actually not defined inside. `NcReadDims()` and `Start()`, though, perceive the different variables inside a file as if stored along a virtual dimension called 'var'. You can ignore this for now and assume 'var' is simply a file dimension (since it appears as a wildcard in the path pattern). Read more on this in Note 1 at the end of this section. -Once we know the dimension names, we have all the information we need to put the `Start()` call together: + +#### Specify the dimension range of interest +For each of the dimensions, the values or indices of interest (i.e., selectors) can be specified in three possible ways: +- Using one or more numeric **indices** +For example: `time = indices(c(1, 3, 5))`, `sdate = indices(3:5)`. In the latter case, the third, fourth and fifth start dates appearing in the file system in alphabetical order would be selected ('19930301', '19930401' and '19930501'). It equals to `time = indices(list(3, 5))`. _(NOTE: `list` does not work for now)_ +- Using one or more actual **values** +For example: `sdate = values('19930101')`, `latitude = values(c(10, 10.5, 11))`, `ensemble = values(c('r1i1p1', 'r2i1p1', 'r3i1p1'))`. The last one equals to `ensemble = values(list('r1i1p1', 'r3i1p1'))`, which takes all the values placed between the two values specified (both ends included). The `values()` helper function can be omitted (as shown in the example). See _**Note 2**_ for details on how value selectors are handled when specified for inner dimensions. _(NOTE: `list` does not work for now)_ +- Using the special keywords 'all', 'first' or 'last'. + + +#### Putting together +Once we define the path pattern, know all the dimension names, and know the dimension values of interest, we have all the required information to put in the `Start()` call: ```r -data <- Start(dat = repos, +data <- Start(dat = repos, #path pattern # outer dimensions var = 'tas', sdate = '19930101', # inner dimensions ensemble = 'all', - time = 'all', + time = 'all', #indices(1:2), latitude = 'all', longitude = 'all') ``` - -For each of the dimensions, the values or indices of interest (a.k.a. selectors) can be specified in three possible ways: -- Using one or more numeric indices, for example `time = indices(c(1, 3, 5))`, or `sdate = indices(3:5)`. In the latter case, the third, fourht and fifth start dates appearing in the file system in alphabetical order would be selected ('19930301', '19930401' and '19930501'). -- Using one or more actual values, for example `sdate = values('19930101')`, or `ensemble = values(c('r1i1p1', 'r2i1p1'))`, or `latitude = values(c(10, 10.5, 11))`. The `values()` helper function can be omitted (as shown in the example). See Note 2 for details on how value selectors are handled when specified for inner dimensions. -- Using a list of two numeric values, for example `sdate = indices(list(5, 10))`. This will take all indices from the 5th to the 10th. -- Using a list of two actual values, for example `sdate = values(list('r1i1p1', 'r5i1p1'))` or `latitude = values(list(-45, 75))`. This will take all values, in order, placed between the two values specified (both ends included). -- Using the special keywords 'all', 'first' or 'last'. - -The dimensions specified in the `Start()` call do not need to follow any specific order, not even the actual order in the path pattern or inside the file. The order, though, can have an impact on the performance of `Start()` as explained in Note 3. +The dimensions specified in the `Start()` call do not need to follow any specific order, not even the actual order in the path pattern or inside the file. The resulting arrays returned by Start() will have the dimensions in the same order as requested in `Start()`. The order, though, can have an impact on the performance of `Start()` as explained in Note 3. Running the `Start()` call shown above will display some progress and information messages: - ```r * Exploring files... This will take a variable amount of time depending * on the issued request and the performance of the file server... @@ -246,7 +265,7 @@ Warning messages: The warnings shown are normal, and could be avoided with a more wordy specification of the parameters to the `Start()` function. -The dimensions of the selected data set and the total size are shown. As you have probably noticed, this `Start()` call is very fast, even though several GB of data are involved (see Note 4 on the size of the data in R). This is because `Start()` is simply discovering the location and dimension of the involved data. +The dimensions of the selected data set and the total size are shown. As you have probably noticed, this `Start()` call is very fast, even though several GB of data are involved (see Note 4 on the size of the data in R). This is because `Start()` is simply discovering the location and dimension of the involved data but not actually retrieving the data to local workstation. You can give a quick look to the collected metadata as follows: @@ -270,6 +289,7 @@ Class 'startR_header' length 9 Start(dat = "/esarchive/exp/ecmwf/system5_m1/6hou .. .. ..$ sdate:List of 1 .. .. .. ..$ : chr "19930101" ..- attr(*, "PatternDim")= chr "dat" + ``` The retrieved information can be accessed with the `attr()` function. For example: @@ -293,18 +313,19 @@ $sdate[[1]] [1] "19930101" ``` -If you are interested in actually loading the entire data set in your machine you can do so in two ways (_**be careful**_, loading the data involved in the `Start()` call in this example will most likely stall your machine. Try it with a smaller region or a subset of forecast time steps): -- adding the parameter `retrieve = TRUE` in your `Start()` call. -- evaluating the object returned by `Start()`: `data_load <- eval(data)` -See the section on "How to choose the number of chunks, jobs and cores" for indications on working out the maximum amount of data that can be retrieved with a `Start()` call on a specific machine. +#### Retrieve data locally +If you are interested in actually loading the entire data set in your machine, you can do so in two ways (_**Be careful!!!**_, Loading the data involved in the `Start()` call in this example will most likely stall your machine. Try it with a smaller region or a subset of forecast time steps). +- Add the parameter `retrieve = TRUE` in your `Start()` call. +- Evaluate the object returned by `Start()`: `data_load <- eval(data)` -You may realize that this functionality is similar to the `Load()` function in the s2dverification package. In fact, `Start()` is more advanced and flexible, although `Load()` is more mature and consistent for loading typical seasonal to decadal forecasting data. `Load()` will be adapted in the future to use `Start()` internally. +#### Advantages of Start() +You may realize that `Start()` is similar to the `Load()` function in the s2dverification package. In fact, `Start()` is more advanced and flexible, though `Load()` is more mature and consistent for loading typical seasonal to decadal forecasting data. `Load()` will be adapted in the future to use `Start()` internally. -There are no constrains for the number or names of the outer or inner dimensions used in a `Start()` call. In other words, `Start()` will handle NetCDF files with any number of dimensions with any name, as well as files distributed across folders in complex ways, since you can use customized wildcards in the path pattern. +There is no constraint for the number or names of the outer or inner dimensions used in a `Start()` call. In other words, `Start()` will handle the files with any number of dimensions with any name, as well as files distributed across folders in complex ways, since you can use customized wildcards in the path pattern. -There are a number of advanced parameters and features in `Start()` to handle heterogeneities across files involved in a `Start()` call, such as the `synonims` parameter, or to handle dimensions extending across multiple NetCDF files, such as the `*_across` parameter. See the documentation on the [**Start()**](inst/doc/start.md) function or `?Start` for more information. +There are a number of advanced parameters and features in `Start()` to handle heterogeneities across files involved in a `Start()` call, such as the `synonims` parameter, or to handle dimensions extending across multiple files, such as the `*_across` parameter. See the documentation on the [**Start()**](inst/doc/start.md) function or `?Start` for more information. -_**Note 1**_ _on the 'var' dimension_: as mentioned above in this section, `NcVarReader()` is showing as if a virtual dimension 'var' appeared inside the file. The existence of this dimension is justified by the fact that, many times, NetCDF files contain more than one variable. The 'var' dimension should hence be considered a "inner" dimension. But, in our example, the dimension 'var' is also defined as a file dimension in the path pattern. So, following the logic of `Start()`, there would be two 'var' dimensions, one of them outer and the other inner, and we should consequently specify selectors for each of them. However, as exception, they are automatically understood to be the same dimension, and the target variable name specified as selector for the outer 'var' dimension is also re-used to select the variable inside the file. This is a feature triggered only by the 'var' dimension name and, if other dimension names appeared more than once as inner or outer dimensions, `Start()` would crash throw an error. The feature described here is useful for the very common case where file paths contain the variable name and that variable is the only climate variable inside the file. If this feature was not available, one could still define the data set as shown in the code snippet below, where there would be some redundancy in the `Start()` call and in the dimensions of the resulting array. +_**Note 1**_ _on the 'var' dimension_: as mentioned above in this section, `NcVarReader()` is showing as if a virtual dimension 'var' appeared inside the file. The existence of this dimension is justified by the fact that, many times, NetCDF files contain more than one variable. The 'var' dimension should hence be considered an "inner" dimension. But, in our example, the dimension 'var' is also defined as a file dimension in the path pattern. So, following the logic of `Start()`, there would be two 'var' dimensions, one of them outer and the other inner, and we should consequently specify selectors for each of them. However, as exception, they are automatically understood to be the same dimension, and the target variable name specified as selector for the outer 'var' dimension is also re-used to select the variable inside the file. This is a feature triggered only by the 'var' dimension name and, if other dimension names appeared more than once as inner or outer dimensions, `Start()` would crash and throw an error. The feature described here is useful for the very common case where file paths contain the variable name and that variable is the only climate variable inside the file. If this feature was not available, one could still define the data set as shown in the code snippet below, where there would be some redundancy in the `Start()` call and in the dimensions of the resulting array. _**Note 2**_ _on providing values as selectors for inner dimensions_: when values are requested for a inner dimension, the corresponding numeric indices are automatically calculated by comparing the provided values with a variable inside the file with the same name as the dimension for which the values have been requested. In the last example, where specific values are requested for the latitudes, the variable 'latitude' is automatically retrieved from the files. If the name of the variable does not coincide with the name of the dimension, the parameter `*_var` can be specified in the `Start()` call, as detailed in `?Start`. @@ -312,36 +333,33 @@ _**Note 3**_ _on the order of dimensions_: neither the file dimensions nor the i _**Note 4**_ _on the size of the data in R_: if you check the size of the involved file in the example `Start()` call used above ('/esarchive/exp/ecmwf/system5_m1/6hourly/tas/tas_19930101.nc'), you will realize it only weighs 34GB. Why is the data reported to occupy 134GB then? This is due to two facts: by one side, NetCDF files are usually compressed, and their uncompressed size can be substantially greater. In this case, the uncompressed data would occupy about 72GB. Besides, the variable we are targetting in the example ('tas') is defined as a float variable inside the NetCDF file. This means each value is a 4-byte real number. However, R cannot represent 4-byte real numbers; it always takes 8 bytes to represent a real number. This is why, when float numbers are represented in R, the effective size of the data is doubled. In this case, the 72GB of uncompressed float numbers need to be represented using 132GB of RAM in R. + ### Step() and AddStep() -Once the data sources are declared, you can define the operation to be applied to them. The operation needs to be encapsulated in the form of an R function receiving one or more multidimensional arrays with named dimensions (plus additional helper parameters) and returning one or more multidimensional arrays, which should also have named dimensions. For example: +Once the data sources are declared, you can define the operation to be applied to them. The operation needs to be encapsulated in the form of an R function, receiving one or more multidimensional arrays with named dimensions (plus additional helper parameters) and returning one or more multidimensional arrays, which should also have named dimensions. In our example: ```r fun <- function(x) { - r <- sqrt(sum(x ^ 2) / length(x)) - for (i in 1:100) { - r <- sqrt(sum(x ^ 2) / length(x)) - } - dim(r) <- c(time = 1) - r + # Expected inputs: + # x: array with dimensions ('ensemble', 'time') + apply(x + 1, 2, mean) } ``` +As you see, the function only needs to operate on the essential dimensions (which are specified in `Step()`), not on the whole set of dimensions of the data set. -As you see, the function only needs to operate on the essential dimensions, not on the whole set of dimensions of the data set. This example function receives only one multidimensional array (with dimensions 'ensemble' and 'time', although not expressed in the code), and returns one multidimensional array (with a single dimension 'time' of length 1). startR will automatically feed the function with subsets of data with only the essential dinmensions, but first, a startR "step" for the function has to be built with with the `Step()` function. +Then, `Step()` and `AddStep()` are used to build up the workflow. The parameter 'target_dims' in `Step()` specifies which inner dimensions of the data are put in the previous function, and the paramter 'output_dims' specifies which inner dimensions of the data are expected to be the output dimensions from the function. -The `Step()` function requires, as parameters, the names of the dimensions of the input arrays expected by the function (in this example, a single array with the dimensions 'ensemble' and 'time'), as well as the names of the dimensions of the arrays the function returns (in this example, a single array with the dimension 'time'): +In our example, we want to do the mean along 'ensemble', then get the returned data with 'time' dimension. Therefore, we tell startR to 'subset' the data with the two dimensions 'ensemble' and 'time' to put in the function, and return the data with 'time' dimension only. ```r -step <- Step(fun = fun, - target_dims = c('ensemble', 'time'), +step <- Step(fun, + target_dims = c('ensemble', 'time'), output_dims = c('time')) ``` -The step function should ideally expect arrays with the dimensions in the same order as requested in the `Start()` call, and consequently the dimension names specified in the `Step()` function should appear in the same order. If a different order was specified, startR would reorder the subsets for the step function to receive them in the expected dimension order. - -Functions that receive or return multiple multidimensional arrays are also supported. In such cases, lists of vectors of dimension names should be provided as `target_dims` or `output_dims`. +Functions that receive or return multiple multidimensional arrays are also supported. In such cases, **lists** of vectors of dimension names should be provided as `target_dims` or `output_dims`. -Since functions wrapped with the `Step()` function will potentially be called thousands of times, it is recommended to keep them as light as possible by, for example, avoiding calls to the `library()` function to load other packages or interacting with files on disk. See the documentation on the parameter `use_libraries` of the `Step()` function, or consider adding additional parameters to the step function with extra information. +Since functions wrapped with the `Step()` function will potentially be called thousands of times, it is recommended to keep them as light as possible. For example, avoid calling the `library()` function to load other packages or interacting with files on disk. See the documentation on the parameter `use_libraries` of the `Step()` function, or consider adding additional parameters to the step function with extra information. Once the step is built, a workflow of steps can be assembled as follows: @@ -349,42 +367,30 @@ Once the step is built, a workflow of steps can be assembled as follows: wf <- AddStep(data, step) ``` -If the step involved more than one data source, a list of data sources could be provided as first parameter. You can find examples using more than one data source further in this guide. +If the step involved more than one data source, a list of data sources could be provided as the first parameter. You can find examples using more than one data source further in this guide. + +It is not possible for now to define workflows with more than one step, but this is not a crucial gap since a step function can contain more than one statistical analysis procedure. Furthermore, it is usually enough to perform only the first or first two steps of the analysis workflow on the HPCs because the volume of data involved will be reduced substantially after these steps, and the analysis can go on with conventional methods. -It is not possible for now to define workflows with more than one step, but this is not a crucial gap since a step function can contain more than one statistical analysis procedure. Furthermore, it is usually enough to perform only the first or two first steps of the analysis workflow on the HPCs because, after these steps, the volume of data involved is reduced substantially and the analysis can go on with conventional methods. ### Compute() -Once the data sources are declared and the workflow is defined, you can proceed to specify the execution parameters (including which platform to run on) and trigger the execution with the `Compute()` function. +Once the data sources and the workflow are declared, you can proceed to specify the execution parameters and trigger the execution with the `Compute()` function. -Next, a few examples are shown with `Compute()` calls to trigger the processing of a dataset locally (only on the machine where the R session is running) and on two different HPCs (the Earth Sciences fat nodes and CTE-Power9). However, let's first define a `Start()` call that involves a smaller subset of data in order not to make the examples too heavy. +Next, a few `Compute()` call examples are shown, which are with different executive machines (local workstation, Power 9, and fatnodes.) To make the execution lighter, please re-run the `Start()` call with a smaller subset of data (i.e., change 'time' to indices(1:2)), and also re-run `AddStep()` again (see below). ```r -library(startR) - -repos <- '/esarchive/exp/ecmwf/system5_m1/6hourly/$var$/$var$_$sdate$.nc' data <- Start(dat = repos, var = 'tas', sdate = '19930101', ensemble = 'all', - time = indices(1:2), + time = indices(1:2), #'all', latitude = 'all', longitude = 'all') -fun <- function(x) { - # Expected inputs: - # x: array with dimensions ('ensemble', 'time') - apply(x + 1, 2, mean) -} - -step <- Step(fun, - target_dims = c('ensemble', 'time'), - output_dims = c('time')) - -wf <- AddStep(data, step) +wf <- AddStep(data, step) #re-run since 'data' has been changed ``` -The output of the `Start()` call follows, where the size of the selected data is reported. +The output of the renew `Start()` call should be as below. The total data size reduces to 316.4Mb only. ```r * Exploring files... This will take a variable amount of time depending * on the issued request and the performance of the file server... @@ -409,22 +415,20 @@ Warning messages: #### Compute() locally -When only your own workstation is available, startR can still be useful to process a very large dataset by chunks, thus avoiding a RAM memory overload and consequent crash of the workstation. startR will simply load and process the dataset by chunks, one after the other. The operations defined in the workflow will be applied to each chunk, and the results will be stored on a temporary file. `Compute()` will finally gather and merge the results of each chunk and return a single data object, including one or multiple multidimensional data arrays, and additional metadata. - -A list of the dimensions which to split the data along, and the number of slices (or "chunks") to make for each, is the only piece of information required for `Compute()` to run locally. It will only be possible to request chunks for those dimensions not required by any of the steps in the workflow built by `Step()` and `AddStep()`. - -Following the worklfow of steps defined in the example, where the step uses 'time' and 'ensemble' as target dimensions, the dimensions remaining for chunking would be 'dat', 'var', 'sdate', 'latitude' and 'longitude'. Note that defining a step which has many target dimensions should be avoided as it will reduce the chunking options. - -As an example, we could request for computation performing two chunks along the 'latitude' dimension, and two chunks along the 'longitude' dimension. This would result in the data being processed in 4 chunks of about 80MB (the size of the involved data, 316MB, divided by 4). - -Calculate the size of the chunks before executing the computation, and make sure they fit in the RAM of your machine. You should have as much free RAM as 2x or 3x the expected size of one chunk. Read more on adjusting the number of chunks in the section "How to choose the number of chunks, jobs and cores". - ```r res <- Compute(wf, chunks = list(latitude = 2, - longitude = 2)) + longitude = 2) ``` +When only your own workstation is available, startR can still be useful to process a very large dataset by chunks, one after another, thus avoiding a RAM memory overload and consequent crash of the workstation. The operations defined in the workflow will be applied to each chunk, and the results will be stored on a temporary file. `Compute()` will finally gather and merge the results of each chunk and return a single data object. + +From the script above, you can see the only thing we have to take care of is the chunks. A listis used to indicate which dimensions to split the data along and the number of chunks to split for each. Note that it is only possible to divide the chunks along those dimensions not assigned in `Step()`. + +In our example, 'time' and 'ensemble' are the dimensions unavailable, so we can choose among 'dat', 'var', 'sdate', 'latitude' and 'longitude' for chunking. The script above shows that the chunks are divided along 'latitude' and 'longitude' by two of each, resulting in 4 chunks which is around 80Mb each (the size of the involved data, 316MB, divided by 4). + +Calculate the size of the chunks before executing the computation, and make sure they fit in the RAM of your machine. You should have as much free RAM as 2x or 3x the expected size of one chunk. Read more on adjusting the number of chunks in the section [How to choose the number of chunks, jobs and cores](inst/doc/practical_guide.md#how-to-choose-the-number-of-chunks-jobs-and-cores). + Once the `Compute()` call is executed, the R session will wait for it to return the result. Progress messages will be shown, with a remaining time estimate after the first chunk has been processed. ```r * Processing chunks... remaining time estimate soon... @@ -511,15 +515,16 @@ dim(res$output1) 2 1 1 1 640 1296 ``` -In addition to performing the computation in chunks, you can adjust the number of execution threads to use for the data retrieval stage (with `threads_load`) and for the computation (with `threads_compute`). Using more than 2 threads for the retrieval will usually be perjudicial, since two will already be able to make full use of the bandwidth between the workstation and the data repository. The optimal number of threads for the computation will depend on the number of processors in your machine, the number of cores they have, and the number of threads supported by each of them. +In addition to chunks, you can adjust the number of execution threads to use for the data retrieval stage (with `threads_load`) and for the computation (with `threads_compute`). Usually, using 2 threads for the retrieval is enough to make full use of the bandwidth between the workstation and the data repository. The optimal number of threads for the computation depends on the number of processors in your machine, the number of cores they have, and the number of threads supported by each of them. You can run the following script again and compare the results with the previous `Compute()` call. You can find that this time the execution is faster. ```r res <- Compute(wf, chunks = list(latitude = 2, longitude = 2), threads_load = 2, - threads_compute = 4) + threads_compute = 4) ``` + ```r * Computation ended successfully. * Number of chunks: 4 @@ -554,24 +559,28 @@ res <- Compute(wf, #### Compute() on CTE-Power 9 -In order to run the computation on a HPC, such as the BSC CTE-Power 9, you will need to make sure the passwordless connection with the login node of that HPC is configured, as shown at the beginning of this guide. If possible, in both directions. Also, you will need to know whether there is a shared file system between your workstation and that HPC, and will need information on the number of nodes, cores per node, threads per core, RAM memory per node, and type of workload used by that HPC (Slurm, PBS and LSF supported). +Before running the script below, make sure you have already set up the configuration mentioned in the previous section. Also, you need to know whether there is a shared file system between your workstation and the HPC, and also the machine information including the number of nodes, cores per node, threads per core, RAM memory per node, and type of workload used by that HPC (Slurm, PBS and LSF supported) etc. + +Based on the local run script above, two more parameters are needed to add in the `Compute()` call: `cluster` and `ecflow_suite_dir`. The parameter `wait` is optional since it has default value as `TRUE`. -You will need to add two parameters to your `Compute()` call: `cluster` and `ecflow_suite_dir`. +- `ecflow_suite_dir`: a path to a folder in the local workstation where to store temporary files generated for the automatic management of the workflow. As you will see later, the EC-Flow workflow manager is used transparently for this purpose. +- `cluster`: a list tailored for the HPC you want to run on. All the components in the cluster list is for the HPC, not local workstation. See the explanation for the components below the script. +- `wait`: if `wait = TRUE` (default), `Compute()` will be a blocking call. Usually, it is recommended to set `wait = FALSE` when executing large data. By this, you can close your R session and collect the results later on with the Collect() function. This will be further discussed in the [Collect()](inst/doc/practical_guide.md#collect-and-the-ec-flow-gui) section. -The parameter `ecflow_suite_dir` expects a path to a folder in the workstation where to store temporary files generated for the automatic management of the workflow. As you will see later, the EC-Flow workflow manager is used transparently for this purpose. +You can run the following script. Note that the `queue_host`, `temp_dir` and `ecflow_suite_dir` parameters in the Compute() call are user-specific. -The parameter `cluster` expects a list with a number of components that will have to be provided a bit differently depending on the HPC you want to run on. You can see next an example cluster configuration that will execute the previously defined workflow on CTE-Power 9. ```r res <- Compute(wf, chunks = list(latitude = 2, longitude = 2), threads_load = 2, threads_compute = 4, - cluster = list(queue_host = 'p9login1.bsc.es', + cluster = list(queue_host = 'p9login1.bsc.es', #user-specific, 'power9' if following the example above queue_type = 'slurm', - temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', - lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.5/', + temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', #user-specific + #lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.5/', r_module = 'R/3.5.0-foss-2018b', + #CDO_module = 'CDO/1.9.5-foss-2018b', cores_per_job = 4, job_wallclock = '00:10:00', max_jobs = 4, @@ -579,27 +588,28 @@ res <- Compute(wf, bidirectional = FALSE, polling_period = 10 ), - ecflow_suite_dir = '/home/Earth/nmanuben/startR_local/' + ecflow_suite_dir = '/home/Earth/nmanuben/startR_local/', #user-specific + wait = TRUE ) ``` -The cluster components and options are explained next: -- `queue_host`: must match the 'short_name_of_the_host' you associated to the login node of the selected HPC in your .ssh/config file. -- `queue_type`: one of 'slurm', 'pbs' or 'lsf'. -- `temp_dir`: directory on the HPC where to store temporary files. Must be accessible from the HPC login node and all HPC nodes. +**The cluster components and options:** +- `queue_host`: User-specific. Must match the 'short_name_of_the_host' you associated to the login node of the selected HPC in your .ssh/config file. +- `queue_type`: one of 'slurm', 'pbs' or 'lsf', depending on the characteristics of the machine. +- `temp_dir`: User-specific. Directory on the HPC where to store temporary files. Must be accessible from the HPC login node and all HPC nodes. You need to create your own. A warning like `Warning: rsync from remote server to collect results failed. Retrying soon.` will show if this parameter is not specified, and the job will remain blue (queuing) forever. - `lib_dir`: directory on the HPC where the startR R package and other required R packages are installed, accessible from all HPC nodes. These installed packages must be compatible with the R module specified in `r_module`. This parameter is optional; only required when the libraries are not installed in the R module. - `r_module`: name of the UNIX environment module to be used for R. If not specified, 'module load R' will be used. +- `CDO_module`: Specify the CDO module if it is used in the computation. Optional. - `cores_per_job`: number of computing cores to be requested when submitting the job for each chunk to the HPC queue. Each node may be capable of supporting more than one computing thread. - `job_wallclock`: amount of time to reserve the resources when submitting the job for each chunk. Must follow the specific format required by the specified `queue_type`. -- `max_jobs`: maximum number of jobs (chunks) to be queued simultaneously onto the HPC queue. Submitting too many jobs could overload the bandwidth between the HPC nodes and the storage system, or could overload the queue system. -- `extra_queue_params`: list of character strings with additional queue headers for the jobs to be submitted to the HPC. Mainly used to specify the amount of memory to book for each job (e.g. '#SBATCH --mem-per-cpu=30000'), to request special queuing (e.g. '#SBATCH --qos=bsc_es'), or to request use of specific software (e.g. '#SBATCH --reservation=test-rhel-7.5'). -- `bidirectional`: whether the connection between the R workstation and the HPC login node is bidirectional (TRUE) or unidirectional from the workstation to the login node (FALSE). -- `polling_period`: when the connection is unidirectional, the workstation will ask the HPC login node for results each `polling_period` seconds. An excessively small value can overload the login node or result in temporary banning. +- `max_jobs`: maximum number of jobs (chunks) to be queued simultaneously on the HPC queue. Submitting too many jobs may overload the bandwidth between the HPC nodes and the storage system, or could overload the queue system. +- `extra_queue_params`: list of character strings with additional queue headers for the jobs to be submitted to the HPC. Mainly used to specify the amount of memory to book for each job (e.g. '#SBATCH --mem-per-cpu=30000'), to request special queuing (e.g. '#SBATCH --qos=bsc_es'), or to request use of specific software (e.g. '#SBATCH --reservation=test-rhel-7.5'). Optional. +- `bidirectional`: whether the connection between the R workstation and the HPC login node is bidirectional (TRUE) or unidirectional from the workstation to the login node (FALSE). Depend on the characteristics of the machine. +- `polling_period`: when the connection is unidirectional, the workstation will ask the HPC login node for results each `polling_period` seconds. An excessively small value may overload the login node and result in temporary banning. -After the `Compute()` call is executed, an EC-Flow server is automatically started on your workstation, which will orchestrate the work and dispatch jobs onto the HPC. Thanks to the use of EC-Flow, you will also be able to monitor visually the progress of the execution. See the "Collect and the EC-Flow GUI" section. +After running the `Compute()` call, an EC-Flow server is automatically started on your workstation, which will orchestrate the work and dispatch jobs onto the HPC. The following messages will be displayed upon execution: - ```r * ATTENTION: Dispatching chunks on a remote cluster. Make sure * passwordless access is properly set in both directions. @@ -609,7 +619,19 @@ server is already started * Remaining time estimate soon... ``` -At this point, you may want to check the jobs are being dispatched and executed properly onto the HPC. For that, you can either use the EC-Flow GUI (covered in the next section), or you can `ssh` to the login node of the HPC and check the status of the queue with `squeue` or `qstat`, as shown below. +At this point, you may want to check if the jobs are dispatched to and executed properly on the HPC. You can either use the EC-Flow GUI or `ssh` to the HPC. +- EC-Flow GUI: +``` +#in your workstation +ecflow_ui & +``` +The EC-Flow user interface will pop out. See the [EC-Flow GUI](inst/doc/practical_guide.md#ec-flow-gui) section for more information. + +- On HPC: `ssh` to the login node of the HPC and check the status of the queue with `squeue` or `qstat`, as shown below. +``` +#in your workstation +ssh #which will be power9 if you follow the example above +``` ``` [bsc32473@p9login1 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) @@ -619,7 +641,7 @@ At this point, you may want to check the jobs are being dispatched and executed 1142421 main /STARTR_ bsc32473 R 0:12 1 p9r3n08 ``` -Here the output of the execution on CTE-Power 9 after waiting for about a minute: +Here is the output of the execution on CTE-Power 9 shown in your R session after waiting for about a minute: ```r * Remaining time estimate (neglecting queue and merge time) (at * 2019-01-28 01:16:59): 0 mins (46.22883 secs per chunk) @@ -661,17 +683,14 @@ Warning messages: !system. Store the result after Collect() ends if needed. ``` -As you have probably realized, this execution has been slower than the local execution, even if 4 simultaneous jobs have been executed on CTE-Power. This is due to the small size of the data being processed here. The overhead of queuing and starting jobs at CTE-Power is large compared to the required computation time for this amount of data. The benefit would be obvious in use cases with larger inputs. - -Usually, in use cases with larger data inputs, it will be preferrable to add the parameter `wait = FALSE` to your `Compute()` call. With this parameter, `Compute()` will return an object with all the information on your startR execution which you will be able to store in your disk. After doing that, you will be able to close your R session and collect the results later on with the `Collect()` function. This is discussed in the next section. +_**Note**_: As you have probably realized, this execution has been slower than the local execution, even if 4 simultaneous jobs have been executed on CTE-Power. This is due to the small size of the data being processed here. The overhead of queuing and starting jobs at CTE-Power is large compared to the required computation time for this amount of data. The benefit would be obvious in use cases with larger inputs. See [How to choose the number of chunks, jobs and cores](inst/doc/practical_guide.md#how-to-choose-the-number-of-chunks-jobs-and-cores) for more information. -As mentioned above in the definition of the `cluster` parameters, it is strongly recommended to check the section on "How to choose the number of chunks, jobs and cores". -#### Compute() on the fat nodes and other HPCs +#### Compute() on the fat nodes The `Compute()` call with the parameters to run the example in this section on the BSC ES fat nodes is provided below (you will need to adjust some of the parameters before using it). As you can see, the only thing that needs to be changed to execute startR on a different HPC is the definition of the `cluster` parameters. -The `cluster` configuration for the fat nodes, CTE-Power 9, Marenostrum 4, Nord III, Minotauro and ECMWF cca/ccb are all provided at the very end of this guide. +The `cluster` configuration for the fat nodes, CTE-Power 9, Marenostrum 4, Nord III, Minotauro and ECMWF cca/ccb are all provided at the [Compute() cluster templates](inst/doc/practical_guide.md#compute-cluster-templates) section. ```r res <- Compute(wf, @@ -681,18 +700,35 @@ res <- Compute(wf, threads_compute = 4, cluster = list(queue_host = 'bsceslogin01.bsc.es', queue_type = 'slurm', - temp_dir = '/home/Earth/nmanuben/startR_hpc/', + temp_dir = '/home/Earth/nmanuben/startR_hpc/', #user-specific cores_per_job = 2, job_wallclock = '00:10:00', max_jobs = 4, bidirectional = TRUE ), - ecflow_suite_dir = '/home/Earth/nmanuben/startR_local/') + ecflow_suite_dir = '/home/Earth/nmanuben/startR_local/', #user-specific + wait = TRUE) ``` -### Collect() and the EC-Flow GUI +### EC-Flow GUI -Usually, in use cases where large data inputs are involved, it is convenient to add the parameter `wait = FALSE` to your `Compute()` call. With this parameter, `Compute()` will immediately return an object with information about your startR execution. You will be able to store this object onto disk. After doing that, you will not need to worry in case your workstation turns off in the middle of the computation. You will be able to close your R session, and collect the results later on with the `Collect()` function. +After running the `Compute()` call, you may want to visually check the status of the execution.EC-Flow, the workflow manager we use, provides a graphical user interface for this purpose. You need to open a new terminal, load the EC-Flow module if you have not, and start the GUI: +``` +module load ecFlow +ecflow_ui & +``` +Then a window will pop up. However, if it is the first time you use the EC-Flow GUI with startR, you need to register the EC-Flow server that has been started automatically by `Compute()`. You can open the top menu "Manage servers" > "New server" > set host to 'localhost' > set port to '5678' > save. + +The host and port can be adjusted with the parameter `ecflow_server` in `Compute()`, which must be provided in the form `c(host = 'hostname', port = port_number)`. _(Note: this parameter is not functional now.)_ + +After registering, an expandable entry will appear, where you can see the list of the jobs to be executed, one for each chunk, with their status represented by colors. Gray means pending, blue means queuing, green means in progress, and yellow means completed. + +Note that if `wait = FALSE` in `Compute()`, the status will not be updated periodically but remain blue (queuing). See [Update EC-Flow GUI](inst/doc/practical_guide.md#update-ec-flow-gui) to know how to solve the problem. + +### Collect() + +#### Collect the output +Usually, in the use cases where large data inputs are involved, it is convenient to add the parameter `wait = FALSE` to the `Compute()` call. With this parameter, `Compute()` will not block the R session and will immediately return an object with information about your startR execution. You can store this object onto disk. By doing that, you do not need to worry in case your workstation turns off in the middle of the computation. You can close your R session and collect the results later on with the `Collect()` function. ```r res <- Compute(wf, @@ -700,10 +736,9 @@ res <- Compute(wf, longitude = 2), threads_load = 2, threads_compute = 4, - cluster = list(queue_host = 'p9login1.bsc.es', + cluster = list(queue_host = 'p9login1.bsc.es', #user-specific queue_type = 'slurm', - temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', - lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.5/', + temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', #user-specific r_module = 'R/3.5.0-foss-2018b', cores_per_job = 4, job_wallclock = '00:10:00', @@ -712,41 +747,21 @@ res <- Compute(wf, bidirectional = FALSE, polling_period = 10 ), - ecflow_suite_dir = '/home/Earth/nmanuben/startR_local/', + ecflow_suite_dir = '/home/Earth/nmanuben/startR_local/', #user-specific wait = FALSE ) -saveRDS(res, file = 'test_collect.Rds') +saveRDS(res, file = 'test_collect.Rds') #save the startR execution object ``` -At this point, after storing the descriptor of the execution and before calling `Collect()`, you may want to visually check the status of the execution. You can do that with the EC-Flow graphical user interface. You need to open a new terminal, load the EC-Flow module if needed, and start the GUI: - -``` -module load ecFlow -ecflow_ui & -``` - -After doing that, a window will pop up. You will be able to monitor the status of your EC-Flow suites there. However, if it is the first time you are using the EC-Flow GUI with startR, you will need to register the EC-Flow server that has been started automatically by `Compute()`. You can open the top menu "Manage servers" > "New server" > set host to 'localhost' > set port to '5678' > save. - -Note that the host and port can be adjusted with the parameter `ecflow_server` in `Compute()`, which must be provided in the form `c(host = 'hostname', port = port_number)`. - -After registering the EC-Flow server, an expandable entry will appear, where you can see listed the jobs to be executed, one for each chunk, with their status represented by a colour. Gray means pending, blue means queuing, green means in progress, and yellow means completed. - -You will see that, if you are running on an HPC where the connection with its login node is unidirectional, the jobs remain blue (queuing). This is because the jobs, upon start or completion, cannot send the signals back. In order to retrieve this information, the `Collect()` function must be called from an R terminal. - +When the computation is done, you can run the following lines to collect the results: ```r -library(startR) - res <- readRDS('test_collect.Rds') - result <- Collect(res, wait = TRUE) ``` +If the computation does not finished yet, the `Collect()` call will block the R session. On the other hand, if `wait = FALSE`, an error will show unless the computation is over. -In this example, `Collect()` has been run with the parameter `wait = TRUE`. This will be a blocking call, in which `Collect()` will retrieve information from the HPC, including signals and outputs, each `polling_period` seconds (as described above). `Collect()` will not return until the results of all chunks have been received. Meanwhile, the status of the EC-Flow workflow on the EC-Flow GUI will be updated periodically and you will be able to monitor the status, as shown in the image below (image taken from another use case). - - - -Upon completion, `Collect()` returns the merged data array, as seen in the "Compute locally" section. +Upon completion, `Collect()` will return the merged data array. ```r str(result) @@ -775,15 +790,33 @@ List of 1 ..$ total : logi NA ``` -Note that, when the results are collected with `Collect()` instead of calling `Compute()` with the parameter `wait = TRUE`, it will not be possible to know the total time taken by the entire data processing workflow, but we will still be able to know the timings of most of the stages. +_**Note 1**_: When the results are collected with `Collect()` instead of calling `Compute()` with the parameter `wait = TRUE`, it will not be possible to know the total time taken by the entire data processing workflow, but we will still be able to know the timings of most of the stages. + +_**Note 2**_: `Collect()` also has a parameter called `remove`, which by default is set to `TRUE` and triggers removal of all data results received from the HPC (and stored under `ecflow_suite_dir`). If you would like to preserve the data, you can set `remove = FALSE` and `Collect()` it as many times as desired. Alternatively, you can `Collect()` with `remove = TRUE` and store the merged array right after with `saveRDS()`. + + +#### Update EC-Flow GUI +If the computation runs on an HPC where the connection with its login node is unidirectional (e.g., Power 9) and `wait = FALSE` in `Compute()`, you will find that the jobs in the EC-Flow GUI will not update periodically but remain blue (queuing). This is because the jobs, upon start or completion, cannot send the signals back. `Collect()` here serves another functionality to update the execution progress: + +```r +res <- readRDS('test_collect.Rds') +result <- Collect(res, wait = TRUE) +``` + +In this example, `Collect()` has been run with the parameter `wait = TRUE`. This will be a blocking call, in which `Collect()` will retrieve information from the HPC, including signals and outputs, each `polling_period` seconds (as described above). `Collect()` will not return until the results of all chunks have been received. Meanwhile, the status of the EC-Flow workflow on the EC-Flow GUI will be updated periodically and you will be able to monitor the status, as shown in the image below (image taken from another use case). -You can also run `Collect()` with `wait = FALSE`. This will crash with an error if the results are not yet available, or will return the merged array otherwise. + -`Collect()` also has a parameter called `remove`, which by default is set to `TRUE` and triggers removal of all data results received from the HPC (and stored under `ecflow_suite_dir`). If you would like to preserve the data, you can set `remove = FALSE` and `Collect()` it as many times as desired. Alternatively, you can `Collect()` with `remove = TRUE` and store the merged array right after with `saveRDS()`. +## Other useful documentation +If you are a new startR users and have not read [**README**](https://earth.bsc.es/gitlab/es/startR/blob/master/README.md), you are recommended to read it first to get a general picture of startR. Then, you can follow the order to know more: +1. See [**Deployment**](inst/doc/deployment.md) for details on the setup if outside of BSC. +2. Follow [**Practical Guide**](inst/doc/practical_guide.md) with use cases step by step. +3. See [**Start**](inst/doc/start.md) to learn more applications of Start(). +4. Find [**FAQs**](https://earth.bsc.es/gitlab/es/startR/wikis/FAQ) and [**Examples**](https://earth.bsc.es/gitlab/es/startR/wikis/Example) in GitLab wiki page. ## Additional information -### How to choose the number of chunks, jobs and cores +### 1. How to choose the number of chunks, jobs and cores #### Number of chunks and memory per job @@ -811,7 +844,7 @@ The number of cores per job should be as large as possible, with two limitations - If the HPC nodes have each "M" total amount of memory with "m" amount of memory per memory module, and have each "N" amount of cores, the requested amount of cores "n" should be such that "n" / "N" does not exceed "m" / "M". - Depending on the shape of the chunks, startR has a limited capability to exploit multiple cores. It is recommended to make small tests increasing the number of cores to work out a reasonable number of cores to be requested. -### How to clean a failed execution +### 2. How to clean a failed execution - Work out the startR execution ID, either by inspecting the execution description by `Compute()` when called with the parameter `wait = FALSE`, or by checking the `ecflow_suite_dir` with `ls -ltr`. - ssh to the HPC login node and cancel all jobs of your startR execution. @@ -821,7 +854,7 @@ The number of cores per job should be as large as possible, with two limitations - Optionally remove the data under `data_dir` on the HPC login node if the file system is not shared between the workstation and the HPC and you do not want to keep the data in the `data_dir`, used as caché for future startR executions. - Open the EC-Flow GUI and remove the workflow entry (a.k.a. suite) named with your startR execution ID with right click > "Remove". -### Visualizing the profiling of the execution +### 3. Visualizing the profiling of the execution As seen in previous sections, profiling measurements of the execution are provided together with the data output. These measurements can be visualized with the `PlotProfiling()` function made available in the source code of the startR package. @@ -839,195 +872,24 @@ A chart displays the timings for the different stages of the computation, as sho You can click on the image to expand it. -### What to do if your function has too many target dimensions -### Pending features -- Adding feature for `Copute()` to run on multiple HPCs or workstations. +## Pending features +- What to do if your function has too many target dimensions. +- Adding feature for `Compute()` to run on multiple HPCs or workstations. - Adding plug-in to read CSV files. - Supporting multiple steps in a workflow defined by `AddStep()`. - Adding feature in `Start()` to read sparse grid points. - Allow for chunking along "essential" (a.k.a. "target") dimensions. -## Other examples - -### Using experimental and (date-corresponding) observational data - -```r -repos <- paste0('/esnas/exp/ecmwf/system4_m1/6hourly/', - '$var$/$var$_$sdate$.nc') - -system4 <- Start(dat = repos, - var = 'sfcWind', - #sdate = paste0(1981:2015, '1101'), - sdate = paste0(1981:1984, '1101'), - #time = indices((30*4+1):(120*4)), - time = indices((30*4+1):(30*4+4)), - ensemble = 'all', - #ensemble = indices(1:6), - #latitude = 'all', - latitude = indices(1:10), - #longitude = 'all', - longitude = indices(1:10), - return_vars = list(latitude = NULL, - longitude = NULL, - time = c('sdate'))) - -repos <- paste0('/esnas/recon/ecmwf/erainterim/6hourly/', - '$var$/$var$_$file_date$.nc') - -dates <- attr(system4, 'Variables')$common$time -dates_file <- sort(unique(gsub('-', '', sapply(as.character(dates), -substr, 1, 7)))) - -erai <- Start(dat = repos, - var = 'sfcWind', - file_date = dates_file, - time = values(dates), - #latitude = 'all', - latitude = indices(1:10), - #longitude = 'all', - longitude = indices(1:10), - time_var = 'time', - time_tolerance = as.difftime(1, units = 'hours'), - time_across = 'file_date', - return_vars = list(latitude = NULL, - longitude = NULL, - time = 'file_date'), - merge_across_dims = TRUE, - split_multiselected_dims = TRUE) - -step <- Step(eqmcv_atomic, - list(a = c('ensemble', 'sdate'), - b = c('sdate')), - list(c = c('ensemble', 'sdate'))) - -res <- Compute(step, list(system4, erai), - chunks = list(latitude = 5, - longitude = 5, - time = 2), - cluster = list(queue_host = 'bsceslogin01.bsc.es', - max_jobs = 4, - cores_per_job = 2), - shared_dir = '/esnas/scratch/nmanuben/test_bychunk', - wait = FALSE) -``` - -### Example of computation of weekly means - -### Example with data on an irregular grid with selection of a region - -### Example on MareNostrum 4 - -```r -library(startR) - -repos <- paste0('/esarchive/exp/ecmwf/system5_m1/6hourly/', - '$var$-longitudeS1latitudeS1all/$var$_$sdate$.nc') -# Slower alternative, using files with a less efficient NetCDF -# compression configuration -#repos <- '/esarchive/exp/ecmwf/system5_m1/6hourly/$var$/$var$_$sdate$.nc' - -data <- Start(dat = repos, - var = 'tas', - sdate = indices(1), - ensemble = 'all', - time = 'all', - latitude = indices(1:40), - longitude = indices(1:40), - retrieve = FALSE) -lons <- attr(data, 'Variables')$common$longitude -lats <- attr(data, 'Variables')$common$latitude - -fun <- function(x) apply(x + 1, 2, mean) -step <- Step(fun, c('ensemble', 'time'), c('time')) -wf <- AddStep(data, step) - -res <- Compute(wf, - chunks = list(latitude = 2, - longitude = 2), - threads_load = 1, - threads_compute = 2, - cluster = list(queue_host = 'mn2.bsc.es', - queue_type = 'slurm', - data_dir = '/gpfs/projects/bsc32/share/startR_data_repos/', - temp_dir = '/gpfs/scratch/pr1efe00/pr1efe03/startR_tests/', - lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.4/', - r_module = 'R/3.4.0', - cores_per_job = 2, - job_wallclock = '00:10:00', - max_jobs = 4, - extra_queue_params = list('#SBATCH --qos=prace'), - bidirectional = FALSE, - polling_period = 10, - special_setup = 'marenostrum4' - ), - ecflow_suite_dir = '/home/Earth/nmanuben/test_remove/', - ecflow_server = NULL, - silent = FALSE, - debug = FALSE, - wait = TRUE) -``` - -### Example on CTE-Power using GPUs - -### Seasonal forecast verification example on cca - -```r -crps <- function(x, y) { - mean(SpecsVerification::EnsCrps(x, y, R.new = Inf)) -} - -library(startR) - -repos <- '/perm/ms/spesiccf/c3ah/qa4seas/data/seasonal/g1x1/ecmf-system4/msmm/atmos/seas/tprate/12/ecmf-system4_msmm_atmos_seas_sfc_$date$_tprate_g1x1_init12.nc' - -data <- Start(dat = repos, - var = 'tprate', - date = 'all', - time = 'all', - number = 'all', - latitude = 'all', - longitude = 'all', - return_vars = list(time = 'date')) - -dates <- attr(data, 'Variables')$common$time - -repos <- '/perm/ms/spesiccf/c3ah/qa4seas/data/ecmf-ei_msmm_atmos_seas_sfc_19910101-20161201_t2m_g1x1_init02.nc' - -obs <- Start(dat = repos, - var = 't2m', - time = values(dates), - latitude = 'all', - longitude = 'all', - split_multiselected_dims = TRUE) - -s <- Step(crps, target_dims = list(c('date', 'number'), c('date')), - output_dims = NULL) -wf <- AddStep(list(data, obs), s) - -r <- Compute(wf, - chunks = list(latitude = 10, - longitude = 3), - cluster = list(queue_host = 'cca', - queue_type = 'pbs', - max_jobs = 10, - init_commands = list('module load ecflow'), - r_module = 'R/3.3.1', - extra_queue_params = list('#PBS -l EC_billing_account=spesiccf')), - ecflow_output_dir = '/perm/ms/spesiccf/c3ah/startR_test/', - is_ecflow_output_dir_shared = FALSE - ) -``` ## Compute() cluster templates ### CTE-Power9 ```r -cluster = list(queue_host = 'p9login1.bsc.es', +cluster = list(queue_host = 'p9login1.bsc.es', #user-defined queue_type = 'slurm', - temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', - lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.5/', + temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', #user-defined r_module = 'R/3.5.0-foss-2018b', cores_per_job = 4, job_wallclock = '00:10:00', @@ -1052,11 +914,10 @@ cluster = list(queue_host = 'bsceslogin01.bsc.es', ### Marenostrum 4 ```r -cluster = list(queue_host = 'mn2.bsc.es', +cluster = list(queue_host = 'mn2.bsc.es', #user-defined queue_type = 'slurm', data_dir = '/gpfs/projects/bsc32/share/startR_data_repos/', - temp_dir = '/gpfs/scratch/pr1efe00/pr1efe03/startR_hpc/', - lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.4/', + temp_dir = '/gpfs/scratch/pr1efe00/pr1efe03/startR_hpc/', #user-defined r_module = 'R/3.4.0', cores_per_job = 2, job_wallclock = '00:10:00', @@ -1071,11 +932,10 @@ cluster = list(queue_host = 'mn2.bsc.es', ### Nord III ```r -cluster = list(queue_host = 'nord1.bsc.es', +cluster = list(queue_host = 'nord1.bsc.es', #user-defined queue_type = 'lsf', data_dir = '/gpfs/projects/bsc32/share/startR_data_repos/', - temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', - lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.3/', + temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', #user-defined init_commands = list('module load intel/16.0.1'), r_module = 'R/3.3.0', cores_per_job = 2, @@ -1091,11 +951,10 @@ cluster = list(queue_host = 'nord1.bsc.es', ### MinoTauro ```r -cluster = list(queue_host = 'mt1.bsc.es', +cluster = list(queue_host = 'mt1.bsc.es', #user-defined queue_type = 'slurm', data_dir = '/gpfs/projects/bsc32/share/startR_data_repos/', - temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', - lib_dir = '/gpfs/projects/bsc32/share/R_libs/3.3/', + temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', #user-defined r_module = 'R/3.3.3', cores_per_job = 2, job_wallclock = '00:10:00', diff --git a/inst/doc/start.md b/inst/doc/start.md index be096562cd65e82295cb12d1e7a60ecd3525bf02..d808aae8736fdbba7fd10c06f9aadb133fd51c7f 100644 --- a/inst/doc/start.md +++ b/inst/doc/start.md @@ -1,8 +1,8 @@ -## Documentation and examples on the Start() function +## Documentation and examples of the Start() function -Data retrieval and alignment is the first step in data analysis in any field and is often highly complex and time-consuming, especially nowadays in the era of Big Data, where large multidimensional data sets from diverse sources need to be combined and processed. Taking subsets of these datasets (Divide) to then be processed efficiently (and Conquer) becomes an indispensable technique. +Data retrieval and alignment is the first step in data analysis in any field and is often highly complex and time-consuming, especially nowadays in the era of Big Data, where large multidimensional data sets from diverse sources need to be combined and processed. Taking subsets of these datasets and making data process efficient become an indispensable technique. -`Start()` has been designed to automatically retrieve multidimensional distributed data sets in R. It provides an interface that allows to perceive and access one or a collection of data sets as if they all were part of a large multidimensional array. Indices or bounds can be specified for each of the dimensions in order to crop the whole array into a smaller sub-array. `Start()` will perform the required operations to fetch the corresponding regions of the corresponding files (potentially distributed over various remote servers) and arrange them into a local R multidimensional array. By default, as many cores as available locally are used in this procedure. +`Start()` has been designed to automatically retrieve multidimensional distributed data sets in R. It provides a way to perceive and access one or a collection of data sets as if they were all part of a large multidimensional array. Indices or bounds can be specified for each of the dimensions in order to crop the whole array into a smaller sub-array. `Start()` will perform the required operations to fetch the corresponding regions of the corresponding files (potentially distributed over various remote servers) and arrange them into a local R multidimensional array. By default, as many cores as available locally are used in this procedure. Usually, in retrieval processes previous to multidimensional data analysis, it is needed to apply a set of common transformations, pre-processes or reorderings to the data as it comes in. `Start()` accepts user-defined transformation or reordering functions to be applied for such purposes. @@ -13,7 +13,7 @@ Metadata and auxilliary data is also preserved and arranged by `Start()` in the ### How to use -`Start()` has a rather steep learning curve but it makes the retrieval process straightforward and highly customizable. The header looks as follows: +`Start()` has a rather steep learning curve but it makes the retrieval process straightforward and highly customizable. The header looks like as follows: ```R Start(..., @@ -39,8 +39,8 @@ Start(..., Usually most of the required information will be provided through the `...` parameter and only a few of the parameters in the function header will be used. The parameters can be grouped as follows: -- The parameters provided via `...`, with information on the structure of the datasets which to take data from, information on which dimensions they have, which indices to take from each of the dimensions, and how to reorder each of the dimensions if needed. -- `synonims` with information to identify dimensions and variablse across the multiple files/datasets (useful if they don't follow a homogeneous naming convention). +- The parameters provided via `...`, with information on the structure of the datasets which to take data from, information on which dimensions they have, which indices to take from each of the dimensions, and how to reorder each of the dimensions if needed. See [practical guide](/inst/doc/practical_guide.md) for more information. +- `synonims` with information to identify dimensions and variablse across the multiple files/datasets (useful if they do not follow a homogeneous naming convention). - `return_vars` with information on auxilliary data to be retrieved. - `file_*` parameters, which allow to specify interface functions to the desired file format. - `transform_*` parameters, which allow to specify a transformation to be applied to the data as it comes in.