diff --git a/R/Compute.R b/R/Compute.R index 0aa94245bd5d5eff3c08fc4d61cb911dac628187..1450b0157b4d5480a8c077c09742f7b02e0e4f12 100644 --- a/R/Compute.R +++ b/R/Compute.R @@ -25,7 +25,7 @@ #' to use for the computation. The default value is 1. #'@param cluster A list of components that define the configuration of the #' machine to be run on. The comoponents vary from the different machines. -#' Check \href{https://earth.bsc.es/gitlab/es/startR/}{startR GitLab} for more +#' Check \href{https://earth.bsc.es/gitlab/es/startR/-/blob/master/inst/doc/practical_guide.md}{Practical guide on GitLab} for more #' details and examples. Only needed when the computation is not run locally. #' The default value is NULL. #'@param ecflow_suite_dir A character string indicating the path to a folder in diff --git a/inst/doc/practical_guide.md b/inst/doc/practical_guide.md index 378f6486bd7a5cb60fadb35918a15bd056f55379..c56fc0b401797fd6b00a790606b4ca33c4c01b01 100644 --- a/inst/doc/practical_guide.md +++ b/inst/doc/practical_guide.md @@ -1,6 +1,6 @@ # Practical guide for processing large data sets with startR -This guide includes explanations and practical examples for you to learn how to use startR to efficiently process large data sets in parallel on the BSC's HPCs (CTE-Power 9, Marenostrum 4, ...). See the main page of the [**startR**](README.md) project for a general overview of the features of startR, without actual guidance on how to use it. +This guide includes explanations and practical examples for you to learn how to use startR to efficiently process large data sets in parallel on the BSC's HPCs (Nord3-v2, CTE-Power 9, Marenostrum 4, ...). See the main page of the [**startR**](README.md) project for a general overview of the features of startR, without actual guidance on how to use it. If you would like to start using startR rightaway on the BSC infrastructure, you can directly go through the "Configuring startR" section, copy/paste the basic startR script example shown at the end of the "Introduction" section onto the text editor of your preference, adjust the paths and user names specified in the `Compute()` call, and run the code in an R session after loading the R and ecFlow modules. @@ -53,7 +53,7 @@ Afterwards, you will need to understand and use five functions, all of them incl - **Compute()**, for specifying the HPC to be employed, the execution parameters (e.g. number of chunks and cores), and to trigger the computation - **Collect()** and the **EC-Flow graphical user interface**, for monitoring of the progress and collection of results -Next, you can see an example startR script performing the ensemble mean of a small data set on CTE-Power9, for you to get a broad picture of how the startR functions interact and the information that is represented in a startR script. Note that the `queue_host`, `temp_dir` and `ecflow_suite_dir` parameters in the `Compute()` call are user-specific. +Next, you can see an example startR script performing the ensemble mean of a small data set on an HPC cluster such as Nord3-v2 or CTE-Power9, for you to get a broad picture of how the startR functions interact and the information that is represented in a startR script. Note that the `queue_host`, `temp_dir` and `ecflow_suite_dir` parameters in the `Compute()` call are user-specific. ```r library(startR) @@ -79,22 +79,24 @@ step <- Step(fun, wf <- AddStep(data, step) -res <- Compute(wf, - chunks = list(latitude = 2, - longitude = 2), - threads_load = 2, - threads_compute = 4, - cluster = list(queue_host = 'p9login1.bsc.es', - queue_type = 'slurm', - temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', - r_module = 'R/3.5.0-foss-2018b', - job_wallclock = '00:10:00', - cores_per_job = 4, - max_jobs = 4, - bidirectional = FALSE, - polling_period = 10 - ), - ecflow_suite_dir = '/home/Earth/nmanuben/startR_local/') + res <- Compute(wf, + chunks = list(latitude = 2, + longitude = 2), + threads_load = 2, + threads_compute = 4, + cluster = list(queue_host = 'nord4.bsc.es', + queue_type = 'slurm', + temp_dir = '/gpfs/scratch/bsc32/bsc32734/startR_hpc/', + cores_per_job = 4, + job_wallclock = '00:10:00', + max_jobs = 4, + extra_queue_params = list('#SBATCH --mem-per-cpu=3000'), + bidirectional = FALSE, + polling_period = 10 + ), + ecflow_suite_dir = '/home/Earth/aho/startR_local/', + wait = TRUE + ) ``` ## Configuring startR @@ -121,7 +123,17 @@ Specifically, you need to set up passwordless, userless access from your machine After following these steps for the connections in both directions (although from the HPC to the workstation might not be possible), you are good to go. -Do not forget adding the following lines in your .bashrc on CTE-Power if you are planning to run on CTE-Power: +Do not forget adding the following lines in your .bashrc on the HPC machine. + +If you are planning to run it on Nord3-v2, you have to add: +``` +if [ $BSC_MACHINE == "nord3v2" ]; then + module purge + module use /gpfs/projects/bsc32/software/suselinux/11/modules/all + module unuse /apps/modules/modulefiles/applications /apps/modules/modulefiles/compilers /apps/modules/modulefiles/tools /apps/modules/modulefiles/libraries /apps/modules/modulefiles/environment +fi +``` +If you are using CTE-Power: ``` if [[ $BSC_MACHINE == "power" ]] ; then module unuse /apps/modules/modulefiles/applications @@ -356,7 +368,7 @@ It is not possible for now to define workflows with more than one step, but this Once the data sources are declared and the workflow is defined, you can proceed to specify the execution parameters (including which platform to run on) and trigger the execution with the `Compute()` function. -Next, a few examples are shown with `Compute()` calls to trigger the processing of a dataset locally (only on the machine where the R session is running) and on two different HPCs (the Earth Sciences fat nodes and CTE-Power9). However, let's first define a `Start()` call that involves a smaller subset of data in order not to make the examples too heavy. +Next, a few examples are shown with `Compute()` calls to trigger the processing of a dataset locally (only on the machine where the R session is running) and different HPCs (Nord3-v2, CTE-Power9 and other HPCs). However, let's first define a `Start()` call that involves a smaller subset of data in order not to make the examples too heavy. ```r library(startR) @@ -551,34 +563,39 @@ res <- Compute(wf, * max: 8.03660178184509 ``` -#### Compute() on CTE-Power 9 +#### Compute() on HPCs -In order to run the computation on a HPC, such as the BSC CTE-Power 9, you will need to make sure the passwordless connection with the login node of that HPC is configured, as shown at the beginning of this guide. If possible, in both directions. Also, you will need to know whether there is a shared file system between your workstation and that HPC, and will need information on the number of nodes, cores per node, threads per core, RAM memory per node, and type of workload used by that HPC (Slurm, PBS and LSF supported). +In order to run the computation on a HPC, you will need to make sure the passwordless connection with the login node of that HPC is configured, as shown at the beginning of this guide. If possible, in both directions. Also, you will need to know whether there is a shared file system between your workstation and that HPC, and will need information on the number of nodes, cores per node, threads per core, RAM memory per node, and type of workload used by that HPC (Slurm, PBS and LSF supported). You will need to add two parameters to your `Compute()` call: `cluster` and `ecflow_suite_dir`. The parameter `ecflow_suite_dir` expects a path to a folder in the workstation where to store temporary files generated for the automatic management of the workflow. As you will see later, the EC-Flow workflow manager is used transparently for this purpose. -The parameter `cluster` expects a list with a number of components that will have to be provided a bit differently depending on the HPC you want to run on. You can see next an example cluster configuration that will execute the previously defined workflow on CTE-Power 9. +The parameter `cluster` expects a list with a number of components that will have to be provided a bit differently depending on the HPC you want to run on. You can see next an example cluster configuration that will execute the previously defined workflow on Nord3-v2. ```r -res <- Compute(wf, - chunks = list(latitude = 2, - longitude = 2), - threads_load = 2, - threads_compute = 4, - cluster = list(queue_host = 'p9login1.bsc.es', - queue_type = 'slurm', - temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', - r_module = 'R/3.5.0-foss-2018b', - cores_per_job = 4, - job_wallclock = '00:10:00', - max_jobs = 4, - extra_queue_params = list('#SBATCH --mem-per-cpu=3000'), - bidirectional = FALSE, - polling_period = 10 - ), - ecflow_suite_dir = '/home/Earth/nmanuben/startR_local/' - ) + # user-defined + temp_dir <- '/gpfs/scratch/bsc32/bsc32734/startR_hpc/' + ecflow_suite_dir <- '/home/Earth/aho/startR_local/' + + res <- Compute(wf, + chunks = list(latitude = 2, + longitude = 2), + threads_load = 2, + threads_compute = 4, + cluster = list(queue_host = 'nord4.bsc.es', + queue_type = 'slurm', + temp_dir = temp_dir, + r_module = 'R/4.1.2-foss-2019b' + cores_per_job = 4, + job_wallclock = '00:10:00', + max_jobs = 4, + extra_queue_params = list('#SBATCH --mem-per-cpu=3000'), + bidirectional = FALSE, + polling_period = 10 + ), + ecflow_suite_dir = ecflow_suite_dir, + wait = TRUE + ) ``` The cluster components and options are explained next: @@ -593,6 +610,7 @@ The cluster components and options are explained next: - `extra_queue_params`: list of character strings with additional queue headers for the jobs to be submitted to the HPC. Mainly used to specify the amount of memory to book for each job (e.g. '#SBATCH --mem-per-cpu=30000'), to request special queuing (e.g. '#SBATCH --qos=bsc_es'), or to request use of specific software (e.g. '#SBATCH --reservation=test-rhel-7.5'). - `bidirectional`: whether the connection between the R workstation and the HPC login node is bidirectional (TRUE) or unidirectional from the workstation to the login node (FALSE). - `polling_period`: when the connection is unidirectional, the workstation will ask the HPC login node for results each `polling_period` seconds. An excessively small value can overload the login node or result in temporary banning. +- `special_setup`: name of the machine if the computation requires an special setup. Only Marenostrum 4 needs this parameter (e.g. special_setup = 'marenostrum4'). After the `Compute()` call is executed, an EC-Flow server is automatically started on your workstation, which will orchestrate the work and dispatch jobs onto the HPC. Thanks to the use of EC-Flow, you will also be able to monitor visually the progress of the execution. See the "Collect and the EC-Flow GUI" section. @@ -609,15 +627,15 @@ server is already started At this point, you may want to check the jobs are being dispatched and executed properly onto the HPC. For that, you can either use the EC-Flow GUI (covered in the next section), or you can `ssh` to the login node of the HPC and check the status of the queue with `squeue` or `qstat`, as shown below. ``` -[bsc32473@p9login1 ~]$ squeue +[bsc32734@login4 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) - 1142418 main /STARTR_ bsc32473 R 0:12 1 p9r3n08 - 1142419 main /STARTR_ bsc32473 R 0:12 1 p9r3n08 - 1142420 main /STARTR_ bsc32473 R 0:12 1 p9r3n08 - 1142421 main /STARTR_ bsc32473 R 0:12 1 p9r3n08 + 757026 main /STARTR_ bsc32734 R 0:46 1 s02r2b24 + 757027 main /STARTR_ bsc32734 R 0:46 1 s04r1b61 + 757028 main /STARTR_ bsc32734 R 0:46 1 s04r1b63 + 757029 main /STARTR_ bsc32734 R 0:46 1 s04r1b64 ``` -Here the output of the execution on CTE-Power 9 after waiting for about a minute: +Here the output of the execution after waiting for about a minute: ```r * Remaining time estimate (neglecting queue and merge time) (at * 2019-01-28 01:16:59): 0 mins (46.22883 secs per chunk) @@ -665,55 +683,33 @@ Usually, in use cases with larger data inputs, it will be preferrable to add the As mentioned above in the definition of the `cluster` parameters, it is strongly recommended to check the section on "How to choose the number of chunks, jobs and cores". -#### Compute() on the fat nodes and other HPCs +You can find the `cluster` configuration for other HPCs at the end of this guide [Compute() cluster templates](#compute-cluster-templates) -The `Compute()` call with the parameters to run the example in this section on the BSC ES fat nodes is provided below (you will need to adjust some of the parameters before using it). As you can see, the only thing that needs to be changed to execute startR on a different HPC is the definition of the `cluster` parameters. - -The `cluster` configuration for the fat nodes, CTE-Power 9, Marenostrum 4, Nord III, Minotauro and ECMWF cca/ccb are all provided at the very end of this guide. - -```r -res <- Compute(wf, - chunks = list(latitude = 2, - longitude = 2), - threads_load = 2, - threads_compute = 4, - cluster = list(queue_host = 'bsceslogin01.bsc.es', - queue_type = 'slurm', - temp_dir = '/home/Earth/nmanuben/startR_hpc/', - cores_per_job = 2, - job_wallclock = '00:10:00', - max_jobs = 4, - bidirectional = TRUE - ), - ecflow_suite_dir = '/home/Earth/nmanuben/startR_local/') -``` ### Collect() and the EC-Flow GUI Usually, in use cases where large data inputs are involved, it is convenient to add the parameter `wait = FALSE` to your `Compute()` call. With this parameter, `Compute()` will immediately return an object with information about your startR execution. You will be able to store this object onto disk. After doing that, you will not need to worry in case your workstation turns off in the middle of the computation. You will be able to close your R session, and collect the results later on with the `Collect()` function. ```r -res <- Compute(wf, - chunks = list(latitude = 2, - longitude = 2), - threads_load = 2, - threads_compute = 4, - cluster = list(queue_host = 'p9login1.bsc.es', - queue_type = 'slurm', - temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', - r_module = 'R/3.5.0-foss-2018b', - cores_per_job = 4, - job_wallclock = '00:10:00', - max_jobs = 4, - extra_queue_params = list('#SBATCH --mem-per-cpu=3000'), - bidirectional = FALSE, - polling_period = 10 - ), - ecflow_suite_dir = '/home/Earth/nmanuben/startR_local/', - wait = FALSE - ) - -saveRDS(res, file = 'test_collect.Rds') + res <- Compute(wf, + chunks = list(latitude = 2, + longitude = 2), + threads_load = 2, + threads_compute = 4, + cluster = list(queue_host = 'nord4.bsc.es', + queue_type = 'slurm', + temp_dir = '/gpfs/scratch/bsc32/bsc32734/startR_hpc/', + cores_per_job = 4, + job_wallclock = '00:10:00', + max_jobs = 4, + extra_queue_params = list('#SBATCH --mem-per-cpu=3000'), + bidirectional = FALSE, + polling_period = 10 + ), + ecflow_suite_dir = '/home/Earth/aho/startR_local/', + wait = FALSE + ) + saveRDS(res, file = 'test_collect.Rds') ``` At this point, after storing the descriptor of the execution and before calling `Collect()`, you may want to visually check the status of the execution. You can do that with the EC-Flow graphical user interface. You need to open a new terminal, load the EC-Flow module if needed, and start the GUI: @@ -909,10 +905,6 @@ res <- Compute(step, list(system4, erai), wait = FALSE) ``` -### Example of computation of weekly means - -### Example with data on an irregular grid with selection of a region - ### Example on MareNostrum 4 ```r @@ -1017,13 +1009,43 @@ r <- Compute(wf, ## Compute() cluster templates +### Nord3-v2 + +```r +cluster = list(queue_host = 'nord4.bsc.es', + queue_type = 'slurm', + temp_dir = '/gpfs/scratch/bsc32/bsc32734/startR_hpc/', + cores_per_job = 2, + job_wallclock = '01:00:00', + max_jobs = 4, + bidirectional = FALSE, + polling_period = 10 + ) +``` + +### Nord3 (deprecated) + +```r +cluster = list(queue_host = 'nord1.bsc.es', + queue_type = 'lsf', + data_dir = '/gpfs/projects/bsc32/share/startR_data_repos/', + temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', + init_commands = list('module load intel/16.0.1'), + cores_per_job = 2, + job_wallclock = '00:10', + max_jobs = 4, + extra_queue_params = list('#BSUB -q bsc_es'), + bidirectional = FALSE, + polling_period = 10 + ) +``` + ### CTE-Power9 ```r cluster = list(queue_host = 'p9login1.bsc.es', queue_type = 'slurm', temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', - r_module = 'R/3.5.0-foss-2018b', cores_per_job = 4, job_wallclock = '00:10:00', max_jobs = 4, @@ -1032,7 +1054,7 @@ cluster = list(queue_host = 'p9login1.bsc.es', ) ``` -### BSC ES fat nodes +### BSC ES fat nodes (deprecated) ```r cluster = list(queue_host = 'bsceslogin01.bsc.es', @@ -1062,25 +1084,6 @@ cluster = list(queue_host = 'mn2.bsc.es', ) ``` -### Nord III - -```r -cluster = list(queue_host = 'nord1.bsc.es', - queue_type = 'lsf', - data_dir = '/gpfs/projects/bsc32/share/startR_data_repos/', - temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/', - init_commands = list('module load intel/16.0.1'), - r_module = 'R/3.3.0', - cores_per_job = 2, - job_wallclock = '00:10', - max_jobs = 4, - extra_queue_params = list('#BSUB -q bsc_es'), - bidirectional = FALSE, - polling_period = 10, - special_setup = 'marenostrum4' - ) -``` - ### MinoTauro ```r