Commit 372a08c3 authored by aho's avatar aho
Browse files

Correct the cluster config of Nord3v2 and prioritize Nord3v2 in the guideline.

parent 111a22ff
Pipeline #7323 passed with stage
in 60 minutes and 33 seconds
......@@ -366,7 +366,7 @@ It is not possible for now to define workflows with more than one step, but this
Once the data sources are declared and the workflow is defined, you can proceed to specify the execution parameters (including which platform to run on) and trigger the execution with the `Compute()` function.
Next, a few examples are shown with `Compute()` calls to trigger the processing of a dataset locally (only on the machine where the R session is running) and different HPCs (the Earth Sciences fat nodes, CTE-Power9 and other HPCs). However, let's first define a `Start()` call that involves a smaller subset of data in order not to make the examples too heavy.
Next, a few examples are shown with `Compute()` calls to trigger the processing of a dataset locally (only on the machine where the R session is running) and different HPCs (Nord3-v2, CTE-Power9 and other HPCs). However, let's first define a `Start()` call that involves a smaller subset of data in order not to make the examples too heavy.
```r
library(startR)
......@@ -561,34 +561,38 @@ res <- Compute(wf,
* max: 8.03660178184509
```
#### Compute() on CTE-Power 9
#### Compute() on HPCs
In order to run the computation on a HPC, such as the BSC CTE-Power 9, you will need to make sure the passwordless connection with the login node of that HPC is configured, as shown at the beginning of this guide. If possible, in both directions. Also, you will need to know whether there is a shared file system between your workstation and that HPC, and will need information on the number of nodes, cores per node, threads per core, RAM memory per node, and type of workload used by that HPC (Slurm, PBS and LSF supported).
In order to run the computation on a HPC, you will need to make sure the passwordless connection with the login node of that HPC is configured, as shown at the beginning of this guide. If possible, in both directions. Also, you will need to know whether there is a shared file system between your workstation and that HPC, and will need information on the number of nodes, cores per node, threads per core, RAM memory per node, and type of workload used by that HPC (Slurm, PBS and LSF supported).
You will need to add two parameters to your `Compute()` call: `cluster` and `ecflow_suite_dir`.
The parameter `ecflow_suite_dir` expects a path to a folder in the workstation where to store temporary files generated for the automatic management of the workflow. As you will see later, the EC-Flow workflow manager is used transparently for this purpose.
The parameter `cluster` expects a list with a number of components that will have to be provided a bit differently depending on the HPC you want to run on. You can see next an example cluster configuration that will execute the previously defined workflow on CTE-Power 9.
The parameter `cluster` expects a list with a number of components that will have to be provided a bit differently depending on the HPC you want to run on. You can see next an example cluster configuration that will execute the previously defined workflow on Nord3-v2.
```r
res <- Compute(wf,
chunks = list(latitude = 2,
longitude = 2),
threads_load = 2,
threads_compute = 4,
cluster = list(queue_host = 'p9login1.bsc.es',
queue_type = 'slurm',
temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/',
r_module = 'R/3.5.0-foss-2018b',
cores_per_job = 4,
job_wallclock = '00:10:00',
max_jobs = 4,
extra_queue_params = list('#SBATCH --mem-per-cpu=3000'),
bidirectional = FALSE,
polling_period = 10
),
ecflow_suite_dir = '/home/Earth/nmanuben/startR_local/'
)
# user-defined
temp_dir <- '/gpfs/scratch/bsc32/bsc32734/startR_hpc/'
ecflow_suite_dir <- '/home/Earth/aho/startR_local/'
res <- Compute(wf,
chunks = list(latitude = 2,
longitude = 2),
threads_load = 2,
threads_compute = 4,
cluster = list(queue_host = 'nord4.bsc.es',
queue_type = 'slurm',
temp_dir = temp_dir,
cores_per_job = 4,
job_wallclock = '00:10:00',
max_jobs = 4,
extra_queue_params = list('#SBATCH --mem-per-cpu=3000'),
bidirectional = FALSE,
polling_period = 10
),
ecflow_suite_dir = ecflow_suite_dir,
wait = TRUE
)
```
The cluster components and options are explained next:
......@@ -619,15 +623,15 @@ server is already started
At this point, you may want to check the jobs are being dispatched and executed properly onto the HPC. For that, you can either use the EC-Flow GUI (covered in the next section), or you can `ssh` to the login node of the HPC and check the status of the queue with `squeue` or `qstat`, as shown below.
```
[bsc32473@p9login1 ~]$ squeue
[bsc32734@login4 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1142418 main /STARTR_ bsc32473 R 0:12 1 p9r3n08
1142419 main /STARTR_ bsc32473 R 0:12 1 p9r3n08
1142420 main /STARTR_ bsc32473 R 0:12 1 p9r3n08
1142421 main /STARTR_ bsc32473 R 0:12 1 p9r3n08
757026 main /STARTR_ bsc32734 R 0:46 1 s02r2b24
757027 main /STARTR_ bsc32734 R 0:46 1 s04r1b61
757028 main /STARTR_ bsc32734 R 0:46 1 s04r1b63
757029 main /STARTR_ bsc32734 R 0:46 1 s04r1b64
```
Here the output of the execution on CTE-Power 9 after waiting for about a minute:
Here the output of the execution after waiting for about a minute:
```r
* Remaining time estimate (neglecting queue and merge time) (at
* 2019-01-28 01:16:59): 0 mins (46.22883 secs per chunk)
......@@ -675,55 +679,33 @@ Usually, in use cases with larger data inputs, it will be preferrable to add the
As mentioned above in the definition of the `cluster` parameters, it is strongly recommended to check the section on "How to choose the number of chunks, jobs and cores".
#### Compute() on the fat nodes and other HPCs
The `Compute()` call with the parameters to run the example in this section on the BSC ES fat nodes is provided below (you will need to adjust some of the parameters before using it). As you can see, the only thing that needs to be changed to execute startR on a different HPC is the definition of the `cluster` parameters.
You can find the `cluster` configuration for other HPCs at the end of this guide [Compute() cluster templates](#compute-cluster-templates)
The `cluster` configuration for the fat nodes, CTE-Power 9, Marenostrum 4, Nord3-v2, Minotauro and ECMWF cca/ccb are all provided at the very end of this guide.
```r
res <- Compute(wf,
chunks = list(latitude = 2,
longitude = 2),
threads_load = 2,
threads_compute = 4,
cluster = list(queue_host = 'bsceslogin01.bsc.es',
queue_type = 'slurm',
temp_dir = '/home/Earth/nmanuben/startR_hpc/',
cores_per_job = 2,
job_wallclock = '00:10:00',
max_jobs = 4,
bidirectional = TRUE
),
ecflow_suite_dir = '/home/Earth/nmanuben/startR_local/')
```
### Collect() and the EC-Flow GUI
Usually, in use cases where large data inputs are involved, it is convenient to add the parameter `wait = FALSE` to your `Compute()` call. With this parameter, `Compute()` will immediately return an object with information about your startR execution. You will be able to store this object onto disk. After doing that, you will not need to worry in case your workstation turns off in the middle of the computation. You will be able to close your R session, and collect the results later on with the `Collect()` function.
```r
res <- Compute(wf,
chunks = list(latitude = 2,
longitude = 2),
threads_load = 2,
threads_compute = 4,
cluster = list(queue_host = 'p9login1.bsc.es',
queue_type = 'slurm',
temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/',
r_module = 'R/3.5.0-foss-2018b',
cores_per_job = 4,
job_wallclock = '00:10:00',
max_jobs = 4,
extra_queue_params = list('#SBATCH --mem-per-cpu=3000'),
bidirectional = FALSE,
polling_period = 10
),
ecflow_suite_dir = '/home/Earth/nmanuben/startR_local/',
wait = FALSE
)
saveRDS(res, file = 'test_collect.Rds')
res <- Compute(wf,
chunks = list(latitude = 2,
longitude = 2),
threads_load = 2,
threads_compute = 4,
cluster = list(queue_host = 'nord4.bsc.es',
queue_type = 'slurm',
temp_dir = '/gpfs/scratch/bsc32/bsc32734/startR_hpc/',
cores_per_job = 4,
job_wallclock = '00:10:00',
max_jobs = 4,
extra_queue_params = list('#SBATCH --mem-per-cpu=3000'),
bidirectional = FALSE,
polling_period = 10
),
ecflow_suite_dir = '/home/Earth/aho/startR_local/',
wait = FALSE
)
saveRDS(res, file = 'test_collect.Rds')
```
At this point, after storing the descriptor of the execution and before calling `Collect()`, you may want to visually check the status of the execution. You can do that with the EC-Flow graphical user interface. You need to open a new terminal, load the EC-Flow module if needed, and start the GUI:
......@@ -1027,13 +1009,52 @@ r <- Compute(wf,
## Compute() cluster templates
### Nord3-v2
```r
res <- Compute(wf,
chunks = list(latitude = 2,
longitude = 2),
threads_load = 2,
threads_compute = 4,
cluster = list(queue_host = 'nord4.bsc.es',
queue_type = 'slurm',
temp_dir = '/gpfs/scratch/bsc32/bsc32734/startR_hpc/',
cores_per_job = 2,
job_wallclock = '01:00:00',
max_jobs = 4,
bidirectional = FALSE,
polling_period = 10
),
ecflow_suite_dir = '/home/Earth/aho/startR_local/',
wait = TRUE
)
```
### Nord3 (deprecated)
```r
cluster = list(queue_host = 'nord1.bsc.es',
queue_type = 'lsf',
data_dir = '/gpfs/projects/bsc32/share/startR_data_repos/',
temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/',
init_commands = list('module load intel/16.0.1'),
cores_per_job = 2,
job_wallclock = '00:10',
max_jobs = 4,
extra_queue_params = list('#BSUB -q bsc_es'),
bidirectional = FALSE,
polling_period = 10,
special_setup = 'marenostrum4'
)
```
### CTE-Power9
```r
cluster = list(queue_host = 'p9login1.bsc.es',
queue_type = 'slurm',
temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/',
r_module = 'R/3.5.0-foss-2018b',
cores_per_job = 4,
job_wallclock = '00:10:00',
max_jobs = 4,
......@@ -1042,7 +1063,7 @@ cluster = list(queue_host = 'p9login1.bsc.es',
)
```
### BSC ES fat nodes
### BSC ES fat nodes (deprecated)
```r
cluster = list(queue_host = 'bsceslogin01.bsc.es',
......@@ -1072,25 +1093,6 @@ cluster = list(queue_host = 'mn2.bsc.es',
)
```
### Nord3-v2
```r
cluster = list(queue_host = 'nord4.bsc.es',
queue_type = 'lsf',
data_dir = '/gpfs/projects/bsc32/share/startR_data_repos/',
temp_dir = '/gpfs/scratch/bsc32/bsc32473/startR_hpc/',
init_commands = list('module load intel/16.0.1'),
r_module = 'R/3.3.0',
cores_per_job = 2,
job_wallclock = '00:10',
max_jobs = 4,
extra_queue_params = list('#BSUB -q bsc_es'),
bidirectional = FALSE,
polling_period = 10,
special_setup = 'marenostrum4'
)
```
### MinoTauro
```r
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment