Commits (10)
Package: startR
Title: Automatically Retrieve Multidimensional Distributed Data Sets
Version: 0.1.1
Version: 0.1.2
Authors@R: c(
person("BSC-CNS", role = c("aut", "cph")),
person("Nicolau", "Manubens", , "nicolau.manubens@bsc.es", role = c("aut", "cre")),
......
......@@ -157,6 +157,9 @@ NcDataReader <- function(file_path = NULL, file_object = NULL,
units <- 'mins'
} else if (units == 'day') {
units <- 'days'
} else if (units %in% c('month', 'months')) {
result <- result * 30.5
units <- 'days'
}
new_array <- rep(as.POSIXct(parts[2]), length(result)) +
......
......@@ -33,7 +33,7 @@ library(startR)
Also, you can install the latest stable version from the GitLab repository as follows:
```r
devtools::install_git('https://earth.bsc.es/gitlab/es/startR')
devtools::install_git('https://earth.bsc.es/gitlab/es/startR.git')
```
### How it works
......
......@@ -23,7 +23,7 @@ If you would like to start using startR rightaway on the BSC infrastructure, you
## Motivation
What would you do if you had to apply a custom statistical analysis procedure to a 10TB climate data set? Probably, you would need to use a scripting language to write a procedure which is able to retrieve a subset of data from the file system (it would rarely be possible to handle all of it at once on a single node), code the procedure in that language, and apply it carefully and efficiently to the data. Afterwards, you would need to think of and develop a mechanism to dispatch the job mutiple times in parallel to an HPC of your choice, each of the jobs processing a different subset of the data set. You could do this by hand but, ideally, you would rather use EC-Flow or a similar general purpose workflow manager which would orchestrate the work for you. Also, it would allow you to visually monitor and control the progress, as well as keep an easy-to-understand record of what you did, in case you need to re-use it in the future. The mentioned solution, although it is the recommended way to go, is a demanding one and you could easily spend a few days until you get it running smoothly. Additionally, when developing the job script, you would be exposed to the difficulties of efficiently managing the data and applying the coded procedure to it.
What would you do if you had to apply a custom statistical analysis procedure to a 10TB climate data set? Probably, you would need to use a scripting language to write a procedure which is able to retrieve a subset of data from the file system (it would rarely be possible to handle all of it at once on a single node), code the custom analysis procedure in that language, and apply it carefully and efficiently to the data. Afterwards, you would need to think of and develop a mechanism to dispatch the job mutiple times in parallel to an HPC of your choice, each of the jobs processing a different subset of the data set. You could do this by hand but, ideally, you would rather use EC-Flow or a similar general purpose workflow manager which would orchestrate the work for you. Also, it would allow you to visually monitor and control the progress, as well as keep an easy-to-understand record of what you did, in case you need to re-use it in the future. The mentioned solution, although it is the recommended way to go, is a demanding one and you could easily spend a few days until you get it running smoothly. Additionally, when developing the job script, you would be exposed to the difficulties of efficiently managing the data and applying the coded procedure to it.
With the constant increase of resolution (in all possible dimensions) of weather and climate model output, and with the growing need for using computationally demanding analytical methodologies (e.g. bootstraping with thousands of repetitions), this kind of divide-and-conquer approach becomes indispensable. While tools exist to simplify and automate this complex procedure, they usually require adapting your data to specific formats, migrating to specific database systems, or an advanced knowledge of computer sciences or of specific programming languages or frameworks.
......@@ -39,7 +39,7 @@ Other things you can expect from startR:
Things that are not supposed to be done with startR:
- Curating/homogenizing model output files or generating files to be stored under /esarchive following the department/community conventions. Although metadata is understood and used by startR, its handling is not 100% consistent yet.
_**Note 1**_: The data files do not need to be migrated to a database system, nor have to comply any specific convention for their file names, name, number or order of the dimensions and variables, or distribution of files in the file system. Although only files in the NetCDF format are supported for now, plug-in functions (of approximately 200 lines of code) can be programmed to add support for additional file formats.
_**Note 1**_: The data files do not need to be migrated to a database system, nor have to comply with any specific convention for their file names, name, number or order of the dimensions and variables, or distribution of files in the file system. Although only files in the NetCDF format are supported for now, plug-in functions (of approximately 200 lines of code) can be programmed to add support for additional file formats.
_**Note 2**_: The HPCs startR is designed to run on are understood as multi-core multi-node clusters. startR relies on a shared file system across all HPC nodes, and does not implement any kind of distributed storage system for now.
......@@ -296,6 +296,7 @@ $sdate[[1]]
If you are interested in actually loading the entire data set in your machine you can do so in two ways (_**be careful**_, loading the data involved in the `Start()` call in this example will most likely stall your machine. Try it with a smaller region or a subset of forecast time steps):
- adding the parameter `retrieve = TRUE` in your `Start()` call.
- evaluating the object returned by `Start()`: `data_load <- eval(data)`
See the section on "How to choose the number of chunks, jobs and cores" for indications on working out the maximum amount of data that can be retrieved with a `Start()` call on a specific machine.
You may realize that this functionality is similar to the `Load()` function in the s2dverification package. In fact, `Start()` is more advanced and flexible, although `Load()` is more mature and consistent for loading typical seasonal to decadal forecasting data. `Load()` will be adapted in the future to use `Start()` internally.
......@@ -1041,7 +1042,6 @@ cluster = list(queue_host = 'p9login1.bsc.es',
```r
cluster = list(queue_host = 'bsceslogin01.bsc.es',
queue_type = 'slurm',
temp_dir = '/home/Earth/nmanuben/startR_hpc/',
cores_per_job = 2,
job_wallclock = '00:10:00',
max_jobs = 4,
......
No preview for this file type