From 802ca2300b8da14dc1a09eb01ce5114dd36497f2 Mon Sep 17 00:00:00 2001 From: Nicolau Manubens Date: Tue, 29 Jan 2019 19:03:44 +0100 Subject: [PATCH 1/6] Few fixes. --- inst/doc/practical_guide.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/inst/doc/practical_guide.md b/inst/doc/practical_guide.md index 0aa31fe..b2ce45a 100644 --- a/inst/doc/practical_guide.md +++ b/inst/doc/practical_guide.md @@ -23,7 +23,7 @@ If you would like to start using startR rightaway on the BSC infrastructure, you ## Motivation -What would you do if you had to apply a custom statistical analysis procedure to a 10TB climate data set? Probably, you would need to use a scripting language to write a procedure which is able to retrieve a subset of data from the file system (it would rarely be possible to handle all of it at once on a single node), code the procedure in that language, and apply it carefully and efficiently to the data. Afterwards, you would need to think of and develop a mechanism to dispatch the job mutiple times in parallel to an HPC of your choice, each of the jobs processing a different subset of the data set. You could do this by hand but, ideally, you would rather use EC-Flow or a similar general purpose workflow manager which would orchestrate the work for you. Also, it would allow you to visually monitor and control the progress, as well as keep an easy-to-understand record of what you did, in case you need to re-use it in the future. The mentioned solution, although it is the recommended way to go, is a demanding one and you could easily spend a few days until you get it running smoothly. Additionally, when developing the job script, you would be exposed to the difficulties of efficiently managing the data and applying the coded procedure to it. +What would you do if you had to apply a custom statistical analysis procedure to a 10TB climate data set? Probably, you would need to use a scripting language to write a procedure which is able to retrieve a subset of data from the file system (it would rarely be possible to handle all of it at once on a single node), code the custom analysis procedure in that language, and apply it carefully and efficiently to the data. Afterwards, you would need to think of and develop a mechanism to dispatch the job mutiple times in parallel to an HPC of your choice, each of the jobs processing a different subset of the data set. You could do this by hand but, ideally, you would rather use EC-Flow or a similar general purpose workflow manager which would orchestrate the work for you. Also, it would allow you to visually monitor and control the progress, as well as keep an easy-to-understand record of what you did, in case you need to re-use it in the future. The mentioned solution, although it is the recommended way to go, is a demanding one and you could easily spend a few days until you get it running smoothly. Additionally, when developing the job script, you would be exposed to the difficulties of efficiently managing the data and applying the coded procedure to it. With the constant increase of resolution (in all possible dimensions) of weather and climate model output, and with the growing need for using computationally demanding analytical methodologies (e.g. bootstraping with thousands of repetitions), this kind of divide-and-conquer approach becomes indispensable. While tools exist to simplify and automate this complex procedure, they usually require adapting your data to specific formats, migrating to specific database systems, or an advanced knowledge of computer sciences or of specific programming languages or frameworks. @@ -39,7 +39,7 @@ Other things you can expect from startR: Things that are not supposed to be done with startR: - Curating/homogenizing model output files or generating files to be stored under /esarchive following the department/community conventions. Although metadata is understood and used by startR, its handling is not 100% consistent yet. -_**Note 1**_: The data files do not need to be migrated to a database system, nor have to comply any specific convention for their file names, name, number or order of the dimensions and variables, or distribution of files in the file system. Although only files in the NetCDF format are supported for now, plug-in functions (of approximately 200 lines of code) can be programmed to add support for additional file formats. +_**Note 1**_: The data files do not need to be migrated to a database system, nor have to comply with any specific convention for their file names, name, number or order of the dimensions and variables, or distribution of files in the file system. Although only files in the NetCDF format are supported for now, plug-in functions (of approximately 200 lines of code) can be programmed to add support for additional file formats. _**Note 2**_: The HPCs startR is designed to run on are understood as multi-core multi-node clusters. startR relies on a shared file system across all HPC nodes, and does not implement any kind of distributed storage system for now. -- GitLab From fbff4167348aa0b4497cd7f3bc4edbf718bf905e Mon Sep 17 00:00:00 2001 From: Nicolau Manubens Date: Fri, 1 Feb 2019 16:07:03 +0100 Subject: [PATCH 2/6] Small addition. --- inst/doc/practical_guide.md | 1 + 1 file changed, 1 insertion(+) diff --git a/inst/doc/practical_guide.md b/inst/doc/practical_guide.md index b2ce45a..1513030 100644 --- a/inst/doc/practical_guide.md +++ b/inst/doc/practical_guide.md @@ -296,6 +296,7 @@ $sdate[[1]] If you are interested in actually loading the entire data set in your machine you can do so in two ways (_**be careful**_, loading the data involved in the `Start()` call in this example will most likely stall your machine. Try it with a smaller region or a subset of forecast time steps): - adding the parameter `retrieve = TRUE` in your `Start()` call. - evaluating the object returned by `Start()`: `data_load <- eval(data)` +See the section on "How to choose the number of chunks, jobs and cores" for indications on working out the maximum amount of data that can be retrieved with a `Start()` call on a specific machine. You may realize that this functionality is similar to the `Load()` function in the s2dverification package. In fact, `Start()` is more advanced and flexible, although `Load()` is more mature and consistent for loading typical seasonal to decadal forecasting data. `Load()` will be adapted in the future to use `Start()` internally. -- GitLab From e985bb80ea25859d9d374133285a8ca01c6157f3 Mon Sep 17 00:00:00 2001 From: Nicolau Manubens Date: Mon, 4 Mar 2019 11:48:32 +0100 Subject: [PATCH 3/6] Small fix in the fat nodes template. --- inst/doc/practical_guide.md | 1 - 1 file changed, 1 deletion(-) diff --git a/inst/doc/practical_guide.md b/inst/doc/practical_guide.md index 1513030..0ae2806 100644 --- a/inst/doc/practical_guide.md +++ b/inst/doc/practical_guide.md @@ -1042,7 +1042,6 @@ cluster = list(queue_host = 'p9login1.bsc.es', ```r cluster = list(queue_host = 'bsceslogin01.bsc.es', queue_type = 'slurm', - temp_dir = '/home/Earth/nmanuben/startR_hpc/', cores_per_job = 2, job_wallclock = '00:10:00', max_jobs = 4, -- GitLab From 86fcb5a091aa4bb529610e3ec952a8590d16b9d3 Mon Sep 17 00:00:00 2001 From: Nicolau Manubens Date: Tue, 23 Apr 2019 15:06:34 +0200 Subject: [PATCH 4/6] Added support for 'months since'. --- R/NcDataReader.R | 3 +++ 1 file changed, 3 insertions(+) diff --git a/R/NcDataReader.R b/R/NcDataReader.R index d0455e5..8213128 100644 --- a/R/NcDataReader.R +++ b/R/NcDataReader.R @@ -157,6 +157,9 @@ NcDataReader <- function(file_path = NULL, file_object = NULL, units <- 'mins' } else if (units == 'day') { units <- 'days' + } else if (units %in% c('month', 'months')) { + result <- result * 30.5 + units <- 'days' } new_array <- rep(as.POSIXct(parts[2]), length(result)) + -- GitLab From a9410dfbb5828cb5de337364b4b257848a21047b Mon Sep 17 00:00:00 2001 From: Nicolau Manubens Date: Tue, 23 Apr 2019 15:21:13 +0200 Subject: [PATCH 5/6] Fix in installation steps. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 664e1c8..ab9968e 100644 --- a/README.md +++ b/README.md @@ -33,7 +33,7 @@ library(startR) Also, you can install the latest stable version from the GitLab repository as follows: ```r -devtools::install_git('https://earth.bsc.es/gitlab/es/startR') +devtools::install_git('https://earth.bsc.es/gitlab/es/startR.git') ``` ### How it works -- GitLab From 79ab2b752ad0f45cd3e813a9bd508d9c5c15dcd4 Mon Sep 17 00:00:00 2001 From: Nicolau Manubens Date: Tue, 23 Apr 2019 15:26:16 +0200 Subject: [PATCH 6/6] Bumped version to v0.1.2. --- DESCRIPTION | 2 +- startR-manual.pdf | Bin 143986 -> 143983 bytes 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/DESCRIPTION b/DESCRIPTION index c914a29..2a76526 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: startR Title: Automatically Retrieve Multidimensional Distributed Data Sets -Version: 0.1.1 +Version: 0.1.2 Authors@R: c( person("BSC-CNS", role = c("aut", "cph")), person("Nicolau", "Manubens", , "nicolau.manubens@bsc.es", role = c("aut", "cre")), diff --git a/startR-manual.pdf b/startR-manual.pdf index 37dadec8798bed8e656b3fb8714ae5a469ab8e5b..dac7899e68eedb61e1475b499a208d22950b4714 100644 GIT binary patch delta 2629 zcmajgX*ARg8^>{D9b1+xMHtJB-OQL78p4gSh3t`?v6VF%Wa)3aG02+iJC!Y4GKxu* z$`T@kn4}?FhGYv*_qm@J&%5V*UR~$>UVhKHwwJ&^mcW@X5IToPS|-bk4cMCXgH*__ zdht87I8M6j6$(7<(5Pn8H%>FP?#QxET%Y=3F2(=urii!VFH1>7sads-A5qIXEE$uM zmm+rPfiDr{C}i|Tl4P6(UslkQE)kwG$U3wt`IQV?<(RVY=QJ&oOP3ye_kQM;W5v5j zE3>qyplX?DDlWt&`Yhd+VFh$6G5f6Bm9C5<;(&k;|Fi$vY1SP;Dce zMC0;ADzfmPg&eDTxANPoqPv{~pPqV5ymiu)V^$|D|`c_&k^S*We!#kH-&m6OO zirq{emm9fOhyUEV#RK58QLH23SFhOge#_3I7TVUL=-!*jyKOSb%e_8~Xh?~ywfn-s zx}1SXoSB9&Paug$4vgsuz6N8h{naOUMaT{r0ni7pX->qj)^L#eaM6>Vf?&O?VyJz&t zGq{Mxog~!TmUCP(;^6aS>`G;jU)ym~`7VNSme*Hs)!*vZyM>Sb_pv(!C*p?BICDxx z_)wtJzy!8PO$cbbz`x_wedkg6^^Y>_#@bq8nUCIZ0)_`nHrG01lXT!g8eu1+{m#c8 zo6(F_q>XDw4GtKFwHE?vW08%0_h7^y4Z0(;sG=Tz7eQcjCqce_Rv!~r@{>-Z-d7cL zc*1+my!6);KbRjOXGaFS&TBDm!qFYo5zG;314nDAgHw~OQU(5fj%T4YPJGR`!< z@V?S>qv2F6{PBDgNo}`GXw06%r=qleR`S>E?;DW#ta+y`!ye;E^*{{tSVTGw(>S=G zUu#b;2Ch`-OfWZkq~sSrb~CH2mz^u{_g8ze4OJ0--LJdwNGotM9t{Rx+WZjU9CGE&9a76aBaTpWv?9pNT`#@y&*jfcC`apJYMOB@dpP4|VaW)-W1 ziaqdba}Ox~NdS8h{C2lGtUHF)H(|W%ouw@7O!pQecth)ZOzW;b=Z~VZ$e-}8sSCd@ zyp83c)G6Ga`WluzYjLsZstKvH!!_cJtww=&7%hx%61MLd8Os&Dbl`M$nzI5cy}Kp> zl!}|EUHxq*gtWFJmYEW|l(Rviw?W(S$@w{1TqReR-|tymhzy83C+T)-F=AlE9ctZl zqjW~f!)9iM|8GzDb1TQD?ij~eCUC>zQ+7v3a#>s4->(8svHHy#B+eHLN1g|ZE~C|0?f6rsq*EqP$0-wi zz91Z}j6!0RF=(VbLRlJtkd|eGn+AH|+-`Y7WN>QA2qYHqzmtxB5qc$VOB_xIp@Y#= z)kS0UP&kAh60M5U(a}TbsAyvmNL_>`oqUP#@j~Lc{DzazL<0a)efW25d%H(;E(hbRpd9NAH`EnIK|CH-oD>c;aVy_vN zP^;&r8+_xiXSY8B{+(2r4@(v@GZpLnjnMf#nacoJ@?|WOZcTp~#V# z1?Y@10xvlQb~hIojL#2GvClDwy;QaX)l)?67}jfO+?VLrDA8M>@ZN!^n>YJ=yq$ka zZXu`X%wEFwZnig*f;xbpj^q}<@$YGO1<6eK!T!bTpR7_RN|CSyce9o@Kj|y?kNckW zHuu)AvP@eSM{f-0M^Tx!>i%W;VjXnI_dU;G-*VrwOM@VVKXhpdrZ=H`qL?86k0IIl zMv2Y;dZbo_9RF!)Xw?=|&Hg1ND)k>gRBGA{ar~XvT9OOFWRit3+y+` znOdn>tB7AK_BE8;wlU}z)FSCqCF}`aQ3n~mT{k3L2;NZ`s!%~H>0=QM*hugV@bHi- zOe_2}FfE8M2rrL5uXSfxC;-fa&ZJqD`F6e6VIH$V=hJqY;grN6-M4aSqdxtCgX9$3 zMSekYvQqh^_HWJ2`X?;n$=o=w1ml?O)3tg#2?d^D;x9-P4;k!|0I$>l8z(?}dEJQ3 zIJ-)DI-3N!Uc?w1P9ET{1OGX6_{aEgg4`d#*Avh>s4=z#O&zgLMe)0t`i7F#JRg$Z zLxy+?iSHrRSv47zM}Ja%C=3WH#Pc_^ZN_4P#P*+` hB7}9FvhKbMTm&SWknVX#%JdLclqwq(s%NIp_8;l;-s1oO delta 2610 zcmaiyS5Om(7KUjdA)pkcBb|WKLJvi1XiF1VX$lK+=>(!eXn_FI4bn?M!-C)vLoX7F z6qTYNHHI#sEQVeMq-Andl=2++&SMJ=9)GdC5TjGikRJQw)na z>HazNk`&1(Al@Hp%aw%3j8o4Btk6Kb(WM+2>^7km4Cv`WxR}Xke5aZ4Avvo8SwXZ?2WKmTx{~BCwqo~xyk_~3%(#vxNeGh2_3ea}8lwoG8KJL>^k#_Lb|qO#Fa8Ci38 zaNCxKQqmPuzjr`w3Yg{IEiw+?7^_N*uZp%s95&N3OJqdoo0xvMF~l`}7}x!tlO1}Z zCvz5S^xPrB+Mas@gHL4Vi_+xY#RjbWrx&5_OsI{>b77>OH!YpwC(5#$YWpimyxBYy zj+RU0=bH*_a_OT6iypA5xubZS9>EJ`DK+QD=jJBq>lhR{#>vsJF4Q_Jrgg@)Sge>> z8C41j!wtp<}}nVB@Co_b@fdc~+cR!_2vR%cD% zaD^Q6vp*aV(*aH|{OCwF5!`}-O(CU~cg+UwJ!tZwPBKm$S{Hb#UNBX`W)evn)mU)aBxCqw8pYuGYsf4vuJiu_cK1 zNm_IEmIx&MJZfRvr=N7K)SZIR&ek}DX&f#ysgke_LXu2M``crG&61&DOhKFQw%H_pBJz08u+~|D+5SBQ%}O-07HH(m zX>hjJwm#K@sc9WWo(PuBtonFMa|O25u(ASjmrz0|Vz=w9$z<*hpluUbOH6SmjFM1h=Cn8vX$jxL8;QK1Q^Rb))<)9|Fa z$@-;!yQih^%+9Nb%X#^1b<4~H?BX^h>1bNhtuW-*G|x$E4us^U-ukHw%e!amcR$jmUlGYdwHo{n4GTU5X4-{>S%|X(*o$p+dw?Rqs;F!^> zt*5Izr)7_U5eFWY!kdBlpBgbZ`r^H&IE43fP@vee&akE~mZQsuo1RgyLH9$W7bmvp zzOxUKI@-KF=B*^FZxLVct6U>`%a>cfiNOe0=3n{aT<&BOuAc-UnJ)tL*=<(N)nw&X zvd$y04uy=?dU54RaFx2PF5##T??hxoyxT*XwwOs z^^L_hBgxzRhd%+4k2nsmqBX#_v^7j-#FWrn(!$O~L*$-F?l-OWCue}+5Ma3dvU6=Mx`H5KTuF;F#7Q-&JD zRE!Pb2D+j;|M$cdtZ3;KDn!!YbY5;B>x>9R)9eBuU)l4vGFvlBKJ z8C&uIcE&f0f>X^$X${l-1BLuCv3ctGs3E7RDJ_TKMeJ~Qv|EwSk=wVq=%e|a$CT8R z@UJPp@V-S%{Fr4x$9JBWOjnSJGsRd`tU9uIMi7gNNA1SWXqwz`cVx{YdlzmqzaV2G zD_?lE0XHj9xicG9>rgf)hpljHL(l%8hDZ+ zxT7cZucCUGCJc}qLJUUkeAw9cR1geo(nF;3!LC}y>@ZAynEWfiZClNee3X|`rp+^`U z2pjG%u4(lu9?>!Q4n6UL7L0wz{p^=Q^2@~%sRB++a$lSnG@Quo7+}?Z4mH{Zm z*w1>=tbO_Ey%P<8@epn=f9UG$#XsVsk?5#KS4V)Itq&H?9WjL8{g?7T zj_9J(4hJRBH_E+e+U@j6sGC-Wh}HN2^!L0 z@jfSi6R2&>Gzj6o;&raG&Po!S<7fa&s21(b7S%50h4&^Zma$fba&+08+}bh#i8FD= zgA7!AQPJEM2~vg%W5NlF?|2uV?ix1%JyPr7PU3Q$`5=oBz@KS%jQJogalQbIA$RGo zP$^y%Ph&6?5KGTA@%^B&igF;n8Y&e$83=CFa5F@lP8fK;W^0uyBh PQ5IDtu#}XsjS2X_wTj+b -- GitLab