Strategy for chunking a large dataset?
Hi,
@aho, @nperez, @amanriqu, @llledo, @rfernand, @lbaltasa, @bsolaraj, @swild I thought you might be interested in this, or even provide me with some advice.
I am trying to load, remap and compute the annual/seasonal averages of 2 large datasets (NCAR DPLE and LENS, each one is more than 90Gb). I already tried using only Start()
, without chunking, but my job fails because of memory problems.
Thus I am trying to use the chunking functionality of startR. The function I include inside my Compute()
function involves computing the annual/seasonal averages and remapping the data to an observational grid:
fun <- function(x) {
if (seas == "ANNUAL"){
y <- Season(x, posdim = 2, monini=1, moninf=1, monsup=12)
}else if (seas == "DJF"){
y <- Season(x, posdim = 2, monini=1, moninf=12, monsup=2)
}else if (seas == "MAM"){
y <- Season(x, posdim = 2, monini=1, moninf=3, monsup=5)
}else if (seas == "JJA"){
y <- Season(x, posdim = 2, monini=1, moninf=6, monsup=8)
}else if (seas == "SON"){
y <- Season(x, posdim = 2, monini=1, moninf=9, monsup=11)
}
r <- s2dverification::CDORemap(y, lons_data, lats_data, resgrid, 'bil', crop = FALSE, force_remap = TRUE)
s <- r$data_array
s
}
Thus this function needs to operate only on time
, longitude
and latitude
dimensions. I thought I would then do the chunking along the other dimensions, the ensemble members (ensemble
) and the start dates (sdate
). However, in the end, I need the output of my Compute()
call to have all the initial dimensions (without the dat
dimension, which is 1 anyway): var
, ensemble
, sdate
, time
, longitude
and latitude
. I tried defining my Step()
like this:
step <- Step(fun = fun,
target_dims = c('var','time','longitude','latitude'),
output_dims = c('var','ensemble','sdate','time','longitude','latitude'))
But obviously, it doesn't work because I'm requesting dimensions in the output that haven't been used as input (ensemble
and sdate
)... If I include ensemble
and sdate
in target_dims
I get an error saying that I can't do the chunking along target dimensions.
Is there a solution to this? A way of chunking along ensemble
and sdate
, but still obtaining an output that has all the initial dimensions? I'm currently testing this on my workstation, but then the objective is to run the chunking on Power9.
Thanks a lot for your help!
Cheers,
Deborah