Wrong output when submitting jobs containing missing files to Nord4
The bug is found in ex2_7 that contains a missing file /esarchive/exp/ecmwf/system4_m1/monthly_mean/tas_f6h/tas_20121101.nc
. If running locally (on WS or on Nord4 interactive session), the output is correct. However, if the job is submitted to Nord4, the NAs from the missing file are replaced by 0.
The problem lies in ObjectBigmemory
, a Start() parameter and startR object attribute created by bigmemory package. The definition in Start() document is:
#'@param ObjectBigmemory a character string to be included as part of the
#' bigmemory object name. This parameter is thought to be used internally by the
#' chunking capabilities of startR.
The corresponding code is in Start() line 3678-3689 as well as in load_process_save_chunk.R line 73-108.
What I've found so far is, when running locally, the startR object has ObjectBigmemory
name like PMFAFdyzxytpFoMvspVWrzdv
; but when submitted to nord4, the name is like _2416136301_1_1_1_1_1_1_
. I tried assigning ObjectBigmemory = "_2416136301_1_1_1_1_1_1_"
in a Start call with retrieve = TRUE
, and the output became wrong. But we cannot specify ObjectBigmemory
with retrieve = FALSE
and submit the job to nord4 because ObjectBigmemory
is created only when the data is retrieved.
I know little about bigmemory, and I don't know yet if it is the problem from the package or it can be solved by changing startR code. The bigmemory version on Nord4 (v4.5.36) and WS(v4.5.33) is different, but since running on nord4 interactive session doesn't have problem, I don't think it's the version problem. I haven't tested other cases with missing file, and I don't know if it happens only on Nord4 or also Nord3/Power9 (but we should/cannot test on them now anyway.)
Please let me know if you have any insights regarding this issue or any knowledge about bigmemory, thanks! I leave the minimal example below.
library(startR)
repos <- paste0('/esarchive/exp/ecmwf/system4_m1/monthly_mean/',
'$var$_f6h/$var$_$sdate$.nc')
# 20121101 is missing
sdates <- sapply(2012:2014, function(x) paste0(x, sprintf('%02d', 1:12), '01'))
exp <- Start(dat = repos,
var = 'tas',
sdate = sdates,
time = indices(1),
ensemble = 1:2,
latitude = indices(1:2),
longitude = indices(1:2),
synonims = list(longitude = c('lon', 'longitude'),
latitude = c('lat', 'latitude')),
return_vars = list(longitude = NULL, latitude = NULL),
retrieve = F)
func <- function(x) {
return(x)
}
step <- Step(func, target_dims = c('ensemble'),
output_dims = c('ensemble'))
wf <- AddStep(exp, step)
# run locally; correct output
res_ws <- Compute(wf, chunks = list(sdate = 2))
# run on nord4; WRONG output, NAs are replaced by 0
temp_dir <- '/gpfs/scratch/bsc32/bsc32734/startR_hpc/'
ecflow_suite_dir <- '/home/Earth/aho/startR_local/'
res_n4 <- Compute(wf,
chunks = list(sdate = 1),
threads_load = 2,
threads_compute = 4,
cluster = list(queue_host = 'nord4',
queue_type = 'slurm',
temp_dir = temp_dir,
cores_per_job = 2,
job_wallclock = '01:00:00',
max_jobs = 4,
bidirectional = FALSE,
polling_period = 10
),
ecflow_suite_dir = ecflow_suite_dir,
wait = TRUE
)
summary(res_ws$output1)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
244.6 251.4 256.9 259.9 272.3 274.0 8
summary(res_n4$output1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 251.1 256.6 252.7 270.1 274.0