Start(): 'caught segfault' error when trying to load more than 16 Gb of data on Nord3v2 highmem node
Hi @aho
@nmilders reported that she was having memory issues when loading 24 years x 6 time steps experiment data and regridding it to ERA5 resolution (0.25ºx0.25º) using startR on Nord3v2, even on the high memory nodes (128 Gb available).
The size of the data in my own test was as follows:
* Detected dimension sizes:
* dat: 1
* var: 1
* sdate: 24
* time: 6
* latitude: 721
* longitude: 1440
* ensemble: 25
* Total size of involved data:
* 1 x 1 x 24 x 6 x 721 x 1440 x 25 x 8 bytes = 27.8 Gb
Here is a reproducible example:
library(startR)
# SEAS5 monthly data
hcst.path <- "/esarchive/exp/ecmwf/system5c3s/monthly_mean/$var$_f6h/$var$_$sdate$0301.nc"
# Regrid to ERA5 resolution
target.grid <- "/esarchive/recon/ecmwf/era5/monthly_mean/tas_f1h-r1440x721cds/tas_201805.nc"
# 24 years of hindcast
start.dates <- as.character(c(1993:2016))
# Region
lat.min <- -90
lat.max <- 90
lon.min <- 0
lon.max <- 359.9
hcst <- Start(dat = hcst.path,
var = 'tas',
sdate = start.dates,
time = 1:6,
latitude = values(list(lat.min, lat.max)),
latitude_reorder = Sort(),
longitude = values(list(lon.min, lon.max)),
longitude_reorder = CircularSort(0, 360),
transform = CDORemapper,
transform_params = list(grid = target.grid,
method = 'bilinear'),
transform_vars = c('latitude', 'longitude'),
synonims = list(latitude = c('lat', 'latitude'),
longitude = c('lon', 'longitude'),
ensemble = c('member', 'ensemble')),
ensemble = 'all',
metadata_dims = 'var',
return_vars = list(latitude = 'dat',
longitude = 'dat',
time = 'sdate'),
retrieve = TRUE)
Trying to load this results in the following error message:
* Progress: 0%[s01r2b17:3954448:0:3954448] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f47dff27ff8)
==== backtrace (tid:3954448) ====
0 0x00000000000534c9 ucs_debug_print_backtrace() ???:0
1 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
2 0x000000000004b216 SetIndivVectorMatrixElements() ???:0
3 0x000000000004491f _bigmemory_SetIndivVectorMatrixElements() ???:0
4 0x00000000001015c0 R_doDotCall() initfini.c:0
5 0x0000000000142717 bcEval() eval.c:0
6 0x000000000014e080 Rf_eval.localalias.34() eval.c:0
7 0x000000000014fe7f R_execClosure() eval.c:0
8 0x0000000000150df7 Rf_applyClosure() ???:0
9 0x000000000014409e bcEval() eval.c:0
10 0x000000000014e080 Rf_eval.localalias.34() eval.c:0
11 0x000000000014fe7f R_execClosure() eval.c:0
12 0x0000000000150df7 Rf_applyClosure() ???:0
13 0x000000000014409e bcEval() eval.c:0
14 0x000000000014e080 Rf_eval.localalias.34() eval.c:0
15 0x000000000014fe7f R_execClosure() eval.c:0
16 0x0000000000151295 R_execMethod() ???:0
17 0x00000000000045b9 R_dispatchGeneric() initfini.c:0
18 0x00000000001962ef do_standardGeneric() initfini.c:0
19 0x000000000013cbd8 bcEval() eval.c:0
20 0x000000000014e080 Rf_eval.localalias.34() eval.c:0
21 0x000000000014fe7f R_execClosure() eval.c:0
22 0x0000000000150df7 Rf_applyClosure() ???:0
23 0x0000000000196e16 R_possible_dispatch() initfini.c:0
24 0x0000000000135e7d tryDispatch() eval.c:0
25 0x000000000013609f tryAssignDispatch() eval.c:0
26 0x000000000013a8a3 bcEval() eval.c:0
27 0x000000000014e080 Rf_eval.localalias.34() eval.c:0
28 0x000000000014fe7f R_execClosure() eval.c:0
29 0x0000000000150df7 Rf_applyClosure() ???:0
30 0x000000000015378c R_forceAndCall() ???:0
31 0x0000000000084e92 do_lapply() initfini.c:0
32 0x0000000000191936 do_internal() initfini.c:0
33 0x000000000013af07 bcEval() eval.c:0
34 0x000000000014e080 Rf_eval.localalias.34() eval.c:0
35 0x000000000014fe7f R_execClosure() eval.c:0
36 0x0000000000150df7 Rf_applyClosure() ???:0
37 0x000000000014409e bcEval() eval.c:0
38 0x000000000014e080 Rf_eval.localalias.34() eval.c:0
39 0x000000000014ea22 forcePromise() eval.c:0
40 0x000000000014eee8 getvar() eval.c:0
41 0x0000000000142515 bcEval() eval.c:0
42 0x000000000014e080 Rf_eval.localalias.34() eval.c:0
43 0x000000000014ea22 forcePromise() eval.c:0
44 0x000000000014eee8 getvar() eval.c:0
45 0x0000000000142515 bcEval() eval.c:0
46 0x000000000014e080 Rf_eval.localalias.34() eval.c:0
47 0x000000000014fe7f R_execClosure() eval.c:0
48 0x0000000000150df7 Rf_applyClosure() ???:0
49 0x000000000014409e bcEval() eval.c:0
50 0x000000000014e080 Rf_eval.localalias.34() eval.c:0
51 0x000000000014fe7f R_execClosure() eval.c:0
52 0x0000000000150df7 Rf_applyClosure() ???:0
53 0x000000000014409e bcEval() eval.c:0
54 0x000000000014e080 Rf_eval.localalias.34() eval.c:0
55 0x000000000014fe7f R_execClosure() eval.c:0
56 0x0000000000150df7 Rf_applyClosure() ???:0
57 0x000000000014e240 Rf_eval.localalias.34() eval.c:0
58 0x00000000001531ea do_set() initfini.c:0
=================================
*** caught segfault ***
address 0x1182003c5710, cause 'unknown'