Start() error in Nord3: 'non-existent physical address'
This error was reported by @cdelgado (using Start with num_procs = 1
and retrieve = T
):
Loading required package: maps
* Exploring files... This will take a variable amount of time depending
* on the issued request and the performance of the file server...
* Detected dimension sizes:
* dataset: 1
* var: 1
* sdate: 59
* aux: 1
* lat: 128
* lon: 256
* fmonth: 122
* member: 10
* Total size of requested data:
* 1 x 1 x 59 x 1 x 128 x 256 x 122 x 10 x 8 bytes = 17.6 Gb
*** caught bus error ***
address 0x2b02fce87000, cause 'non-existent physical address'
Traceback:
1: CreateSharedMatrix(as.double(nrow), as.double(ncol), as.character(colnames), as.character(rownames), as.integer(typeVal), as.double(init), as.logical(separated))
2: bigmemory::big.matrix(nrow = prod(final_dims), ncol = 1)
3: Start(dataset = path_exp, var = variable, sdate = paste0(sdates), aux = "all", aux_depends = "sdate", lat = values(list(lat_min, lat_max)), lon = values(list(lon_min, lon_max)), fmonth = indices(fmonths), member = members, synonims = list(fmonth = c("fmonth", "time"), lon = c("lon", "longitude"), lat = c("lat", "latitude")), return_vars = list(lat = "dataset", lon = "dataset"), lat_reorder = Sort(decreasing = F), lon_reorder = CircularSort(0, 360), num_procs = num_procs_start_call, retrieve = retrieve)
An irrecoverable exception occurred. R is aborting now ...
/home/bsc32/bsc32924/.lsbatch/1603358172.1951216.shell: line 9: 19831 Bus error Rscript metrics/0_load_avg.R
He was submitting a job to Nord3 using bsub command and here it is the log:
------------------------------------------------------------
# LSBATCH: User input
#!/bin/bash
#BSUB -n 20
#BSUB -J miroc_pr
#BSUB -oo /esarchive/scratch/cdelgado/nord3_logs/%J.out
#BSUB -eo /esarchive/scratch/cdelgado/nord3_logs/%J.err
#BSUB -W 48:00
source ~/load_nord3_modules
Rscript metrics/0_load_avg.R
------------------------------------------------------------
Exited with exit code 135.
Resource usage summary:
CPU time : 16.87 sec.
Max Memory : 16498 MB
Average Memory : 57.00 MB
Total Requested Memory : 35000.00 MB
Delta Memory : 18502.00 MB
(Delta: the difference between Total Requested Memory and Max Memory.)
Max Processes : 4
Max Threads : 5
Job Energy Consumption : 0.000594 kWh
The output (if any) is above this job summary.
He also checked that the error didn't happen when the data requested had a smaller size than 17 GB. However, requesting the 17 GB data, the code failed for 16 and 32 cores, while it runs very slowly with -n 1
. Could you confirm if the jobs succeeded in this case, @cdelgado?
He did a very wise trick including a line in the job to make sure the temporal folder is empty rm -r /dev/shm
(because he hadn't other jobs submitted simultaneously).
Thanks for reporting this problem!
Núria