Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
  • startR startR
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 29
    • Issues 29
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 7
    • Merge requests 7
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Earth SciencesEarth Sciences
  • startRstartR
  • Issues
  • #77
Closed
Open
Issue created Oct 23, 2020 by Nuria Pérez-Zanón@nperezMaintainer

Start() error in Nord3: 'non-existent physical address'

This error was reported by @cdelgado (using Start with num_procs = 1 and retrieve = T):

Loading required package: maps
* Exploring files... This will take a variable amount of time depending
*   on the issued request and the performance of the file server...
* Detected dimension sizes:
*   dataset:   1
*       var:   1
*     sdate:  59
*       aux:   1
*       lat: 128
*       lon: 256
*    fmonth: 122
*    member:  10
* Total size of requested data:
*   1 x 1 x 59 x 1 x 128 x 256 x 122 x 10 x 8 bytes = 17.6 Gb
 *** caught bus error ***
address 0x2b02fce87000, cause 'non-existent physical address'
Traceback:
 1: CreateSharedMatrix(as.double(nrow), as.double(ncol), as.character(colnames),     as.character(rownames), as.integer(typeVal), as.double(init),     as.logical(separated))
 2: bigmemory::big.matrix(nrow = prod(final_dims), ncol = 1)
 3: Start(dataset = path_exp, var = variable, sdate = paste0(sdates),     aux = "all", aux_depends = "sdate", lat = values(list(lat_min,         lat_max)), lon = values(list(lon_min, lon_max)), fmonth = indices(fmonths),     member = members, synonims = list(fmonth = c("fmonth", "time"),         lon = c("lon", "longitude"), lat = c("lat", "latitude")),     return_vars = list(lat = "dataset", lon = "dataset"), lat_reorder = Sort(decreasing = F),     lon_reorder = CircularSort(0, 360), num_procs = num_procs_start_call,     retrieve = retrieve)
An irrecoverable exception occurred. R is aborting now ...
/home/bsc32/bsc32924/.lsbatch/1603358172.1951216.shell: line 9: 19831 Bus error               Rscript metrics/0_load_avg.R

He was submitting a job to Nord3 using bsub command and here it is the log:

------------------------------------------------------------
# LSBATCH: User input
#!/bin/bash
#BSUB -n 20
#BSUB -J miroc_pr
#BSUB -oo /esarchive/scratch/cdelgado/nord3_logs/%J.out
#BSUB -eo /esarchive/scratch/cdelgado/nord3_logs/%J.err
#BSUB -W 48:00
source ~/load_nord3_modules
Rscript metrics/0_load_avg.R
------------------------------------------------------------
Exited with exit code 135.
Resource usage summary:
    CPU time :               16.87 sec.
    Max Memory :             16498 MB
    Average Memory :         57.00 MB
    Total Requested Memory : 35000.00 MB
    Delta Memory :           18502.00 MB
    (Delta: the difference between Total Requested Memory and Max Memory.)
    Max Processes :          4
    Max Threads :            5
    Job Energy Consumption : 0.000594 kWh
The output (if any) is above this job summary.

He also checked that the error didn't happen when the data requested had a smaller size than 17 GB. However, requesting the 17 GB data, the code failed for 16 and 32 cores, while it runs very slowly with -n 1. Could you confirm if the jobs succeeded in this case, @cdelgado?

He did a very wise trick including a line in the job to make sure the temporal folder is empty rm -r /dev/shm(because he hadn't other jobs submitted simultaneously).

Thanks for reporting this problem!

Núria

Assignee
Assign to
Time tracking