Performance of CST_Anomaly() with large datasets
I had to run the Verification Suite with global data on Nord3v2 and noticed that CST_Anomaly() was very slow, in particular when asking for the anomalies to be computed in cross-validation (taking several hours to finish with 1.7 GB of hindcast data), even though I requested multiple cores.
I then noticed that the parameter ncores
is missing from the calls to s2dv::Ano_CrossValid() and s2dv::Clim() inside CST_Anomaly(). I have added them in the branch dev-CST_Anomaly-ncores.
I have run a simple test with this sample data and 12 cores on the medmem nodes in Nord3v2:
obs_array <- rnorm(9383040)
dim(obs_array) <- c(sdate = 24, ftime = 6, lat = 181, lon = 360, ensemble = 1)
exp_array <- rnorm(9383040*25)
dim(exp_array) <- c(sdate = 24, ftime = 6, lat = 181, lon = 360, ensemble = 25)
You can find the full script in my personal gitlab.
For cross = F, we have:
- master branch: "Time difference of 3.49159 mins"
- new fix: "Time difference of 43.56318 secs"
For cross = T:
- master branch: at least 2 hours (The run did not finish before my session ended)
- new fix: "Time difference of 10.94001 mins"
While this already seems to improve the situation, I am still surprised by the huge difference between cross = T
and cross = F
. Do you think it might be worth it to investigate if the performance of Ano_CrossValid() can be improved?
Thanks,
Victòria