This is an old revision of the document!
This work on reproducibility is motivated by three scientific questions:
1/ Is it necessary to rerun a control run whenever a climate model is ported to a new machine?
2/ Can we distinguish a stream of simulations on machine B from a stream of simulations on machine A when only the hardware and the environmental software are changed?
3/ Is the modelled climate sensitive to the platform used? How do we deal with the associated uncertainties?
Three experiments have been run:
'e00y
' (ECMWF) [completed]
'm04y
' (Mare Nostrum) [completed]
'i06c
' (Ithaca) [completed]
All three experiments started the 1st of January 1850, from spun-up oceanic and atmospheric initial conditions under pre-industrial forcing. The restarts come from CNR, http://sansone.to.isac.cnr.it/ecearth/init/year1850_tome/15010101/ (1850 pre-industrial ICs, after 500 yr spin up, for EC-Earth3.1; atm ICs from 2000s). The simulations are 20-year long, and have each five ensemble members. Each member starts from slightly different initial conditions: a white noise with standard deviation of 1e-4 K has been added to the SST to create replicae of the ocean restarts. Note that the the same perturbation was applied for the two corresponding members of different machines, so that the restarts are purely identical from machine to machine.
Note that a fourth simulation 'e00x
' exists, but this one was run erroneously with interannually varying forcings. Since we know in advance that it is supposed to be different from the three others, it can also be used to detect differences.
Ithaca | ECMWF-CCA | MareNostrum3 | |
---|---|---|---|
Motherboard | Sun Blade X6270 servers | Cray XC30 system | IBM dx360 M4 |
Processor | Dual Quad-Core Intel Xeon 5570 (2.93 GHz), 8 cores per node | Dual 12-core E5-2697 v2 (Ivy Bridge) series processors (2.7 GHz), 24 cores per node | Intel SandyBridge-EP E5-2670 (2.6 GHz), 16 cores per node |
Main memory | 8 GB per node | 64 GB per node | 32 GB per node |
Interconnect | Infiniband (IB) | Cray Aries interconnect links all compute nodes in a Dragonfly topology | Infiniband (IB) |
Operating system | SUSE Linux Enterprise Server 11 (x86_64) | Cray Linux Environment (CLE) | Linux - SuSe Distribution 11 SP2 |
Queue scheduler | OGS/GE 2011.11 | PBSPro_12.1.400.132424 | IBM Platform LSF 9.1.2.0 build 226830, Nov 15 2013 |
Compiler | Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.0.080 Build 20130728 | Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.1.106 Build 20131008 | Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.0.1.117 Build 20121010 |
MPI | IntelMPI v4.0.3 | Cray mpich2 v6.2.0 | IntelMPI v4.1.0 |
LAPACK | IntelMPI v4.0.3 | Cray libsci v12.2.0 | IntelMPI v4.1.0 |
SZIP, HDF5, NetCDF5 | v2.1, v1.8.11, v4.1.3 | v2.1, v1.8.11, v4.3.0 | v2.1, v1.8.10, v4.1.3 |
GribAPI, GribEX | v1.9.9, v000370 | v1.13.0, v000395 | v1.9.9, v000370 |
F Flags | -O2 -g -traceback -vec-report0 -r8 | -O2 -g -traceback -vec-report0 -r8 | -O2 -g -traceback -vec-report0 -r8 |
C Flags | -O2 -g -traceback | -O2 -g -traceback | -O2 -g -traceback |
LD Flags | -O2 -g -traceback | -O2 -g -traceback | -O2 -g -traceback |
NPROCS: (IFS+NEMO+OASIS3) | 72: (32+16+22) | 598: (480+96+22) | 512: (384+96+22) |
Two members from the same machine will produce different output due to the small noise added in the initial conditions. Thus, the question is to test if the “difference” between two simulations is “larger” between two members from different machines than it is between two members from the same machine. The use of quotes here reflects that designing a proper distance and statistical test is not straightforward
A first approach is to define a common reference data set to which we measure the distance of each simulation. This reference dataset is the one used by Reichler and Kim (2008), and its implementation for EC-Earth simulations is described in the wiki page about ECMean.
The outcome is, for each member of each experiment, a 'table
' summarizing the performance of the simulation. The tables are recorded in text files located here:
/home/fmassonnet/EC-Earth_Performance/ECmean/${exp}/${memb}/PI2_RK08_${exp}_${yearb}_${yeare}.txt
Here, ${exp} is the name of the experiment: 'e00y
', 'm04y
', 'i06c
' or 'e00x
' (note that this last experiment is bugged, see above); ${memb} is 'fc0
', 'fc1
','fc2
', 'fc3
' or 'fc4
'; ${yearb} and ${yeare} are the start and end years: 1850 and 1869, although other periods can be defined.
The table looks like this (example with 'e00x
', member 'fc0
', 1850-1869):
Performance Indices - Reichler and Kim 2008 - CDO version
NEW VERSION: windstress, land-sea masks and 100 hPa corrections
e00x 1850 1869
Field | PI | Domain | Dataset | CMIP3 | RatioCMIP3 |
---|---|---|---|---|---|
t2m | 45.2141 | land | CRU | 25.13 | 1.79 |
msl | 2.8880 | global | COADS | 11.69 | 0.24 |
qnet | 19.9575 | ocean | OAFLUX | 14.24 | 1.40 |
tp | 23.1720 | global | CMAP | 38.87 | 0.59 |
ewss | 9.2510 | ocean | DASILVA | 4.03 | 2.29 |
nsss | 4.7073 | ocean | DASILVA | 3.10 | 1.51 |
SST | 18.2279 | ocean | GISS | 17.21 | 1.05 |
SSS | 0.1101 | ocean | levitus | 0.22 | 0.50 |
SICE | 0.1004 | ocean | GISS | 0.34 | 0.29 |
T | 39.5616 | zonal | ERA40 | 38.89 | 1.01 |
U | 2.2818 | zonal | ERA40 | 12.07 | 0.18 |
V | 1.6940 | zonal | ERA40 | 8.25 | 0.20 |
Q | 41.3965 | zonal | ERA40 | 29.41 | 1.40 |
Total Performance Index is: 0.9576
Partial PI (atm only) is: 0.7728
The second column defines the performance of several variables against reanalyses/observations. The last but one column is this performance for CMIP3 models (probably the average of them, but this needs to be checked), and the last column is the ratio between the second and last but one columns. The “Total Performance Index” is the average (equal weight) of the column RatioCMIP3, while the partial PI only considers atmospheric variables.
The following command returns all t2m indices for experiment e00x:
cat ~fmassonnet/EC-Earth_Performance/ECmean/e00x/fc?/PI2_RK08_e00x_1850_1869.txt | grep t2m | awk {'print $2'} returns 45.2141 46.9433 43.9849 44.6414 48.6909
The following command returns all t2m indices for experiment e00y: cat ~fmassonnet/EC-Earth_Performance/ECmean/e00y/fc?/PI2_RK08_e00y_1850_1869.txt | grep t2m | awk {'print $2'} returns 47.2227 48.5643 45.8102 46.8731 49.1348
The folowing command returns all t2m indices for experiment m04y: cat ~fmassonnet/EC-Earth_Performance/ECmean/m04y/fc?/PI2_RK08_m04y_1850_1869.txt | grep t2m | awk {'print $2'} returns 42.6386 43.9406 40.7176 44.0625 44.0779
Interesting enough, my eye can see a difference. Will the stats confirm this?
A second, cleaner approach (still to be implemented) will directly measure the distance between two members.
As a first shot, we used a Kolmogorov-Smirnov test to conclude if the performance index is statistically different from one machine to another for 13 different variables. As you can see on the plot, simulations have the same distributions for all variables, except for surface temperature, surface heat flux and sea-ice.
Regarding the maps of the difference between these simulations, the model simulates different sea-ice Antarctic conditions on the two platforms. This induces a difference in surface heat flux and surface temperature. Such differences can be either explained by the sea-ice model itself or by the coupling of the sea-ice model with the general coupled climate model giving different results depending on the platform used.
Overall, except this sea-ice difference, all the other variables simulated on the different machines seem to be consistent on the two platforms, which is comforting… The next step will be to check the restart that has been used for the simulation, and then to investigate the reasons explaining the differences that should not have existed.
NB: This is the old file with the differences of RK indexes computed for only 2 different platforms. In this plot, the simulations e00x and e00y have the same distribution for all variables. This is not the case for the simulations e00y and m04y, that differ in terms of surface temperature, surface heat flux and sea-ice.
Hopefully, we can generalize the test above to measure if two joint PDFs are different from each other based on a finite number of samples.
'make sure to save the ice outputs correctly
'.'drift
' in Antarctic sea ice. There is supposed to be no such drift, since the simulations are started from equilibrated conditions (tome run from Paolo Davini). Given that the two machines have roughly the same drift, this can be a problem from the code. We need to check if Arctic sea ice drifts as well, this can help attributing the sources of the drift.
'Eleftheria
': analysis of (1) ocean column stability, heat fluxes, convection in the Southern Ocean (2) global mean temperature time series (3) Arctic sea ice.
Plots of February sea ice area in Antarctic plot_feb.pdf
Plots of September sea ice area in Antarctic plot_sep.pdf
Plots of February sea ice area in the Arctic plot_feb_NH.pdf
Plots of September sea ice area in the Arctic plot_sep_NH.pdf
Plots of SST, global-mean, and Northern/Southern hemisphere plot_tos.pdf
'Asif
': make a difference between our (IC3) EC-Earth3.1 code and the CNR code. In particular: are they using LIM3, with one category; what are their namelists; do they have the last 20 years of the tome simulation? Are we using the same forcing as they are? Make a difference between our ocean, ice restarts and theirs (they are supposed to be identical except for a noise in the SSTs).
'Martin
': understand why the analysis with three machines gives contradictory results: Mare Nostrum is different for one variable, Ithaca is different for another variable.
'Omar
': Make a figure similar to Eleftheria's for his run started from the “tomf” restarts of Paolo Davini. Is the model drift also present?
'Paco
': Send an e-mail to Uwe to clarify his idea of experiment for MPP reproducibility.
'tomf
' restarts as we had with the 'tome
' restarts from Paolo Davini: significant negative model drift in Antarctic sea ice , although the restarts are supposed to be from an equilibrated run.'We are clearly not running the same model
'. Major differences:'François
': As soon the last chunk of the last member of 'i06c
' is done, process the PI scripts. Done'Martin
': As soon the last chunk of the last member of 'i06c
' is done, update the figures of scores and maps to have three models on the figures. 'Asif
': extend fc0 of 'i06c
' to 30-50 years (20 years are already done). 'Eleftheria
': follow up the simulation of Asif: when the model is equilibrated for sea ice and global SST, send a notification to all of us. If possible, also check temperatures in the deeper ocean. It is probable that they won't be stabilized even after 100 years.'Francois
': When new, equilibrated restart is available from Asif's run, add tiny perturbations and dispatch the restarts on the three machines to start a new cycle of experiments. Pending Asif's run'Asif and François
': Make new experiments, new repro restarts, recompile the code on MN and ECMWF with the same distribution and configuration of processors, possibly new keys (fp_precise), and less agressive optimization.François made a summary on the current status of this research. File available here. 20151102_repro_fm.pdf. Salient messages are
Asif and François launch the reproducibility experiment from stabilized i06c on Ithaca and MareNostrum
Stabilization of i06c. The plots here and here show the stabilization of one of the member of i06c that was extended to achieve equilibrium. It was decided that the run was now in a sufficiently stable climate to perform the second stream of reproducibility experiments.
There was a meeting involving François, Kim and Oriol, to discuss about the meaning and the use of compilation flags/options in these experiments.
As a reminder: here's the current status. We don't have reproducibility in EC-Earth3.1 (look at discussion above) but the setup was not 100% perfect. There were two problems: 1) Problems linked to the differences in domain decomposition (number and distribution of processors) and 2) Problems linked to the differences in versions of compilers, the use of aggressive optimization levels and the absence of certain keys like fp-model strict/precise.
To distinguish between the two problems and following Oriol and Kim's suggestion, here is the updated plan.
0) Talk to NEMO and IFS teams to at least inform them on our plans. François sends an e-mail to Sébastien Masson and Kim to IFS people at ECMWF. We know already that different domain decomposition gives different results (bit-wise) for NEMO. Whether the results can be different climate-wise is what we want to test, but these teams could have obtained results we are unaware of.
1) Reproducibility on Ithaca. To isolate the effect of the decomposition of processors, we'll first run a reproducibility experiment on Ithaca, started from the equilibrated restart we have obtained after 60 years of simulation. We'll just change the domain decomposition (number and distribution). Since all other things will be equal by construction, this will allow to examine the sole effect of processors on reproducibility. Questions to elucidate at this stage are:
2) Reproducibility across machines. When looking at the table prepared by Asif (above), we can see that there are well differences in the versions of compilers. We'll have to make sure all versions of compilers are identical, at least as much as we can. As a reminder, the idea is to make everything we have in our hands to make the experiments reproducible. For now only simulations on Ithaca and MareNostrum3 can be conducted.
3) We'll use the same diagnostics as we did earlier this year. This part is ready, there is no reason why diagnostics should change.
More about compilation options can be found here. A description of the tradeoffs in floating-point operations is here and here
We had a meeting with usual people + Klaus and Uwe (SMHI) who are also tracking this reproducibility issue and are interested in what we are doing. Please visit this page. They are mentioning an interesting paper by Barker et al.. In this paper, a software is presented to track differences in the CESM model. The paper is not properly about GCMs, more about atmosphere and about short periods (1-yr).
The discussions were quite rich, and here is the summary in a few bullet points
The topic is becoming extremely complex, far-reaching and our team looking into the topic is growing every month. On the other hand it has been a long-standing issue (almost one year now) and we need to have insights for the next EC-Earth meeting and the upcoming CMIP6. Here is a suggestion as how to continue the work: this should be split in two tasks
The User aspect experiments are launched. Ithaca's i077 is now under way.