reproducibility []

This is an old revision of the document!

Introduction

Rationale

This work on reproducibility is motivated by three scientific questions:

1/ Is it necessary to rerun a control run whenever a climate model is ported to a new machine?

2/ Can we distinguish a stream of simulations on machine B from a stream of simulations on machine A when only the hardware and the environmental software are changed?

3/ Is the modelled climate sensitive to the platform used? How do we deal with the associated uncertainties?

Reproducibility experiments

Three experiments have been run:

'e00y' (ECMWF) [completed]

'm04y' (Mare Nostrum) [completed]

'i06c' (Ithaca) [completed]

All three experiments started the 1st of January 1850, from spun-up oceanic and atmospheric initial conditions under pre-industrial forcing. The restarts come from CNR, http://sansone.to.isac.cnr.it/ecearth/init/year1850_tome/15010101/ (1850 pre-industrial ICs, after 500 yr spin up, for EC-Earth3.1; atm ICs from 2000s). The simulations are 20-year long, and have each five ensemble members. Each member starts from slightly different initial conditions: a white noise with standard deviation of 1e-4 K has been added to the SST to create replicae of the ocean restarts. Note that the the same perturbation was applied for the two corresponding members of different machines, so that the restarts are purely identical from machine to machine.

Note that a fourth simulation 'e00x' exists, but this one was run erroneously with interannually varying forcings. Since we know in advance that it is supposed to be different from the three others, it can also be used to detect differences.

Differences between the machines

	Ithaca	ECMWF-CCA	MareNostrum3

Motherboard	Sun Blade X6270 servers	Cray XC30 system	IBM dx360 M4

Processor	Dual Quad-Core Intel Xeon 5570 (2.93 GHz), 8 cores per node	Dual 12-core E5-2697 v2 (Ivy Bridge) series processors (2.7 GHz), 24 cores per node	Intel SandyBridge-EP E5-2670 (2.6 GHz), 16 cores per node

Main memory	8 GB per node	64 GB per node	32 GB per node

Interconnect	Infiniband (IB)	Cray Aries interconnect links all compute nodes in a Dragonfly topology	Infiniband (IB)

Operating system	SUSE Linux Enterprise Server 11 (x86_64)	Cray Linux Environment (CLE)	Linux - SuSe Distribution 11 SP2

Queue scheduler	OGS/GE 2011.11	PBSPro_12.1.400.132424	IBM Platform LSF 9.1.2.0 build 226830, Nov 15 2013

Compiler	Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.0.080 Build 20130728	Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.1.106 Build 20131008	Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.0.1.117 Build 20121010

MPI	IntelMPI v4.0.3	Cray mpich2 v6.2.0	IntelMPI v4.1.0

LAPACK	IntelMPI v4.0.3	Cray libsci v12.2.0	IntelMPI v4.1.0

SZIP, HDF5, NetCDF5	v2.1, v1.8.11, v4.1.3	v2.1, v1.8.11, v4.3.0	v2.1, v1.8.10, v4.1.3

GribAPI, GribEX	v1.9.9, v000370	v1.13.0, v000395	v1.9.9, v000370

F Flags	-O2 -g -traceback -vec-report0 -r8	-O2 -g -traceback -vec-report0 -r8	-O2 -g -traceback -vec-report0 -r8

C Flags	-O2 -g -traceback	-O2 -g -traceback	-O2 -g -traceback

LD Flags	-O2 -g -traceback	-O2 -g -traceback	-O2 -g -traceback

NPROCS: (IFS+NEMO+OASIS3)	72: (32+16+22)	598: (480+96+22)	512: (384+96+22)

Defining the "difference" between two simulations

Two members from the same machine will produce different output due to the small noise added in the initial conditions. Thus, the question is to test if the “difference” between two simulations is “larger” between two members from different machines than it is between two members from the same machine. The use of quotes here reflects that designing a proper distance and statistical test is not straightforward

First approach: measure the distance to a reference data set

A first approach is to define a common reference data set to which we measure the distance of each simulation. This reference dataset is the one used by Reichler and Kim (2008), and its implementation for EC-Earth simulations is described in the wiki page about ECMean.

The outcome is, for each member of each experiment, a 'table' summarizing the performance of the simulation. The tables are recorded in text files located here:

/home/fmassonnet/EC-Earth_Performance/ECmean/${exp}/${memb}/PI2_RK08_${exp}_${yearb}_${yeare}.txt

Here, ${exp} is the name of the experiment: 'e00y', 'm04y', 'i06c' or 'e00x' (note that this last experiment is bugged, see above); ${memb} is 'fc0', 'fc1','fc2', 'fc3' or 'fc4'; ${yearb} and ${yeare} are the start and end years: 1850 and 1869, although other periods can be defined.

The table looks like this (example with 'e00x', member 'fc0', 1850-1869): Performance Indices - Reichler and Kim 2008 - CDO version NEW VERSION: windstress, land-sea masks and 100 hPa corrections e00x 1850 1869

Field	PI	Domain	Dataset	CMIP3	RatioCMIP3
t2m	45.2141	land	CRU	25.13	1.79
msl	2.8880	global	COADS	11.69	0.24
qnet	19.9575	ocean	OAFLUX	14.24	1.40
tp	23.1720	global	CMAP	38.87	0.59
ewss	9.2510	ocean	DASILVA	4.03	2.29
nsss	4.7073	ocean	DASILVA	3.10	1.51
SST	18.2279	ocean	GISS	17.21	1.05
SSS	0.1101	ocean	levitus	0.22	0.50
SICE	0.1004	ocean	GISS	0.34	0.29
T	39.5616	zonal	ERA40	38.89	1.01
U	2.2818	zonal	ERA40	12.07	0.18
V	1.6940	zonal	ERA40	8.25	0.20
Q	41.3965	zonal	ERA40	29.41	1.40

Total Performance Index is: 0.9576
Partial PI (atm only) is: 0.7728

The second column defines the performance of several variables against reanalyses/observations. The last but one column is this performance for CMIP3 models (probably the average of them, but this needs to be checked), and the last column is the ratio between the second and last but one columns. The “Total Performance Index” is the average (equal weight) of the column RatioCMIP3, while the partial PI only considers atmospheric variables.

The following command returns all t2m indices for experiment e00x:

cat ~fmassonnet/EC-Earth_Performance/ECmean/e00x/fc?/PI2_RK08_e00x_1850_1869.txt | grep t2m | awk {'print $2'} returns 45.2141 46.9433 43.9849 44.6414 48.6909

The following command returns all t2m indices for experiment e00y: cat ~fmassonnet/EC-Earth_Performance/ECmean/e00y/fc?/PI2_RK08_e00y_1850_1869.txt | grep t2m | awk {'print $2'} returns 47.2227 48.5643 45.8102 46.8731 49.1348

The folowing command returns all t2m indices for experiment m04y: cat ~fmassonnet/EC-Earth_Performance/ECmean/m04y/fc?/PI2_RK08_m04y_1850_1869.txt | grep t2m | awk {'print $2'} returns 42.6386 43.9406 40.7176 44.0625 44.0779

Interesting enough, my eye can see a difference. Will the stats confirm this?

Second approach: measure directly the distance between two simulations

A second, cleaner approach (still to be implemented) will directly measure the distance between two members.

Statistical tests to assess if the difference is significant

Tests on univariate performance indices

As a first shot, we used a Kolmogorov-Smirnov test to conclude if the performance index is statistically different from one machine to another for 13 different variables. As you can see on the plot, simulations have the same distributions for all variables, except for surface temperature, surface heat flux and sea-ice.

Regarding the maps of the difference between these simulations, the model simulates different sea-ice Antarctic conditions on the two platforms. This induces a difference in surface heat flux and surface temperature. Such differences can be either explained by the sea-ice model itself or by the coupling of the sea-ice model with the general coupled climate model giving different results depending on the platform used.

Overall, except this sea-ice difference, all the other variables simulated on the different machines seem to be consistent on the two platforms, which is comforting… The next step will be to check the restart that has been used for the simulation, and then to investigate the reasons explaining the differences that should not have existed.

NB: This is the old file with the differences of RK indexes computed for only 2 different platforms. In this plot, the simulations e00x and e00y have the same distribution for all variables. This is not the case for the simulations e00y and m04y, that differ in terms of surface temperature, surface heat flux and sea-ice.

Tests on multivariate performance indices

Hopefully, we can generalize the test above to measure if two joint PDFs are different from each other based on a finite number of samples.

Summary of monthly meetings

13 May 2015

Agenda

Feedback on the EC-Earth presentation
Update of results with i06c (Ithaca)
Additional experiments to conduct: different # of procs on same machine
Increasing understanding: why does Antarctic sea ice diverge even in the first year (deep convection diagnostic?)

Points discussed

Most people don't use the Reichler and Kim (2008) indices for specific variables; they instead look at the super-index. We wouldn't be surprised that this super-index is not found to be machine-sensitive, even though differences were found to exist at the variable level. This also suggests that we have to be very cautious with numbers that pretend reflecting the performance of an entire model.
The differences seen in the Southern Ocean sea ice between different simulations call for better understanding. Analysing the origins of sea ice differences (from the ocean, the atmosphere, the ice itself) would be a plus for the study.
While for a paper we readily need more members, the conclusions are already strong with five. If we run five more members, we have to 'make sure to save the ice outputs correctly'.
The MPP reproducibility (on the same machine) has also to be investigated. Uwe Fladrich has suggested some ideas at the EC-Earth meeting.
Striking is the strong 'drift' in Antarctic sea ice. There is supposed to be no such drift, since the simulations are started from equilibrated conditions (tome run from Paolo Davini). Given that the two machines have roughly the same drift, this can be a problem from the code. We need to check if Arctic sea ice drifts as well, this can help attributing the sources of the drift.

Tasks for next meeting

'Eleftheria': analysis of (1) ocean column stability, heat fluxes, convection in the Southern Ocean (2) global mean temperature time series (3) Arctic sea ice.

Plots of February sea ice area in Antarctic plot_feb.pdf
Plots of September sea ice area in Antarctic plot_sep.pdf
Plots of February sea ice area in the Arctic plot_feb_NH.pdf
Plots of September sea ice area in the Arctic plot_sep_NH.pdf
Plots of SST, global-mean, and Northern/Southern hemisphere plot_tos.pdf

'Asif': make a difference between our (IC3) EC-Earth3.1 code and the CNR code. In particular: are they using LIM3, with one category; what are their namelists; do they have the last 20 years of the tome simulation? Are we using the same forcing as they are? Make a difference between our ocean, ice restarts and theirs (they are supposed to be identical except for a noise in the SSTs).

'Martin': understand why the analysis with three machines gives contradictory results: Mare Nostrum is different for one variable, Ithaca is different for another variable.

'Omar': Make a figure similar to Eleftheria's for his run started from the “tomf” restarts of Paolo Davini. Is the model drift also present?

'Paco': Send an e-mail to Uwe to clarify his idea of experiment for MPP reproducibility.

17 June 2015

Points Discussed

Omar has similar issues with the 'tomf' restarts as we had with the 'tome' restarts from Paolo Davini: significant negative model drift in Antarctic sea ice , although the restarts are supposed to be from an equilibrated run.
Asif checked differences in the code and namelists between Paolo Davini and us. 'We are clearly not running the same model'. Major differences:
- in codes: NEMO and IFS present significant differences.
- in namelists. No difference was found in the sea ice namelist, which confirms the hypothesis that Southern Ocean warming could be the cause for the drift as well as the differences between machines
- no difference in forcings
- make a difference between our (IC3) EC-Earth 3.1 code and the CNR code: “We used an EC-Earth 3.1 (we “started” with the r1692) and then we add the correction for the Caspian sea (see https://dev.ec-earth.org/issues/152) since the model was crashing. Although minor change in the code can exist, I would say that it should share almost everything with the maintenance version of the 3.1 (https://dev.ec-earth.org/projects/ecearth3/repository/show/ecearth3/branches/maintenance/3.1).” Apart from what Paolo says, we also have the local changes introduced into our local branch v3.1b like Isabel's NSSS patch, Uwe's northfold patch, Klaus's iceflux patch…. Plus the key_diahth an additional key.
- what are their namelists: coupler; contain changes related to Klaus's iceflux patch, ifs; contains changes of Lauriane's sppt stuff, nemo; contains changes of Klaus's icefulux (cn_rcv_qsr, cn_rcv_qns), nn_chldta, ln_rnf_mouth, namsbc_ssr, Chloe's rn_ebb, SPCES's namptr, Uwe's northfold (ln_nnogather) and ice; no change
  - cn_rcv_qsr: under namelist section namsbc_cpl (coupled ocean/atmosphere model) (tome has value “oce and ice” and i06c has “conservative”)
  - cn_rcv_qns: namsbc_cpl (coupled ocean/atmosphere model) (tome has value “oce and ice” and i06c has “conservative”)
  - n_chldta: namtra_qsr (Penetrative solar radiation) (tome has value 0 and i06c has 1)
  - ln_rnf_mouth: namsbc_rnf (Runoffs namelist surface boundary condition) (tome has value false and i06c has true)
  - namsbc_ssr section modified in i06c w.r.t nudging (surface boundary condition : sea surface restoring) (apart from variable names rn_deds=-27.7 in tome while -150.0 in i06c (rn_deds is “magnitude of the damping on salinity [mm/day]”
  - rn_ebb: namzdf_tke (Turbulent eddy kinetic dependent vertical diffusion) (tome has value 67.83 and i06c has 33.92)
  - namptr section modified to Poleward Transport Diagnostic needed for SPECS
  - ln_nnogather: nammpp (Massively Parallel Processing) (tome has not set; these changes comes along with Uwe's northfold patch used in order to improve the scalability of ocean model)
- In particular: are they using LIM3, with one category: “you are right, we used LIM3. Unfortunately we do not have any saved output from the tome spinup, thus we cannot help you with this.”
- Are we using the same forcing as they are: “we set it to 1850. ifs_cmip5_fixyear is set to the year you want to use as a constant GHG/volcano/aerosol forcing. If it is zero, the simulation uses the rcp protocol.”
- do they have the last 20 years of the tome simulation: No
- Make a difference between our ocean, ice restarts and theirs (they are supposed to be identical except for a noise in the SSTs): Yes (they are identical)
- http://sansone.to.isac.cnr.it/ecearth/diag/table/tome/
- http://sansone.to.isac.cnr.it/ecearth/diag/timeseries/diagnoce/tome/

In the light of the previous point, it does not make sense to exchange restarts with another group, even if this group has checked out the same version of the model. Home-made patches are continuously added in each group, the consequence of that is to end with different climates.

Tasks for next meeting

'François': As soon the last chunk of the last member of 'i06c' is done, process the PI scripts. Done
'Martin': As soon the last chunk of the last member of 'i06c' is done, update the figures of scores and maps to have three models on the figures.
'Asif': extend fc0 of 'i06c' to 30-50 years (20 years are already done).
'Eleftheria': follow up the simulation of Asif: when the model is equilibrated for sea ice and global SST, send a notification to all of us. If possible, also check temperatures in the deeper ocean. It is probable that they won't be stabilized even after 100 years.
'Francois': When new, equilibrated restart is available from Asif's run, add tiny perturbations and dispatch the restarts on the three machines to start a new cycle of experiments. Pending Asif's run
'Asif and François': Make new experiments, new repro restarts, recompile the code on MN and ECMWF with the same distribution and configuration of processors, possibly new keys (fp_precise), and less agressive optimization.

2 November 2015

Points Discussed

François made a summary on the current status of this research. File available here. 20151102_repro_fm.pdf. Salient messages are

Under current standards, EC-Earth3.1 is not reproducible – and this is probably true for other models. This is because we tend to ignore compilation options and keys that are strict with respect to floating point operations and optimization. This is also because we often use a different number of processors on different machines.
A long, stable, run from i06c is now available for a second stream of reproducibility experiments. These experiments will run with the fp-model precise key and with the same distribution of processors.
SSS restoring and FW conservation flags are to be disabled in the future reproducibility runs. The problem of Antarctic sea ice loss is pervasive in all EC-Earth versions, and should be solved independently of this reproducibility exercise.

Tasks for next meeting

Asif and François launch the reproducibility experiment from stabilized i06c on Ithaca and MareNostrum

9 November 2015

Stabilization of i06c. The plots here and here show the stabilization of one of the member of i06c that was extended to achieve equilibrium. It was decided that the run was now in a sufficiently stable climate to perform the second stream of reproducibility experiments.

12 November 2015

There was a meeting involving François, Kim and Oriol, to discuss about the meaning and the use of compilation flags/options in these experiments.

As a reminder: here's the current status. We don't have reproducibility in EC-Earth3.1 (look at discussion above) but the setup was not 100% perfect. There were two problems: 1) Problems linked to the differences in domain decomposition (number and distribution of processors) and 2) Problems linked to the differences in versions of compilers, the use of aggressive optimization levels and the absence of certain keys like fp-model strict/precise.

To distinguish between the two problems and following Oriol and Kim's suggestion, here is the updated plan.

0) Talk to NEMO and IFS teams to at least inform them on our plans. François sends an e-mail to Sébastien Masson and Kim to IFS people at ECMWF. We know already that different domain decomposition gives different results (bit-wise) for NEMO. Whether the results can be different climate-wise is what we want to test, but these teams could have obtained results we are unaware of.

1) Reproducibility on Ithaca. To isolate the effect of the decomposition of processors, we'll first run a reproducibility experiment on Ithaca, started from the equilibrated restart we have obtained after 60 years of simulation. We'll just change the domain decomposition (number and distribution). Since all other things will be equal by construction, this will allow to examine the sole effect of processors on reproducibility. Questions to elucidate at this stage are:

Can we risk this strategy given that we don't know when we won't have access to Ithaca anymore?
The reference decomposition is 72: (32+16+22) . What can be another decomposition? I suggest 64: (32+12+20) but without any clue if this makes sense
Ideally, the compiler version, MPI and LAPACK versions, SZIP-HDF5-NetCDF-GRIB versions should also be freezed now, if we want to then run other experiments on other platforms.
Flags for compilation should have the -fp-model source option, that favors reproducibility and portability (see the reference below). Unless what we all might think, the -fp-model precise or -fp-model strict options allow for accuracy, but not necessarily for reproducibility. Actually, not both characteristics can be achieved simultaneously – look at the reference below. Thanks Kim for raising that.
Optimization flags should be set to -O0. This will likely reduce the time of execution, but we don't know by how much yet. I would suggest to start the experiment. If we realize that it will take too long to finish, we might come back to this choice.

2) Reproducibility across machines. When looking at the table prepared by Asif (above), we can see that there are well differences in the versions of compilers. We'll have to make sure all versions of compilers are identical, at least as much as we can. As a reminder, the idea is to make everything we have in our hands to make the experiments reproducible. For now only simulations on Ithaca and MareNostrum3 can be conducted.

3) We'll use the same diagnostics as we did earlier this year. This part is ready, there is no reason why diagnostics should change.

More about compilation options can be found here. A description of the tradeoffs in floating-point operations is here and here

9 December 2015

We had a meeting with usual people + Klaus and Uwe (SMHI) who are also tracking this reproducibility issue and are interested in what we are doing. Please visit this page. They are mentioning an interesting paper by Barker et al.. In this paper, a software is presented to track differences in the CESM model. The paper is not properly about GCMs, more about atmosphere and about short periods (1-yr).

The discussions were quite rich, and here is the summary in a few bullet points

We have to be extremely careful when saying things like “EC-Earth is not reproducible”. First because “reproducibility” is a loosely defined concept: bit-for-bit reproducibility is different from climate-for-climate reproducibility. Defining the latter (i.e., are two ensembles statistically indistinguishable from each other) is particularly challenging, both regarding what protocol to use and the statistical test to apply. The other reason why we need to be careful is because we users might offend developers who strive to make their models reproducible, and this could be seen as a lack of respect.
SMHI is mostly interested in understanding what are the configurations under which EC-Earth is reproducible, while the initial question we (at IC3 and now BSC) ask is: can we run EC-Earth on different platforms if we follow our common standards.
Assessing reproducibility of a whole system is different from assessing reproducibility of one particular variable (e.g., Antarctic sea ice extent in winter). A good point of the Barker et al. paper referenced above is that their test is multivariate, since the set of 120 variables is first EOFed. By making the study in the Principal Component Analysis (statistical) space rather than physical space, they reach probably stronger conclusions than if they had looked at all (dependent) variables separately, as we do in our case.

The topic is becoming extremely complex, far-reaching and our team looking into the topic is growing every month. On the other hand it has been a long-standing issue (almost one year now) and we need to have insights for the next EC-Earth meeting and the upcoming CMIP6. Here is a suggestion as how to continue the work: this should be split in two tasks

Developer aspect - Xavi Yepes is now looking in the bit-for-bit reproducibility issue with EC-Earth 3.2. and for short (3-month) runs. SMHI (Uwe, Klaus) and KNMI (Philippe Le Sager) are aware of this. He is making several tests:
1. Changing the number of processors in NEMO, IFS, both.
2. Setting optimization to -O2 or -O3
3. Setting the -fp-model to precise, strict, source
User aspect - Asif, François will continue adopting the “user” point of view, extend i06c under the same conditions as before, in order to reach same conclusions as before but without the massive drift that we had. When insights from the “developer”, team will be available, other tests will be performed to see if we can achieve reproducibility or not.

15 December 2015

The User aspect experiments are launched. Ithaca's i077 is now under way.

Sidebar

Table of Contents

Introduction

Rationale

Reproducibility experiments

Differences between the machines

Defining the "difference" between two simulations

First approach: measure the distance to a reference data set

Second approach: measure directly the distance between two simulations

Statistical tests to assess if the difference is significant

Tests on univariate performance indices

Tests on multivariate performance indices

Summary of monthly meetings

13 May 2015

Agenda

Points discussed

Tasks for next meeting

17 June 2015

Points Discussed

Tasks for next meeting

2 November 2015

Points Discussed

Tasks for next meeting

9 November 2015

12 November 2015

9 December 2015

15 December 2015

User Tools

Site Tools

Sidebar

Table of Contents

Introduction

Rationale

Reproducibility experiments

Differences between the machines

Defining the "difference" between two simulations

First approach: measure the distance to a reference data set

Second approach: measure directly the distance between two simulations

Statistical tests to assess if the difference is significant

Tests on univariate performance indices

Tests on multivariate performance indices

Summary of monthly meetings

13 May 2015

Agenda

Points discussed

Tasks for next meeting

17 June 2015

Points Discussed

Tasks for next meeting

2 November 2015

Points Discussed

Tasks for next meeting

9 November 2015

12 November 2015

9 December 2015

15 December 2015

Page Tools