User Tools

Site Tools


reproducibility

Introduction

Rationale

This work on reproducibility is motivated by three scientific questions:

1/ Is it necessary to rerun a control run whenever a climate model is ported to a new machine?

2/ Can we distinguish a stream of simulations on machine B from a stream of simulations on machine A when only the hardware and the environmental software are changed?

3/ Is the modelled climate sensitive to the platform used? How do we deal with the associated uncertainties?

Reproducibility experiments

Three experiments have been run:

'e00y' (ECMWF) [completed]

'm04y' (Mare Nostrum) [completed]

'i06c' (Ithaca) [completed]

All three experiments started the 1st of January 1850, from spun-up oceanic and atmospheric initial conditions under pre-industrial forcing. The restarts come from CNR, http://sansone.to.isac.cnr.it/ecearth/init/year1850_tome/15010101/ (1850 pre-industrial ICs, after 500 yr spin up, for EC-Earth3.1; atm ICs from 2000s). The simulations are 20-year long, and have each five ensemble members. Each member starts from slightly different initial conditions: a white noise with standard deviation of 1e-4 K has been added to the SST to create replicae of the ocean restarts. Note that the the same perturbation was applied for the two corresponding members of different machines, so that the restarts are purely identical from machine to machine.

Note that a fourth simulation 'e00x' exists, but this one was run erroneously with interannually varying forcings. Since we know in advance that it is supposed to be different from the three others, it can also be used to detect differences.

Differences between the machines

Ithaca ECMWF-CCA MareNostrum3
Motherboard Sun Blade X6270 servers Cray XC30 system IBM dx360 M4
Processor Dual Quad-Core Intel Xeon 5570 (2.93 GHz), 8 cores per node Dual 12-core E5-2697 v2 (Ivy Bridge) series processors (2.7 GHz), 24 cores per node Intel SandyBridge-EP E5-2670 (2.6 GHz), 16 cores per node
Main memory 8 GB per node 64 GB per node 32 GB per node
Interconnect Infiniband (IB) Cray Aries interconnect links all compute nodes in a Dragonfly topology Infiniband (IB)
Operating system SUSE Linux Enterprise Server 11 (x86_64) Cray Linux Environment (CLE) Linux - SuSe Distribution 11 SP2
Queue scheduler OGS/GE 2011.11 PBSPro_12.1.400.132424 IBM Platform LSF 9.1.2.0 build 226830, Nov 15 2013
Compiler Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.0.080 Build 20130728 Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.1.106 Build 20131008 Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.0.1.117 Build 20121010
MPI IntelMPI v4.0.3 Cray mpich2 v6.2.0 IntelMPI v4.1.0
LAPACK IntelMPI v4.0.3 Cray libsci v12.2.0 IntelMPI v4.1.0
SZIP, HDF5, NetCDF5 v2.1, v1.8.11, v4.1.3 v2.1, v1.8.11, v4.3.0 v2.1, v1.8.10, v4.1.3
GribAPI, GribEX v1.9.9, v000370 v1.13.0, v000395 v1.9.9, v000370
F Flags -O2 -g -traceback -vec-report0 -r8 -O2 -g -traceback -vec-report0 -r8 -O2 -g -traceback -vec-report0 -r8
C Flags -O2 -g -traceback -O2 -g -traceback -O2 -g -traceback
LD Flags -O2 -g -traceback -O2 -g -traceback -O2 -g -traceback
NPROCS: (IFS+NEMO+OASIS3) 72: (32+16+22) 598: (480+96+22) 512: (384+96+22)

Defining the "difference" between two simulations

Two members from the same machine will produce different output due to the small noise added in the initial conditions. Thus, the question is to test if the “difference” between two simulations is “larger” between two members from different machines than it is between two members from the same machine. The use of quotes here reflects that designing a proper distance and statistical test is not straightforward

First approach: measure the distance to a reference data set

A first approach is to define a common reference data set to which we measure the distance of each simulation. This reference dataset is the one used by Reichler and Kim (2008), and its implementation for EC-Earth simulations is described in the wiki page about ECMean.

The outcome is, for each member of each experiment, a 'table' summarizing the performance of the simulation. The tables are recorded in text files located here:

/home/fmassonnet/EC-Earth_Performance/ECmean/${exp}/${memb}/PI2_RK08_${exp}_${yearb}_${yeare}.txt

Here, ${exp} is the name of the experiment: 'e00y', 'm04y', 'i06c' or 'e00x' (note that this last experiment is bugged, see above); ${memb} is 'fc0', 'fc1','fc2', 'fc3' or 'fc4'; ${yearb} and ${yeare} are the start and end years: 1850 and 1869, although other periods can be defined.

The table looks like this (example with 'e00x', member 'fc0', 1850-1869): Performance Indices - Reichler and Kim 2008 - CDO version NEW VERSION: windstress, land-sea masks and 100 hPa corrections e00x 1850 1869

Field PI Domain Dataset CMIP3 RatioCMIP3
t2m 45.2141 land CRU 25.13 1.79
msl 2.8880 global COADS 11.69 0.24
qnet 19.9575 ocean OAFLUX 14.24 1.40
tp 23.1720 global CMAP 38.87 0.59
ewss 9.2510 ocean DASILVA 4.03 2.29
nsss 4.7073 ocean DASILVA 3.10 1.51
SST 18.2279 ocean GISS 17.21 1.05
SSS 0.1101 ocean levitus 0.22 0.50
SICE 0.1004 ocean GISS 0.34 0.29
T 39.5616 zonal ERA40 38.89 1.01
U 2.2818 zonal ERA40 12.07 0.18
V 1.6940 zonal ERA40 8.25 0.20
Q 41.3965 zonal ERA40 29.41 1.40

Total Performance Index is: 0.9576
Partial PI (atm only) is: 0.7728

The second column defines the performance of several variables against reanalyses/observations. The last but one column is this performance for CMIP3 models (probably the average of them, but this needs to be checked), and the last column is the ratio between the second and last but one columns. The “Total Performance Index” is the average (equal weight) of the column RatioCMIP3, while the partial PI only considers atmospheric variables.

The following command returns all t2m indices for experiment e00x:

cat ~fmassonnet/EC-Earth_Performance/ECmean/e00x/fc?/PI2_RK08_e00x_1850_1869.txt | grep t2m | awk {'print $2'} returns 45.2141 46.9433 43.9849 44.6414 48.6909

The following command returns all t2m indices for experiment e00y: cat ~fmassonnet/EC-Earth_Performance/ECmean/e00y/fc?/PI2_RK08_e00y_1850_1869.txt | grep t2m | awk {'print $2'} returns 47.2227 48.5643 45.8102 46.8731 49.1348

The folowing command returns all t2m indices for experiment m04y: cat ~fmassonnet/EC-Earth_Performance/ECmean/m04y/fc?/PI2_RK08_m04y_1850_1869.txt | grep t2m | awk {'print $2'} returns 42.6386 43.9406 40.7176 44.0625 44.0779

Interesting enough, my eye can see a difference. Will the stats confirm this?

Second approach: measure directly the distance between two simulations

A second, cleaner approach (still to be implemented) will directly measure the distance between two members.

Statistical tests to assess if the difference is significant

Tests on univariate performance indices

As a first shot, we used a Kolmogorov-Smirnov test to conclude if the performance index is statistically different from one machine to another for 13 different variables. As you can see on the plot, simulations have the same distributions for all variables, except for surface temperature, surface heat flux and sea-ice.

Regarding the maps of the difference between these simulations, the model simulates different sea-ice Antarctic conditions on the two platforms. This induces a difference in surface heat flux and surface temperature. Such differences can be either explained by the sea-ice model itself or by the coupling of the sea-ice model with the general coupled climate model giving different results depending on the platform used.

Overall, except this sea-ice difference, all the other variables simulated on the different machines seem to be consistent on the two platforms, which is comforting… The next step will be to check the restart that has been used for the simulation, and then to investigate the reasons explaining the differences that should not have existed.

NB: This is the old file with the differences of RK indexes computed for only 2 different platforms. In this plot, the simulations e00x and e00y have the same distribution for all variables. This is not the case for the simulations e00y and m04y, that differ in terms of surface temperature, surface heat flux and sea-ice.

Tests on multivariate performance indices

Hopefully, we can generalize the test above to measure if two joint PDFs are different from each other based on a finite number of samples.

Summary of monthly meetings

13 May 2015

Agenda

  • Feedback on the EC-Earth presentation
  • Update of results with i06c (Ithaca)
  • Additional experiments to conduct: different # of procs on same machine
  • Increasing understanding: why does Antarctic sea ice diverge even in the first year (deep convection diagnostic?)

Points discussed

  • Most people don't use the Reichler and Kim (2008) indices for specific variables; they instead look at the super-index. We wouldn't be surprised that this super-index is not found to be machine-sensitive, even though differences were found to exist at the variable level. This also suggests that we have to be very cautious with numbers that pretend reflecting the performance of an entire model.
  • The differences seen in the Southern Ocean sea ice between different simulations call for better understanding. Analysing the origins of sea ice differences (from the ocean, the atmosphere, the ice itself) would be a plus for the study.
  • While for a paper we readily need more members, the conclusions are already strong with five. If we run five more members, we have to 'make sure to save the ice outputs correctly'.
  • The MPP reproducibility (on the same machine) has also to be investigated. Uwe Fladrich has suggested some ideas at the EC-Earth meeting.
  • Striking is the strong 'drift' in Antarctic sea ice. There is supposed to be no such drift, since the simulations are started from equilibrated conditions (tome run from Paolo Davini). Given that the two machines have roughly the same drift, this can be a problem from the code. We need to check if Arctic sea ice drifts as well, this can help attributing the sources of the drift.

Tasks for next meeting

'Eleftheria': analysis of (1) ocean column stability, heat fluxes, convection in the Southern Ocean (2) global mean temperature time series (3) Arctic sea ice.

Plots of February sea ice area in Antarctic plot_feb.pdf
Plots of September sea ice area in Antarctic plot_sep.pdf
Plots of February sea ice area in the Arctic plot_feb_NH.pdf
Plots of September sea ice area in the Arctic plot_sep_NH.pdf
Plots of SST, global-mean, and Northern/Southern hemisphere plot_tos.pdf

'Asif': make a difference between our (IC3) EC-Earth3.1 code and the CNR code. In particular: are they using LIM3, with one category; what are their namelists; do they have the last 20 years of the tome simulation? Are we using the same forcing as they are? Make a difference between our ocean, ice restarts and theirs (they are supposed to be identical except for a noise in the SSTs).

'Martin': understand why the analysis with three machines gives contradictory results: Mare Nostrum is different for one variable, Ithaca is different for another variable.

'Omar': Make a figure similar to Eleftheria's for his run started from the “tomf” restarts of Paolo Davini. Is the model drift also present?

'Paco': Send an e-mail to Uwe to clarify his idea of experiment for MPP reproducibility.

17 June 2015

Points Discussed

  • Omar has similar issues with the 'tomf' restarts as we had with the 'tome' restarts from Paolo Davini: significant negative model drift in Antarctic sea ice , although the restarts are supposed to be from an equilibrated run.
  • Asif checked differences in the code and namelists between Paolo Davini and us. 'We are clearly not running the same model'. Major differences:
    • in codes: NEMO and IFS present significant differences.
    • in namelists. No difference was found in the sea ice namelist, which confirms the hypothesis that Southern Ocean warming could be the cause for the drift as well as the differences between machines
    • no difference in forcings
    • make a difference between our (IC3) EC-Earth 3.1 code and the CNR code: “We used an EC-Earth 3.1 (we “started” with the r1692) and then we add the correction for the Caspian sea (see https://dev.ec-earth.org/issues/152) since the model was crashing. Although minor change in the code can exist, I would say that it should share almost everything with the maintenance version of the 3.1 (https://dev.ec-earth.org/projects/ecearth3/repository/show/ecearth3/branches/maintenance/3.1).” Apart from what Paolo says, we also have the local changes introduced into our local branch v3.1b like Isabel's NSSS patch, Uwe's northfold patch, Klaus's iceflux patch…. Plus the key_diahth an additional key.
    • what are their namelists: coupler; contain changes related to Klaus's iceflux patch, ifs; contains changes of Lauriane's sppt stuff, nemo; contains changes of Klaus's icefulux (cn_rcv_qsr, cn_rcv_qns), nn_chldta, ln_rnf_mouth, namsbc_ssr, Chloe's rn_ebb, SPCES's namptr, Uwe's northfold (ln_nnogather) and ice; no change
      • cn_rcv_qsr: under namelist section namsbc_cpl (coupled ocean/atmosphere model) (tome has value “oce and ice” and i06c has “conservative”)
      • cn_rcv_qns: namsbc_cpl (coupled ocean/atmosphere model) (tome has value “oce and ice” and i06c has “conservative”)
      • n_chldta: namtra_qsr (Penetrative solar radiation) (tome has value 0 and i06c has 1)
      • ln_rnf_mouth: namsbc_rnf (Runoffs namelist surface boundary condition) (tome has value false and i06c has true)
      • namsbc_ssr section modified in i06c w.r.t nudging (surface boundary condition : sea surface restoring) (apart from variable names rn_deds=-27.7 in tome while -150.0 in i06c (rn_deds is “magnitude of the damping on salinity [mm/day]”
      • rn_ebb: namzdf_tke (Turbulent eddy kinetic dependent vertical diffusion) (tome has value 67.83 and i06c has 33.92)
      • namptr section modified to Poleward Transport Diagnostic needed for SPECS
      • ln_nnogather: nammpp (Massively Parallel Processing) (tome has not set; these changes comes along with Uwe's northfold patch used in order to improve the scalability of ocean model)
    • In particular: are they using LIM3, with one category: “you are right, we used LIM3. Unfortunately we do not have any saved output from the tome spinup, thus we cannot help you with this.”
    • Are we using the same forcing as they are: “we set it to 1850. ifs_cmip5_fixyear is set to the year you want to use as a constant GHG/volcano/aerosol forcing. If it is zero, the simulation uses the rcp protocol.”
    • do they have the last 20 years of the tome simulation: No
    • Make a difference between our ocean, ice restarts and theirs (they are supposed to be identical except for a noise in the SSTs): Yes (they are identical)
  • In the light of the previous point, it does not make sense to exchange restarts with another group, even if this group has checked out the same version of the model. Home-made patches are continuously added in each group, the consequence of that is to end with different climates.

Tasks for next meeting

  • 'François': As soon the last chunk of the last member of 'i06c' is done, process the PI scripts. Done
  • 'Martin': As soon the last chunk of the last member of 'i06c' is done, update the figures of scores and maps to have three models on the figures.
  • 'Asif': extend fc0 of 'i06c' to 30-50 years (20 years are already done).
  • 'Eleftheria': follow up the simulation of Asif: when the model is equilibrated for sea ice and global SST, send a notification to all of us. If possible, also check temperatures in the deeper ocean. It is probable that they won't be stabilized even after 100 years.
  • 'Francois': When new, equilibrated restart is available from Asif's run, add tiny perturbations and dispatch the restarts on the three machines to start a new cycle of experiments. Pending Asif's run
  • 'Asif and François': Make new experiments, new repro restarts, recompile the code on MN and ECMWF with the same distribution and configuration of processors, possibly new keys (fp_precise), and less agressive optimization.

2 November 2015

Points Discussed

François made a summary on the current status of this research. File available here. 20151102_repro_fm.pdf. Salient messages are

  • Under current standards, EC-Earth3.1 is not reproducible – and this is probably true for other models. This is because we tend to ignore compilation options and keys that are strict with respect to floating point operations and optimization. This is also because we often use a different number of processors on different machines.
  • A long, stable, run from i06c is now available for a second stream of reproducibility experiments. These experiments will run with the fp-model precise key and with the same distribution of processors.
  • SSS restoring and FW conservation flags are to be disabled in the future reproducibility runs. The problem of Antarctic sea ice loss is pervasive in all EC-Earth versions, and should be solved independently of this reproducibility exercise.

Tasks for next meeting

Asif and François launch the reproducibility experiment from stabilized i06c on Ithaca and MareNostrum

9 November 2015

Stabilization of i06c. The plots here and here show the stabilization of one of the member of i06c that was extended to achieve equilibrium. It was decided that the run was now in a sufficiently stable climate to perform the second stream of reproducibility experiments.

12 November 2015

There was a meeting involving François, Kim and Oriol, to discuss about the meaning and the use of compilation flags/options in these experiments.

As a reminder: here's the current status. We don't have reproducibility in EC-Earth3.1 (look at discussion above) but the setup was not 100% perfect. There were two problems: 1) Problems linked to the differences in domain decomposition (number and distribution of processors) and 2) Problems linked to the differences in versions of compilers, the use of aggressive optimization levels and the absence of certain keys like fp-model strict/precise.

To distinguish between the two problems and following Oriol and Kim's suggestion, here is the updated plan.

0) Talk to NEMO and IFS teams to at least inform them on our plans. François sends an e-mail to Sébastien Masson and Kim to IFS people at ECMWF. We know already that different domain decomposition gives different results (bit-wise) for NEMO. Whether the results can be different climate-wise is what we want to test, but these teams could have obtained results we are unaware of.

1) Reproducibility on Ithaca. To isolate the effect of the decomposition of processors, we'll first run a reproducibility experiment on Ithaca, started from the equilibrated restart we have obtained after 60 years of simulation. We'll just change the domain decomposition (number and distribution). Since all other things will be equal by construction, this will allow to examine the sole effect of processors on reproducibility. Questions to elucidate at this stage are:

  • Can we risk this strategy given that we don't know when we won't have access to Ithaca anymore?
  • The reference decomposition is 72: (32+16+22) . What can be another decomposition? I suggest 64: (32+12+20) but without any clue if this makes sense
  • Ideally, the compiler version, MPI and LAPACK versions, SZIP-HDF5-NetCDF-GRIB versions should also be freezed now, if we want to then run other experiments on other platforms.
  • Flags for compilation should have the -fp-model source option, that favors reproducibility and portability (see the reference below). Unless what we all might think, the -fp-model precise or -fp-model strict options allow for accuracy, but not necessarily for reproducibility. Actually, not both characteristics can be achieved simultaneously – look at the reference below. Thanks Kim for raising that.
  • Optimization flags should be set to -O0. This will likely reduce the time of execution, but we don't know by how much yet. I would suggest to start the experiment. If we realize that it will take too long to finish, we might come back to this choice.

2) Reproducibility across machines. When looking at the table prepared by Asif (above), we can see that there are well differences in the versions of compilers. We'll have to make sure all versions of compilers are identical, at least as much as we can. As a reminder, the idea is to make everything we have in our hands to make the experiments reproducible. For now only simulations on Ithaca and MareNostrum3 can be conducted.

3) We'll use the same diagnostics as we did earlier this year. This part is ready, there is no reason why diagnostics should change.

More about compilation options can be found here. A description of the tradeoffs in floating-point operations is here and here

9 December 2015

We had a meeting with usual people + Klaus and Uwe (SMHI) who are also tracking this reproducibility issue and are interested in what we are doing. Please visit this page. They are mentioning an interesting paper by Barker et al.. In this paper, a software is presented to track differences in the CESM model. The paper is not properly about GCMs, more about atmosphere and about short periods (1-yr).

The discussions were quite rich, and here is the summary in a few bullet points

  • We have to be extremely careful when saying things like “EC-Earth is not reproducible”. First because “reproducibility” is a loosely defined concept: bit-for-bit reproducibility is different from climate-for-climate reproducibility. Defining the latter (i.e., are two ensembles statistically indistinguishable from each other) is particularly challenging, both regarding what protocol to use and the statistical test to apply. The other reason why we need to be careful is because we users might offend developers who strive to make their models reproducible, and this could be seen as a lack of respect.
  • SMHI is mostly interested in understanding what are the configurations under which EC-Earth is reproducible, while the initial question we (at IC3 and now BSC) ask is: can we run EC-Earth on different platforms if we follow our common standards.
  • Assessing reproducibility of a whole system is different from assessing reproducibility of one particular variable (e.g., Antarctic sea ice extent in winter). A good point of the Barker et al. paper referenced above is that their test is multivariate, since the set of 120 variables is first EOFed. By making the study in the Principal Component Analysis (statistical) space rather than physical space, they reach probably stronger conclusions than if they had looked at all (dependent) variables separately, as we do in our case.

The topic is becoming extremely complex, far-reaching and our team looking into the topic is growing every month. On the other hand it has been a long-standing issue (almost one year now) and we need to have insights for the next EC-Earth meeting and the upcoming CMIP6. Here is a suggestion as how to continue the work: this should be split in two tasks

  • Developer aspect - Xavi Yepes is now looking in the bit-for-bit reproducibility issue with EC-Earth 3.2. and for short (3-month) runs. SMHI (Uwe, Klaus) and KNMI (Philippe Le Sager) are aware of this. He is making several tests:
    1. Changing the number of processors in NEMO, IFS, both.
    2. Setting optimization to -O2 or -O3
    3. Setting the -fp-model to precise, strict, source
  • User aspect - Asif, François will continue adopting the “user” point of view, extend i06c under the same conditions as before, in order to reach same conclusions as before but without the massive drift that we had. When insights from the “developer”, team will be available, other tests will be performed to see if we can achieve reproducibility or not.

15 December 2015

The User aspect experiments are launched. Ithaca's i077 is now under way.

17 December 2015

François and Xavier agreed that it is necessary to perform several executions changing technical aspects. Ideally, the following aspects should be all evaluated, but it is not feasible to handle it, because combinations grow exponentially. So, the parameters to try are:

  • Compulsory:
    • Code optimization: -O2 and -O3
    • Regarding floating-point calculations: -fp-model [precise | strict]
    • Usage of -xHost flag (best instructions according to host machine)
    • Two processor combinations:
      • IFS 320 and NEMO 288
      • IFS 128 and NEMO 64
  • Optional:
    • Without -xHost
    • Without -fp-model clause
    • Try -fp-model source
    • Explore more processor combinations

So, we should have 4 compulsory compilations:

  • -O2 -fp-model precise -xHost
  • -O2 -fp-model strict -xHost
  • -O3 -fp-model precise -xHost
  • -O3 -fp-model strict -xHost

And consequently, 8 compulsory outputs.

Additional considerations:

  • Use last EC-Earth 3.2beta release
  • Enable key_mpp_rep
  • 1 month, writing every day
  • Use optimization to avoid mpi_allgather use at the northfold

4 February 2016

Javier García-Serrano and Mario Acosta have showed some reproducibility results in the EC-earth meeting 2016. The community recommend us to finish the reproducibility experiments and publish the results. Some issues should be treated before:

-Different combination of flags for optimization and floating-point operations have been checked in marenostrum3, bit for bit reproducibility had not been possible for EC-earth 3.2beta. However, bit for bit reproducibility could decrease performance, a combination of flags should be found in order to balance reproducibility, accuracy and performance. The next tasks should be discussed to achieve it:

  • Determine the best method to quantify differences between runs
    • Propose a reference which we can use to compare the rest of experiments. This reference could be use in the future to check runs in new platforms, the inclusion of new modules, etc.
    • Use a statistical method to quantify the differences between runs and propose a minimum to achieve instead of bitwise precision in order to avoid critical restrictions in performance.
    • Propose a method to know which of two simulations with valid results is the best. Some experiments using different compiler flags will obtain similar valid results (maybe with differences of only 1%). It would be convenient to know which obtain better results (quality of the simulation results).
  • Determine a combination of flags (Floating-point control and optimization) and additional optimization methods which achieve a balance between performance and accuracy & reproducibility.
    • Suggest a combination of flags and/or implement some specific optimizations to achieve the best performance possible and at the same time the differences are less than X% using a particular platform and less than Y% using two different platforms with a similar architecture (being Y > X).
  • If bit for bit reproducibility was achieved using ec-earth3.1, study how to obtain it using ec-earth3.2beta at least in a debug mode.

27th of May 2016

See the summarizing presentations of François and Mario . A more general set of slides about climate-reproducibility is available here and was also posted on the EC-Earth development portal issue 207.

Actions: * Mario runs an experiment with -fpe0 activated, on ECMWF. * Mario/Oriol: Tests are to be made with libraries (NetCDF, GRIB, etc.) compiled with the same options and the same version of the code.

10th of November 2017

Martin and François have worked to make the scripts testing the reproducibility more universal. These can now be found in the following gitlab project:

https://earth.bsc.es/gitlab/fmassonnet/reproducibility.git

A draft of the paper has been created:

https://docs.google.com/document/d/1aMsdggygIGmbyiFmmEOEFIl6ZVe-EO7Jcd04B6ZP91A/edit

reproducibility.txt · Last modified: 2017/11/10 14:03 by fmassonn