Differences

This shows you the differences between two versions of the page.

--- reproducibility [2015/11/12 10:18]
fmassonn [12 November 2015]
+++ reproducibility [2017/11/10 14:03]
fmassonn
@@ Line 145: / Line 145: @@
 ===== Summary of monthly meetings =====
-==== 13 May 2015 ====
+===== 13 May 2015 =====
 === Agenda ===
@@ Line 180: / Line 180: @@
 '''Paco''': Send an e-mail to Uwe to clarify his idea of experiment for MPP reproducibility.
-==== 17 June 2015 ====
+===== 17 June 2015 =====
 === Points Discussed ===
@@ Line 217: / Line 217: @@
   * '''Asif and François''': Make new experiments, new repro restarts, recompile the code on MN and ECMWF with the same distribution and configuration of processors, possibly new keys (fp_precise), and less agressive optimization.
-==== 2 November 2015 ====
+===== 2 November 2015 =====
 === Points Discussed ===
@@ Line 228: / Line 228: @@
 **Asif and François** launch the reproducibility experiment from stabilized **i06c** on Ithaca and MareNostrum
-==== 12 November 2015 ====
+===== 9 November 2015 =====
+Stabilization of **i06c**. The plots {{stabilization_GMST.pdf|here}} and {{stabilization_ice.pdf|here}} show the stabilization of one of the member of i06c that was extended to achieve equilibrium. It was decided that the run was now in a sufficiently stable climate to perform the second stream of reproducibility experiments.
+===== 12 November 2015 =====
 There was a meeting involving François, Kim and Oriol, to discuss about the meaning and the use of compilation flags/options in these experiments.
-As a reminder: here's the current status. We don't have reproducibility (look at discussion above) but the setup was not 100% perfect. There were two problems: 1) Problems linked to the differences in domain decomposition (number and distribution of processors) and 2) Problems linked to the differences in versions of compilers, the use of aggressive optimization levels and the absence of certain keys like fp-model strict/precise.
+As a reminder: here's the current status. We don't have reproducibility in EC-Earth3.1 (look at discussion above) but the setup was not 100% perfect. There were two problems: 1) Problems linked to the differences in domain decomposition (number and distribution of processors) and 2) Problems linked to the differences in versions of compilers, the use of aggressive optimization levels and the absence of certain keys like fp-model strict/precise.
 To distinguish between the two problems and following Oriol and Kim's suggestion, here is the updated plan.
-** 0) Talk to NEMO and IFS teams** to at least inform them on our plans. **François** sends an e-mail to Sébastien Masson and **Kim** to IFS contacts at ECMWF. We know already that different domain decomposition gives different results (bit-wise) for NEMO. Whether the results can be different climate-wise is what we want to test, but these teams could have obtained results we are unaware of.
+** 0) Talk to NEMO and IFS teams** to at least inform them on our plans. **François** sends an e-mail to Sébastien Masson and **Kim** to IFS people at ECMWF. We know already that different domain decomposition gives different results (bit-wise) for NEMO. Whether the results can be different climate-wise is what we want to test, but these teams could have obtained results we are unaware of.
+**1) Reproducibility on Ithaca**. To isolate the effect of the decomposition of processors, we'll first run a reproducibility experiment on Ithaca, started from the equilibrated restart we have obtained after 60 years of simulation. We'll just change the domain decomposition (number and distribution). Since all other things will be equal by construction, this will allow to examine the sole effect of processors on reproducibility. Questions to elucidate at this stage are:
+  * Can we risk this strategy given that we don't know when we won't have access to Ithaca anymore?
+  * The reference decomposition is 72: (32+16+22) . What can be another decomposition? I suggest 64: (32+12+20) but without any clue if this makes sense
+  * Ideally, the compiler version, MPI and LAPACK versions, SZIP-HDF5-NetCDF-GRIB versions should also be freezed now, if we want to then run other experiments on other platforms.
+  * Flags for compilation should have the **-fp-model source** option, that favors reproducibility and portability (see the reference below). Unless what we all might think, the **-fp-model precise** or **-fp-model strict** options allow for accuracy, but not necessarily for reproducibility. Actually, not both characteristics can be achieved simultaneously -- look at the reference below. Thanks Kim for raising that.
+  * Optimization flags should be set to **-O0**. This will likely reduce the time of execution, but we don't know by how much yet. I would suggest to start the experiment. If we realize that it will take too long to finish, we might come back to this choice.
+**2) Reproducibility across machines**. When looking at the table prepared by Asif (above), we can see that there are well differences in the versions of compilers. We'll have to make sure all versions of compilers are identical, at least as much as we can. As a reminder, the idea is to make everything we have in our hands to make the experiments reproducible. For now only simulations on Ithaca and MareNostrum3 can be conducted.
+**3) We'll use the same diagnostics as we did earlier this year**. This part is ready, there is no reason why diagnostics should change.
+More about compilation options can be found {{https://software.intel.com/sites/default/files/Compiler_QRG_2013.pdf|here}}. A description of the tradeoffs in floating-point operations is {{https://software.intel.com/es-es/node/582224|here}} and {{https://software.intel.com/es-es/node/582223|here}}
+===== 9 December 2015 =====
+We had a meeting with usual people + Klaus and Uwe (SMHI) who are also tracking this reproducibility issue and are interested in what we are doing. Please visit [[https://dev.ec-earth.org/boards/6/topics/375|this page]]. They are mentioning an interesting paper by [[http://www.geosci-model-dev.net/8/2829/2015/gmd-8-2829-2015.pdf|Barker et al.]]. In this paper, a software is presented to track differences in the CESM model. The paper is not properly about GCMs, more about atmosphere and about short periods (1-yr).
+The discussions were quite rich, and here is the summary in a few bullet points
+  * We have to be extremely **careful** when saying things like "EC-Earth is not reproducible". First because "reproducibility" is a loosely defined concept: bit-for-bit reproducibility is different from climate-for-climate reproducibility. Defining the latter (i.e., are two ensembles statistically indistinguishable from each other) is particularly challenging, both regarding what protocol to use and the statistical test to apply. The other reason why we need to be careful is because we //users// might offend //developers// who strive to make their models reproducible, and this could be seen as a lack of respect.
+  * SMHI is mostly interested in understanding what are the configurations under which EC-Earth is reproducible, while the initial question we (at IC3 and now BSC) ask is: can we run EC-Earth on different platforms if we follow our common standards.
+  * Assessing reproducibility of a whole system is different from assessing reproducibility of one particular variable (e.g., Antarctic sea ice extent in winter). A good point of the Barker et al. paper referenced above is that their test is multivariate, since the set of 120 variables is first EOFed. By making the study in the Principal Component Analysis (statistical) space rather than physical space, they reach probably stronger conclusions than if they had looked at all (dependent) variables separately, as we do in our case.
+The topic is becoming extremely complex, far-reaching and our team looking into the topic is growing every month. On the other hand it has been a long-standing issue (almost one year now) and we need to have insights for the next EC-Earth meeting and the upcoming CMIP6. Here is a suggestion as how to continue the work: this should be split in two tasks
+  * **Developer aspect** - Xavi Yepes is now looking in the bit-for-bit reproducibility issue with EC-Earth 3.2. and for short (3-month) runs. SMHI (Uwe, Klaus) and KNMI (Philippe Le Sager) are aware of this. He is making several tests:
+         - Changing the number of processors in NEMO, IFS, both.
+         - Setting optimization to -O2 or -O3
+         - Setting the -fp-model to precise, strict, source
+  * **User aspect** - Asif, François will continue adopting the "user" point of view, extend i06c under the same conditions as before, in order to reach same conclusions as before but without the massive drift that we had. When insights from the "developer", team will be available, other tests will be performed to see if we can achieve reproducibility or not.
+===== 15 December 2015 =====
+The **User aspect** experiments are launched. Ithaca's **i077** is now under way.
+===== 17 December 2015 =====
+François and Xavier agreed that it is necessary to perform several executions changing technical aspects. Ideally, the following aspects should be all evaluated, but it is not feasible to handle it, because combinations grow exponentially. So, the parameters to try are:
+  * Compulsory:
+    * Code optimization: -O2 and -O3
+    * Regarding floating-point calculations: -fp-model [precise | strict]
+    * Usage of -xHost flag (best instructions according to host machine)
+    * Two processor combinations:
+      * IFS 320 and NEMO 288
+      * IFS 128 and NEMO 64
+  * Optional:
+    * Without -xHost
+    * Without -fp-model clause
+    * Try -fp-model source
+    * Explore more processor combinations
+So, we should have 4 compulsory compilations:
+  * -O2 -fp-model precise -xHost
+  * -O2 -fp-model strict -xHost
+  * -O3 -fp-model precise -xHost
+  * -O3 -fp-model strict -xHost
+And consequently, 8 compulsory outputs.
+Additional considerations:
+  * Use last EC-Earth 3.2beta release
+  * Enable key_mpp_rep
+  * 1 month, writing every day
+  * Use optimization to avoid mpi_allgather use at the northfold
+===== 4 February 2016 =====
+Javier García-Serrano and Mario Acosta have showed some reproducibility results in the EC-earth meeting 2016. The community recommend us to finish the reproducibility experiments and publish the results. Some issues should be treated before:
+-Different combination of flags for optimization and floating-point operations have been checked in marenostrum3, bit for bit reproducibility had not been possible for EC-earth 3.2beta. However, bit for bit reproducibility could decrease performance, a combination of flags should be found in order to balance reproducibility, accuracy and performance. The next tasks should be discussed to achieve it:
+  * Determine the best method to quantify differences between runs
+      * Propose a reference which we can use to compare the rest of experiments. This reference could be use in the future to check runs in new platforms, the inclusion of new modules, etc.
+      * Use a statistical method to quantify the differences between runs and propose a minimum to achieve instead of bitwise precision in order to avoid critical restrictions in performance.
+      * Propose a method to know which of two simulations with valid results is the best. Some experiments using different compiler flags will obtain similar valid results (maybe with differences of only 1%). It would be convenient to know which obtain better results (quality of the simulation results).
+  * Determine a combination of flags (Floating-point control and optimization) and additional optimization methods which achieve a balance between performance and accuracy & reproducibility.
+      * Suggest a combination of flags and/or implement some specific optimizations to achieve the best performance possible and at the same time the differences are less than X% using a particular platform and less than Y% using two different platforms with a similar architecture (being Y > X).
+  * If bit for bit reproducibility was achieved using ec-earth3.1, study how to obtain it using ec-earth3.2beta at least in a debug mode.
+===== 27th of May 2016 =====
+See the summarizing presentations of {{20160526_groupmeeting.pdf | François }} and {{20160526_EC-Earth3.2_MarioAcosta.pdf | Mario }}. A more general set of slides about climate-reproducibility is available {{ 20160526_EC-Earth3.1_FrancoisMassonnet.pdf | here }} and was also posted on the EC-Earth development portal issue {{https://dev.ec-earth.org/issues/207 | 207}}.
+Actions:
+* Mario runs an experiment with **-fpe0** activated, on ECMWF.
+* Mario/Oriol: Tests are to be made with libraries (NetCDF, GRIB, etc.) compiled with the same options and the same version of the code.
+===== 10th of November 2017 =====
+Martin and François have worked to make the scripts testing the reproducibility more universal. These can now be found in the following gitlab project:
-**1) Reproducibility on Ithaca**. To isolate the effect of the decomposition of processors, we'll first run a reproducibility experiment on Ithaca. We'll just change the domain decomposition (number and distribution). Since all other things will be equal by construction, this will allow to examine the sole effect of processors on reproducibility. For this we'll make sure to enable the __fp-model strict__ compilation option.
+https://earth.bsc.es/gitlab/fmassonnet/reproducibility.git
-) **Reproducibility across machines**
+A draft of the paper has been created:
-More about compilation options can be found {{https://software.intel.com/sites/default/files/Compiler_QRG_2013.pdf|here}}
+https://docs.google.com/document/d/1aMsdggygIGmbyiFmmEOEFIl6ZVe-EO7Jcd04B6ZP91A/edit

User Tools

Site Tools

Differences

Page Tools