Okke van Eck · 77c24e82
--- a/0.c.-Papers.md
+++ b/0.c.-Papers.md
+The most peer-reviewed information will be published in papers. Often, there is also a detailed section on their methodology for acquiring the presented knowledge. Here is a list of papers that we use internally.
+### c.1 [Analysis of OpenMP 4.5 Offloading in Implementations: Correctness and Overhead](https://www.sciencedirect.com/science/article/abs/pii/S0167819119301371)
+> Article: "Analysis of OpenMP 4.5 Offloading in Implementations: 
+> Correctness and Overhead"; conference paper for ParCo 2019, by people 
+> from University of Delaware and Oak Ridge National Lab.
+local: *j.parco.2019.102546.pdf* *accelerator-programming-using-directives-2021.pdf*
+The article tries to analyze the OpenMP 4.5-complient offloading implementation to understand some details of the OpenMP specification implementation and the overhead that it introduces. They demonstrate a set of 88 kernels that they check for correctness (37 of them in Fortran). This set helped to locate errors or unexpected behaviour situations in compilers (GCC, Clang, XLC/XLF). Biggest section of an article describes the methodology to find the OpenMP-induced overhead and analyses the observed overheads on Summit (and similar) hardware/software.
+The kind of follow-up for this article is: "Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs" (April 2021). It pays attention to porting of some compute kernels using OpenMP and different compilers (local: accelerator-programming-using-directives-2021.pdf). They show that their technique of OpenMP overhead analysis allows to find hidden performance gotchas while porting OpenMP code. The recommendations they give: 
+- use combined OpenMP command: `"omp teams distribute parallel for"`. When it is impossible, sub-optimal settings for threads/blocks in kernels can be expected (and sometimes compensated with manual `num_teams`/`num_threads` setting); 
+- check if there are kernels with runtimes less than 50 usec and try to fuse them; 
+- minilize use of OpenMP reductions, replace min/max reductions with sum reductions, check at runtime if code with reductions uses atomics.
+### c.2 [Performant Portable OpenMP](https://grypp.github.io/papers/PerformantPortableOpenMP.pdf)
+> Article: "Performant Portable OpenMP" -- Conference paper CC'22, April 2022, by Güray Özen
+local: *PerformantPortableOpenMP.pdf*
+A paper overviews the NVIDIA's implementation of OpenMP/GPU. First, it gives a high-level motivation explanation. In section 3 they expand on two issues of explicit OpenMP constructs (the `distribute parallel do` style):
+- portability issue if the CPU OpenMP for the same code is considered as well
+- performance considerations: the necessity to support changes in workflow from serial execution to parallel one, as it is required by OpenMP specification. For this issue, various implementations are discussed
+- Challenging scenarios for parallelism are mentioned (`omp master`, `omp critical`, etc).
+Section 4 discusses `omp teams loop` implementation: restrictions on constructs like `omp master` (but `atomic` is allowed), restriction on use `parallel` directive.
+Section 5 explains how `omp target teams loop` maps and/or collapses loops to exploit the parallelism, with simple examples.
+Then some illustration on how GPU code generation works are given, finishing with a discussion on benchmarking and evaluation.
+### c.3 [Outcomes of OpenMP Hackathon](https://www.osti.gov/servlets/purl/1818931)
+> Outcomes of OpenMP Hackathon: OpenMP Application Experiences with the Offloading Mode; 15 pages document; by S. Pophale, D. Oryspayev; September 2021 (Brookhaven National Laboratory, U.S. Department of Energy)
+There are two sources:
+ - [source 1](https://www.osti.gov/servlets/purl/1818931)
+ - [source 2](https://www.osti.gov/servlets/purl/1823332)
+local: *1818931.pdf* *1823332.pdf*
+Summary on experience of porting varios apps and mini-apps (BerkeleyGW, WDMApp/XGC, GAMESS, GESTS, and GridMini) using OpenMP. Two NVIDIA-based HPC platforms were used. 
+For BerkeyGW application, in section 3.1.4, for `GPP` kernel, An interesting example of multi-level `omp loop`: inner loops are tied to threads within a block, and make portions of work sequentially. The compiler option `-mp=noautopar` is mentioned. (The inner-loop optimisation is in fact a "tiling" technique for better cache fit, they mention that tiling can be done automatically with `tile` directive of OpenMP 5.1.
+Some useful notes are given in the "Lessons learned" section (3.1.5).
+One of the citations:
+> One of the keys to our success over the years has been the use of BerkeleyGW mini-apps. Having mini-apps that accurately capture the computational motifs without the various library dependencies and a small data footprint (data set fits in memory of single compute node) was helpful for sharing with vendors for independent assessment [6]. Another critical component of our GPU porting workflow has been the creation of OpenMP-target and OpenACC builds in our Buildbot continuous integration suite running on Amazon Web Services (AWS). These builds run all GPU-accelerated kernels through the full BerkeleyGW regression test suite for every pull request.
+They recommend using the combination of `OMP_TARGET_OFFLOAD=MANDATORY` and `NVCOMPILER_ACC_NOTIFY=3`. Recommendation on using HPCToolkit for Program Counter sampling profiling technique.
+For WDMApp application, the mention an interesting compiler related issue when the NVIDIA OpenMP runtime was launching the dominant GPU kernel with only 8 OpenMP teams. They emphasize the necessity of double-checking all the runtime configurations of OpenMP kernels not only at compile time with 
+`-Minfo=mp`, but also at runtime.
+(TO BE CONTINUED: GAMESS; GESTS; GridMini)