Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
  • gpu_offloading gpu_offloading
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Terraform modules
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Okke van Eck
  • gpu_offloadinggpu_offloading
  • Wiki
  • 0.c. Papers

0.c. Papers · Changes

Page history
Update 0.c. Papers authored Nov 07, 2023 by Okke van Eck's avatar Okke van Eck
Hide whitespace changes
Inline Side-by-side
0.c.-Papers.md 0 → 100644
View page @ 77c24e82
The most peer-reviewed information will be published in papers. Often, there is also a detailed section on their methodology for acquiring the presented knowledge. Here is a list of papers that we use internally.
### c.1 [Analysis of OpenMP 4.5 Offloading in Implementations: Correctness and Overhead](https://www.sciencedirect.com/science/article/abs/pii/S0167819119301371)
> Article: "Analysis of OpenMP 4.5 Offloading in Implementations:
> Correctness and Overhead"; conference paper for ParCo 2019, by people
> from University of Delaware and Oak Ridge National Lab.
local: *j.parco.2019.102546.pdf* *accelerator-programming-using-directives-2021.pdf*
The article tries to analyze the OpenMP 4.5-complient offloading implementation to understand some details of the OpenMP specification implementation and the overhead that it introduces. They demonstrate a set of 88 kernels that they check for correctness (37 of them in Fortran). This set helped to locate errors or unexpected behaviour situations in compilers (GCC, Clang, XLC/XLF). Biggest section of an article describes the methodology to find the OpenMP-induced overhead and analyses the observed overheads on Summit (and similar) hardware/software.
The kind of follow-up for this article is: "Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs" (April 2021). It pays attention to porting of some compute kernels using OpenMP and different compilers (local: accelerator-programming-using-directives-2021.pdf). They show that their technique of OpenMP overhead analysis allows to find hidden performance gotchas while porting OpenMP code. The recommendations they give:
- use combined OpenMP command: `"omp teams distribute parallel for"`. When it is impossible, sub-optimal settings for threads/blocks in kernels can be expected (and sometimes compensated with manual `num_teams`/`num_threads` setting);
- check if there are kernels with runtimes less than 50 usec and try to fuse them;
- minilize use of OpenMP reductions, replace min/max reductions with sum reductions, check at runtime if code with reductions uses atomics.
### c.2 [Performant Portable OpenMP](https://grypp.github.io/papers/PerformantPortableOpenMP.pdf)
> Article: "Performant Portable OpenMP" -- Conference paper CC'22, April 2022, by Güray Özen
local: *PerformantPortableOpenMP.pdf*
A paper overviews the NVIDIA's implementation of OpenMP/GPU. First, it gives a high-level motivation explanation. In section 3 they expand on two issues of explicit OpenMP constructs (the `distribute parallel do` style):
- portability issue if the CPU OpenMP for the same code is considered as well
- performance considerations: the necessity to support changes in workflow from serial execution to parallel one, as it is required by OpenMP specification. For this issue, various implementations are discussed
- Challenging scenarios for parallelism are mentioned (`omp master`, `omp critical`, etc).
Section 4 discusses `omp teams loop` implementation: restrictions on constructs like `omp master` (but `atomic` is allowed), restriction on use `parallel` directive.
Section 5 explains how `omp target teams loop` maps and/or collapses loops to exploit the parallelism, with simple examples.
Then some illustration on how GPU code generation works are given, finishing with a discussion on benchmarking and evaluation.
### c.3 [Outcomes of OpenMP Hackathon](https://www.osti.gov/servlets/purl/1818931)
> Outcomes of OpenMP Hackathon: OpenMP Application Experiences with the Offloading Mode; 15 pages document; by S. Pophale, D. Oryspayev; September 2021 (Brookhaven National Laboratory, U.S. Department of Energy)
There are two sources:
- [source 1](https://www.osti.gov/servlets/purl/1818931)
- [source 2](https://www.osti.gov/servlets/purl/1823332)
local: *1818931.pdf* *1823332.pdf*
Summary on experience of porting varios apps and mini-apps (BerkeleyGW, WDMApp/XGC, GAMESS, GESTS, and GridMini) using OpenMP. Two NVIDIA-based HPC platforms were used.
For BerkeyGW application, in section 3.1.4, for `GPP` kernel, An interesting example of multi-level `omp loop`: inner loops are tied to threads within a block, and make portions of work sequentially. The compiler option `-mp=noautopar` is mentioned. (The inner-loop optimisation is in fact a "tiling" technique for better cache fit, they mention that tiling can be done automatically with `tile` directive of OpenMP 5.1.
Some useful notes are given in the "Lessons learned" section (3.1.5).
One of the citations:
> One of the keys to our success over the years has been the use of BerkeleyGW mini-apps. Having mini-apps that accurately capture the computational motifs without the various library dependencies and a small data footprint (data set fits in memory of single compute node) was helpful for sharing with vendors for independent assessment [6]. Another critical component of our GPU porting workflow has been the creation of OpenMP-target and OpenACC builds in our Buildbot continuous integration suite running on Amazon Web Services (AWS). These builds run all GPU-accelerated kernels through the full BerkeleyGW regression test suite for every pull request.
They recommend using the combination of `OMP_TARGET_OFFLOAD=MANDATORY` and `NVCOMPILER_ACC_NOTIFY=3`. Recommendation on using HPCToolkit for Program Counter sampling profiling technique.
For WDMApp application, the mention an interesting compiler related issue when the NVIDIA OpenMP runtime was launching the dominant GPU kernel with only 8 OpenMP teams. They emphasize the necessity of double-checking all the runtime configurations of OpenMP kernels not only at compile time with
`-Minfo=mp`, but also at runtime.
(TO BE CONTINUED: GAMESS; GESTS; GridMini)
Clone repository
  • 0. Sources
  • 0.a. Documentation & Manuals
  • 0.b. Slides
  • 0.c. Papers
  • 0.d. Misc
  • 1. Offloading process
  • 1.a. Supercomputers
  • 1.b. Compilers
  • 1.c. OpenMP directives
  • 1.d. Profiling & Debugging
  • 1.e Performance hazards
  • 2. Nemo standalone
  • 3. Tools
  • 3.a. Intel Offload Advisor & Intel VTune
  • 3.b. CPU and GPU performance overview
View All Pages