|
|
Many conferences and workshops publish the slides after the event. These can offer information which is not easily acquired from the official documentation or manual. Moreover, they also convey user experience, which helps understand the program more in-depth. Here is a list of slides that we use internally.
|
|
|
|
|
|
### b.1 [The Cray Compilation Environment by KAUST](https://www.hpc.kaust.edu.sa/sites/default/files/files/public/1.03a-AdditionalInformation_CrayCompilationEnvironment.pdf)
|
|
|
Presentation on some of the compiler options with regards to debugging and optimization.
|
|
|
Specifically a lot of information on the given optimization flags, where each possible value is explained.
|
|
|
|
|
|
### b.2 [The Cray Compilers and Libraries by ECMWF Wiki](https://confluence.ecmwf.int/download/attachments/46600240/09_Compilers_and_Libraries.pdf?api=v2)
|
|
|
Presentation on some of the compiler options with regards to debugging and optimization.
|
|
|
Contains [2.1 The Cray Compilation Environment by KAUST](0.-Sources#21-the-cray-compilation-environment-by-kaust), but also extends on compilation and libraries.
|
|
|
|
|
|
### b.3 [Tools and Methods for ACC debugging by Jülich](https://juser.fz-juelich.de/record/902543/files/3-Debugging--TH.pdf)
|
|
|
|
|
|
> Tools and Methods for Debugging OpenACC - JSC Openacc Course 2021;
|
|
|
> Presentation slides, 65 pages; by T.Hater (Research center in Julich)
|
|
|
|
|
|
local: *3-Debugging--TH.pdf*
|
|
|
|
|
|
Strategy notes on how to debug failing code on NVIDIA hardware/software stack:
|
|
|
|
|
|
- `NV_ACC_NOTIFY`/`NV_ACC_TIME`,
|
|
|
- `compute-sanitizer`,
|
|
|
- `cuda-gdb` and single-stepping approach.
|
|
|
- `cuda-gdb`: `set cuda launch_blocking on`
|
|
|
- `cuda-gdb`:`set cuda api_failure stop`
|
|
|
- `cuda-gdb`: `set cuda memcheck on`
|
|
|
- `cuda-gdb`:`set cuda value_extrapolation on`
|
|
|
- `cuda-gdb`:`set autostep <start> for <count> lines`
|
|
|
|
|
|
The kernel example is then reviewed, but we do not have that source code to reproduce.
|
|
|
|
|
|
### b.4 [OpenMP and OpenACC for CCE by Cray](https://on-demand.gputechconf.com/gtc/2013/presentations/S3084-OpenACC-OpenMP-Directives-CCE.pdf)
|
|
|
|
|
|
> OpenMP and OpenACC for CCE (Cray Compilation Environment);
|
|
|
> Presentation, 48 slides; by Dr.James C. Beyer. 2013
|
|
|
|
|
|
local: *S3084-OpenACC-OpenMP-Directives-CCE.pdf*
|
|
|
|
|
|
Review of CCE 8 (2013). Some brief OpenACC and OpenMP comparison. `"async"` in OpenACC and `"task"` in OpenMP. Discussion on `"–haccel_mode="` and its possible values (async execution of kernels and memcopies). Outline of how compiler handles the OpenACC constructs in code. Basic idea and hints of porting code to OpenACC. Detailed discussion on deep copy, structure shaping in OpenACC.
|
|
|
|
|
|
### b.5 [Cray Compiler and OpenMP Offloading by Cray](https://www.olcf.ornl.gov/wp-content/uploads/2019/11/Cray-Compiler-and-OpenMP.pdf)
|
|
|
|
|
|
> Frontier Application Readiness Kick-Off Workshop - CRAY COMPILER AND
|
|
|
> OPENMP OFFLOADING; Presentation, 33 slides; by Jeff Sandoval (CRAY);
|
|
|
> October 2019
|
|
|
|
|
|
local: *Cray-Compiler-and-OpenMP.pdf*
|
|
|
|
|
|
Review of CCE 9 (2019). Basic overview of openmp-related compiler flags (very basic). A diagram mentioning libraries holding different parts of runtime. OpenMP offloading strategy: basic principles, quick GPU hardware overview, meaning of `#omp teams`/`#omp distribute`; recommended way of parallelization with very short code examples; Explanation of what `-hmsgs -hlist=m` output files/messages mean. Explanation of `#omp target data map`. Debug practice: CRAY_ACC_DEBUG; target=host. Example for MPI communication: `#omp target data use_device_ptr`. Asyncronity: `nowait`, `depend`.
|
|
|
|
|
|
### b.6 [Tools and strategies for debugging on HPE-Cray systems by Hewlett Packard Enterprise](https://www.olcf.ornl.gov/wp-content/uploads/2021/04/CrayToolsAndDebuggers_v1.0_pdfVersion.pdf)
|
|
|
|
|
|
> Presentation slides (CrayToolsAndDebuggers_v1.0_pdfVersion.pdf); 40
|
|
|
> slides; by Kostas Makrides (HPE Performance Engineer); May 2021
|
|
|
|
|
|
local: *CrayToolsAndDebuggers_v1.0_pdfVersion.pdf*
|
|
|
|
|
|
Basic idea of using OpenMP/GPU in Cray environment (modules, compilers, basic environment opts). Debugging with `CRAY_ACC_DEBUG`: difference in output for `CRAY_ACC_DEBUG=1|2|3`. Understading the debug output for transfers with derived types. Turning on and analysis of the compiler listings: `–hlist=a`, understanding output for derived types. Looking in details into use case: "Mapping Derived Types with Pointer components". Overview of `CrayPat` and `Apprentice2` tools. Brief overview of `gdb4hpc`.
|
|
|
Final debugging tips: `rocgdb`, `rocprof`, `printf` (says: printf is not available in Fortran+OpenMP regions), `AMD_LOG_LEVEL`. Finally mentions `Reveal` tool.
|
|
|
|
|
|
### b.7 [Introduction to the Cray Programming Environment by Cray](https://www.olcf.ornl.gov/wp-content/uploads/2019/11/20191010_Introduction2CrayPE.pdf)
|
|
|
|
|
|
> Introduction to the Cray Programming Environment; Presentation slides, 65 pages; October 10, 2019 Heidi Poxon
|
|
|
|
|
|
local: *20191010_Introduction2CrayPE.pdf*
|
|
|
|
|
|
A slide with a list of all compilers and tools that are included in Cray Developer Environment, followed by a slide with accelerator-related Cray tools. Basic idea of Cray Environment module system. A note on `MPICH_GNI_HUGEPAGE_SIZE`, on module `cray-mpich-abi`. A slide showing examples of `PAT_record()`, `PAT_region_begin()` things, how to load PAT module and where are profiling results. A note on `MPICH_RANK_ORDER.Grid` and the ways to understand the communication locality in MPI. A couple of other PAT examples. A screenshot of `Apprentice2` for GPU runtime diagrams. 11 slides on `Reveal` tool. `ATP` tool for signal handling. A few words on `valgrind4hpc` and `gdb4hpc`.
|
|
|
|
|
|
### b.8 [OpenMP Offloading Support for VASP Using Cray Compiler by AMD](https://www.openmp.org/wp-content/uploads/vasp_omp_offloading_with_cray_compiler.pdf)
|
|
|
|
|
|
> ORNL Application Readiness Workshop - AMD Tools Overview; Presentation slides, 28 pages; January 2023.
|
|
|
|
|
|
local: *vasp_omp_offloading_with_cray_compiler.pdf*
|
|
|
|
|
|
Summary on the experience of porting components of `VASP` material science application to GPUs using OpenMP directives.
|
|
|
|
|
|
- Slide on `CRAY_ACC_DEBUG=1|3` difference.
|
|
|
- Slide on how OpenMP keywords correspond to some diagnostic lines in `–hlist=aimd` output listing.
|
|
|
- Slide on how OpenMP instructions correspond to simple rocm trace. How to put ranges in the trace to visaully separate various blocks.
|
|
|
- The way they separate CPU and GPU code paths (preprocessor and omp if-directive).
|
|
|
- Simple example code on how to interface with C-libraries like ROCm BLAS.
|
|
|
- 4 slides on pointer aliasing issue (propably this issue is fixed in CCE15).
|
|
|
- 2 slides show the solution for complex number atomics, demonstrates an overhead of atomics.
|
|
|
|
|
|
### b.9 [Best Practices for OpenMP by NERSC (workshop)](https://www.nersc.gov/assets/Uploads/nersc-best-practices-sep-1-2022.pdf)
|
|
|
|
|
|
> Introduction to OpenMP Offload Part 2: Optimization and Data Movement.
|
|
|
> Presentation slides by Chris Daley, NERSC; 25 slides; Sept 2022
|
|
|
|
|
|
local: *nersc-best-practices-sep-1-2022.pdf*
|
|
|
|
|
|
Discusses "Best practices" of using OpenMP to offload to GPUs:
|
|
|
|
|
|
- don't forget the "teams" keyword;
|
|
|
- Use as much parallelism as possible: be aware that NVIDIA A100 executes up to 221184 "threads", AMD MI-250X up to 281600 "threads". A nice slide on memory bus saturation (55000 threads are required to get 90% memory bandwidth or A100, similar on AMD, that means about 25% load is required to saturate memory subsystem).
|
|
|
- Use colapse() keyword to get more parallelism.
|
|
|
- Avoid small kernels.
|
|
|
- Try to use combined directives ("#pragma omp target teams distribute parallel for"), or minimize separation between "teams" and "parallel" directives.
|
|
|
- Use map() keywords to minimize data movement. Couple of examples on how to use `CRAY_ACC_DEBUG=2` / `NVCOMPILER_ACC_NOTIFY=3` for data transfer debugging.
|
|
|
- Don't copy scalars, map tham as `firstprivate(scalar)` -- they will become the kernel arguments.
|
|
|
- Fortran array operation and sections `(:)` must be avoided in the target regions.
|
|
|
|
|
|
### 2.10 [Thinking OpenMP with NVIDIA compilers by NVIDIA](https://www.nas.nasa.gov/assets/nas/pdf/ams/2021/AMS_20210504_Ozen.pdf)
|
|
|
|
|
|
> Advanced Modeling & Simulation (AMS) Seminar Series; NASA Ames Research
|
|
|
> Center, May 2021; Presentation slides, 43 slides; by Dr. Güray Özen
|
|
|
|
|
|
local: *AMS_20210504_Ozen.pdf*
|
|
|
|
|
|
Simplest compiler keys to switch on OpenMP/GPU. Basic advice on OpenMP/GPU porting: how to get more parallelism, advice to remove allocs and i/o, use compiler output for analysis. Basics of omp target construct; mentions `num_teams(X)`, `thread_limit(Y)` clauses. Compares joined instruction: `#pragma omp target teams distribute parallel for collapse(2)` and a pair of: `#pragma omp target teams distribute` and `#pragma omp parallel for` before inner loop. Recomendation on using `omp target teams loop` (+ `num_teams` and `thread_limit` clauses). Recommendation on using `omp target parallel loop` (without teams) for light kernels. Using `bind()` clause with `loop`. Difference between `omp distribute parallel for` and `omp loop` concepts (i.e. `#omp master`, `#omp barrier` are not allowed in `omp loop`). Brief example on calling a subroutine with some additional parallelism. Baisics on data transfer directives and managed memory. Small example on `enter data`/`exit data`. A note on interoperability (OpenACC/OpenMP/...). |
|
|
\ No newline at end of file |