Okke van Eck · 9a8046c0
--- a/1.d.-Profiling-&-Debugging.md
+++ b/1.d.-Profiling-&-Debugging.md
+There are many ways to profile and debug GPU code.
+We can use profilers to make traces, highlighting memory copies and kernel executions.
+Another option is to set compiler flags or environment variables which provides us with compile time or runtime information respectively.
+Each tool has its own benefits and downsides, so it is best to know all of them.
+Moreover, each hardware vendor has his own tools that need to be used.
+The sections below will be grouped by tool, and then describes the available tools for each hardware vendor.
+## Profiler
+For AMD hardware we use `rocprof` and for NVIDIA hardware we use `nsys`.
+### Rocprof
+Rocprof is a CLI profiler, which allows for collecting hardware counters and making traces. 
+Traces are made with the `--hip-trace`, `--hsa-trace`, and/or `--sys-trace` flags. An example of generating a trace and collecting statistics is:
+```bash
+rocprof --hip-trace -o results.csv $BINARY
+```
+Making a trace will generate result files, including the `results.json` file. This file contains the timelines, which can be visualized through `Perfetto`, which can be reached at [ui.perfetto.dev](https://www.ui.perfetto.dev).
+You can get extra statistics on the execution of the kernels by specifying the `--stats` flag. This will create a `results.stats.csv` file, in which for each action, you can read the: number of calls, total duration (ns), average time (ns), percentage of this action with respect to full execution.
+It is also possible to track hardware counters, which are specified through the input file with the `-i` option like:
+```bash
+rocprof -i rocprof_counters.txt -o results.csv $BINARY
+```
+where `rocprof_counters.txt` contains the different perf counter groups. Here is an example:
+```yaml
+pmc: GPUBusy Wavefronts VALUInsts VFetchInsts VWriteInsts VALUUtilization VALUBusy WriteSize
+pmc: L2CacheHit MemUnitBusy MemUnitStalled LDSBankConflict
+range: 0:1
+gpu: 0
+kernel: wrapper$wrapper_mod_$ck_L236_1
+```
+Collecting hardware counters will generate output files with the statistics for each specified counter in the file.
+There are [many hardware counters](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/rocprof.html#publicly-available-counters-and-metrics) that can be tracked. Some of the most interesting to us are:
+|  Counter | Related hazard |     Description    |
+|:--------:|:------:|:------------------:|
+| GPU Busy | Device load and occupation | The percentage of time GPU was busy. |
+| Wavefronts | Device load and occupation | Total wavefronts. |
+| VALUInsts | Device load and occupation | The average number of vector ALU instructions executed per work-item (affected by flow control). |
+| VFetchInsts | Global memory traffic & Device load and occupation | The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory. |
+| VWriteInsts | Global memory traffic | The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory. |
+| VALUUtilization | Device load and occupation | The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence). |
+| VALUBusy | Device load and occupation | The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
+| WriteSize | Global memory traffic | The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
+| L2CacheHit | Global memory traffic |  The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal). |
+| MemUnitBusy | Global memory traffic | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
+| MemUnitStalled | Global memory traffic | This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
+| WriteUnitStalled | Global memory traffic | The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad). |
+| LDSBankConflict | Shared memory bank conflicts | The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
+There are also some hardware specific counters, which might be interesting. You can find the ones for MI200 chips [here](https://rocm.docs.amd.com/en/docs-5.7.0/understand/gpu_arch/mi200_performance_counters.html).
+### Nsys
+Nsys (or NVIDIA Nsight Systems) is a performance analysis tool with many functions. It contains a profiler, analyzer, and a visualizer. We mainly use the profiler and visualizer.
+Traces can be made through the profiler, which is called through the CLI with `nsys profile -t cuda`. We add the `--stats=true` flag to generate summary statistics. So the full command would be:
+```bash
+nsys profile -t cuda --stats=true -o results $BINARY
+```
+This generates the `results.nsys-rep` and `results.sqlite` files with results of the trace. The `results.nsys-rep` file can be opened with the nsys visualizer to view the trace. We recommend running the visualizer using the GUI. Simply launch `NVIDIA Nsight Systems` and open the `results.nsys-rep` file. 
+More information on the different profiler options can be found [here](https://docs.nvidia.com/nsight-systems/2021.4/UserGuide/#cli-profile-command-switch-options).
+## Compiler flags & Environment variables
+Debugging is possible at compile time or at runtime.
+The options are compiler and hardware vendor specific, so that distinction will be made first.
+Then in the subsections, the different compiler and runtime options will be discussed.
+The hardware independent environment variables are in a separate subsections.
+### AMD
+For the `ftn` compiler there are a couple compiler flags that give extra information on the internal procedures:  
+|   Flag   | Possible Values | Debug Value |     Description    |
+|:--------:|:---------------:|:-----------:|:------------------:|
+| -G0 | N/A | N/A | Compile in debug mode. |
+| -O[N] | <ul><li>`-O0`: Disable optimization, including floating point optimizations. Low compile time, small compile size, no global scalar optimization. Vectorize most array syntax statements, but disable all other vectorizations. Implies -h fp0. Some informational messages may not be issued. </li><li>`-O1`: Conservative optimization, moderate compile time and size, global scalar optimizations, and loop nest restructuring. Results may differ from the results obtained when -O0 is specified because of operator reassociation. No optimizations will be performed that might create false exceptions. Only array syntax statements and inner loops are vectorized and the system does not perform some vector reductions. User tasking is enabled, so OpenMP directives are recognized.</li><li>`-O2`: Moderate optimization, moderate compile time and size, global scalar optimizations, pattern matching, and loop nest restructuring. Results may differ from results obtained when -O1 is specified because of vector reductions. The -O2 option enables automatic vectorization of array syntax and entire loop nests. This is the default level of optimization. (default)</li><li>`-O3`: Aggressive optimization, potentially larger compile time and size, global scalar optimizations, possible loop nest restructuring, and pattern matching. The optimizations performed might create false exceptions in rare instances. Results may differ from results obtained when -O1 is specified because of vector reductions.</li></ul> | -O0 | Optimization level |
+| -eo | N/A | N/A | Lists compiler optimizations currently enforced. |
+| -hmsgs | N/A | N/A | Writes compiler optimization messages to stderr. |
+| -hnegmsgs | N/A | N/A | Writes messages to stderr indicating why optimizations (e.g. vectorization, inlining, cloning) did not occur in a given instance.|
+| -m[N] | <ul><li>`1`: Comment</li><li>`2`: Note</li><li>`3`: Warning (default)</li><li>`4`: Error</li></ul> | -m3 | Specifies the lowest level of severity of messages to be issued. Messages at the specified level and above are issued. |
+| -Mmsg[N] | <ul><li>`1`: Comment</li><li>`2`: Note</li><li>`3`: Warning (default)</li><li>`4`: Error</li></ul> | -Mmsg3 | Suppress specific messages at the given level(s), where the list `N` is the numbers disabled. |
+| -hlist[=options] | <ul><li>`a`: vector atomic memory operation</li><li>`b`: blocked</li><li>`f`: fused</li><li>`i`: interchanged</li><li>`m`: streamed but not partitioned</li><li>`p`: conditional, partial and/or computed</li><li>`r`: unrolled</li><li>`s`: short loop</li><li>`t`: array syntax temp used</li><li>`w`: unwound</li></ul> | --hlist=amid | Output compiler listing files with performed (and not performed!) optimizations. |
+Compilation per project for debugging:  
+**nemo-build:** ```ftn -O1 -hipa3 -lm -fopenmp -hnoacc```
+---
+There are also environment variables that can be set for extra information during runtime:  
+|   Environment Variable | Possible Values | Debug Value |     Description    |
+|:----------------------:|:---------------:|:-----------:|:------------------:|
+| CRAY_ACC_DEBUG | <ul><li>`1`: Concise info on offloading regions</li><li>`2`: More in-depth runtime info, but user friendly.</li><li>`3`: Very verbose (i.e. memory addresses), not designed for everyday user.</li></ul> | 2 | Emits messages for all offloading operations. |
+| AMD_SERIALIZE_KERNEL | <ul><li>`1`: Synchronize before launching (i.e. make sure everything is done on GPU)</li><li>`2`: Synchronize after launching (i.e. wait for kernel to finish before moving one)</li><li>`3`: Do both 1 and 2</li></ul> | 3 | Serializes kernel, good for checking race conditions. |
+| AMD_SERIALIZE_COPY | <ul><li>`1`: Synchronize before copies (i.e. make sure everything is done on GPU)</li><li>`2`: Synchronize after copies (i.e. wait for kernel to finish before moving one)</li><li>`3`: Do both 1 and 2</li></ul> | 3 | Serializes data copies, good for checking race conditions. |
+### NVIDIA
+For the `nvfortran` compiler there are a couple compiler flags that give extra information on the internal procedures:  
+|   Flag   | Possible Values | Debug Value |     Description    |
+|:--------:|:---------------:|:-----------:|:------------------:|
+| -g | N/A | N/A | Compile in CPU debug mode. |
+| -G | N/A | N/A | Compile in GPU debug mode (expensive). |
+| -O[N] | <ul><li>`-O`: All -O1 optimizations plus traditional global scalar optimizations performed. (default)</li><li>`-O0`: Creates a basic block for each statement. No scheduling or global optimizations performed.</li><li>`-O1`: Some scheduling and register allocation is enabled. No global optimizations performed.</li><li>`-O2`: All -O optimizations plus SIMD code generation (implies `-Mvect=simd`)</li><li>`-O3`: All -O2 optimizations plus more aggressive code hoisting and scalar replacement, that may or may not be profitable, performed (implies `-Mvect=simd`, `-Mflushz`, `-Mcache_align`)</li><li>`-O4`: All -O3 optimizations plus more aggressive hoisting of guarded expressions performed (implies `-Mvect=simd`)</li></ul> | -O0 | Optimization level. |
+| -C | N/A | N/A | Generates code to check array bounds. |
+| -Wall | N/A | N/A | Turns on 'all' warnings. |
+| -Wextra | N/A | N/A | Turns on 'extra' warnings. |
+| -Minfo[=options] | <ul><li>`all`: -Minfo=accel,inline,ipa,loop,lre,mp,opt,par,vect,stdpar</li><li>`accel`: Enable Accelerator information.</li><li>`ftn`: Enable Fortran-specific information.</li><li>`inline`: Enable inliner information.</li><li>`intensity`: Enable compute intensity information.</li><li>`ipa`: Enable IPA (Interconnect Prototyping Assistant) information.</li><li>`loop`: Enable loop optimization information.</li><li>`lre`: Enable LRE (Loop-carried Redundancy Elimination) information.</li><li>`mp`: Enable OpenMP information.</li><li>`opt`: Enable optimizer information.</li><li>`par`: Enable parallelizer information.</li><li>`pcast`: Enable PCAST (Parallel Compiler Assisted Software Testing) information.</li><li>`pfo` Enable profile feedback information.</li><li>`stat`: Same as -Minfo=time.</li><li>`time`: Display time spent in compiler phases.</li><li>`vect`: Enable vectorizer information.</li><li>`stdpar`: Enable stdpar (set with -stdpar flag) information.</li></ul> | -Minfo=accel,ftn,inline,intensity,loop,mp,opt,par,pcast,pfo,vect | Display compile-time optimization listings. | 
+| -Mneginfo[=options] | <ul><li>`all`: -Mneginfo=accel,inline,ipa,loop,lre,mp,opt,par,vect,stdpar</li><li>`accel`: Enable Accelerator information.</li><li>`ftn`: Enable Fortran-specific information.</li><li>`inline`: Enable inliner information.</li><li>`ipa`: Enable IPA (Interconnect Prototyping Assistant) information.</li><li>`loop`: Enable loop optimization information.</li><li>`lre`: Enable LRE (Loop-carried Redundancy Elimination) information.</li><li>`mp`: Enable OpenMP information.</li><li>`opt`: Enable optimizer information.</li><li>`pfo` Enable profile feedback information.</li><li>`vect`: Enable vectorizer information.</li><li>`stdpar`: Enable stdpar (set with -stdpar flag) information.</li></ul> | -Minfo=accel,ftn,inline,loop,mp,opt,par,pfo,vect | Display inhibited compile-time optimization listings. | 
+| -Mlist | N/A | N/A | Create compiler listing files. | 
+For the optimization flag `-O[N]`, it is possible to control the automatic vectorization as well when using `-O2` or higher. These options can be set with the `-M[no]vect=[options]` flag. These options are listed in the [1.b. Compilers](1.b.-Compilers) section.
+Then it is also possible to control some other behavior when using `-O3`:
+| Flag | Description | 
+|:----:|:-----------:|
+| -M[no]flushz | Set SSE to flush-to-zero mode. |
+| -Mcache_align | Align large objects on cache-line boundaries. |
+| -Mrecip-div | Rewrite x/y => x*1/y if profitable. |
+| -Mfactorize | Enable factorization. |
+---
+There are also environment variables that can be set for extra information during runtime:  
+|   Environment Variable | Possible Values | Debug Value |     Description    |
+|:----------------------:|:---------------:|:-----------:|:------------------:|
+| NCPUS | [N] | 1 | Sets the number of processes or threads used in parallel regions. *NOTE: Same as OMP_NUM_THREADS, kept for historical reasons.* |
+| NVCOMPILER_ACC_NOTIFY | <ul><li>`1`: Kernel launches only</li><li>`2`: Data transfers only</li><li>`3`: Kernel launches and data transfers</li><li>`4`: region entry/exits only</li><li>`5`: region entry/exits and kernel launches</li><li>`8`: wait operations, synchronizatoins</li><li>`16`: (de)allocation of device memory</li></ul> | 3 | Print information for GPU-related events. |
+### Hardware independent
+There are also hardware independent environment variables that gives us runtime information.
+|   Environment Variable | Possible Values | Debug Value |     Description    |
+|:----------------------:|:---------------:|:-----------:|:------------------:|
+| LIBOMPTARGET_INFO | <ul><li>`0x01`: Print all data arguments upon entering an OpenMP device kernel</li><li>`0x02`: Indicate when a mapped address already exists in the device mapping table</li><li>`0x04`: Dump the contents of the device pointer map at kernel exit</li><li>`0x08`: Indicate when an entry is changed in the device mapping table</li><li>`0x10`: Print OpenMP kernel information from device plugins</li><li>`0x20`: Indicate when data is copied to and from the device</li><li>`-1`: Get all of the above</li></ul> | $((0x1 \| 0x10 \| 0x20)) | Controls what is printed during runtime, including data-mappings and kernel executions. Extra information can be found [here](https://openmp.llvm.org/design/Runtimes.html#libomptarget-info). Note that multiple values can be set through a bitwise operation. |
+| LIBOMPTARGET_DEBUG | `1`: print LLVM debug information. | 1 | Toggles printing of LLVM debug information. Extra information can be found [here](https://openmp.llvm.org/design/Runtimes.html#libomptarget-debug). Program needs to be compiled with `-DOMPTARGET_DEBUG`. |
+| MPICH_ENV_DISPLAY | `1`: logs exactly how MPI was configured. | 1 | Toggles exact logging of how MPI was configured |
+| MPICH_VERSION_DISPLAY | `1`: prints MPI version on startup | 1 | Toggles printing of MPI version on startup. |
+| OMP_NUM_THREADS | [N] | 1 | Sets number of threads to use in parallel region. |
+| OMP_DISPLAY_ENV | <ul><li>`TRUE`: Display OpenMP version number and initial ICV values for the environment variables.</li><li>`FALSE`: Do not display anything.</li><li>`VERBOSE`: Also display the values of runtime variables that may be modified by vendor-specific environment variables.</li></ul> | VERBOSE | Display information on OpenMP and environment variables during runtime. |
+**DO NOT USE IN PRODUCTION**  
+There are also some environment variables that should not be used in production, but might be helpful when debugging nasty bugs.
+|   Environment Variable | Possible Values | Debug Value |     Description    |
+|:----------------------:|:---------------:|:-----------:|:------------------:|
+| MPICH_SINGLE_HOST_ENABLED | `0`: force traffic through the NIC when you only have one node. | *unset* | Forces traffic through the NIC when you only have one node. |
+| MPICH_SMP_SINGLE_COPY_MODE | `NONE`: disables all intra-node optimizations and forces all in-node communication through slow inter-process communicator paths on CPU memory. | *unset* | No optimizations and all communication goes over inter-process communicator on CPU memory. |
+| MPICH_GPU_IPC_ENABLED | `0`: disables SMP opterations and disables IPC. | *unset* | Disables SMP operations and disables IPC. |
+## Sources
+The sources used for this wiki page:
+- [Rocprof documentation](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/rocprof.html)
+- [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems#)
+- [NVIDIA Nsight Systems User Guide](https://docs.nvidia.com/nsight-systems/2021.4/UserGuide/#cli-profile-command-switch-options)
+- [NVIDIA HPC SDK Documentation](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html#opt-gs)
+- [FTN compiler options from KAUST](https://www.hpc.kaust.edu.sa/sites/default/files/files/public/1.03a-AdditionalInformation_CrayCompilationEnvironment.pdf)
+- [FTN compiler options ECMWF Wiki](https://confluence.ecmwf.int/download/attachments/46600240)
+- [OpenMP Environment Variables](https://www.openmp.org/spec-html/5.0/openmpch6.html)
+- [Tools and Methods for ACC debugging](https://juser.fz-juelich.de/record/902543/files/3-Debugging--TH.pdf)
\ No newline at end of file