Okke van Eck · ceedd683
--- a/1.e-Performance-hazards.md
+++ b/1.e-Performance-hazards.md
+There are many things that can limit the performance we achieve. 
+In order to overcome these issues, there are some debugging tools that can be of help.
+First, a general overview of different GPU performance bottlenecks is given.
+Then a list of tools to combat these bottlenecks will be discussed.
+## Potential problems
+There are 6 general problems that can limit the performance of GPUs.
+Then there are also some OpenMP specific cases where the compiler has implicit behavior which should be avoided.
+Both will be described below.
+### The 6 performance bottlenecks
+The bottlenecks are given in order of severity, where the chance of occurrence also plays a role.  
+To indicate locality of the problems, we use:
+ - (*) Global issue (not a per-kernel)  
+ - (**) May have global impact
+**1. Host-device memory transfers (\*)**  
+Memory transfers are *very* slow. 
+When debugging this, think about: 
+ - Redundant memory transfers
+ - Transfers by small portions (memcopy operation latency)
+ - Pinned/non-pinned memory(*)
+ - Unified memory
+**2. Global memory traffic (device DRAM) (\*\*)**  
+It is an issue of threads have to load memory from different regions of the global memory.
+When debugging this, think about:
+ - Coalesced memory reads or strided/random access
+ - Memory bound vs. compute-bound kernel
+**3. Device load and occupation (\*\*)**  
+Some kernels are too small to fully use all the available SMs.
+But it can also be that the algorithm is not suited for GPUs, for example in numerical analysis the effect of [explicit vs implicit methods](https://en.wikipedia.org/wiki/Explicit_and_implicit_methods).
+When debugging his, think about:
+ - How many threads we have and what is the maximum for device
+ - The occupancy problem (register pressure, shared memory)
+**4. Branching**  
+Branching for GPUs is very costly and should be avoided as much as possible.
+When debugging this, take a closer look at if-else branches in the kernel.
+**5. Shared memory bank conflicts**  
+The local memory is divided into memory banks. 
+Each bank can only address one dataset at a time, so if a half warp tries to load/store data from/to the same bank the access has to be serialized (this is a bank conflict).
+So if each thread in a half warp accesses successive 32-bit values there are no bank conflicts.
+An exception to this rule (every thread must access its own bank) is broadcasting: if all threads access the same address, the value is only read once and broadcasted to all threads.
+**6. Impact of atomic operations**  
+Atomic operations can cause contention on the related variables, significantly reducing the overall throughput.
+### OpenMP specific problems
+There is also some implicit behavior by OpenMP which can penalize performance.
+The list below are the ones we know of, but there might be more:
+**Array copying**  
+If an array is copied (`a=b`, `a(:,:,:) = b(:,:,:)`, etc.) within an `$omp target parallel do`, all threads will perform this copy.
+## Debug tools
+Many things discussed here are an extension to the previous section [1.d. Profiling](1.d-Profiling).
+Please refer to that page for specifics on how to use the tools.
+Debugging is possible at compiler time or at runtime.
+The options are compiler and hardware vendor specific, so that distinction will be made first.
+Then in their subsections, the different compiler and runtime diagnostics will be discussed.
\ No newline at end of file