|
|
|
There are many things that can limit the performance we achieve.
|
|
|
|
In order to overcome these issues, there are some debugging tools that can be of help.
|
|
|
|
First, a general overview of different GPU performance bottlenecks is given.
|
|
|
|
Then a list of tools to combat these bottlenecks will be discussed.
|
|
|
|
|
|
|
|
## Potential problems
|
|
|
|
There are 6 general problems that can limit the performance of GPUs.
|
|
|
|
Then there are also some OpenMP specific cases where the compiler has implicit behavior which should be avoided.
|
|
|
|
Both will be described below.
|
|
|
|
|
|
|
|
### The 6 performance bottlenecks
|
|
|
|
The bottlenecks are given in order of severity, where the chance of occurrence also plays a role.
|
|
|
|
To indicate locality of the problems, we use:
|
|
|
|
- (*) Global issue (not a per-kernel)
|
|
|
|
- (**) May have global impact
|
|
|
|
|
|
|
|
**1. Host-device memory transfers (\*)**
|
|
|
|
Memory transfers are *very* slow.
|
|
|
|
When debugging this, think about:
|
|
|
|
- Redundant memory transfers
|
|
|
|
- Transfers by small portions (memcopy operation latency)
|
|
|
|
- Pinned/non-pinned memory(*)
|
|
|
|
- Unified memory
|
|
|
|
|
|
|
|
**2. Global memory traffic (device DRAM) (\*\*)**
|
|
|
|
It is an issue of threads have to load memory from different regions of the global memory.
|
|
|
|
When debugging this, think about:
|
|
|
|
- Coalesced memory reads or strided/random access
|
|
|
|
- Memory bound vs. compute-bound kernel
|
|
|
|
|
|
|
|
**3. Device load and occupation (\*\*)**
|
|
|
|
Some kernels are too small to fully use all the available SMs.
|
|
|
|
But it can also be that the algorithm is not suited for GPUs, for example in numerical analysis the effect of [explicit vs implicit methods](https://en.wikipedia.org/wiki/Explicit_and_implicit_methods).
|
|
|
|
When debugging his, think about:
|
|
|
|
- How many threads we have and what is the maximum for device
|
|
|
|
- The occupancy problem (register pressure, shared memory)
|
|
|
|
|
|
|
|
**4. Branching**
|
|
|
|
Branching for GPUs is very costly and should be avoided as much as possible.
|
|
|
|
When debugging this, take a closer look at if-else branches in the kernel.
|
|
|
|
|
|
|
|
**5. Shared memory bank conflicts**
|
|
|
|
The local memory is divided into memory banks.
|
|
|
|
Each bank can only address one dataset at a time, so if a half warp tries to load/store data from/to the same bank the access has to be serialized (this is a bank conflict).
|
|
|
|
So if each thread in a half warp accesses successive 32-bit values there are no bank conflicts.
|
|
|
|
|
|
|
|
An exception to this rule (every thread must access its own bank) is broadcasting: if all threads access the same address, the value is only read once and broadcasted to all threads.
|
|
|
|
|
|
|
|
**6. Impact of atomic operations**
|
|
|
|
Atomic operations can cause contention on the related variables, significantly reducing the overall throughput.
|
|
|
|
|
|
|
|
### OpenMP specific problems
|
|
|
|
There is also some implicit behavior by OpenMP which can penalize performance.
|
|
|
|
The list below are the ones we know of, but there might be more:
|
|
|
|
|
|
|
|
**Array copying**
|
|
|
|
If an array is copied (`a=b`, `a(:,:,:) = b(:,:,:)`, etc.) within an `$omp target parallel do`, all threads will perform this copy.
|
|
|
|
|
|
|
|
## Debug tools
|
|
|
|
Many things discussed here are an extension to the previous section [1.d. Profiling](1.d-Profiling).
|
|
|
|
Please refer to that page for specifics on how to use the tools.
|
|
|
|
|
|
|
|
Debugging is possible at compiler time or at runtime.
|
|
|
|
The options are compiler and hardware vendor specific, so that distinction will be made first.
|
|
|
|
Then in their subsections, the different compiler and runtime diagnostics will be discussed. |
|
|
|
\ No newline at end of file |