... | ... | @@ -188,62 +188,74 @@ This section covers all performance patterns for the GPU side of applications. |
|
|
|
|
|
---
|
|
|
|
|
|
#### Thread load-imbalance
|
|
|
#### Host-device memory operations
|
|
|
**Issue**
|
|
|
|
|
|
Slowdown due to memory transfers.
|
|
|
**Performance Behavior**
|
|
|
|
|
|
Many small copies from host to device.
|
|
|
Low bandwidth of memory transfers.
|
|
|
**Performance counters**
|
|
|
|
|
|
MemUnitBusy.
|
|
|
**Fix**
|
|
|
Fuse memory transfer operations as much as possible.
|
|
|
Remove redundant transfers.
|
|
|
Use pinned memory instead of non-pinned memory (on some systems non-pinned memory transfers are several times slower than pinned).
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Thread divergence
|
|
|
#### Device load and occupation
|
|
|
**Issue**
|
|
|
|
|
|
The accelerator is underloaded, resulting in poor memory bandwidth and arithmetic throughput.
|
|
|
**Performance Behavior**
|
|
|
|
|
|
Number of wavefronts is lower than theoretical maximum.
|
|
|
**Performance counters**
|
|
|
|
|
|
Compare the number of wavefronts to theoretical peak (Wavefronts).
|
|
|
See the arithmetic unit load naturally increase when device load increases (VALUBusy)
|
|
|
If shared memory is not used, check if the number of registers are a limiter (spi_vwc_csc_wr & spi_swc_csc_wr).
|
|
|
**Fix**
|
|
|
Increase workload for the device.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### High branching
|
|
|
#### Global memory traffic
|
|
|
**Issue**
|
|
|
|
|
|
The memory bandwidth is not fully utilized.
|
|
|
**Performance Behavior**
|
|
|
|
|
|
Average bandwidth of memory transfer operations falls down several times.
|
|
|
**Performance counters**
|
|
|
|
|
|
AI signifies memory bound application (VALUBusy).
|
|
|
Saturation of memory bus (MemUnitStalled).
|
|
|
**Fix**
|
|
|
Coalesce memory access patterns.
|
|
|
Increase workload or decrease size of offloaded code.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Non-cached memory access
|
|
|
#### Accelerator kernels granularity
|
|
|
**Issue**
|
|
|
|
|
|
The overhead of launching kernels is bigger than benefit.
|
|
|
**Performance Behavior**
|
|
|
|
|
|
Low execution times of individual kernels, but many are launched.
|
|
|
**Performance counters**
|
|
|
|
|
|
?
|
|
|
**Fix**
|
|
|
Fuse kernels together. Sometimes possible to do automatically, sometimes requires manual work.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Bandwidth saturation/limitation
|
|
|
#### Divergent branching overhead
|
|
|
**Issue**
|
|
|
|
|
|
Excessive branching reducing
|
|
|
**Performance Behavior**
|
|
|
|
|
|
There is low utilization of the ALUs.
|
|
|
**Performance counters**
|
|
|
|
|
|
VALUUtilization.
|
|
|
**Fix**
|
|
|
Remove branching factors from code.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### SM load imbalance
|
|
|
#### Shared memory bank conflicts
|
|
|
**Issue**
|
|
|
|
|
|
**Performance Behavior**
|
... | ... | @@ -254,7 +266,7 @@ This section covers all performance patterns for the GPU side of applications. |
|
|
|
|
|
---
|
|
|
|
|
|
#### Insufficient workload
|
|
|
#### Impact of atomic operations
|
|
|
**Issue**
|
|
|
|
|
|
**Performance Behavior**
|
... | ... | @@ -265,15 +277,6 @@ This section covers all performance patterns for the GPU side of applications. |
|
|
|
|
|
---
|
|
|
|
|
|
#### Synchronization issues
|
|
|
**Issue**
|
|
|
|
|
|
**Performance Behavior**
|
|
|
|
|
|
**Performance counters**
|
|
|
|
|
|
**Fix**
|
|
|
|
|
|
|
|
|
# Subpages
|
|
|
- [Intel Advisor & Intel VTune](3.a.-Intel-Offload-Advisor-&-Intel-VTune) |
|
|
\ No newline at end of file |