|
|
# Overview profile & analysis goals
|
|
|
**Whole application vs functions**.
|
|
|
For DEODE we will take a look at whole applications, while for DE340 we mostly take a look at functions through the mocking framework.
|
|
|
# GPU Profiling Methodology
|
|
|
|
|
|
**Code setups**
|
|
|
For the code, we currently have 4 different setups:
|
|
|
|
|
|
- Sequential CPU code
|
|
|
- Parallel CPU code
|
|
|
- GPU + sequential CPU code
|
|
|
- GPU + parallel CPU code
|
|
|
|
|
|
**Performance metrics**
|
|
|
There are many performance metrics to test for. This is a list of hardware counters that can be issued.
|
|
|
On CPU:
|
|
|
|
|
|
- Execution time
|
|
|
- Instructions retired per core / CPU clock unhalted per core (for load balancing)
|
|
|
- CPI (cycles per iteration)
|
|
|
- AI (arithmetic intensity)
|
|
|
- Branch rate & Branch misprediction rate/ratio
|
|
|
- Memory number of loads vs stores
|
|
|
- Memory bandwidth saturation
|
|
|
- Memory NUMA accesses
|
|
|
- False sharing
|
|
|
- Cache misses (L1, L2, L3)
|
|
|
|
|
|
On GPU:
|
|
|
|
|
|
- GPU Busy (percentage of time GPU busy)
|
|
|
- Wavefronts (total wavefronts)
|
|
|
- VALUInsts (average number of vector ALU instructions executed per work-item)
|
|
|
- VFetchInsts (average number of vector fetch instructions from video memory per work-item)
|
|
|
- VWriteInsts (average number of vector write instructions to video memory per work-item)
|
|
|
- VALUUtilization (percentage of active vector ALU threads in a wave)
|
|
|
- VALUBusy (percentage of GPUTime vector ALU instructions are processed)
|
|
|
- WriteSize (total KBs written to video memory)
|
|
|
- L2CacheHit (percentage of hits in L2 data)
|
|
|
- MemUnitBusy (percentage of GPUTime the memory unit is active)
|
|
|
- MemUnitStalled (measured with extra fetches and writes and effects taken into account)
|
|
|
- WriteUnitStalled (percentage GPUTime the WriteUnit is stalled)
|
|
|
- LDSBankConflict (Percentage of GPUTime LDS is stalled by bank conflict)
|
|
|
|
|
|
## Application optimization workflow
|
|
|
The workflow for optimization code can be summarized as:
|
|
|
|
|
|
1. Understand requirements
|
|
|
2. Understand current performance
|
|
|
3. Can it be done? (modeling)
|
|
|
4. How can it be done? (some options)
|
|
|
5. Tuning
|
|
|
- Not there yet? Back to 2!
|
|
|
6. Analyze the result
|
|
|
|
|
|
This workflow can be adapted to analyze our projects via:
|
|
|
|
|
|
1. Optimize for latency or throughput?
|
|
|
2. Create cache-aware roofline model to understand current performance and theoretical limit
|
|
|
3. Take action based on findings
|
|
|
1. If memory bounded, check: Load Imbalance, Cache Misses, Memory Loads & Stores, NUMA access
|
|
|
2. If computation bounded, check: Load Imbalance, Branch Rate & Misprediction, Performance Options (intrinsics, vectorization, pipelining, superscalar execution, out of order execution, branch prediction, speculative execution), or GPU porting
|
|
|
4. Analyze changes with new roofline model, and (potentially) repeat!
|
|
|
|
|
|
## Performance Patterns & Signatures
|
|
|
Performance patterns (or anti-patterns) are specific behaviors/problems that form bottlenecks for performance. The subsections below will discuss performance patterns to look out for. CPU and GPU patterns will be discussed separately.
|
|
|
|
|
|
### CPU performance patterns
|
|
|
This section covers all performance patterns for the CPU side of applications.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Load Imbalance
|
|
|
**Issue**
|
|
|
The workload is not equally distributed.
|
|
|
Several units stall waiting for one unit to complete.
|
|
|
**Performance Behavior**
|
|
|
Saturating speedup (sooner than expected).
|
|
|
**Performance counters**
|
|
|
Different count of instructions retired or floating point operations among cores (FLOPS_DP, FLOPS_SP).
|
|
|
**Fix**
|
|
|
Reorganize work to improve load balancing.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Bandwidth saturation
|
|
|
**Issue**
|
|
|
Bandwidth of a shared data path is exhausted.
|
|
|
**Performance Behavior**
|
|
|
Staturating speedup across cores sharing a memory interface.
|
|
|
**Compare memory bandwidth to peak bandwidth**
|
|
|
Measure peak with microbenchmark (MEM), can be applied for L3 or Mem.
|
|
|
**Fix**
|
|
|
Reduce the number of load/stores.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Strided or erratic data access
|
|
|
**Issue**
|
|
|
Low data transfer efficiency (between caches and to/from memory).
|
|
|
Inappropriate data structures or badly ordered loop nests.
|
|
|
**Performance Behavior**
|
|
|
Large discrepancy between simple bandwidth-based model and actual performance.
|
|
|
**Performance counters**
|
|
|
Low bandwidth utilization despite LD/ST domination.
|
|
|
Low cache hit ratios, frequent evicts/replacements (CACHE, DATA, MEM).
|
|
|
**Fix** Improve locality and strides.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Bad Instruction Mix
|
|
|
**Issue**
|
|
|
Not enough parallelism, no vectorization, expensive operations.
|
|
|
Inefficient compiler code.
|
|
|
**Performance Behavior**
|
|
|
Performance insensitive to problem size fitting into different cache levels.
|
|
|
**Performance counters**
|
|
|
Large ratio of instructions retired to FP instructions if the useful work is FP.
|
|
|
Many cycles per instruction (CPI) if the problem is large-latency arithmetic.
|
|
|
Scalar instructions dominating in data-parallel loops.
|
|
|
(FLOPS_DP, FLOPS_SP).
|
|
|
**Fix**
|
|
|
Improve instruction mix (different operations, reordering, loop unrolling).
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Limited instruction throughput
|
|
|
**Issue**
|
|
|
Fewer than expected instructions per cycle.
|
|
|
**Performance Behavior**
|
|
|
Large discrepancy between actual performance and simple predictions based on max Flop/s or LD/ST throughput.
|
|
|
**Performance counters**
|
|
|
Low CPI near theoretical limit (if instruction throughput is the problem).
|
|
|
Static code analysis predicting large pressure on single execution port.
|
|
|
High CPI due to bad pipelining.
|
|
|
(FLOPS_DP, FLOPS_SP, DATA).
|
|
|
**Fix:**
|
|
|
?
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Synchronization overhead
|
|
|
**Issue**
|
|
|
Barriers at the end of parallel loops.
|
|
|
Locks protecting shared resources.
|
|
|
**Performance Behavior**
|
|
|
Speedup going down as more cores are added.
|
|
|
No speedup with small problem sizes.
|
|
|
Cores busy but low FP performance.
|
|
|
**Performance counters**
|
|
|
Large non-FP instruction count (growing with number of cores used).
|
|
|
Low CPI.
|
|
|
FLOPS_DP, FLOPS_DP.
|
|
|
**Fix**
|
|
|
Remove unnecessary synchronization (especially the implicit ones!)
|
|
|
|
|
|
---
|
|
|
|
|
|
#### False cache line sharing
|
|
|
**Issue**
|
|
|
Different threads accessing a cache line, at least one of them modifying it.
|
|
|
**Performance Behavior**
|
|
|
Very low speedup or slowdown even with small core counts.
|
|
|
**Performance counters**
|
|
|
Frequent (remote) evicts (CACHE).
|
|
|
**Fix**
|
|
|
Revisit the working set per thread.
|
|
|
Data replication.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Bad page placement on ccNUMA
|
|
|
**Issue**
|
|
|
Non-local data access.
|
|
|
Bandwidth contention.
|
|
|
**Performance Behavior**
|
|
|
Bad/no scaling across locality domains.
|
|
|
**Performance counters**
|
|
|
Unbalanced bandwidth on memory interfaces.
|
|
|
High remote traffic (MEM).
|
|
|
**Fix**
|
|
|
Reorganize memory accesses.
|
|
|
(Attempt different page placement).
|
|
|
|
|
|
---
|
|
|
|
|
|
### GPU performance patterns
|
|
|
This section covers all performance patterns for the GPU side of applications.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Host-device memory operations
|
|
|
**Issue**
|
|
|
Slowdown due to memory transfers.
|
|
|
**Performance Behavior**
|
|
|
Many small copies from host to device.
|
|
|
Low bandwidth of memory transfers.
|
|
|
**Performance counters**
|
|
|
MemUnitBusy.
|
|
|
**Fix**
|
|
|
Fuse memory transfer operations as much as possible.
|
|
|
Remove redundant transfers.
|
|
|
Use pinned memory instead of non-pinned memory (on some systems non-pinned memory transfers are several times slower than pinned).
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Device load and occupation
|
|
|
**Issue**
|
|
|
The accelerator is underloaded, resulting in poor memory bandwidth and arithmetic throughput.
|
|
|
**Performance Behavior**
|
|
|
Number of wavefronts is lower than theoretical maximum.
|
|
|
**Performance counters**
|
|
|
Compare the number of wavefronts to theoretical peak (Wavefronts).
|
|
|
See the arithmetic unit load naturally increase when device load increases (VALUBusy)
|
|
|
If shared memory is not used, check if the number of registers are a limiter (spi_vwc_csc_wr & spi_swc_csc_wr).
|
|
|
**Fix**
|
|
|
Increase workload for the device.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Global memory traffic
|
|
|
**Issue**
|
|
|
The memory bandwidth is not fully utilized.
|
|
|
**Performance Behavior**
|
|
|
Average bandwidth of memory transfer operations falls down several times.
|
|
|
**Performance counters**
|
|
|
AI signifies memory bound application (VALUBusy).
|
|
|
Saturation of memory bus (MemUnitStalled).
|
|
|
**Fix**
|
|
|
Coalesce memory access patterns.
|
|
|
Increase workload or decrease size of offloaded code.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Accelerator kernels granularity
|
|
|
**Issue**
|
|
|
The overhead of launching kernels is bigger than benefit.
|
|
|
**Performance Behavior**
|
|
|
Low execution times of individual kernels, but many are launched.
|
|
|
**Performance counters**
|
|
|
?
|
|
|
**Fix**
|
|
|
Fuse kernels together. Sometimes possible to do automatically, sometimes requires manual work.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Divergent branching overhead
|
|
|
**Issue**
|
|
|
Excessive branching reducing
|
|
|
**Performance Behavior**
|
|
|
There is low utilization of the ALUs.
|
|
|
**Performance counters**
|
|
|
VALUUtilization.
|
|
|
**Fix**
|
|
|
Remove branching factors from code.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Shared memory bank conflicts
|
|
|
**Issue**
|
|
|
Degrading performance due to increase in memory access times.
|
|
|
**Performance Behavior**
|
|
|
Increased ALU stall?
|
|
|
**Performance counters**
|
|
|
LDSBankConflict indicates thta a conflict appears.
|
|
|
ALUStalledByLDS and LDSInsts can be used to get a better picture.
|
|
|
**Fix**
|
|
|
?
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Impact of atomic operations
|
|
|
**Issue**
|
|
|
Atomic operations can cause contention on variables, reducing the overall throughput.
|
|
|
**Performance Behavior**
|
|
|
Reduced performance which can be different in different memory regions.
|
|
|
**Performance counters**
|
|
|
Metrics on memory transactions, cache utilization, and potential contention, if there are no alternative reasons for the worsening of their values.
|
|
|
These can be: MemUnitBusy, MemUnitStalled, L2CacheHit, VALUBusy
|
|
|
**Fix**
|
|
|
Remove atomic operations from code.
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
# Subpages
|
|
|
- [Intel Advisor & Intel VTune](3.a.-Intel-Offload-Advisor-&-Intel-VTune) |
|
|
\ No newline at end of file |
|
|
- [Intel Advisor & Intel VTune](3.a.-Intel-Offload-Advisor-&-Intel-VTune)
|
|
|
- [CPU and GPU performance overview](3.b.-CPU-and-GPU-performance-overview) |
|
|
\ No newline at end of file |