# Overview profile & analysis goals
**Whole application vs functions**.
For DEODE we will take a look at whole applications, while for DE340 we mostly take a look at functions through the mocking framework.
# GPU Profiling Methodology
**Code setups**
For the code, we currently have 4 different setups:
- Sequential CPU code
- Parallel CPU code
- GPU + sequential CPU code
- GPU + parallel CPU code
**Performance metrics**
There are many performance metrics to test for. This is a list of hardware counters that can be issued.
- Execution time
- Instructions retired per core / CPU clock unhalted per core (for load balancing)
- CPI (cycles per iteration)
- AI (arithmetic intensity)
- Branch rate & Branch misprediction rate/ratio
- Memory number of loads vs stores
- Memory bandwidth saturation
- Memory NUMA accesses
- False sharing
- Cache misses (L1, L2, L3)
- GPU Busy (percentage of time GPU busy)
- Wavefronts (total wavefronts)
- VALUInsts (average number of vector ALU instructions executed per work-item)
- VFetchInsts (average number of vector fetch instructions from video memory per work-item)
- VWriteInsts (average number of vector write instructions to video memory per work-item)
- VALUUtilization (percentage of active vector ALU threads in a wave)
- VALUBusy (percentage of GPUTime vector ALU instructions are processed)
- WriteSize (total KBs written to video memory)
- L2CacheHit (percentage of hits in L2 data)
- MemUnitBusy (percentage of GPUTime the memory unit is active)
- MemUnitStalled (measured with extra fetches and writes and effects taken into account)
- WriteUnitStalled (percentage GPUTime the WriteUnit is stalled)
- LDSBankConflict (Percentage of GPUTime LDS is stalled by bank conflict)
## Application optimization workflow
The workflow for optimization code can be summarized as:
1. Understand requirements
2. Understand current performance
3. Can it be done? (modeling)
4. How can it be done? (some options)
5. Tuning
- Not there yet? Back to 2!
6. Analyze the result
This workflow can be adapted to analyze our projects via:
1. Optimize for latency or throughput?
2. Create cache-aware roofline model to understand current performance and theoretical limit
3. Take action based on findings
1. If memory bounded, check: Load Imbalance, Cache Misses, Memory Loads & Stores, NUMA access
2. If computation bounded, check: Load Imbalance, Branch Rate & Misprediction, Performance Options (intrinsics, vectorization, pipelining, superscalar execution, out of order execution, branch prediction, speculative execution), or GPU porting
4. Analyze changes with new roofline model, and (potentially) repeat!
## Performance Patterns & Signatures
Performance patterns (or anti-patterns) are specific behaviors/problems that form bottlenecks for performance. The subsections below will discuss performance patterns to look out for. CPU and GPU patterns will be discussed separately.
### CPU performance patterns
This section covers all performance patterns for the CPU side of applications.
#### Load Imbalance
The workload is not equally distributed.
Several units stall waiting for one unit to complete.
**Performance Behavior**
Saturating speedup (sooner than expected).
**Performance counters**
Different count of instructions retired or floating point operations among cores (FLOPS_DP, FLOPS_SP).
Reorganize work to improve load balancing.
#### Bandwidth saturation
Bandwidth of a shared data path is exhausted.
**Performance Behavior**
Staturating speedup across cores sharing a memory interface.
**Compare memory bandwidth to peak bandwidth**
Measure peak with microbenchmark (MEM), can be applied for L3 or Mem.
Reduce the number of load/stores.
#### Strided or erratic data access
Low data transfer efficiency (between caches and to/from memory).
Inappropriate data structures or badly ordered loop nests.
**Performance Behavior**
Large discrepancy between simple bandwidth-based model and actual performance.
**Performance counters**
Low bandwidth utilization despite LD/ST domination.
Low cache hit ratios, frequent evicts/replacements (CACHE, DATA, MEM).
**Fix** Improve locality and strides.
#### Bad Instruction Mix
Not enough parallelism, no vectorization, expensive operations.
Inefficient compiler code.
**Performance Behavior**
Performance insensitive to problem size fitting into different cache levels.
**Performance counters**
Large ratio of instructions retired to FP instructions if the useful work is FP.
Many cycles per instruction (CPI) if the problem is large-latency arithmetic.
Scalar instructions dominating in data-parallel loops.
Improve instruction mix (different operations, reordering, loop unrolling).
#### Limited instruction throughput
Fewer than expected instructions per cycle.
**Performance Behavior**
Large discrepancy between actual performance and simple predictions based on max Flop/s or LD/ST throughput.
**Performance counters**
Low CPI near theoretical limit (if instruction throughput is the problem).
Static code analysis predicting large pressure on single execution port.
High CPI due to bad pipelining.
#### Synchronization overhead
Barriers at the end of parallel loops.
Locks protecting shared resources.
**Performance Behavior**
Speedup going down as more cores are added.
No speedup with small problem sizes.
Cores busy but low FP performance.
**Performance counters**
Large non-FP instruction count (growing with number of cores used).
Low CPI.
Remove unnecessary synchronization (especially the implicit ones!)
#### False cache line sharing
Different threads accessing a cache line, at least one of them modifying it.
**Performance Behavior**
Very low speedup or slowdown even with small core counts.
**Performance counters**
Frequent (remote) evicts (CACHE).
Revisit the working set per thread.
Data replication.
#### Bad page placement on ccNUMA
Non-local data access.
Bandwidth contention.
**Performance Behavior**
Bad/no scaling across locality domains.
**Performance counters**
Unbalanced bandwidth on memory interfaces.
High remote traffic (MEM).
Reorganize memory accesses.
(Attempt different page placement).
### GPU performance patterns
This section covers all performance patterns for the GPU side of applications.
#### Host-device memory operations
Slowdown due to memory transfers.
**Performance Behavior**
Many small copies from host to device.
Low bandwidth of memory transfers.
**Performance counters**
Fuse memory transfer operations as much as possible.
Remove redundant transfers.
Use pinned memory instead of non-pinned memory (on some systems non-pinned memory transfers are several times slower than pinned).
#### Device load and occupation
The accelerator is underloaded, resulting in poor memory bandwidth and arithmetic throughput.
**Performance Behavior**
Number of wavefronts is lower than theoretical maximum.
**Performance counters**
Compare the number of wavefronts to theoretical peak (Wavefronts).
See the arithmetic unit load naturally increase when device load increases (VALUBusy)
If shared memory is not used, check if the number of registers are a limiter (spi_vwc_csc_wr & spi_swc_csc_wr).
Increase workload for the device.
#### Global memory traffic
The memory bandwidth is not fully utilized.
**Performance Behavior**
Average bandwidth of memory transfer operations falls down several times.
**Performance counters**
AI signifies memory bound application (VALUBusy).
Saturation of memory bus (MemUnitStalled).
Coalesce memory access patterns.
Increase workload or decrease size of offloaded code.
#### Accelerator kernels granularity
The overhead of launching kernels is bigger than benefit.
**Performance Behavior**
Low execution times of individual kernels, but many are launched.
**Performance counters**
Fuse kernels together. Sometimes possible to do automatically, sometimes requires manual work.
#### Divergent branching overhead
Excessive branching reducing
**Performance Behavior**
There is low utilization of the ALUs.
**Performance counters**
Remove branching factors from code.
#### Shared memory bank conflicts
Degrading performance due to increase in memory access times.
**Performance Behavior**
Increased ALU stall?
**Performance counters**
LDSBankConflict indicates thta a conflict appears.
ALUStalledByLDS and LDSInsts can be used to get a better picture.
#### Impact of atomic operations
Atomic operations can cause contention on variables, reducing the overall throughput.
**Performance Behavior**
Reduced performance which can be different in different memory regions.
**Performance counters**
Metrics on memory transactions, cache utilization, and potential contention, if there are no alternative reasons for the worsening of their values.
These can be: MemUnitBusy, MemUnitStalled, L2CacheHit, VALUBusy
Remove atomic operations from code.
# Subpages
- [Intel Advisor & Intel VTune](3.a.-Intel-Offload-Advisor-&-Intel-VTune) |
\ No newline at end of file |
- [Intel Advisor & Intel VTune](3.a.-Intel-Offload-Advisor-&-Intel-VTune)
- [CPU and GPU performance overview](3.b.-CPU-and-GPU-performance-overview) |
\ No newline at end of file |