... | ... | @@ -67,53 +67,118 @@ Performance patterns (or anti-patterns) are specific behaviors/problems that for |
|
|
### CPU performance patterns
|
|
|
This section covers all performance patterns for the CPU side of applications.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Load Imbalance
|
|
|
**Issue:** The workload is not equally distributed. Several units stall waiting for one unit to complete.
|
|
|
**Performance Behavior:** Saturating speed-up (sooner than expected)
|
|
|
**Performance counters:** Different count of instructions retired or floating point operations among cores (FLOPS_DP, FLOPS_SP)
|
|
|
**Fix:** Reorganize work to improve load balancing.
|
|
|
**Issue**
|
|
|
The workload is not equally distributed.
|
|
|
Several units stall waiting for one unit to complete.
|
|
|
**Performance Behavior**
|
|
|
Saturating speedup (sooner than expected).
|
|
|
**Performance counters**
|
|
|
Different count of instructions retired or floating point operations among cores (FLOPS_DP, FLOPS_SP).
|
|
|
**Fix**
|
|
|
Reorganize work to improve load balancing.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Bandwidth saturation
|
|
|
**Issue:**
|
|
|
**Performance Behavior:**
|
|
|
**Performance counters:**
|
|
|
**Fix:**
|
|
|
**Issue**
|
|
|
Bandwidth of a shared data path is exhausted.
|
|
|
**Performance Behavior**
|
|
|
Staturating speedup across cores sharing a memory interface.
|
|
|
**Compare memory bandwidth to peak bandwidth**
|
|
|
Measure peak with microbenchmark (MEM), can be applied for L3 or Mem.
|
|
|
**Fix**
|
|
|
Reduce the number of load/stores.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Strided or erratic data access
|
|
|
**Issue:**
|
|
|
**Performance Behavior:**
|
|
|
**Performance counters:**
|
|
|
**Fix:**
|
|
|
**Issue**
|
|
|
Low data transfer efficiency (between caches and to/from memory).
|
|
|
Inappropriate data structures or badly ordered loop nests.
|
|
|
**Performance Behavior**
|
|
|
Large discrepancy between simple bandwidth-based model and actual performance.
|
|
|
**Performance counters**
|
|
|
Low bandwidth utilization despite LD/ST domination.
|
|
|
Low cache hit ratios, frequent evicts/replacements (CACHE, DATA, MEM).
|
|
|
**Fix** Improve locality and strides.
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Bad Instruction Mix
|
|
|
**Issue:**
|
|
|
**Performance Behavior:**
|
|
|
**Performance counters:**
|
|
|
**Fix:**
|
|
|
**Issue**
|
|
|
Not enough parallelism, no vectorization, expensive operations.
|
|
|
Inefficient compiler code.
|
|
|
**Performance Behavior**
|
|
|
Performance insensitive to problem size fitting into different cache levels.
|
|
|
**Performance counters**
|
|
|
Large ratio of instructions retired to FP instructions if the useful work is FP.
|
|
|
Many cycles per instruction (CPI) if the problem is large-latency arithmetic.
|
|
|
Scalar instructions dominating in data-parallel loops.
|
|
|
(FLOPS_DP, FLOPS_SP).
|
|
|
**Fix**
|
|
|
Improve instruction mix (different operations, reordering, loop unrolling).
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Limited instruction throughput
|
|
|
**Issue:**
|
|
|
**Performance Behavior:**
|
|
|
**Performance counters:**
|
|
|
**Fix:**
|
|
|
**Issue**
|
|
|
Fewer than expected instructions per cycle.
|
|
|
**Performance Behavior**
|
|
|
Large discrepancy between actual performance and simple predictions based on max Flop/s or LD/ST throughput.
|
|
|
**Performance counters**
|
|
|
Low CPI near theoretical limit (if instruction throughput is the problem).
|
|
|
Static code analysis predicting large pressure on single execution port.
|
|
|
High CPI due to bad pipelining.
|
|
|
(FLOPS_DP, FLOPS_SP, DATA).
|
|
|
**Fix:** ?
|
|
|
|
|
|
---
|
|
|
|
|
|
#### Synchronization overhead
|
|
|
**Issue:**
|
|
|
**Performance Behavior:**
|
|
|
**Performance counters:**
|
|
|
**Fix:**
|
|
|
**Issue**
|
|
|
Barriers at the end of parallel loops.
|
|
|
Locks protecting shared resources.
|
|
|
**Performance Behavior**
|
|
|
Speedup going down as more cores are added.
|
|
|
No speedup with small problem sizes.
|
|
|
Cores busy but low FP performance.
|
|
|
**Performance counters**
|
|
|
Large non-FP instruction count (growing with number of cores used).
|
|
|
Low CPI.
|
|
|
FLOPS_DP, FLOPS_DP.
|
|
|
**Fix**
|
|
|
Remove unnecessary synchronization (especially the implicit ones!)
|
|
|
---
|
|
|
|
|
|
#### False cache line sharing
|
|
|
**Issue:**
|
|
|
**Performance Behavior:**
|
|
|
**Performance counters:**
|
|
|
**Fix:**
|
|
|
**Issue**
|
|
|
Different threads accessing a cache line, at least one of them modifying it.
|
|
|
**Performance Behavior**
|
|
|
Very low speedup or slowdown even with small core counts.
|
|
|
**Performance counters**
|
|
|
Frequent (remote) evicts (CACHE).
|
|
|
**Fix**
|
|
|
Revisit the working set per thread.
|
|
|
Data replication.
|
|
|
---
|
|
|
|
|
|
#### Bad page placement on ccNUMA
|
|
|
**Issue:**
|
|
|
**Performance Behavior:**
|
|
|
**Performance counters:**
|
|
|
**Fix:**
|
|
|
**Issue**
|
|
|
Non-local data access.
|
|
|
Bandwidth contention.
|
|
|
**Performance Behavior**
|
|
|
Bad/no scaling across locality domains.
|
|
|
**Performance counters**
|
|
|
Unbalanced bandwidth on memory interfaces.
|
|
|
High remote traffic (MEM).
|
|
|
**Fix**
|
|
|
Reorganize memory accesses.
|
|
|
(Attempt different page placement).
|
|
|
|
|
|
---
|
|
|
|
|
|
### GPU performance patterns
|
|
|
This section covers all performance patterns for the GPU side of applications.
|
... | ... | |