Okke van Eck · 48aea4e9
--- a/3.-Tools.md
+++ b/3.-Tools.md
@@ -67,53 +67,118 @@ Performance patterns (or anti-patterns) are specific behaviors/problems that for
 ### CPU performance patterns
 This section covers all performance patterns for the CPU side of applications.

+---
+
 #### Load Imbalance
-**Issue:** The workload is not equally distributed. Several units stall waiting for one unit to complete.  
-**Performance Behavior:** Saturating speed-up (sooner than expected)  
-**Performance counters:** Different count of instructions retired or floating point operations among cores (FLOPS_DP, FLOPS_SP)  
-**Fix:** Reorganize work to improve load balancing.
+**Issue**  
+The workload is not equally distributed.  
+Several units stall waiting for one unit to complete.  
+**Performance Behavior**  
+Saturating speedup (sooner than expected).  
+**Performance counters**  
+Different count of instructions retired or floating point operations among cores (FLOPS_DP, FLOPS_SP).  
+**Fix**  
+Reorganize work to improve load balancing.  
+
+---

 #### Bandwidth saturation
-**Issue:** 
-**Performance Behavior:** 
-**Performance counters:** 
-**Fix:** 
+**Issue**  
+Bandwidth of a shared data path is exhausted.  
+**Performance Behavior**  
+Staturating speedup across cores sharing a memory interface.  
+**Compare memory bandwidth to peak bandwidth**  
+Measure peak with microbenchmark (MEM), can be applied for L3 or Mem.  
+**Fix**  
+Reduce the number of load/stores.  
+
+---

 #### Strided or erratic data access
-**Issue:** 
-**Performance Behavior:** 
-**Performance counters:** 
-**Fix:** 
+**Issue**  
+Low data transfer efficiency (between caches and to/from memory).  
+Inappropriate data structures or badly ordered loop nests.  
+**Performance Behavior**  
+Large discrepancy between simple bandwidth-based model and actual performance.  
+**Performance counters**  
+Low bandwidth utilization despite LD/ST domination.  
+Low cache hit ratios, frequent evicts/replacements (CACHE, DATA, MEM).  
+**Fix** Improve locality and strides.  
+
+---

 #### Bad Instruction Mix
-**Issue:** 
-**Performance Behavior:** 
-**Performance counters:** 
-**Fix:** 
+**Issue**  
+Not enough parallelism, no vectorization, expensive operations.  
+Inefficient compiler code.  
+**Performance Behavior**  
+Performance insensitive to problem size fitting into different cache levels.  
+**Performance counters**  
+Large ratio of instructions retired to FP instructions if the useful work is FP.  
+Many cycles per instruction (CPI) if the problem is large-latency arithmetic.  
+Scalar instructions dominating in data-parallel loops.  
+(FLOPS_DP, FLOPS_SP).  
+**Fix**  
+Improve instruction mix (different operations, reordering, loop unrolling).  
+
+---

 #### Limited instruction throughput
-**Issue:** 
-**Performance Behavior:** 
-**Performance counters:** 
-**Fix:** 
+**Issue**  
+Fewer than expected instructions per cycle.  
+**Performance Behavior**  
+Large discrepancy between actual performance and simple predictions based on max Flop/s or LD/ST throughput.  
+**Performance counters**  
+Low CPI near theoretical limit (if instruction throughput is the problem).  
+Static code analysis predicting large pressure on single execution port.  
+High CPI due to bad pipelining.  
+(FLOPS_DP, FLOPS_SP, DATA).  
+**Fix:** ?  
+
+---

 #### Synchronization overhead
-**Issue:** 
-**Performance Behavior:** 
-**Performance counters:** 
-**Fix:** 
+**Issue**  
+Barriers at the end of parallel loops.  
+Locks protecting shared resources.  
+**Performance Behavior**  
+Speedup going down as more cores are added.  
+No speedup with small problem sizes.  
+Cores busy but low FP performance.  
+**Performance counters**  
+Large non-FP instruction count (growing with number of cores used).  
+Low CPI.  
+FLOPS_DP, FLOPS_DP.  
+**Fix**  
+Remove unnecessary synchronization (especially the implicit ones!)  
+---

 #### False cache line sharing
-**Issue:** 
-**Performance Behavior:** 
-**Performance counters:** 
-**Fix:** 
+**Issue**  
+Different threads accessing a cache line, at least one of them modifying it.  
+**Performance Behavior**  
+Very low speedup or slowdown even with small core counts.  
+**Performance counters**  
+Frequent (remote) evicts (CACHE).  
+**Fix**  
+Revisit the working set per thread.  
+Data replication.  
+---

 #### Bad page placement on ccNUMA
-**Issue:** 
-**Performance Behavior:** 
-**Performance counters:** 
-**Fix:** 
+**Issue**  
+Non-local data access.  
+Bandwidth contention.  
+**Performance Behavior**  
+Bad/no scaling across locality domains.  
+**Performance counters**  
+Unbalanced bandwidth on memory interfaces.  
+High remote traffic (MEM).  
+**Fix**  
+Reorganize memory accesses.  
+(Attempt different page placement).  
+
+---

 ### GPU performance patterns
 This section covers all performance patterns for the GPU side of applications.