Okke van Eck · 290939a4
--- a/3.-Tools.md
+++ b/3.-Tools.md
-# Overview profile & analysis goals
-**Whole application vs functions**.  
-For DEODE we will take a look at whole applications, while for DE340 we mostly take a look at functions through the mocking framework.
+# GPU Profiling Methodology

-**Code setups**  
-For the code, we currently have 4 different setups:
-
- Sequential CPU code
- Parallel CPU code
- GPU + sequential CPU code
- GPU + parallel CPU code
-
-**Performance metrics**  
-There are many performance metrics to test for. This is a list of hardware counters that can be issued.  
-On CPU:
-
- Execution time
- Instructions retired per core / CPU clock unhalted per core (for load balancing)
- CPI (cycles per iteration)
- AI (arithmetic intensity)
- Branch rate & Branch misprediction rate/ratio
- Memory number of loads vs stores
- Memory bandwidth saturation
- Memory NUMA accesses
- False sharing
- Cache misses (L1, L2, L3)
-
-On GPU:
-
- GPU Busy (percentage of time GPU busy)
- Wavefronts (total wavefronts)
- VALUInsts (average number of vector ALU instructions executed per work-item)
- VFetchInsts (average number of vector fetch instructions from video memory per work-item)
- VWriteInsts (average number of vector write instructions to video memory per work-item)
- VALUUtilization (percentage of active vector ALU threads in a wave)
- VALUBusy (percentage of GPUTime vector ALU instructions are processed)
- WriteSize (total KBs written to video memory)
- L2CacheHit (percentage of hits in L2 data)
- MemUnitBusy (percentage of GPUTime the memory unit is active)
- MemUnitStalled (measured with extra fetches and writes and effects taken into account)
- WriteUnitStalled (percentage GPUTime the WriteUnit is stalled)
- LDSBankConflict (Percentage of GPUTime LDS is stalled by bank conflict)
-
-## Application optimization workflow
-The workflow for optimization code can be summarized as:
-
-1. Understand requirements
-2. Understand current performance
-3. Can it be done? (modeling)
-4. How can it be done? (some options)
-5. Tuning 
-    - Not there yet? Back to 2!
-6. Analyze the result
-
-This workflow can be adapted to analyze our projects via:
-
-1. Optimize for latency or throughput?
-2. Create cache-aware roofline model to understand current performance and theoretical limit
-3. Take action based on findings
-    1. If memory bounded, check: Load Imbalance, Cache Misses, Memory Loads & Stores, NUMA access
-    2. If computation bounded, check: Load Imbalance, Branch Rate & Misprediction, Performance Options (intrinsics, vectorization, pipelining, superscalar execution, out of order execution, branch prediction, speculative execution), or GPU porting
-4. Analyze changes with new roofline model, and (potentially) repeat!
-
-## Performance Patterns & Signatures
-Performance patterns (or anti-patterns) are specific behaviors/problems that form bottlenecks for performance. The subsections below will discuss performance patterns to look out for. CPU and GPU patterns will be discussed separately.
-
-### CPU performance patterns
-This section covers all performance patterns for the CPU side of applications.
-
---
-
-#### Load Imbalance
-**Issue**  
-The workload is not equally distributed.  
-Several units stall waiting for one unit to complete.  
-**Performance Behavior**  
-Saturating speedup (sooner than expected).  
-**Performance counters**  
-Different count of instructions retired or floating point operations among cores (FLOPS_DP, FLOPS_SP).  
-**Fix**  
-Reorganize work to improve load balancing.  
-
---
-
-#### Bandwidth saturation
-**Issue**  
-Bandwidth of a shared data path is exhausted.  
-**Performance Behavior**  
-Staturating speedup across cores sharing a memory interface.  
-**Compare memory bandwidth to peak bandwidth**  
-Measure peak with microbenchmark (MEM), can be applied for L3 or Mem.  
-**Fix**  
-Reduce the number of load/stores.  
-
---
-
-#### Strided or erratic data access
-**Issue**  
-Low data transfer efficiency (between caches and to/from memory).  
-Inappropriate data structures or badly ordered loop nests.  
-**Performance Behavior**  
-Large discrepancy between simple bandwidth-based model and actual performance.  
-**Performance counters**  
-Low bandwidth utilization despite LD/ST domination.  
-Low cache hit ratios, frequent evicts/replacements (CACHE, DATA, MEM).  
-**Fix** Improve locality and strides.  
-
---
-
-#### Bad Instruction Mix
-**Issue**  
-Not enough parallelism, no vectorization, expensive operations.  
-Inefficient compiler code.  
-**Performance Behavior**  
-Performance insensitive to problem size fitting into different cache levels.  
-**Performance counters**  
-Large ratio of instructions retired to FP instructions if the useful work is FP.  
-Many cycles per instruction (CPI) if the problem is large-latency arithmetic.  
-Scalar instructions dominating in data-parallel loops.  
-(FLOPS_DP, FLOPS_SP).  
-**Fix**  
-Improve instruction mix (different operations, reordering, loop unrolling).  
-
---
-
-#### Limited instruction throughput
-**Issue**  
-Fewer than expected instructions per cycle.  
-**Performance Behavior**  
-Large discrepancy between actual performance and simple predictions based on max Flop/s or LD/ST throughput.  
-**Performance counters**  
-Low CPI near theoretical limit (if instruction throughput is the problem).  
-Static code analysis predicting large pressure on single execution port.  
-High CPI due to bad pipelining.  
-(FLOPS_DP, FLOPS_SP, DATA).  
-**Fix:**  
-?  
-
---
-
-#### Synchronization overhead
-**Issue**  
-Barriers at the end of parallel loops.  
-Locks protecting shared resources.  
-**Performance Behavior**  
-Speedup going down as more cores are added.  
-No speedup with small problem sizes.  
-Cores busy but low FP performance.  
-**Performance counters**  
-Large non-FP instruction count (growing with number of cores used).  
-Low CPI.  
-FLOPS_DP, FLOPS_DP.  
-**Fix**  
-Remove unnecessary synchronization (especially the implicit ones!)  
-
---
-
-#### False cache line sharing
-**Issue**  
-Different threads accessing a cache line, at least one of them modifying it.  
-**Performance Behavior**  
-Very low speedup or slowdown even with small core counts.  
-**Performance counters**  
-Frequent (remote) evicts (CACHE).  
-**Fix**  
-Revisit the working set per thread.  
-Data replication.  
-
---
-
-#### Bad page placement on ccNUMA
-**Issue**  
-Non-local data access.  
-Bandwidth contention.  
-**Performance Behavior**  
-Bad/no scaling across locality domains.  
-**Performance counters**  
-Unbalanced bandwidth on memory interfaces.  
-High remote traffic (MEM).  
-**Fix**  
-Reorganize memory accesses.  
-(Attempt different page placement).  
-
---
-
-### GPU performance patterns
-This section covers all performance patterns for the GPU side of applications.
-
---
-
-#### Host-device memory operations
-**Issue**  
-Slowdown due to memory transfers.  
-**Performance Behavior**  
-Many small copies from host to device.  
-Low bandwidth of memory transfers.  
-**Performance counters**  
-MemUnitBusy.  
-**Fix**  
-Fuse memory transfer operations as much as possible.  
-Remove redundant transfers.  
-Use pinned memory instead of non-pinned memory (on some systems non-pinned memory transfers are several times slower than pinned).  
-
---
-
-#### Device load and occupation
-**Issue**  
-The accelerator is underloaded, resulting in poor memory bandwidth and arithmetic throughput.  
-**Performance Behavior**  
-Number of wavefronts is lower than theoretical maximum.  
-**Performance counters**  
-Compare the number of wavefronts to theoretical peak (Wavefronts).  
-See the arithmetic unit load naturally increase when device load increases (VALUBusy)  
-If shared memory is not used, check if the number of registers are a limiter (spi_vwc_csc_wr & spi_swc_csc_wr).  
-**Fix**  
-Increase workload for the device.  
-
---
-
-#### Global memory traffic
-**Issue**  
-The memory bandwidth is not fully utilized.  
-**Performance Behavior**  
-Average bandwidth of memory transfer operations falls down several times.  
-**Performance counters**  
-AI signifies memory bound application (VALUBusy).  
-Saturation of memory bus (MemUnitStalled).  
-**Fix**  
-Coalesce memory access patterns.  
-Increase workload or decrease size of offloaded code.  
-
---
-
-#### Accelerator kernels granularity
-**Issue**  
-The overhead of launching kernels is bigger than benefit.  
-**Performance Behavior**  
-Low execution times of individual kernels, but many are launched.  
-**Performance counters**  
-?  
-**Fix**  
-Fuse kernels together. Sometimes possible to do automatically, sometimes requires manual work.  
-
---
-
-#### Divergent branching overhead
-**Issue**  
-Excessive branching reducing 
-**Performance Behavior**  
-There is low utilization of the ALUs.  
-**Performance counters**  
-VALUUtilization.  
-**Fix**  
-Remove branching factors from code.  
-
---
-
-#### Shared memory bank conflicts
-**Issue**  
-Degrading performance due to increase in memory access times.  
-**Performance Behavior**  
-Increased ALU stall?
-**Performance counters**  
-LDSBankConflict indicates thta a conflict appears.  
-ALUStalledByLDS and LDSInsts can be used to get a better picture.  
-**Fix**  
-?
-
---
-
-#### Impact of atomic operations
-**Issue**  
-Atomic operations can cause contention on variables, reducing the overall throughput.  
-**Performance Behavior**  
-Reduced performance which can be different in different memory regions.  
-**Performance counters**  
-Metrics on memory transactions, cache utilization, and potential contention, if there are no alternative reasons for the worsening of their values.  
-These can be: MemUnitBusy, MemUnitStalled, L2CacheHit, VALUBusy  
-**Fix**  
-Remove atomic operations from code.  
-
---


 # Subpages
- [Intel Advisor & Intel VTune](3.a.-Intel-Offload-Advisor-&-Intel-VTune)
\ No newline at end of file
+- [Intel Advisor & Intel VTune](3.a.-Intel-Offload-Advisor-&-Intel-VTune)
+- [CPU and GPU performance overview](3.b.-CPU-and-GPU-performance-overview)
\ No newline at end of file