Okke van Eck · c5ae7164
--- a/3.-Tools.md
+++ b/3.-Tools.md
@@ -188,62 +188,74 @@ This section covers all performance patterns for the GPU side of applications.

 ---

-#### Thread load-imbalance
+#### Host-device memory operations
 **Issue**  
-
+Slowdown due to memory transfers.  
 **Performance Behavior**  
-
+Many small copies from host to device.  
+Low bandwidth of memory transfers.  
 **Performance counters**  
-
+MemUnitBusy.  
 **Fix**  
+Fuse memory transfer operations as much as possible.  
+Remove redundant transfers.  
+Use pinned memory instead of non-pinned memory (on some systems non-pinned memory transfers are several times slower than pinned).  

 ---

-#### Thread divergence
+#### Device load and occupation
 **Issue**  
-
+The accelerator is underloaded, resulting in poor memory bandwidth and arithmetic throughput.  
 **Performance Behavior**  
-
+Number of wavefronts is lower than theoretical maximum.  
 **Performance counters**  
-
+Compare the number of wavefronts to theoretical peak (Wavefronts).  
+See the arithmetic unit load naturally increase when device load increases (VALUBusy)  
+If shared memory is not used, check if the number of registers are a limiter (spi_vwc_csc_wr & spi_swc_csc_wr).  
 **Fix**  
+Increase workload for the device.  

 ---

-#### High branching
+#### Global memory traffic
 **Issue**  
-
+The memory bandwidth is not fully utilized.  
 **Performance Behavior**  
-
+Average bandwidth of memory transfer operations falls down several times.  
 **Performance counters**  
-
+AI signifies memory bound application (VALUBusy).  
+Saturation of memory bus (MemUnitStalled).  
 **Fix**  
+Coalesce memory access patterns.  
+Increase workload or decrease size of offloaded code.  

 ---

-#### Non-cached memory access
+#### Accelerator kernels granularity
 **Issue**  
-
+The overhead of launching kernels is bigger than benefit.  
 **Performance Behavior**  
-
+Low execution times of individual kernels, but many are launched.  
 **Performance counters**  
-
+?  
 **Fix**  
+Fuse kernels together. Sometimes possible to do automatically, sometimes requires manual work.  

 ---

-#### Bandwidth saturation/limitation
+#### Divergent branching overhead
 **Issue**  
-
+Excessive branching reducing 
 **Performance Behavior**  
-
+There is low utilization of the ALUs.  
 **Performance counters**  
-
+VALUUtilization.  
 **Fix**  
+Remove branching factors from code.  

 ---

-#### SM load imbalance
+#### Shared memory bank conflicts
 **Issue**  

 **Performance Behavior**  
@@ -254,7 +266,7 @@ This section covers all performance patterns for the GPU side of applications.

 ---

-#### Insufficient workload
+#### Impact of atomic operations
 **Issue**  

 **Performance Behavior**  
@@ -265,15 +277,6 @@ This section covers all performance patterns for the GPU side of applications.

 ---

-#### Synchronization issues
-**Issue**  
-
-**Performance Behavior**  
-
-**Performance counters**  
-
-**Fix**  
-

 # Subpages
 - [Intel Advisor & Intel VTune](3.a.-Intel-Offload-Advisor-&-Intel-VTune)
\ No newline at end of file