... | ... | @@ -86,17 +86,32 @@ For the Lumi-G hardware, a good data block size is .., and good bandwidth is .. |
|
|
In this subsection, list the number of operations, and the sizes there are. Also list the bandwidth of the transfers, and compare everything with what is considered "good" for the hardware.
|
|
|
|
|
|
#### 2. Device load and occupation
|
|
|
Wavefronts are limited by either the shared memory or the number of registers that are used by one kernel. The number of registers can be reported as `sgpr` and `vgpr`, acquired through the `spi_vwc_csc_wr` and `spi_swc_csc_wr` hardware counters.
|
|
|
Wavefronts are limited by either the shared memory or the number of registers that are used by one kernel. The number of registers can be reported as `sgpr` and `vgpr`, acquired through the `spi_vwc_csc_wr` and `spi_swc_csc_wr` hardware counters. Another way of getting the number of registers is through the `CRAY_ACC_DEBUG=3` output.
|
|
|
|
|
|
In this section, first make a comparison with how many registers are used, and what that translates to in terms of Wavefronts. Compare this number to the theoretical maximum. Then also state what the optimal number of registers are per kernel.
|
|
|
|
|
|
Then you can compare the total number of wavefronts with the theoretical maximum. You can use the hardware counter `Wavefront` for each input size.
|
|
|
|
|
|
Lastly, you can compare the `VALUBusy` metric for different workload sizes to see if the arithmetic load scales with the input size.
|
|
|
|
|
|
#### 3. Global memory traffic (device DRAM)
|
|
|
The memory traffic can be a limiter, and that is why a comparison of `MemUnitBusy` vs `VALUBusy` is required. The ratio of the two signifies if memory operations are a significant part of the kernel.
|
|
|
|
|
|
Another metric for memory efficiency is the cache hit rate, which is reflected by the `L2CacheHit` metric. The closer to 100% the better. If the cache hit rate is low, it could be a sign of bad temporal locality. However, if it remains constantly low for any input size, we can conclude that this is not the case as we would have expected a steady decrease.
|
|
|
|
|
|
You can also check for memory bus saturation through the `MemUnitStalled` hardware counter. If it is low, then this might be due to low occupancy and device load. If it goes up with bigger workloads, we can conclude that the device load is actually playing part.
|
|
|
|
|
|
#### 4. Accelerator kernels granularity
|
|
|
Kernel granularity refers to performance degradation through the API overhead created by many kernel launches. This should be already visible from the "GPU Kernels and Memory transfers" section above.
|
|
|
|
|
|
#### 5. Divergent branching overhead
|
|
|
Excessive branching will reduce the arithmetic throughput and thus a lower value for the `VALUUtilization` metric. Reasonable numbers are 95% and up. The ALU stall time is the inverse of this percentage.
|
|
|
|
|
|
#### 6. Shared memory bank conflicts
|
|
|
Shared memory bank conflicts are tracked through the `LDSBankConflict` hardware counter. Generally, this value should be 0. You can check the runtime debug information for verifying whether the application actually uses shared memory or not.
|
|
|
|
|
|
#### 7. Impact of atomic operations
|
|
|
Atomic instructions are difficult to diagnose. You can scan the source code for these instructions, but there is no known automated way.
|
|
|
|
|
|
|
|
|
# Subpages
|
... | ... | |