... | ... | @@ -76,8 +76,17 @@ Speedup |
|
|
In this section, we cover the 7 performance hazards and use our findings in the analysis section above to state what the performance bottlenecks are. Each subsection discusses one of the performance hazards.
|
|
|
|
|
|
#### 1. Host-device memory transfers
|
|
|
Memory transfers can slow down performance if they are:
|
|
|
- Many small transfers that can be fused
|
|
|
- A really big transfer that is too big to allow overlap copy and compute
|
|
|
|
|
|
For the Lumi-G hardware, a good data block size is .., and good bandwidth is ..
|
|
|
! ~ TODO: Perform benchmarking tests and investigate optimal sizes and bandwidth.
|
|
|
|
|
|
In this subsection, list the number of operations, and the sizes there are. Also list the bandwidth of the transfers, and compare everything with what is considered "good" for the hardware.
|
|
|
|
|
|
#### 2. Device load and occupation
|
|
|
Wavefronts are limited by either the shared memory or the number of registers that are used by one kernel. The number of registers can be reported as `sgpr` and `vgpr`, acquired through the `spi_vwc_csc_wr` and `spi_swc_csc_wr` hardware counters.
|
|
|
|
|
|
#### 3. Global memory traffic (device DRAM)
|
|
|
|
... | ... | |