... | ... | @@ -19,6 +19,44 @@ The system software part should provide enough detail to reproduce the results. |
|
|
Lastly, the used hardware needs to be described. This can be done in a node-way fashion, where one node of the used partition is described. It is important to mention the GPU and CPU hardware used, how many are available. Also mention the logic level, where you can give more information on the NUMA topology with the number available cores per NUMA node. Moreover, list as well what the capacity of the GPUs are, like \#compute_units, \#max_wavefronts_per_eu, \#EUs_per_CU, \#total_max_wavefronts. These will all be required when discussing device occupancy and load.
|
|
|
|
|
|
### Analysis of the GPU code
|
|
|
For the analysis of the code, there are some sections that need to be profiled. If the application supports input size parameters, make sure to test for different combinations to get a clear picture. Each subsection below covers a subsection of the report.
|
|
|
|
|
|
#### GPU kernels and memory transfers
|
|
|
Here you can list the number of kernels, and if they have memory transfers. For each of these transfers, it is interesting to note down the minimum and maximum transfer size. The most clear way of portraying the data would be a table with rows, or a stacked barchart for multiple kernels, for the following metrics for each input size:
|
|
|
|
|
|
```
|
|
|
Kernel time (ns)
|
|
|
Call count HtoD
|
|
|
Call count DtoH
|
|
|
Transfer time HtoD (ns)
|
|
|
Transfer time DtoH (ns)
|
|
|
Transfer size HtoD (MB)
|
|
|
Transfer size DtoH (MB)
|
|
|
```
|
|
|
|
|
|
In a second table, it would be good to list the ratio of Host-Device memory transfer time to computation time for each input size. This would be computed as:
|
|
|
|
|
|
```
|
|
|
Cumulative GPU kernel time / Cumulative transfer time
|
|
|
```
|
|
|
|
|
|
#### Kernel metrics
|
|
|
Next, add some metrics on the kernel to shed light on potential performance hazards. These are all values from hardware counters that are collected through a profiler. Interesting hardware counters are:
|
|
|
|
|
|
```
|
|
|
vgpr
|
|
|
sgpr
|
|
|
Wavefronts
|
|
|
grd
|
|
|
wgr
|
|
|
VALUUtilization
|
|
|
VALUBusy
|
|
|
MemUnitBusy
|
|
|
L2CacheHit
|
|
|
MemUnitStalled
|
|
|
LDSBankConflict
|
|
|
```
|
|
|
|
|
|
|
|
|
### Performance analysis discussion
|
|
|
|
... | ... | |