... | ... | @@ -12,49 +12,14 @@ NOTE: This wiki does not contain all information, please also read the [specific |
|
|
|
|
|
These directives can be used for invoking GPU parallelization.
|
|
|
|
|
|
**target**
|
|
|
Declares portion of code to be executed on GPU.
|
|
|
|
|
|
**teams**
|
|
|
With teams, the code runs on GPU using multiple "teams".
|
|
|
|
|
|
- Invokes fork/join model
|
|
|
- No collective synchronization! (i.e. barriers)
|
|
|
- Must be within "omp target"
|
|
|
|
|
|
**distribute**
|
|
|
Iterations of the loop below are partitioned across "teams".
|
|
|
Without, only thread 0 of each team would be used.
|
|
|
|
|
|
- Loop partitioning (like omp do/far)
|
|
|
- Iterations are partitioned across "teams"
|
|
|
- NO implied barrier at end of loop!
|
|
|
- Best practice to combine with teams (target teams distribute)
|
|
|
|
|
|
**simd**
|
|
|
Optimize internal functions with SIMD instructions if possible.
|
|
|
This directive gets ignored if this is not possible, thus it can always be used.
|
|
|
Maps the work below to GPU threads within the "teams" blocks.
|
|
|
|
|
|
- Used for two-level GPU parallelism
|
|
|
- "teams" maps to GPU threadblocks
|
|
|
- "simd" maps to GPU threads within the teams
|
|
|
- According to Oak Ridge, always use in combination with parallel do.
|
|
|
|
|
|
! NOTE - parallel and simd are inconsistent across implementations !
|
|
|
|
|
|
- CCE-Classic maps "simd" to GPU threads and skips "parallel for"
|
|
|
- Clang maps "parallel for" to GPU threads and skips "simd"
|
|
|
- This will change for CCE16, where "parallel" will have the function that "simd" has in CCE15.
|
|
|
|
|
|
**parallel do**
|
|
|
Maps the kernel to the available threads.
|
|
|
Most useful in combination with "teams" blocks.
|
|
|
Causes the work done in a loop inside a parallel region to be divided among threads.
|
|
|
|
|
|
**loop**
|
|
|
Needs to be bounded to the teams with `bind(teams)` or is done implicitly when called inside a **teams** region.
|
|
|
Basically does a *parallel do simd* internally with some compiler optimizations.
|
|
|
| Directive | Description |
|
|
|
|:---------:|:------------|
|
|
|
| target | Declares portion of code to be executed on GPU. |
|
|
|
| teams | Runs the code on GPU using multiple "teams".</br><ul><li>Invokes fork/join model</li><li>No collective synchronization! (i.e. barriers)</li><li>Must be within "omp target"</li></ul> |
|
|
|
| distribute | Iterations of the loop below are partitioned across "teams". Without, only thread 0 of each team would be used.</br><ul><li>Loop partitioning (like omp do/far)</li><li>Iterations are partitioned across "teams"</li><li>NO implied barrier at end of loop!</li><li>Best practice to combine with teams (target teams distribute)</li></ul> |
|
|
|
| simd | Optimize internal functions with SIMD instructions if possible. This directive gets ignored if this is not possible, thus it can always be used. Maps the work below to GPU threads within the "teams" blocks.</br><ul><li>Used for two-level GPU parallelism</li><li>"teams" maps to GPU threadblocks</li><li>"simd" maps to GPU threads within the teams</li><li>According to Oak Ridge, always use in combination with parallel do.</li></ul></br>*NOTE - parallel and simd are inconsistent across implementations*</br><ul><li>CCE-Classic maps "simd" to GPU threads and skips "parallel for"</li><li>Clang maps "parallel for" to GPU threads and skips "simd"</li><li>This will change for CCE16, where "parallel" will have the function that "simd" has in CCE15.</li></ul> |
|
|
|
| parallel do | Maps the kernel to the available threads. Most useful in combination with "teams" blocks. Causes the work done in a loop inside a parallel region to be divided among threads. |
|
|
|
| loop | Needs to be bounded to the teams with `bind(teams)` or is done implicitly when called inside a **teams** region. Basically does a *parallel do simd* internally with some compiler optimizations. |
|
|
|
|
|
|
## Memory directives.
|
|
|
|
... | ... | @@ -92,5 +57,5 @@ One can deallocate memory on the device (after the kernel finishes) with: |
|
|
where the data is deleted. If you want to copy over the data, you can use:
|
|
|
`!$omp target exit data map (from:<var>)`
|
|
|
|
|
|
## Environment variables.
|
|
|
|
|
|
## Environment variables
|
|
|
TODO |