... | @@ -23,7 +23,7 @@ With teams, the code runs on GPU using multiple "teams". |
... | @@ -23,7 +23,7 @@ With teams, the code runs on GPU using multiple "teams". |
|
- Must be within "omp target"
|
|
- Must be within "omp target"
|
|
|
|
|
|
**distribute**
|
|
**distribute**
|
|
Iterations of the loop below are partitioned across "teams".
|
|
Iterations of the loop below are partitioned across "teams".
|
|
Without, only thread 0 of each team would be used.
|
|
Without, only thread 0 of each team would be used.
|
|
|
|
|
|
- Loop partitioning (like omp do/far)
|
|
- Loop partitioning (like omp do/far)
|
... | @@ -32,7 +32,8 @@ Without, only thread 0 of each team would be used. |
... | @@ -32,7 +32,8 @@ Without, only thread 0 of each team would be used. |
|
- Best practice to combine with teams (target teams distribute)
|
|
- Best practice to combine with teams (target teams distribute)
|
|
|
|
|
|
**simd**
|
|
**simd**
|
|
Optimize internal functions with SIMD instructions if possible. This directive gets ignored if this is not possible, thus it can always be used.
|
|
Optimize internal functions with SIMD instructions if possible.
|
|
|
|
This directive gets ignored if this is not possible, thus it can always be used.
|
|
Maps the work below to GPU threads within the "teams" blocks.
|
|
Maps the work below to GPU threads within the "teams" blocks.
|
|
|
|
|
|
- Used for two-level GPU parallelism
|
|
- Used for two-level GPU parallelism
|
... | @@ -47,16 +48,45 @@ Maps the work below to GPU threads within the "teams" blocks. |
... | @@ -47,16 +48,45 @@ Maps the work below to GPU threads within the "teams" blocks. |
|
- This will change for CCE16, where "parallel" will have the function that "simd" has in CCE15.
|
|
- This will change for CCE16, where "parallel" will have the function that "simd" has in CCE15.
|
|
|
|
|
|
**parallel do**
|
|
**parallel do**
|
|
Maps the kernel to the available threads.
|
|
Maps the kernel to the available threads.
|
|
Most usefull in combination with "teams" blocks.
|
|
Most useful in combination with "teams" blocks.
|
|
Causes the work done in a loop inside a parallel region to be divided among threads.
|
|
Causes the work done in a loop inside a parallel region to be divided among threads.
|
|
|
|
|
|
**loop**
|
|
**loop**
|
|
Needs to be bounded to the teams with bind(teams) or is done implicitely when called inside a "teams" region.
|
|
Needs to be bounded to the teams with `bind(teams)` or is done implicitly when called inside a **teams** region.
|
|
Basically does a *parallel do simd* internally? But it is newer.
|
|
Basically does a *parallel do simd* internally with some compiler optimizations.
|
|
|
|
|
|
## Memory directives.
|
|
## Memory directives.
|
|
|
|
|
|
|
|
These directives can be used for invoking GPU memory operations for a single kernel.
|
|
|
|
|
|
|
|
**map([\<var\>])**
|
|
|
|
Maps a variable to and from host.
|
|
|
|
|
|
|
|
**map(from:[\<var\>])**
|
|
|
|
Maps a variable from host.
|
|
|
|
|
|
|
|
**map(to:[\<var\>])**
|
|
|
|
Maps a variable to host after the kernel finsihes.
|
|
|
|
|
|
|
|
**map(fromto:[\<var\>])**
|
|
|
|
Maps a variable from host and back after the kernel finishes.
|
|
|
|
|
|
|
|
**map(alloc:[\<var\>])**
|
|
|
|
Allocate space for the variable on device.
|
|
|
|
|
|
|
|
## Allocative memory operations
|
|
|
|
|
|
|
|
There are also directives that you can use to allocate memory on the device and is accessible by the kernel.
|
|
|
|
|
|
|
|
One can allocate memory on the device with:
|
|
|
|
```!$OMP target enter data map (\<specifier\>)```
|
|
|
|
Where the specifier is one of the operations listed in the section above.
|
|
|
|
|
|
|
|
One can deallocate memory on the device (after the kernel finishes) with:
|
|
|
|
```!$OMP target exit data map (delete:<var>)```
|
|
|
|
where the data is deleted. If you want to copy over the data, you can use:
|
|
|
|
```!$OMP target exit data map (from:<var>)```
|
|
|
|
|
|
## Environment variables.
|
|
## Environment variables.
|
|
|
|
|