|
|
# Issues with Fortran and Intel compilers.
|
|
|
Some experiments have been done with simple kernels to understand how the Intel compiler generates the assembly code.
|
|
|
|
|
|
With the following kernel:
|
|
|
|
|
|
```
|
|
|
do jj=1,array_size
|
|
|
!Simple kernel
|
|
|
a(jj) = pa*(b(jj)+pb)*(pc+c(jj))*(pd+d(jj))
|
|
|
end do
|
|
|
```
|
|
|
|
|
|
The assembly code generated by the compiler varies in function of several elements that are not always obvious.
|
|
|
With the compilation flags -O3 -fp-model strict -xHost the compiler will try to apply some optimizations on the kernel.
|
|
|
If the arrays are aligned and the compiler is able to know it it will produce an assembly code (case 1) and if the compiler
|
|
|
is not able to know if the array is aligned it will produce another assembly code (case 2).
|
|
|
If the arrays are aligned, the compiler is able to use vector instructions to load 4 vector elements of data from the cache for each instruction, if the arrays are not aligned the load operation is split in two loads.
|
|
|
The final result is that for the aligned version the CPU requires 4 load instructions to perform 24 floating point operations,
|
|
|
and for the non-aligned version the number of necessary loads increases to 8.
|
|
|
Whether the compiler is able or not to align the arrays and optimize the kernel accordingly depends on multiple things,
|
|
|
here we can find some cases.
|
|
|
If the size of the arrays (array_size) is determined at compilation time, the compiler is able to align by default the arrays in memory and to optimize the loops accordingly.
|
|
|
Otherwise, if the size of the array is determined in execution time, the compiler is not able to align the arrays unless you use compiler directives before the allocation:
|
|
|
|
|
|
```!dir$ attributes align:64 :: A```
|
|
|
|
|
|
Even if using the align directive, in some cases the compiler do not compile the loop as it is aligned.
|
|
|
This happens for instance when the arrays are accessed from routines out of the scope where the arrays are declared.
|
|
|
If we declare a routine that acts over the same set of arrays, even if this routine is not called and the align directive is used,
|
|
|
the compiler generates the case 2 code even when it is supposed to generate the case 1.
|
|
|
```
|
|
|
subroutine kernel
|
|
|
integer :: jj
|
|
|
do jj=1,array_size
|
|
|
a(jj) = pa*(b(jj)+pb)*(pc+c(jj))*(pd+d(jj))
|
|
|
end do
|
|
|
end subroutine
|
|
|
```
|
|
|
This does not happen if the arrays are passed as an array to the function with the proper definition
|
|
|
|
|
|
```
|
|
|
subroutine kernel(a,b,c,d,pa,pb,pc,pd)
|
|
|
real,allocatable(:) :: a,b,c,d
|
|
|
real :: pa,pb,pc,pd
|
|
|
integer :: jj
|
|
|
do jj=1,array_size
|
|
|
a(jj) = pa*(b(jj)+pb)*(pc+c(jj))*(pd+d(jj))
|
|
|
end do
|
|
|
end subroutine
|
|
|
```
|
|
|
|
|
|
In this case, the loop that is called outside the routine is treated as the arrays are aligned, but if the function is called the inner loop of the function is still not treated as the array is aligned.
|
|
|
To allow the kernel to be properly optimized the directive vector aligned must be used.
|
|
|
|
|
|
```
|
|
|
subroutine kernel(a,b,c,d,pa,pb,pc,pd)
|
|
|
real,allocatable(:) :: a,b,c,d
|
|
|
real :: pa,pb,pc,pd
|
|
|
integer :: jj
|
|
|
!dir$ vector aligned
|
|
|
do jj=1,array_size
|
|
|
a(jj) = pa*(b(jj)+pb)*(pc+c(jj))*(pd+d(jj))
|
|
|
end do
|
|
|
end subroutine
|
|
|
```
|
|
|
|
|
|
### CASE 1:
|
|
|
```
|
|
|
vaddpd (%rbx,%rdi,8), %ymm3, %ymm4 #71.32
|
|
|
vaddpd 32(%rbx,%rdi,8), %ymm3, %ymm5 #71.32
|
|
|
vaddpd (%rcx,%rdi,8), %ymm2, %ymm7 #71.40
|
|
|
vaddpd 32(%rcx,%rdi,8), %ymm2, %ymm9 #71.40
|
|
|
vaddpd (%rdx,%rdi,8), %ymm0, %ymm11 #71.51
|
|
|
vaddpd 32(%rdx,%rdi,8), %ymm0, %ymm13 #71.51
|
|
|
vmulpd %ymm4, %ymm1, %ymm6 #71.25
|
|
|
vmulpd %ymm5, %ymm1, %ymm8 #71.25
|
|
|
vaddpd 64(%rbx,%rdi,8), %ymm3, %ymm4 #71.32
|
|
|
vaddpd 96(%rbx,%rdi,8), %ymm3, %ymm5 #71.32
|
|
|
vmulpd %ymm7, %ymm6, %ymm10 #71.36
|
|
|
vmulpd %ymm9, %ymm8, %ymm12 #71.36
|
|
|
vmulpd %ymm4, %ymm1, %ymm6 #71.25
|
|
|
vmulpd %ymm5, %ymm1, %ymm8 #71.25
|
|
|
vmulpd %ymm11, %ymm10, %ymm14 #71.14
|
|
|
vmulpd %ymm13, %ymm12, %ymm15 #71.14
|
|
|
vaddpd 64(%rcx,%rdi,8), %ymm2, %ymm7 #71.40
|
|
|
vaddpd 96(%rcx,%rdi,8), %ymm2, %ymm9 #71.40
|
|
|
vaddpd 64(%rdx,%rdi,8), %ymm0, %ymm11 #71.51
|
|
|
vaddpd 96(%rdx,%rdi,8), %ymm0, %ymm13 #71.51
|
|
|
vmulpd %ymm7, %ymm6, %ymm10 #71.36
|
|
|
vmulpd %ymm9, %ymm8, %ymm12 #71.36
|
|
|
vmovupd %ymm14, (%rsi,%rdi,8) #71.14
|
|
|
vmovupd %ymm15, 32(%rsi,%rdi,8) #71.14
|
|
|
vmulpd %ymm11, %ymm10, %ymm14 #71.14
|
|
|
vmulpd %ymm13, %ymm12, %ymm15 #71.14
|
|
|
vmovupd %ymm14, 64(%rsi,%rdi,8) #71.14
|
|
|
vmovupd %ymm15, 96(%rsi,%rdi,8) #71.14
|
|
|
```
|
|
|
#### Summary:
|
|
|
12 x vaddpd → 12 LOAD + 12 ADD
|
|
|
|
|
|
12 x vmulpd → 12 MUL
|
|
|
|
|
|
4 x vmovupd → 4 STORE
|
|
|
|
|
|
|
|
|
|
|
|
### CASE2:
|
|
|
```
|
|
|
vmovupd (%rdi,%rdx,8), %xmm4 #71.27
|
|
|
vmovupd 32(%rdi,%rdx,8), %xmm5 #71.27
|
|
|
vinsertf128 $1, 16(%rdi,%rdx,8), %ymm4, %ymm6 #71.27
|
|
|
vinsertf128 $1, 48(%rdi,%rdx,8), %ymm5, %ymm7 #71.27
|
|
|
vaddpd %ymm13, %ymm6, %ymm8 #71.32
|
|
|
vaddpd %ymm7, %ymm13, %ymm9 #71.32
|
|
|
vmovupd (%r9,%rdx,8), %xmm4 #71.41
|
|
|
vmovupd 32(%r9,%rdx,8), %xmm5 #71.41
|
|
|
vmulpd %ymm8, %ymm14, %ymm10 #71.25
|
|
|
vmulpd %ymm9, %ymm14, %ymm11 #71.25
|
|
|
vinsertf128 $1, 16(%r9,%rdx,8), %ymm4, %ymm6 #71.41
|
|
|
vinsertf128 $1, 48(%r9,%rdx,8), %ymm5, %ymm7 #71.41
|
|
|
vaddpd %ymm6, %ymm12, %ymm8 #71.40
|
|
|
vaddpd %ymm7, %ymm12, %ymm4 #71.40
|
|
|
vmulpd %ymm8, %ymm10, %ymm9 #71.36
|
|
|
vmovupd (%r8,%rdx,8), %xmm10 #71.52
|
|
|
vmulpd %ymm4, %ymm11, %ymm8 #71.36
|
|
|
vmovupd 32(%r8,%rdx,8), %xmm11 #71.52
|
|
|
vinsertf128 $1, 16(%r8,%rdx,8), %ymm10, %ymm4 #71.52
|
|
|
vinsertf128 $1, 48(%r8,%rdx,8), %ymm11, %ymm5 #71.52
|
|
|
vaddpd %ymm4, %ymm15, %ymm6 #71.51
|
|
|
vaddpd %ymm5, %ymm15, %ymm7 #71.51
|
|
|
vmovupd 64(%rdi,%rdx,8), %xmm10 #71.27
|
|
|
vmulpd %ymm6, %ymm9, %ymm9 #71.14
|
|
|
vmulpd %ymm7, %ymm8, %ymm8 #71.14
|
|
|
vmovupd 96(%rdi,%rdx,8), %xmm4 #71.27
|
|
|
vmovupd 96(%r9,%rdx,8), %xmm11 #71.41
|
|
|
vmovupd %ymm9, (%rsi,%rdx,8) #71.14
|
|
|
vmovupd %ymm8, 32(%rsi,%rdx,8) #71.14
|
|
|
vmovupd 64(%r9,%rdx,8), %xmm9 #71.41
|
|
|
vinsertf128 $1, 80(%rdi,%rdx,8), %ymm10, %ymm5 #71.27
|
|
|
vaddpd %ymm13, %ymm5, %ymm7 #71.32
|
|
|
vinsertf128 $1, 112(%rdi,%rdx,8), %ymm4, %ymm6 #71.27
|
|
|
vaddpd %ymm6, %ymm13, %ymm8 #71.32
|
|
|
vmulpd %ymm7, %ymm14, %ymm6 #71.25
|
|
|
vmulpd %ymm8, %ymm14, %ymm7 #71.25
|
|
|
vinsertf128 $1, 80(%r9,%rdx,8), %ymm9, %ymm4 #71.41
|
|
|
vaddpd %ymm4, %ymm12, %ymm8 #71.40
|
|
|
vinsertf128 $1, 112(%r9,%rdx,8), %ymm11, %ymm5 #71.41
|
|
|
vaddpd %ymm5, %ymm12, %ymm9 #71.40
|
|
|
vmulpd %ymm8, %ymm6, %ymm5 #71.36
|
|
|
vmovupd 64(%r8,%rdx,8), %xmm6 #71.52
|
|
|
vmulpd %ymm9, %ymm7, %ymm4 #71.36
|
|
|
vmovupd 96(%r8,%rdx,8), %xmm7 #71.52
|
|
|
vinsertf128 $1, 80(%r8,%rdx,8), %ymm6, %ymm10 #71.52
|
|
|
vinsertf128 $1, 112(%r8,%rdx,8), %ymm7, %ymm6 #71.52
|
|
|
vaddpd %ymm10, %ymm15, %ymm7 #71.51
|
|
|
vaddpd %ymm6, %ymm15, %ymm8 #71.51
|
|
|
vmulpd %ymm7, %ymm5, %ymm5 #71.14
|
|
|
vmulpd %ymm8, %ymm4, %ymm4 #71.14
|
|
|
vmovupd %ymm5, 64(%rsi,%rdx,8) #71.14
|
|
|
vmovupd %ymm4, 96(%rsi,%rdx,8) #71.14
|
|
|
```
|
|
|
|
|
|
#### Summary:
|
|
|
12 x vaddpd → 12 ADD
|
|
|
|
|
|
12 x vmulpd → 12 MUL
|
|
|
|
|
|
16 x vmovpd → 4 STORE + 12 LOAD
|
|
|
|
|
|
12 x vinsertf128 → 12 LOAD
|
|
|
|
|
|
|
|
|
### Comparing the two cases :
|
|
|
**Case 1**: 6 FLO / (24 + 8) Bytes → 0.1875 FLO/Byte
|
|
|
|
|
|
**Case 2**: 6 FLO / (48 + 8) Bytes → 0.1071 FLO/Byte
|
|
|
|
|
|
|
|
|
Even if that “a priori” it seems that the aligned case would be generally faster, evaluating the difference of performance between the two cases it is tangible (1.5x faster) only when the arrays fit in the L1 memory (Figure 1). When the data comes from L2 or further, the time to bring the data from the origin to the L1 cache is much bigger than the time to bring the data from L1 to the CPU and the fact that in the non-aligned case the last step L1-to-cpu is splitted in two loads has almost no impact.
|
|
|
|
|
|
Figure 1- GFLOPs vs Array size for aligned and non-aligned arrays.
|
|
|
|
|
|
|
|
|
As a summary of the different cases tested:
|
|
|
1 - No routine declared nor called, align attribute Correct assembly code.
|
|
|
2 - No routine declared nor called, no align attribute Not correct assembly code.
|
|
|
3 - Routine declared but not called, align attribute Not correct assembly code.
|
|
|
4 - Routine declared but not called, no align attribute Not correct assembly code.
|
|
|
5 - Routine declared and called, no align attribute Not correct assembly code.
|
|
|
6 - Routine declared and called, align attribute Not correct assembly code.
|
|
|
7 - R. declared w/array description b. not called, align Correct assembly code.
|
|
|
8 - R. declared w/array description and called, align Not correct assembly code.}} |
|
|
\ No newline at end of file |