User Tools

Site Tools


library:computing:ifs_impi_troubles

IFS Intel MPI issues

IFS @ MN4

Issue 1: IFS memory corrupted when activating AXV512

Environment: IFS 36r4. This bug was observed using the following compilers and MPI libraries:

  • Intel 2017.4 & Intel MPI 2017.4

The problem was reported when using -O2 & -O3 optimization flags in conjunction with activation of AVX-512, aka 512-bit SIMD or which is the same, 512-bit vector extensions. The code was free of this issue when using AVX or AVX2.

Problem: Running IFS with -O3 (the highest optimization level) and -xCORE-AVX512, to enable SIMD 512 instructions, makes the model trigger an exit due to a check on a matrix that detects it as a singular matrix (a matrix that does not have an inverse) that therefore cannot be used to solve a system of equations needed in a physics solver.

 ABORT!  381 LUDCMP: Singular matrix
MPL_ABORT: CALLED FROM PROCESSOR    381 THRD     1
 MPL_ABORT: THRD           1   LUDCMP: Singular matrix
 SDL_TRACEBACK: Calling INTEL_TRBK, THRD =            1
Calling traceback from intel_trbk()
Image              PC                Routine            Line        Source
ifsmaster-ecconf   0000000003C806AD  Unknown               Unknown  Unknown
ifsmaster-ecconf   000000000396934D  intel_trbk_                10  gentrbk.F90
ifsmaster-ecconf   0000000003939117  sdl_mod_mp_sdl_tr          66  sdl_mod.F90
ifsmaster-ecconf   000000000390DA77  mpl_abort_mod_mp_          35  mpl_abort_mod.F90
ifsmaster-ecconf   000000000393BDBA  abor1_                     31  abor1.F90
ifsmaster-ecconf   0000000002D50CF6  ludcmp_                    63  ludcmp.F90
ifsmaster-ecconf   0000000002BE1466  cloudsc_                 2703  cloudsc.F90
ifsmaster-ecconf   0000000002AFACB7  callpar_                 3449  callpar.F90
ifsmaster-ecconf   000000000272334A  ec_phys_                  795  ec_phys.F90
ifsmaster-ecconf   0000000001631661  ec_phys_drv_              403  ec_phys_drv.F90
ifsmaster-ecconf   00000000015F20F5  gp_model_                 553  gp_model.F90
ifsmaster-ecconf   0000000000A8D351  scan2m_                   548  scan2m.F90
ifsmaster-ecconf   0000000000A89E6B  scan2h_                   107  scan2h.F90
ifsmaster-ecconf   0000000000450D4A  stepo_                    373  stepo.F90
ifsmaster-ecconf   00000000004168D5  cnt4_                    1087  cnt4.F90
ifsmaster-ecconf   000000000041328C  cnt3_                     324  cnt3.F90
ifsmaster-ecconf   00000000004113A4  cnt2_                      76  cnt2.F90
ifsmaster-ecconf   00000000004111D1  cnt1_                     116  cnt1.F90
ifsmaster-ecconf   0000000000410C94  cnt0_                     154  cnt0.F90
ifsmaster-ecconf   00000000004102B8  MAIN__                     33  master.F90
ifsmaster-ecconf   000000000041021E  Unknown               Unknown  Unknown
libc-2.22.so       00002B9828CB56E5  __libc_start_main     Unknown  Unknown
ifsmaster-ecconf   0000000000410129  Unknown               Unknown  Unknown

(The error was happening at different steps or in different processes, depending on the run)

However, if the model is run using -O2 (and also -xCORE-AVX512) in the compilation, the error is different, and it is triggered in a part of the program that is executed before the one failing when the -O3 flag is used.

forrtl: severe (154): array index out of bounds
Image              PC                Routine            Line        Source
ifsmaster-ecconf   0000000003C88D29  Unknown               Unknown  Unknown
libpthread-2.22.s  00002AC171F19B10  Unknown               Unknown  Unknown
ifsmaster-ecconf   00000000036FBD3D  surfexcdriver_ctl         479  surfexcdriver_ctl_mod.F90
ifsmaster-ecconf   00000000036E2096  surfexcdriver_            663  surfexcdriver.F90
ifsmaster-ecconf   000000000338CEA2  vdfmain_                  608  vdfmain.F90
ifsmaster-ecconf   0000000002DE21C4  vdfouter_                 618  vdfouter.F90
ifsmaster-ecconf   0000000002AF0ABC  callpar_                 2526  callpar.F90
ifsmaster-ecconf   000000000272334A  ec_phys_                  795  ec_phys.F90
ifsmaster-ecconf   0000000001631661  ec_phys_drv_              403  ec_phys_drv.F90
ifsmaster-ecconf   00000000015F20F5  gp_model_                 553  gp_model.F90
ifsmaster-ecconf   0000000000A8D351  scan2m_                   548  scan2m.F90
ifsmaster-ecconf   0000000000A89E6B  scan2h_                   107  scan2h.F90
ifsmaster-ecconf   0000000000450D4A  stepo_                    373  stepo.F90
ifsmaster-ecconf   00000000004168D5  cnt4_                    1087  cnt4.F90
ifsmaster-ecconf   000000000041328C  cnt3_                     324  cnt3.F90
ifsmaster-ecconf   00000000004113A4  cnt2_                      76  cnt2.F90
ifsmaster-ecconf   00000000004111D1  cnt1_                     116  cnt1.F90
ifsmaster-ecconf   0000000000410C94  cnt0_                     154  cnt0.F90
ifsmaster-ecconf   00000000004102B8  MAIN__                     33  master.F90
ifsmaster-ecconf   000000000041021E  Unknown               Unknown  Unknown
libc-2.22.so       00002AC1724436E5  __libc_start_main     Unknown  Unknown
ifsmaster-ecconf   0000000000410129  Unknown               Unknown  Unknown

Actions taken: We first debugged the regions of the code referred in the error trace by using Allinea DDT and, though it is difficult to debug using -O2 or -O3 (because variables are optimized and its value hidden, and code lines does not always correspond to the ones actually being executed), we could see where the problem was coming from.

The part of the code that was exiting in the -O3 mode was in the ludcmp routine:

  DO JL=KIDIA,KFDIA
    IF (ZAAMAX(JL) <= 0.0_JPRB) THEN
      CALL ABOR1('LUDCMP: Singular matrix')
    ENDIF ! SINGULAR MATRIX 
    ZVV(JL,I) = 1.0_JPRB/ZAAMAX(JL) !SAVE THE SCALING. 
  ENDDO

Basically it is checking all the maximum values for each given row of the matrix (stored in ZAAMAX), and if any maximum is equal or smaller than 0, it declares the matrix as singular, because it would contain a zero-vector, so the code triggers an Abort.

The part of the code that crashes in the O2 mode is in surfexcdriver_ctl_mod module:

 DO JTILE=1,KTILES
  DO JL=KIDIA,KFDIA
    IF (LLHISSR(JL)) THEN
      PSSRFLTI(JL,JTILE)=PSSRFLTI(JL,JTILE)*PSSRFL(JL)/ZSSRFL1(JL)
    ENDIF
    ZSRFD(JL)=PSSRFLTI(JL,JTILE)/(1.0_JPRB-PALBTI(JL,JTILE))
  ENDDO

Here the code is assigning a weigthed value to PSSRFLTI, proportional to its contribution to the ZSSRFL1 variable in a previous calculation. LLHISSR(JL) stores the indexes of PSSRFLTI values that in a previous loop were bigger than 700 and so were assigned a 700 value.

The fact that using -O2 provoked an “array index out of bounds” error and the fact that the error in -O3 mode was in a subsequent routine, made us think that this array access error could be guilty of messing with the matrix values and filling it with zeroes so it became singular. Therefore, we focused first in the -O2 problem.

In order to have more information from the vectorization applied by the compiler in this situation, we generated optimization and vectorization reports, and indeed the loop was being vectorized, but there was not much difference between the -O2 and the -O3 case. We also generated an output file with the assembly code for both routines, and beside there were differences between -O2 and -O3, the structure was similar and the conditional was not ignored.

Diagnosis: Aside of the validity of the scientific algorithm, the fact is that using conditionals inside loop structures is a bad practice. So it is likely than at the moment of the vectorization, there is a bug in the Intel compiler that is not able to deal correctly with this conditional. A proof of the validity of this hypothesis is that the compiler does not automatically merge the two resulting loops when we splitted them (see the solution for more detail on this).

Solution: We developed two working solutions for this issue. Both of them rely on small modifications in the IFS source. Obviously the best solution would be that Intel fixes their compiler, but our approach can work in the meantime.

The first fix is to use compiler directives to avoid the vectorization of the loop in surfexcdriver_ctl_mod:

!DIR$ NOVECTOR
DO JTILE=1,KTILES
  !DIR$ NOVECTOR
  DO JL=KIDIA,KFDIA
! Disaggregate solar flux but limit to 700 W/m2 (due to inconsistency
!  with albedo)
    PSSRFLTI(JL,JTILE)=((1.0_JPRB-PALBTI(JL,JTILE))/&
   & (1.0_JPRB-ZALB(JL)))*PSSRFL(JL)
    IF (PSSRFLTI(JL,JTILE) > 700._JPRB) THEN
      LLHISSR(JL)=.TRUE.
      PSSRFLTI(JL,JTILE)=700._JPRB
    ENDIF

The second is getting the conditional IF out of the loop and make two independent loops instead:

DO JTILE=1,KTILES
  DO JL=KIDIA,KFDIA
    IF (LLHISSR(JL)) THEN
      PSSRFLTI(JL,JTILE)=PSSRFLTI(JL,JTILE)*PSSRFL(JL)/ZSSRFL1(JL)
    ENDIF
  ENDDO

  DO JL=KIDIA,KFDIA
    ZSRFD(JL)=PSSRFLTI(JL,JTILE)/(1.0_JPRB-PALBTI(JL,JTILE))
  ENDDO

We could see in the vectorization report that, being some of the other loops in the same function merged for optimization, this was not merged again. This can happen because the compiler thinks it is risky or sub-optimal to do so.

Both fixes work with both -O2 and -O3, so the matrix is no detected as singular in ludcmp.

More information:

Intel® AVX-512 Instructions introduction:

https://software.intel.com/en-us/blogs/2013/avx-512-instructions

Compiling for the Intel® Xeon Phi™ Processor and the Intel® Advanced Vector Extensions 512 ISA:

https://software.intel.com/en-us/articles/compiling-for-the-intel-xeon-phi-processor-and-the-intel-avx-512-isa

Quick Reference Guide to Optimization with Intel® C++ and Fortran Compilers v16:

https://software.intel.com/sites/default/files/managed/12/f1/Quick-Reference-Card-Intel-Compilers-v16.pdf

Vectorization and Optimization Reports:

https://software.intel.com/en-us/articles/vectorization-and-optimization-reports

Generating a Vectorization Report:

https://software.intel.com/en-us/node/590464

General Compiler Directives:

https://software.intel.com/en-us/node/692388

library/computing/ifs_impi_troubles.txt · Last modified: 2017/08/24 13:56 by mcastril