====== IFS Intel MPI issues ====== ===== IFS @ MN4 ===== ==== Issue 1: IFS memory corrupted when activating AXV512 ==== **Environment:** IFS 36r4. This bug was observed using the following compilers and MPI libraries: * Intel 2017.4 & Intel MPI 2017.4 The problem was reported when using -O2 & -O3 optimization flags in conjunction with activation of [[http://example.com|AVX-512]], aka 512-bit SIMD or which is the same, 512-bit vector extensions. The code was free of this issue when using AVX or AVX2. **Problem:** Running IFS with __-O3__ (the highest optimization level) and -xCORE-AVX512, to enable SIMD 512 instructions, makes the model trigger an exit due to a check on a matrix that detects it as a singular matrix (a matrix that does not have an inverse) that therefore cannot be used to solve a system of equations needed in a physics solver. ABORT! 381 LUDCMP: Singular matrix MPL_ABORT: CALLED FROM PROCESSOR 381 THRD 1 MPL_ABORT: THRD 1 LUDCMP: Singular matrix SDL_TRACEBACK: Calling INTEL_TRBK, THRD = 1 Calling traceback from intel_trbk() Image PC Routine Line Source ifsmaster-ecconf 0000000003C806AD Unknown Unknown Unknown ifsmaster-ecconf 000000000396934D intel_trbk_ 10 gentrbk.F90 ifsmaster-ecconf 0000000003939117 sdl_mod_mp_sdl_tr 66 sdl_mod.F90 ifsmaster-ecconf 000000000390DA77 mpl_abort_mod_mp_ 35 mpl_abort_mod.F90 ifsmaster-ecconf 000000000393BDBA abor1_ 31 abor1.F90 ifsmaster-ecconf 0000000002D50CF6 ludcmp_ 63 ludcmp.F90 ifsmaster-ecconf 0000000002BE1466 cloudsc_ 2703 cloudsc.F90 ifsmaster-ecconf 0000000002AFACB7 callpar_ 3449 callpar.F90 ifsmaster-ecconf 000000000272334A ec_phys_ 795 ec_phys.F90 ifsmaster-ecconf 0000000001631661 ec_phys_drv_ 403 ec_phys_drv.F90 ifsmaster-ecconf 00000000015F20F5 gp_model_ 553 gp_model.F90 ifsmaster-ecconf 0000000000A8D351 scan2m_ 548 scan2m.F90 ifsmaster-ecconf 0000000000A89E6B scan2h_ 107 scan2h.F90 ifsmaster-ecconf 0000000000450D4A stepo_ 373 stepo.F90 ifsmaster-ecconf 00000000004168D5 cnt4_ 1087 cnt4.F90 ifsmaster-ecconf 000000000041328C cnt3_ 324 cnt3.F90 ifsmaster-ecconf 00000000004113A4 cnt2_ 76 cnt2.F90 ifsmaster-ecconf 00000000004111D1 cnt1_ 116 cnt1.F90 ifsmaster-ecconf 0000000000410C94 cnt0_ 154 cnt0.F90 ifsmaster-ecconf 00000000004102B8 MAIN__ 33 master.F90 ifsmaster-ecconf 000000000041021E Unknown Unknown Unknown libc-2.22.so 00002B9828CB56E5 __libc_start_main Unknown Unknown ifsmaster-ecconf 0000000000410129 Unknown Unknown Unknown (The error was happening at different steps or in different processes, depending on the run) However, if the model is run using __-O2__ (and also -xCORE-AVX512) in the compilation, the error is different, and it is triggered in a part of the program that is executed before the one failing when the -O3 flag is used. forrtl: severe (154): array index out of bounds Image PC Routine Line Source ifsmaster-ecconf 0000000003C88D29 Unknown Unknown Unknown libpthread-2.22.s 00002AC171F19B10 Unknown Unknown Unknown ifsmaster-ecconf 00000000036FBD3D surfexcdriver_ctl 479 surfexcdriver_ctl_mod.F90 ifsmaster-ecconf 00000000036E2096 surfexcdriver_ 663 surfexcdriver.F90 ifsmaster-ecconf 000000000338CEA2 vdfmain_ 608 vdfmain.F90 ifsmaster-ecconf 0000000002DE21C4 vdfouter_ 618 vdfouter.F90 ifsmaster-ecconf 0000000002AF0ABC callpar_ 2526 callpar.F90 ifsmaster-ecconf 000000000272334A ec_phys_ 795 ec_phys.F90 ifsmaster-ecconf 0000000001631661 ec_phys_drv_ 403 ec_phys_drv.F90 ifsmaster-ecconf 00000000015F20F5 gp_model_ 553 gp_model.F90 ifsmaster-ecconf 0000000000A8D351 scan2m_ 548 scan2m.F90 ifsmaster-ecconf 0000000000A89E6B scan2h_ 107 scan2h.F90 ifsmaster-ecconf 0000000000450D4A stepo_ 373 stepo.F90 ifsmaster-ecconf 00000000004168D5 cnt4_ 1087 cnt4.F90 ifsmaster-ecconf 000000000041328C cnt3_ 324 cnt3.F90 ifsmaster-ecconf 00000000004113A4 cnt2_ 76 cnt2.F90 ifsmaster-ecconf 00000000004111D1 cnt1_ 116 cnt1.F90 ifsmaster-ecconf 0000000000410C94 cnt0_ 154 cnt0.F90 ifsmaster-ecconf 00000000004102B8 MAIN__ 33 master.F90 ifsmaster-ecconf 000000000041021E Unknown Unknown Unknown libc-2.22.so 00002AC1724436E5 __libc_start_main Unknown Unknown ifsmaster-ecconf 0000000000410129 Unknown Unknown Unknown **Actions taken:** We first debugged the regions of the code referred in the error trace by using Allinea DDT and, though it is difficult to debug using -O2 or -O3 (because variables are optimized and its value hidden, and code lines does not always correspond to the ones actually being executed), we could see where the problem was coming from. The part of the code that was exiting in the -O3 mode was in the //ludcmp// routine: DO JL=KIDIA,KFDIA IF (ZAAMAX(JL) <= 0.0_JPRB) THEN CALL ABOR1('LUDCMP: Singular matrix') ENDIF ! SINGULAR MATRIX ZVV(JL,I) = 1.0_JPRB/ZAAMAX(JL) !SAVE THE SCALING. ENDDO Basically it is checking all the maximum values for each given row of the matrix (stored in ZAAMAX), and if any maximum is equal or smaller than 0, it declares the matrix as singular, because it would contain a zero-vector, so the code triggers an //Abort//. The part of the code that crashes in the O2 mode is in //surfexcdriver_ctl_mod// module: DO JTILE=1,KTILES DO JL=KIDIA,KFDIA IF (LLHISSR(JL)) THEN PSSRFLTI(JL,JTILE)=PSSRFLTI(JL,JTILE)*PSSRFL(JL)/ZSSRFL1(JL) ENDIF ZSRFD(JL)=PSSRFLTI(JL,JTILE)/(1.0_JPRB-PALBTI(JL,JTILE)) ENDDO Here the code is assigning a weigthed value to PSSRFLTI, proportional to its contribution to the ZSSRFL1 variable in a previous calculation. LLHISSR(JL) stores the indexes of PSSRFLTI values that in a previous loop were bigger than 700 and so were assigned a 700 value. The fact that using -O2 provoked an "array index out of bounds" error and the fact that the error in -O3 mode was in a subsequent routine, made us think that this array access error could be guilty of messing with the matrix values and filling it with zeroes so it became singular. Therefore, we focused first in the //-O2 problem//. In order to have more information from the vectorization applied by the compiler in this situation, we generated //optimization and vectorization reports//, and indeed the loop was being vectorized, but there was not much difference between the -O2 and the -O3 case. We also generated an output file with the //assembly code// for both routines, and beside there were differences between -O2 and -O3, the structure was similar and the conditional was not ignored. **Diagnosis:** Aside of the validity of the scientific algorithm, the fact is that using conditionals inside loop structures is a bad practice. So it is likely than at the moment of the vectorization, there is a bug in the Intel compiler that is not able to deal correctly with this conditional. A proof of the validity of this hypothesis is that the compiler does not automatically merge the two resulting loops when we splitted them (see the solution for more detail on this). **Solution:** We developed two working solutions for this issue. Both of them rely on small modifications in the IFS source. Obviously the best solution would be that Intel fixes their compiler, but our approach can work in the meantime. The first fix is to use [[https://software.intel.com/en-us/node/692388|compiler directives]] to avoid the vectorization of the loop in //surfexcdriver_ctl_mod//: !DIR$ NOVECTOR DO JTILE=1,KTILES !DIR$ NOVECTOR DO JL=KIDIA,KFDIA ! Disaggregate solar flux but limit to 700 W/m2 (due to inconsistency ! with albedo) PSSRFLTI(JL,JTILE)=((1.0_JPRB-PALBTI(JL,JTILE))/& & (1.0_JPRB-ZALB(JL)))*PSSRFL(JL) IF (PSSRFLTI(JL,JTILE) > 700._JPRB) THEN LLHISSR(JL)=.TRUE. PSSRFLTI(JL,JTILE)=700._JPRB ENDIF The second is getting the conditional IF out of the loop and make two independent loops instead: DO JTILE=1,KTILES DO JL=KIDIA,KFDIA IF (LLHISSR(JL)) THEN PSSRFLTI(JL,JTILE)=PSSRFLTI(JL,JTILE)*PSSRFL(JL)/ZSSRFL1(JL) ENDIF ENDDO DO JL=KIDIA,KFDIA ZSRFD(JL)=PSSRFLTI(JL,JTILE)/(1.0_JPRB-PALBTI(JL,JTILE)) ENDDO We could see in the vectorization report that, being some of the other loops in the same function merged for optimization, this was not merged again. This can happen because the compiler thinks it is risky or sub-optimal to do so. Both fixes work with both __-O2__ and __-O3__, so the matrix is no detected as singular in //ludcmp//. **More information:** Intel® AVX-512 Instructions introduction: [[https://software.intel.com/en-us/blogs/2013/avx-512-instructions|https://software.intel.com/en-us/blogs/2013/avx-512-instructions]] Compiling for the Intel® Xeon Phi™ Processor and the Intel® Advanced Vector Extensions 512 ISA: [[https://software.intel.com/en-us/articles/compiling-for-the-intel-xeon-phi-processor-and-the-intel-avx-512-isa|https://software.intel.com/en-us/articles/compiling-for-the-intel-xeon-phi-processor-and-the-intel-avx-512-isa]] Quick Reference Guide to Optimization with Intel® C++ and Fortran Compilers v16: [[https://software.intel.com/sites/default/files/managed/12/f1/Quick-Reference-Card-Intel-Compilers-v16.pdf|https://software.intel.com/sites/default/files/managed/12/f1/Quick-Reference-Card-Intel-Compilers-v16.pdf]] Vectorization and Optimization Reports: [[https://software.intel.com/en-us/articles/vectorization-and-optimization-reports|https://software.intel.com/en-us/articles/vectorization-and-optimization-reports]] Generating a Vectorization Report: [[https://software.intel.com/en-us/node/590464|https://software.intel.com/en-us/node/590464]] General Compiler Directives: [[https://software.intel.com/en-us/node/692388|https://software.intel.com/en-us/node/692388]]