====== NEMO-XIOS Intel MPI issues ====== ===== NEMO-XIOS @ MN3 ===== **Environment:** NEMO 3.6 stable, XIOS 1.0. This bug has been reported using the following compilers and MPI libraries: * Intel 13.0.1 & Intel MPI 4.1.3.049 * Intel 16.0.1 & Intel MPI 5.1.2.150 The problem was reported when using default optimization flags as well as using -O3 optimization flag. **Problem:** When using more than 1.920 MPI processes (120 MN3 nodes), during the XIOS initialization, the simulation was falling into a dead lock: Some of the __NEMO clients__ remain stuck in client.cpp doing an MPI send: MPI_Send(buff,buffer.count(),MPI_CHAR,serverLeader,1,CXios::globalComm) ; __XIOS master server__ (first XIOS process), remains in CServer::listenContext(void) routine at server.cpp, trying to dispatch all the messages: MPI_Iprobe(MPI_ANY_SOURCE,1,CXios::globalComm, &flag, &status) ; **Actions taken:** Prints were placed into the code, before and after the mentioned call. It could be seen that some NEMO processes were waiting in the MPI barrier (synchronous send), while XIOS master server was looping to infinite, trying to get all the messages (the total number of messages should be equal to the number of clients, or NEMO processes). The error could be reproduced in a [[/wiki/doku.php?id=library:computing:xios_impi_troubles_bench|small code]], in order to facilitate the debug and the error fixing. **Diagnosis:** It seemed that some messages sent from the clients to the master server were lost, maybe because all of these messages were sent from all the nodes at the same time. **Solution:** Our first workaround was to include a call to the __sleep function__ before de MPI_Send in the clients' code, to interleave the outcome messages and avoid flooding buffers and network. This obviously is not the cleanest solution, because it introduces a total delay of 15 secs. in the execution, but it is an affordable approach, given that this code is only executed in the initialization. char hostName[50]; gethostname(hostName, 50); // Sleep Fix sleep(rank%16); MPI_Comm_create_errhandler(eh, &newerr); MPI_Comm_set_errhandler(CXios::globalComm, newerr ); error_code = MPI_Send(buff,buffer.count(),MPI_CHAR,serverLeader,1,CXios::globalComm) ; delete [] buff ; BSC operations provided another solution: enabling the __User Datagram__ protocol by using Intel's MPI environment variables (more information below). This alternative works and doesn't need any code modification, but it entails a penalty in performance: we observed that simulations using this option were increasingly slower (5%-20%) as the number of cores was aumented, in comparison with the reference ones. I_MPI_DAPL_UD=on **More information:** This bug was reported in the XIOS portal: [[http://forge.ipsl.jussieu.fr/ioserver/ticket/90]] About Intel Communication Fabrics control: [[https://software.intel.com/en-us/node/528821]] DAPL UD-capable Network Fabrics Control: [[https://software.intel.com/en-us/node/528824]] ===== NEMO-XIOS @ MN4 ===== ==== Issue 1: NEMO fails to read input files ==== **Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification). * Compiler: Intel 2017.4 * MPI: Intel 2017.3.196 * NetCDF: 4.4.4.1 & 4.2 * HDF5: 1.8.19 * Flags: -O3 & -O0 **Problem:** NEMO crashes in the initialization when reading input files: __ocean.output__: ===>>> : E R R O R =========== iom_nf90_check : NetCDF: Invalid argument iom_nf90_open ~~~ iom_nf90_open ~~~ open existing file: ./weights_WOA13d1_2_o rca1_bilinear.nc in READ mode ===>>> : E R R O R =========== iom_nf90_check : NetCDF: Invalid argument iom_nf90_open ~~~ ===>>> : E R R O R =========== fld_weight : unable to read the file 879,6 Final __log.err:__ forrtl: severe (408): fort: (7): Attempt to use pointer FLY_DTA when it is not associated with a target Image PC Routine Line Source nemo.exe 00000000027994F6 Unknown Unknown Unknown nemo.exe 0000000000B3F219 fldread_mp_fld_in 1375 fldread.f90 nemo.exe 0000000000B15D4E fldread_mp_fld_ge 614 fldread.f90 nemo.exe 0000000000B13A6B fldread_mp_fld_in 413 fldread.f90 nemo.exe 0000000000B0A69B fldread_mp_fld_re 175 fldread.f90 nemo.exe 0000000000978301 dtatsd_mp_dta_tsd 224 dtatsd.f90 nemo.exe 0000000000C312DF istate_mp_istate_ 196 istate.f90 nemo.exe 000000000043C33F nemogcm_mp_nemo_i 326 nemogcm.f90 nemo.exe 000000000043A64D nemogcm_mp_nemo_g 120 nemogcm.f90 nemo.exe 000000000043A606 MAIN__ 18 nemo.f90 nemo.exe 000000000043A5DE Unknown Unknown Unknown libc-2.22.so 00002B64D88596E5 __libc_start_main Unknown Unknown nemo.exe 000000000043A4E9 Unknown Unknown Unknown **Actions taken:** BSC Ops had previously observed this error using NEMO standalone and NetCDF 4.4.4.1, so they installed NetCDF 4.4.0. We could not reproduce the failure that they had reported by running NEMO and NetCDF 4.4.4.1 but we got the same error when running EC-Earth. So we also moved to NetCDF 4.4.0 and this error was not arising. However, we got other XIOS errors with the output write (commented in following issues) and we asked for the same version we were using at MN3 to be installed. When operations installed 4.2 version we surprisingly got again the same error. After looking for differences between NetCDF 4.4.0 and NetCDF 4.2 configurations (using nc-config & nf-config commands), we found out that while NetCDF 4.4.0 was compiled with no support for nc4 nor P-NetCDF (a library that gives parallel I/O support for classic NetCDF files), NetCDF 4.2 was installed with support for these features. We commented this to Ops and they installed __NetCDF without linking to P-NetCDF__, and this seemed to fix the problem. In order to know more about the source of this bug, we __compared the behavior of two NEMO executables__: one compiled with NetCDF and another one without P-NetCDF support. Both executions were linked with NetCDF without P-NetCDF support at runtime. The result is that the __NEMO compiled with P-NetCDF did not run__, no matter the library used at runtime were not using it. The conclusion was that something was wrong at the NEMO binary itself. We did a __comparison of the functions included in both binaries__ through the nm command, and we found that __they were identical__. Then we did a __more in deep comparison of both binaries__ with objdump and we found out little differences, but some of that differences were __pointing to a XIOS header file called netcdf.hpp__. This header is responsible to include some NetCDF function definitions, and its behavior depends on the environment (preprocessing macros). In order to know if this file is the responsible of the bug we would have to compile NetCDF ourselves in debugging mode (with -g flag). **Diagnosis:** What we know until now is that compiling NetCDF with P-NetCDF messes up the nc_open function so it cannot be used by NEMO. We are not sure if this problem is produced by NetCDF alone, or if netcdf.hpp header files included with XIOS 2.0 are doing the mess. As stated above, compiling NetCDF ourselves would be needed to know more about the problem. **Solution:** Until we have more information the best solution is to use a NetCDF version that does not have P-NetCDF support (4.2 and 4.4.0 in MN4). In any case XIOS uses NC4, which is using HDF5 for parallel write. **More information:** This bug had already been reported in Unidata Github: [[https://github.com/Unidata/netcdf4-python/issues/170|https://github.com/Unidata/netcdf4-python/issues/170]] ==== Issue 2: XIOS crashes when writing model output ==== **Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification). * Compiler: Intel 2017.4 * MPI: Intel 2017.3.196 * NetCDF: 4.4.0 * HDF5: 1.8.19 * Flags: -O3 , with and without -fpe0 **Problem:** XIOS2 breaks when writing output files. As a result, output files are incomplete: they contain the headers for each one of the variables they are supposed to store, but only values for nav_lat, nav_lon, depthu and depthu_bounds are actually saved. __ocean.output__: ocean.output looks normal, the stop is just after writing the restarts. __log.err__: forrtl: error (65): floating invalid Image PC Routine Line Source nemo.exe 0000000001B2B832 Unknown Unknown Unknown libpthread-2.22.s 00002B2472B71B10 Unknown Unknown Unknown nemo.exe 000000000189CE85 _ZN5blitz5ArrayId 182 fastiter.h nemo.exe 000000000189BDE9 _ZN5blitz5ArrayId 89 methods.cc nemo.exe 000000000189BA8C _ZN4xios13COperat 175 operator_expr.hpp nemo.exe 000000000188A3BA _ZN4xios27CFieldF 61 binary_arithmetic_filter.cpp nemo.exe 00000000018152E4 _ZN4xios7CFilter1 14 filter.cpp nemo.exe 0000000001563FBE _ZN4xios9CInputPi 37 input_pin.cpp nemo.exe 000000000176284F _ZN4xios10COutput 46 output_pin.cpp nemo.exe 0000000001762A48 _ZN4xios10COutput 35 output_pin.cpp nemo.exe 0000000001815364 _ZN4xios7CFilter1 16 filter.cpp nemo.exe 0000000001563FBE _ZN4xios9CInputPi 37 input_pin.cpp nemo.exe 000000000176284F _ZN4xios10COutput 46 output_pin.cpp nemo.exe 0000000001762A48 _ZN4xios10COutput 35 output_pin.cpp nemo.exe 0000000001815364 _ZN4xios7CFilter1 16 filter.cpp nemo.exe 0000000001563FBE _ZN4xios9CInputPi 37 input_pin.cpp nemo.exe 000000000176284F _ZN4xios10COutput 46 output_pin.cpp nemo.exe 0000000001762A48 _ZN4xios10COutput 35 output_pin.cpp nemo.exe 000000000178A260 _ZN4xios13CSource 32 source_filter.cpp nemo.exe 0000000001812DC7 _ZN4xios6CField7s 21 field_impl.hpp nemo.exe 00000000014B3586 cxios_write_data_ 434 icdata.cpp nemo.exe 0000000000FFE0FD idata_mp_xios_sen 552 idata.F90 nemo.exe 0000000000738081 diawri_mp_dia_wri 252 diawri.f90 nemo.exe 0000000000485B67 step_mp_stp_ 284 step.f90 nemo.exe 000000000043908B nemogcm_mp_nemo_g 147 nemogcm.f90 nemo.exe 0000000000438FDD MAIN__ 18 nemo.f90 nemo.exe 0000000000438F9E Unknown Unknown Unknown libc-2.22.so 00002B247309B6E5 __libc_start_main Unknown Unknown nemo.exe 0000000000438EA9 Unknown Unknown Unknown 2517,1 Final **Actions taken:** Given that the error is a floating invalid we disabled the -fpe0 flag, but we still were having the same problem. Then we disabled compiler optimizations (use -O0) and the problem disappeared, but this obviously has an effect on performance. At second instance we activated -O2 optimizations and the model runs, so this way performance loss is not that important as it would be running with -O0. Update: Using the new NetCDF 4.2 installed by Ops and key netcdf_par for XIOS2 the model can run using O3. We have to investigate further this issue. **Diagnosis:** By now it is difficult to know the exact source of this problem. Further debugging will be required. **Solution:** Use -O2 flag (instead of -O2). But now it seems to work using NetCDF 4.2, -O3 & netcdf_par for XIOS2. ==== Issue 3: MPI kills XIOS when writing model output ==== **Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification). * Compiler: Intel 2017.4 * MPI: Intel 2017.3.196 * NetCDF: 4.4.0 & 4.2 (after removed PNETCDF) * HDF5: 1.8.19 * Flags: -O0 **Problem:** The model crashes at one of the steps it is supposed to write an output file, getting what it seems it is an MPI problem. The step of the crash is not always the same, and it may write output files in previous timesteps. __ocean.output__: The ocean.output file presents no anomalies. __log.err__: s11r1b56.58976Assertion failure at /nfs/site/home/phcvs2/gitrepo/ifs-all/Ofed_Delta/rpmbuild/BUILD/libpsm2-10.2.175/ptl_ips/ips_proto.c:1869: (scb->payload_size & 0x3) == 0 forrtl: error (76): Abort trap signal Image PC Routine Line Source nemo.exe 00000000027A1F9A Unknown Unknown Unknown libpthread-2.22.s 00002AD71847FB10 Unknown Unknown Unknown libc-2.22.so 00002AD7189BD8D7 gsignal Unknown Unknown libc-2.22.so 00002AD7189BECAA abort Unknown Unknown libpsm2.so.2.1 00002AD73FDD3E6E Unknown Unknown Unknown libpsm2.so.2.1 00002AD73FDE7D59 Unknown Unknown Unknown libpsm2.so.2.1 00002AD73FDEA3F2 Unknown Unknown Unknown libpsm2.so.2.1 00002AD73FDE419D Unknown Unknown Unknown libpsm2.so.2.1 00002AD73FDE10D3 Unknown Unknown Unknown libpsm2.so.2.1 00002AD73FDDF93B Unknown Unknown Unknown libpsm2.so.2.1 00002AD73FDDABF0 psm2_mq_ipeek2 Unknown Unknown libtmip_psm2.so.1 00002AD73FBBE2FE Unknown Unknown Unknown libmpi.so.12.0 00002AD7174E05C1 Unknown Unknown Unknown libmpi.so.12.0 00002AD717364020 Unknown Unknown Unknown libmpi.so.12 00002AD7170E05F2 PMPIDI_CH3I_Progr Unknown Unknown libmpi.so.12 00002AD7174D1BFF PMPI_Test Unknown Unknown nemo.exe 00000000022E5F14 _ZN4xios13CClient 79 buffer_client.cpp nemo.exe 00000000018A348B _ZN4xios14CContex 201 context_client.cpp nemo.exe 000000000186A6E1 _ZN4xios8CContext 350 context.cpp nemo.exe 0000000001D262F1 cxios_write_data_ 412 icdata.cpp nemo.exe 0000000001478A79 idata_mp_xios_sen 545 idata.F90 nemo.exe 0000000000BF1463 iom_mp_iom_p2d_ 1136 iom.f90 nemo.exe 0000000000810534 diawri_mp_dia_wri 293 diawri.f90 nemo.exe 00000000004A9EFC step_mp_stp_ 284 step.f90 nemo.exe 000000000043A9E9 nemogcm_mp_nemo_g 147 nemogcm.f90 nemo.exe 000000000043A7C6 MAIN__ 18 nemo.f90 nemo.exe 000000000043A79E Unknown Unknown Unknown libc-2.22.so 00002AD7189A96E5 __libc_start_main Unknown Unknown nemo.exe 000000000043A6A9 Unknown Unknown Unknown **Actions taken:** A similar error was observed with NEMO standalone v3.6r6499. In that case, Ops told us to use the //fabric// module, which selects //ofi// as internode fabrics, similarly to the solution used in MN3 (see above). Using this module solved the problem for NEMO standalone, although it had the collateral effect that jobs were never ending. In coupled EC-Earth this module produced a dead lock, commented below. We tried an alternative solution, which was to __increment the number of XIOS servers__ in order to reduce the number of messages sent to the same process and by the moment it seems that it is effective. **Diagnosis:** The problem is that an one point the buffer for data transfer is not 4-byte aligned, and the library assumes it is. This is a really low level problem and we do not completely understand the relation between this and reducing the congestion (which is achieved by adding servers), but maybe in the future we can get more information. **Solution:** By the moment the solution used is to use enough number of XIOS servers (47 for SR). About Intel Communication Fabrics control: [[https://software.intel.com/en-us/node/528821]] Ips_proto.c source code: [[https://github.com/01org/psm/blob/master/ptl_ips/ips_proto.c]] ==== Issue 4: EC-Earth enters in a dead lock when using fabric (OFA network fabrics) module ==== **Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification). * Compiler: Intel 2017.4 * MPI: Intel 2017.3.196 * NetCDF: 4.4.0 * HDF5: 1.8.19 * Flags: -O0 & -O3 **Problem:** When loading module fabric, created by Ops to solve Issue 2 on NEMO (assertion invalid), EC-Earth enters in a deadlock. Our NEMO benchmark was rather running, but MPI_Finalize was not working and jobs never finished until wallclock time limit was reached. **Actions taken:** We managed to solve the issues 1 to 3, so there is no need to solve this one by now. However, if we find the time we will debug this problem. **Diagnosis:** **Solution:** No solution yet, but model can work without fabric module.