User Tools

Site Tools


library:computing:xios_impi_troubles

This is an old revision of the document!


NEMO-XIOS issues with Intel MPI

NEMO-XIOS @ MN3

Environment: NEMO 3.6 stable, XIOS 1.0. This bug was documented using the following compilers and MPI libraries:

  • Intel 13.0.1 & Intel MPI 4.1.3.049
  • Intel 16.0.1 & Intel MPI 5.1.2.150

The problem was reported when using default optimization flags as well as using -O3 optimization flag.

Problem: When using more than 1.920 MPI processes (120 MN3 nodes), during the XIOS initialization, the simulation was falling into a dead lock:

Some of the NEMO clients remain stuck in client.cpp doing an MPI send:

MPI_Send(buff,buffer.count(),MPI_CHAR,serverLeader,1,CXios::globalComm) ;

XIOS master server (first XIOS process), remains in CServer::listenContext(void) routine at server.cpp, trying to dispatch all the messages:

MPI_Iprobe(MPI_ANY_SOURCE,1,CXios::globalComm, &flag, &status) ;

Actions taken: Prints were placed into the code, before and after the mentioned call. It could be seen that some NEMO processes were waiting in the MPI barrier (synchronous send), while XIOS master server was looping to infinite, trying to get all the messages (the total number of messages should be equal to the number of clients, or NEMO processes).

The error could be reproduced in a small code, in order to facilitate the debug and the error fixing.

Diagnosis: It seemed that some messages sent from the clients to the master server were lost, maybe because all of these messages were sent from all the nodes at the same time.

Solution: Our first workaround was to include a call to the sleep function before de MPI_Send in the clients' code, to interleave the outcome messages and avoid flooding buffers and network. This obviously is not the cleanest solution, because it introduces a total delay of 15 secs. in the execution, but it is an affordable approach, given that this code is only executed in the initialization.

char hostName[50];
gethostname(hostName, 50);

// Sleep Fix
sleep(rank%16);

MPI_Comm_create_errhandler(eh, &newerr);
MPI_Comm_set_errhandler(CXios::globalComm, newerr );

error_code = MPI_Send(buff,buffer.count(),MPI_CHAR,serverLeader,1,CXios::globalComm) ;

delete [] buff ;

BSC operations provided another solution: enabling the User Datagram protocol by using Intel's MPI environment variables (more information below). This alternative works and doesn't need any code modification, but it entails a penalty in performance: we observed that simulations using this option were increasingly slower (5%-20%) as the number of cores was aumented, in comparison with the reference ones.

I_MPI_DAPL_UD=on 

More information:

This bug was reported in the XIOS portal:

http://forge.ipsl.jussieu.fr/ioserver/ticket/90

About Intel Communication Fabrics control:

https://software.intel.com/en-us/node/528821

DAPL UD-capable Network Fabrics Control:

https://software.intel.com/en-us/node/528824

NEMO-XIOS @ MN4

Issue 1:

Environment: Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification).

  • Compiler: Intel 2017.4
  • MPI: Intel 2017.3.196
  • NetCDF: 4.4.4.1 & 4.2
  • HDF5: 1.8.19
  • Flags: -O3 & -O0

Problem: NEMO crashes in the initialization when reading input files:

ocean.output:

 ===>>> : E R R O R
         ===========

 iom_nf90_check : NetCDF: Invalid argument
                     iom_nf90_open ~~~
                     iom_nf90_open ~~~ open existing file: ./weights_WOA13d1_2_o
 rca1_bilinear.nc in READ mode

 ===>>> : E R R O R
         ===========

 iom_nf90_check : NetCDF: Invalid argument
                     iom_nf90_open ~~~

 ===>>> : E R R O R
         ===========

     fld_weight : unable to read the file
                                                                                                                                                                                                                                                             879,6       Final

log.err:

forrtl: severe (408): fort: (7): Attempt to use pointer FLY_DTA when it is not associated with a target

Image              PC                Routine            Line        Source
nemo.exe           00000000027994F6  Unknown               Unknown  Unknown
nemo.exe           0000000000B3F219  fldread_mp_fld_in        1375  fldread.f90
nemo.exe           0000000000B15D4E  fldread_mp_fld_ge         614  fldread.f90
nemo.exe           0000000000B13A6B  fldread_mp_fld_in         413  fldread.f90
nemo.exe           0000000000B0A69B  fldread_mp_fld_re         175  fldread.f90
nemo.exe           0000000000978301  dtatsd_mp_dta_tsd         224  dtatsd.f90
nemo.exe           0000000000C312DF  istate_mp_istate_         196  istate.f90
nemo.exe           000000000043C33F  nemogcm_mp_nemo_i         326  nemogcm.f90
nemo.exe           000000000043A64D  nemogcm_mp_nemo_g         120  nemogcm.f90
nemo.exe           000000000043A606  MAIN__                     18  nemo.f90
nemo.exe           000000000043A5DE  Unknown               Unknown  Unknown
libc-2.22.so       00002B64D88596E5  __libc_start_main     Unknown  Unknown
nemo.exe           000000000043A4E9  Unknown               Unknown  Unknown

Actions taken: Operations had observed this error using NEMO standalone and NeTCDF 4.4.4.1, so they installed NetCDF 4.4.0 version. We could not reproduce the failure they reported with NEMO and NetCDF 4.4.4.1 but we got the same error when running EC-Earth. So we also moved to NetCDF 4.4.0. However, we got other XIOS errors when writing outputs (commented in following issues) and we asked for the same version we were using at MN3 to be installed. When operations installed 4.2 version we surprisingly got again the same error.

Diagnosis:

Solution:

More information:

This bug had already been reported in Unidata Github:

https://github.com/Unidata/netcdf4-python/issues/170

Issue 2:

Environment: Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification).

  • Compiler: Intel 2017.4
  • MPI: Intel 2017.3.196
  • NetCDF: 4.4.0
  • HDF5: 1.8.19
  • Flags: -O3 , with and without -fpe0

Problem: XIOS2 breaks when writing output files. As a result, output files are incomplete: they contain the headers for each one of the variables they are supposed to store, but only values for nav_lat, nav_lon, depthu and depthu_bounds are actually saved.

ocean.output: ocean.output looks normal, the stop is just after writing the restarts.

log.err:

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
nemo.exe           0000000001B2B832  Unknown               Unknown  Unknown
libpthread-2.22.s  00002B2472B71B10  Unknown               Unknown  Unknown
nemo.exe           000000000189CE85  _ZN5blitz5ArrayId         182  fastiter.h
nemo.exe           000000000189BDE9  _ZN5blitz5ArrayId          89  methods.cc
nemo.exe           000000000189BA8C  _ZN4xios13COperat         175  operator_expr.hpp
nemo.exe           000000000188A3BA  _ZN4xios27CFieldF          61  binary_arithmetic_filter.cpp
nemo.exe           00000000018152E4  _ZN4xios7CFilter1          14  filter.cpp
nemo.exe           0000000001563FBE  _ZN4xios9CInputPi          37  input_pin.cpp
nemo.exe           000000000176284F  _ZN4xios10COutput          46  output_pin.cpp
nemo.exe           0000000001762A48  _ZN4xios10COutput          35  output_pin.cpp
nemo.exe           0000000001815364  _ZN4xios7CFilter1          16  filter.cpp
nemo.exe           0000000001563FBE  _ZN4xios9CInputPi          37  input_pin.cpp
nemo.exe           000000000176284F  _ZN4xios10COutput          46  output_pin.cpp
nemo.exe           0000000001762A48  _ZN4xios10COutput          35  output_pin.cpp
nemo.exe           0000000001815364  _ZN4xios7CFilter1          16  filter.cpp
nemo.exe           0000000001563FBE  _ZN4xios9CInputPi          37  input_pin.cpp
nemo.exe           000000000176284F  _ZN4xios10COutput          46  output_pin.cpp
nemo.exe           0000000001762A48  _ZN4xios10COutput          35  output_pin.cpp
nemo.exe           000000000178A260  _ZN4xios13CSource          32  source_filter.cpp
nemo.exe           0000000001812DC7  _ZN4xios6CField7s          21  field_impl.hpp
nemo.exe           00000000014B3586  cxios_write_data_         434  icdata.cpp
nemo.exe           0000000000FFE0FD  idata_mp_xios_sen         552  idata.F90
nemo.exe           0000000000738081  diawri_mp_dia_wri         252  diawri.f90
nemo.exe           0000000000485B67  step_mp_stp_              284  step.f90
nemo.exe           000000000043908B  nemogcm_mp_nemo_g         147  nemogcm.f90
nemo.exe           0000000000438FDD  MAIN__                     18  nemo.f90
nemo.exe           0000000000438F9E  Unknown               Unknown  Unknown
libc-2.22.so       00002B247309B6E5  __libc_start_main     Unknown  Unknown
nemo.exe           0000000000438EA9  Unknown               Unknown  Unknown
                                                                                                                                                                                                                                                             2517,1      Final

Actions taken: Given that the error is a floating invalid we disabled the -fpe0 flag, but we still were having the same problem. Then we disabled compiler optimizations (use -O0) and the problem disappeared, but this obviously has an effect on performance.

Diagnosis:

Solution: Disabling compiler optimizations (activate -O0).

Issue 3:

Environment: Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification).

  • Compiler: Intel 2017.4
  • MPI: Intel 2017.3.196
  • NetCDF: 4.4.0
  • HDF5: 1.8.19
  • Flags: -O0

Problem: The model crashes at one of the steps it is supposed to write an output file, getting what it seems it is an MPI problem. The step of the crash is not always the same, and it can write output files in previous timesteps.

ocean.output: The ocean.output file presents no anomalies.

log.err:

s11r1b56.58976Assertion failure at /nfs/site/home/phcvs2/gitrepo/ifs-all/Ofed_Delta/rpmbuild/BUILD/libpsm2-10.2.175/ptl_ips/ips_proto.c:1869: (scb->payload_size & 0x3) == 0
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
nemo.exe           00000000027A1F9A  Unknown               Unknown  Unknown
libpthread-2.22.s  00002AD71847FB10  Unknown               Unknown  Unknown
libc-2.22.so       00002AD7189BD8D7  gsignal               Unknown  Unknown
libc-2.22.so       00002AD7189BECAA  abort                 Unknown  Unknown
libpsm2.so.2.1     00002AD73FDD3E6E  Unknown               Unknown  Unknown
libpsm2.so.2.1     00002AD73FDE7D59  Unknown               Unknown  Unknown
libpsm2.so.2.1     00002AD73FDEA3F2  Unknown               Unknown  Unknown
libpsm2.so.2.1     00002AD73FDE419D  Unknown               Unknown  Unknown
libpsm2.so.2.1     00002AD73FDE10D3  Unknown               Unknown  Unknown
libpsm2.so.2.1     00002AD73FDDF93B  Unknown               Unknown  Unknown
libpsm2.so.2.1     00002AD73FDDABF0  psm2_mq_ipeek2        Unknown  Unknown
libtmip_psm2.so.1  00002AD73FBBE2FE  Unknown               Unknown  Unknown
libmpi.so.12.0     00002AD7174E05C1  Unknown               Unknown  Unknown
libmpi.so.12.0     00002AD717364020  Unknown               Unknown  Unknown
libmpi.so.12       00002AD7170E05F2  PMPIDI_CH3I_Progr     Unknown  Unknown
libmpi.so.12       00002AD7174D1BFF  PMPI_Test             Unknown  Unknown
nemo.exe           00000000022E5F14  _ZN4xios13CClient          79  buffer_client.cpp
nemo.exe           00000000018A348B  _ZN4xios14CContex         201  context_client.cpp
nemo.exe           000000000186A6E1  _ZN4xios8CContext         350  context.cpp
nemo.exe           0000000001D262F1  cxios_write_data_         412  icdata.cpp
nemo.exe           0000000001478A79  idata_mp_xios_sen         545  idata.F90
nemo.exe           0000000000BF1463  iom_mp_iom_p2d_          1136  iom.f90
nemo.exe           0000000000810534  diawri_mp_dia_wri         293  diawri.f90
nemo.exe           00000000004A9EFC  step_mp_stp_              284  step.f90
nemo.exe           000000000043A9E9  nemogcm_mp_nemo_g         147  nemogcm.f90
nemo.exe           000000000043A7C6  MAIN__                     18  nemo.f90
nemo.exe           000000000043A79E  Unknown               Unknown  Unknown
libc-2.22.so       00002AD7189A96E5  __libc_start_main     Unknown  Unknown
nemo.exe           000000000043A6A9  Unknown               Unknown  Unknown

Actions taken: A similar error was observed with NEMO standalone v3.6r6499. In that case, operations told us to use the fabric module, which selects ofi as internode fabrics, similarly to the solution used in MN3 (see above). Using this module solved the problem for NEMO standalone, although it had the collateral effect that jobs were never ending. In coupled EC-Earth this module produced a dead lock, commented below.

Diagnosis:

Solution:

About Intel Communication Fabrics control:

https://software.intel.com/en-us/node/528821

Issue 4:

Environment: Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification).

  • Compiler: Intel 2017.4
  • MPI: Intel 2017.3.196
  • NetCDF: 4.4.0
  • HDF5: 1.8.19
  • Flags: -O0 & -O3

Problem:

Actions taken:

Diagnosis:

Solution:

library/computing/xios_impi_troubles.1501759610.txt.gz · Last modified: 2017/08/03 11:26 by 84.88.184.232