This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
library:computing:xios_impi_troubles [2017/08/11 10:25] mcastril [NEMO-XIOS issues with Intel MPI] |
— (current) | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== NEMO-XIOS Intel MPI issues ====== | ||
- | ===== NEMO-XIOS @ MN3 ===== | ||
- | |||
- | **Environment: | ||
- | |||
- | * Intel 13.0.1 & Intel MPI 4.1.3.049 | ||
- | * Intel 16.0.1 & Intel MPI 5.1.2.150 | ||
- | |||
- | The problem was reported when using default optimization flags as well as using -O3 optimization flag. | ||
- | |||
- | **Problem: | ||
- | |||
- | Some of the __NEMO clients__ remain stuck in client.cpp doing an MPI send: | ||
- | |||
- | < | ||
- | MPI_Send(buff, | ||
- | </ | ||
- | |||
- | __XIOS master server__ (first XIOS process), remains in CServer:: | ||
- | |||
- | < | ||
- | MPI_Iprobe(MPI_ANY_SOURCE, | ||
- | </ | ||
- | |||
- | **Actions taken:** Prints were placed into the code, before and after the mentioned call. It could be seen that some NEMO processes were waiting in the MPI barrier (synchronous send), while XIOS master server was looping to infinite, trying to get all the messages (the total number of messages should be equal to the number of clients, or NEMO processes). | ||
- | |||
- | The error could be reproduced in a [[/ | ||
- | |||
- | **Diagnosis: | ||
- | |||
- | **Solution: | ||
- | |||
- | < | ||
- | char hostName[50]; | ||
- | gethostname(hostName, | ||
- | |||
- | // Sleep Fix | ||
- | sleep(rank%16); | ||
- | |||
- | MPI_Comm_create_errhandler(eh, | ||
- | MPI_Comm_set_errhandler(CXios:: | ||
- | |||
- | error_code = MPI_Send(buff, | ||
- | |||
- | delete [] buff ; | ||
- | </ | ||
- | |||
- | BSC operations provided another solution: enabling the __User Datagram__ protocol by using Intel' | ||
- | |||
- | < | ||
- | I_MPI_DAPL_UD=on | ||
- | </ | ||
- | |||
- | **More information: | ||
- | |||
- | This bug was reported in the XIOS portal: | ||
- | |||
- | [[http:// | ||
- | |||
- | About Intel Communication Fabrics control: | ||
- | |||
- | [[https:// | ||
- | |||
- | DAPL UD-capable Network Fabrics Control: | ||
- | |||
- | [[https:// | ||
- | |||
- | |||
- | ===== NEMO-XIOS @ MN4 ===== | ||
- | |||
- | ==== Issue 1: NEMO fails to read input files ==== | ||
- | |||
- | **Environment: | ||
- | |||
- | * Compiler: Intel 2017.4 | ||
- | * MPI: Intel 2017.3.196 | ||
- | * NetCDF: 4.4.4.1 & 4.2 | ||
- | * HDF5: 1.8.19 | ||
- | * Flags: -O3 & -O0 | ||
- | |||
- | **Problem: | ||
- | |||
- | __ocean.output__: | ||
- | |||
- | < | ||
- | | ||
- | | ||
- | |||
- | | ||
- | | ||
- | | ||
- | | ||
- | |||
- | | ||
- | | ||
- | |||
- | | ||
- | | ||
- | |||
- | | ||
- | | ||
- | |||
- | | ||
- | | ||
- | </ | ||
- | |||
- | __log.err: | ||
- | < | ||
- | forrtl: severe (408): fort: (7): Attempt to use pointer FLY_DTA when it is not associated with a target | ||
- | |||
- | Image PC Routine | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | libc-2.22.so | ||
- | nemo.exe | ||
- | </ | ||
- | |||
- | **Actions taken:** BSC Ops had previously observed this error using NEMO standalone and NetCDF 4.4.4.1, so they installed NetCDF 4.4.0. We could not reproduce the failure that they had reported by running NEMO and NetCDF 4.4.4.1 but we got the same error when running EC-Earth. So we also moved to NetCDF 4.4.0 and this error was not arising. However, we got other XIOS errors with the output write (commented in following issues) and we asked for the same version we were using at MN3 to be installed. When operations installed 4.2 version we surprisingly got again the same error. | ||
- | |||
- | After looking for differences between NetCDF 4.4.0 and NetCDF 4.2 configurations (using nc-config & nf-config commands), we found out that while NetCDF 4.4.0 was compiled with no support for nc4 nor P-NetCDF (a library that gives parallel I/O support for classic NetCDF files), NetCDF 4.2 was installed with support for these features. We commented this to Ops and they installed __NetCDF without linking to P-NetCDF__, and this seemed to fix the problem. | ||
- | |||
- | In order to know more about the source of this bug, we __compared the behavior of two NEMO executables__: | ||
- | |||
- | We did a __comparison of the functions included in both binaries__ through the nm command, and we found that __they were identical__. Then we did a __more in deep comparison of both binaries__ with objdump and we found out little differences, | ||
- | |||
- | **Diagnosis: | ||
- | |||
- | **Solution: | ||
- | |||
- | **More information: | ||
- | |||
- | This bug had already been reported in Unidata Github: | ||
- | |||
- | [[https:// | ||
- | |||
- | ==== Issue 2: XIOS crashes when writing model output ==== | ||
- | |||
- | **Environment: | ||
- | |||
- | * Compiler: Intel 2017.4 | ||
- | * MPI: Intel 2017.3.196 | ||
- | * NetCDF: 4.4.0 | ||
- | * HDF5: 1.8.19 | ||
- | * Flags: -O3 , with and without -fpe0 | ||
- | |||
- | **Problem: | ||
- | |||
- | __ocean.output__: | ||
- | |||
- | __log.err__: | ||
- | |||
- | < | ||
- | forrtl: error (65): floating invalid | ||
- | Image PC Routine | ||
- | nemo.exe | ||
- | libpthread-2.22.s | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | libc-2.22.so | ||
- | nemo.exe | ||
- | | ||
- | </ | ||
- | |||
- | |||
- | **Actions taken:** Given that the error is a floating invalid we disabled the -fpe0 flag, but we still were having the same problem. Then we disabled compiler optimizations (use -O0) and the problem disappeared, | ||
- | |||
- | Update: Using the new NetCDF 4.2 installed by Ops and key netcdf_par for XIOS2 the model can run using O3. We have to investigate further this issue. | ||
- | |||
- | **Diagnosis: | ||
- | |||
- | **Solution: | ||
- | |||
- | ==== Issue 3: MPI kills XIOS when writing model output ==== | ||
- | |||
- | **Environment: | ||
- | |||
- | * Compiler: Intel 2017.4 | ||
- | * MPI: Intel 2017.3.196 | ||
- | * NetCDF: 4.4.0 & 4.2 (after removed PNETCDF) | ||
- | * HDF5: 1.8.19 | ||
- | * Flags: -O0 | ||
- | |||
- | **Problem: | ||
- | |||
- | __ocean.output__: | ||
- | |||
- | |||
- | __log.err__: | ||
- | |||
- | < | ||
- | s11r1b56.58976Assertion failure at / | ||
- | forrtl: error (76): Abort trap signal | ||
- | Image PC Routine | ||
- | nemo.exe | ||
- | libpthread-2.22.s | ||
- | libc-2.22.so | ||
- | libc-2.22.so | ||
- | libpsm2.so.2.1 | ||
- | libpsm2.so.2.1 | ||
- | libpsm2.so.2.1 | ||
- | libpsm2.so.2.1 | ||
- | libpsm2.so.2.1 | ||
- | libpsm2.so.2.1 | ||
- | libpsm2.so.2.1 | ||
- | libtmip_psm2.so.1 | ||
- | libmpi.so.12.0 | ||
- | libmpi.so.12.0 | ||
- | libmpi.so.12 | ||
- | libmpi.so.12 | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | nemo.exe | ||
- | libc-2.22.so | ||
- | nemo.exe | ||
- | </ | ||
- | |||
- | |||
- | **Actions taken:** A similar error was observed with NEMO standalone v3.6r6499. In that case, Ops told us to use the //fabric// module, which selects //ofi// as internode fabrics, similarly to the solution used in MN3 (see above). Using this module solved the problem for NEMO standalone, although it had the collateral effect that jobs were never ending. In coupled EC-Earth this module produced a dead lock, commented below. | ||
- | |||
- | We tried an alternative solution, which was to __increment the number of XIOS servers__ in order to reduce the number of messages sent to the same process and by the moment it seems that it is effective. | ||
- | |||
- | **Diagnosis: | ||
- | |||
- | **Solution: | ||
- | |||
- | About Intel Communication Fabrics control: | ||
- | |||
- | [[https:// | ||
- | |||
- | Ips_proto.c source code: | ||
- | |||
- | [[https:// | ||
- | ==== Issue 4: EC-Earth enters in a dead lock when using fabric (OFA network fabrics) module ==== | ||
- | |||
- | **Environment: | ||
- | |||
- | * Compiler: Intel 2017.4 | ||
- | * MPI: Intel 2017.3.196 | ||
- | * NetCDF: 4.4.0 | ||
- | * HDF5: 1.8.19 | ||
- | * Flags: -O0 & -O3 | ||
- | |||
- | **Problem: | ||
- | |||
- | **Actions taken:** We managed to solve the issues 1 to 3, so there is no need to solve this one by now. However, if we find the time we will debug this problem. | ||
- | |||
- | **Diagnosis: | ||
- | |||
- | **Solution: |