This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
library:computing:xios_impi_troubles [2025/06/18 09:20] 84.88.52.107 old revision restored (2025/05/25 15:49) |
library:computing:xios_impi_troubles [2025/06/28 04:44] (current) 84.88.52.107 old revision restored (2017/08/04 15:13) |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== XIOS I/O server | + | ====== |
- | ===== XIOS @ MN3 ===== | + | ===== NEMO-XIOS @ MN3 ===== |
**Environment: | **Environment: | ||
Line 68: | Line 68: | ||
- | ===== XIOS @ MN4 ===== | + | ===== NEMO-XIOS @ MN4 ===== |
- | ==== Issue 1: ==== | + | ==== Issue 1: NEMO fails to read input files ==== |
**Environment: | **Environment: | ||
Line 126: | Line 126: | ||
</ | </ | ||
- | **Actions taken: | + | **Actions taken: |
- | **Diagnosis: | + | After looking for differences between NetCDF 4.4.0 and NetCDF 4.2 configurations (using nc-config & nf-config commands), we found out that while NetCDF 4.4.0 was compiled with no support for nc4 nor P-NetCDF (a library that gives parallel I/O support for classic NetCDF files), NetCDF 4.2 was installed with support for these features. We commented this to Ops and they installed __NetCDF without linking to P-NetCDF__, and this seemed to fix the problem. |
- | **Solution: | + | In order to know more about the source of this bug, we __compared the behavior of two NEMO executables__: |
+ | |||
+ | We did a __comparison of the functions included in both binaries__ through the nm command, and we found that __they were identical__. Then we did a __more in deep comparison of both binaries__ with objdump and we found out little differences, | ||
+ | |||
+ | **Diagnosis: | ||
+ | |||
+ | **Solution: | ||
**More information: | **More information: | ||
- | This bug was reported in Unidata Github: | + | This bug had already been reported in Unidata Github: |
[[https:// | [[https:// | ||
- | ==== Issue 2: ==== | + | ==== Issue 2: XIOS crashes when writing model output |
**Environment: | **Environment: | ||
Line 146: | Line 152: | ||
* NetCDF: 4.4.0 | * NetCDF: 4.4.0 | ||
* HDF5: 1.8.19 | * HDF5: 1.8.19 | ||
- | * Flags: -O3 | + | * Flags: -O3 , with and without -fpe0 |
- | **Problem: | + | **Problem: |
- | __ocean.output__: | + | __ocean.output__: |
- | < | + | __log.err__: |
+ | < | ||
+ | forrtl: error (65): floating invalid | ||
+ | Image PC Routine | ||
+ | nemo.exe | ||
+ | libpthread-2.22.s | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | libc-2.22.so | ||
+ | nemo.exe | ||
+ | | ||
</ | </ | ||
- | **Actions taken:** | ||
- | **Diagnosis:** | + | **Actions taken:** Given that the error is a floating invalid we disabled the -fpe0 flag, but we still were having the same problem. Then we disabled compiler optimizations (use -O0) and the problem disappeared, |
- | **Solution:** | + | **Diagnosis:** |
- | ==== Issue 3: ==== | + | **Solution:** Use -O2 flag (instead of -O2). |
- | **Environment:** | + | ==== Issue 3: MPI kills XIOS when writing model output ==== |
- | **Problem:** | + | **Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification). |
- | **Actions taken:** | + | |
+ | | ||
+ | | ||
+ | | ||
+ | * Flags: -O0 | ||
- | **Diagnosis:** | + | **Problem:** The model crashes at one of the steps it is supposed to write an output file, getting what it seems it is an MPI problem. The step of the crash is not always the same, and it may write output files in previous timesteps. |
- | **Solution:** | + | __ocean.output__: The ocean.output file presents no anomalies. |
+ | |||
+ | __log.err__: | ||
+ | |||
+ | < | ||
+ | s11r1b56.58976Assertion failure at / | ||
+ | forrtl: error (76): Abort trap signal | ||
+ | Image PC Routine | ||
+ | nemo.exe | ||
+ | libpthread-2.22.s | ||
+ | libc-2.22.so | ||
+ | libc-2.22.so | ||
+ | libpsm2.so.2.1 | ||
+ | libpsm2.so.2.1 | ||
+ | libpsm2.so.2.1 | ||
+ | libpsm2.so.2.1 | ||
+ | libpsm2.so.2.1 | ||
+ | libpsm2.so.2.1 | ||
+ | libpsm2.so.2.1 | ||
+ | libtmip_psm2.so.1 | ||
+ | libmpi.so.12.0 | ||
+ | libmpi.so.12.0 | ||
+ | libmpi.so.12 | ||
+ | libmpi.so.12 | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | nemo.exe | ||
+ | libc-2.22.so | ||
+ | nemo.exe | ||
+ | </ | ||
+ | |||
+ | |||
+ | **Actions taken:** A similar error was observed with NEMO standalone v3.6r6499. In that case, Ops told us to use the //fabric// module, which selects //ofi// as internode fabrics, similarly to the solution used in MN3 (see above). Using this module solved the problem for NEMO standalone, although it had the collateral effect that jobs were never ending. In coupled EC-Earth this module produced a dead lock, commented below. | ||
+ | |||
+ | We tried an alternative solution, which was to __increment the number of XIOS servers__ in order to reduce the number of messages sent to the same process and by the moment it seems that it is effective. | ||
+ | |||
+ | **Diagnosis: | ||
+ | |||
+ | **Solution: | ||
+ | |||
+ | About Intel Communication Fabrics control: | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | Ips_proto.c source code: | ||
+ | |||
+ | [[https:// | ||
==== Issue 4: ==== | ==== Issue 4: ==== | ||
- | **Environment: | + | **Environment: |
+ | |||
+ | * Compiler: Intel 2017.4 | ||
+ | * MPI: Intel 2017.3.196 | ||
+ | * NetCDF: 4.4.0 | ||
+ | * HDF5: 1.8.19 | ||
+ | * Flags: -O0 & -O3 | ||
- | **Problem: | + | **Problem: |
**Actions taken:** | **Actions taken:** |