This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
library:computing:xios_impi_troubles [2017/08/04 13:14] 84.88.184.232 [Issue 1:] |
library:computing:xios_impi_troubles [2017/08/11 11:11] mcastril |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== NEMO-XIOS | + | ====== NEMO-XIOS Intel MPI issues |
===== NEMO-XIOS @ MN3 ===== | ===== NEMO-XIOS @ MN3 ===== | ||
- | **Environment: | + | **Environment: |
* Intel 13.0.1 & Intel MPI 4.1.3.049 | * Intel 13.0.1 & Intel MPI 4.1.3.049 | ||
Line 126: | Line 126: | ||
</ | </ | ||
- | **Actions taken: | + | **Actions taken: |
- | After looking for differences between NetCDF 4.4.0 and NetCDF 4.2 configurations (using nc-config & nf-config commands), we found out that while NetCDF 4.4.0 was compiled with no support for nc4 nor P-NetCDF (a library | + | After looking for differences between NetCDF 4.4.0 and NetCDF 4.2 configurations (using nc-config & nf-config commands), we found out that while NetCDF 4.4.0 was compiled with no support for nc4 nor P-NetCDF (a library that gives parallel I/O support for classic NetCDF files), NetCDF |
- | In order to know more about the source of this bug, we compared | + | In order to know more about the source of this bug, we __compared |
- | We did a comparison | + | We did a __comparison |
**Diagnosis: | **Diagnosis: | ||
- | **Solution: | + | **Solution: |
**More information: | **More information: | ||
Line 144: | Line 144: | ||
[[https:// | [[https:// | ||
- | ==== Issue 2: ==== | + | ==== Issue 2: XIOS crashes when writing model output |
**Environment: | **Environment: | ||
Line 196: | Line 196: | ||
- | **Actions taken:** Given that the error is a floating invalid we disabled the -fpe0 flag, but we still were having the same problem. Then we disabled compiler optimizations (use -O0) and the problem disappeared, | + | **Actions taken:** Given that the error is a floating invalid we disabled the -fpe0 flag, but we still were having the same problem. Then we disabled compiler optimizations (use -O0) and the problem disappeared, |
- | **Diagnosis: | + | Update: Using the new NetCDF 4.2 installed by Ops and key netcdf_par for XIOS2 the model can run using O3. We have to investigate further this issue. |
+ | |||
+ | **Diagnosis: | ||
- | **Solution: | + | **Solution: |
- | ==== Issue 3: ==== | + | ==== Issue 3: MPI kills XIOS when writing model output |
**Environment: | **Environment: | ||
Line 212: | Line 214: | ||
* Flags: -O0 | * Flags: -O0 | ||
- | **Problem: | + | **Problem: |
__ocean.output__: | __ocean.output__: | ||
Line 255: | Line 257: | ||
- | **Actions taken:** A similar error was observed with NEMO standalone v3.6r6499. In that case, operations | + | **Actions taken:** A similar error was observed with NEMO standalone v3.6r6499. In that case, Ops told us to use the //fabric// module, which selects //ofi// as internode fabrics, similarly to the solution used in MN3 (see above). Using this module solved the problem for NEMO standalone, although it had the collateral effect that jobs were never ending. In coupled EC-Earth this module produced a dead lock, commented below. |
- | **Diagnosis: | + | We tried an alternative solution, which was to __increment the number of XIOS servers__ in order to reduce the number of messages sent to the same process and by the moment it seems that it is effective. |
- | **Solution: | + | **Diagnosis: |
+ | |||
+ | **Solution: | ||
About Intel Communication Fabrics control: | About Intel Communication Fabrics control: | ||
[[https:// | [[https:// | ||
- | ==== Issue 4: ==== | + | |
+ | Ips_proto.c source code: | ||
+ | |||
+ | [[https:// | ||
+ | ==== Issue 4: EC-Earth enters in a dead lock when using fabric (OFA network fabrics) module | ||
**Environment: | **Environment: | ||
Line 274: | Line 282: | ||
* Flags: -O0 & -O3 | * Flags: -O0 & -O3 | ||
- | **Problem: | + | **Problem: |
- | **Actions taken:** | + | **Actions taken: |
**Diagnosis: | **Diagnosis: | ||
- | **Solution: | + | **Solution: |