User Tools

Site Tools


library:computing:xios_impi_troubles

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
library:computing:xios_impi_troubles [2026/05/15 06:43]
84.88.52.107 old revision restored (2025/11/08 08:27)
library:computing:xios_impi_troubles [2026/06/15 22:20] (current)
84.88.52.107 old revision restored (2026/05/11 00:51)
Line 70: Line 70:
 ===== NEMO-XIOS @ MN4 ===== ===== NEMO-XIOS @ MN4 =====
  
-==== Issue 1: ====+==== Issue 1: NEMO fails to read input files ====
  
 **Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification).  **Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification). 
Line 126: Line 126:
 </code> </code>
  
-**Actions taken:** Operations had observed this error using NEMO standalone and NeTCDF 4.4.4.1, so they installed NetCDF 4.4.0 version. We could not reproduce the failure they reported with NEMO and NetCDF 4.4.4.1 but we got the same error when running EC-Earth. So we also moved to NetCDF 4.4.0. However, we got other XIOS errors when writing outputs (commented in following issues) and we asked for the same version we were using at MN3 to be installed. When operations installed 4.2 version we surprisingly got again the same error.+**Actions taken:** Operations had observed this error using NEMO standalone and NetCDF 4.4.4.1, so they installed NetCDF 4.4.0 version. We could not reproduce the failure that they had been reported running NEMO and NetCDF 4.4.4.1 but we got the same error when running EC-Earth. So we also moved to NetCDF 4.4.0 and this error was not arising. However, we got other XIOS errors when writing outputs (commented in following issues) and we asked for the same version we were using at MN3 to be installed. When operations installed 4.2 version we surprisingly got again the same error.
  
-**Diagnosis:**+After looking for differences between NetCDF 4.4.0 and NetCDF 4.2 configurations (using nc-config & nf-config commands), we found out that while NetCDF 4.4.0 was compiled with no support for nc4 nor P-NetCDF (a library used that gives parallel I/O support for classic NetCDF files), while NetCDF was supporting this features. Then operations compiled again __NetCDF without linking P-NetCDF__, and this seemed to fix the problem.
  
-**Solution:**+In order to know more about the source of this bug, we __compared the behavior of two NEMO executables__, compiled with NetCDF with and without P-NetCDF support. Both executions were linked with NetCDF without P-NetCDF support at runtime. The result is that the __NEMO compiled with P-NetCDF did not run__, no matter the library used at runtime were not using it, so something was wrong at the NEMO binary itself. 
 + 
 +We did a __comparison of the functions included in both binaries__ through the nm command, and we found that __they were identical__. Then we did a __more in deep comparison of both binaries__ with objdump and we found out little differences, but some of that differences were __pointing to a XIOS header file called netcdf.hpp__. This header is responsible to include some NetCDF function definitions, and its behavior depends on the environment (preprocessing macros). In order to know if this file is the responsible of the bug we would have to compile NetCDF ourselves in debugging mode (with -g flag). 
 + 
 +**Diagnosis:** What we know until now is that compiling NetCDF with P-NetCDF messes up the nc_open function so it cannot be used by NEMO. We are not sure if this problem is produced by NetCDF alone, or if netcdf.hpp header files included with XIOS 2.0 are doing the mess. As stated above, compiling NetCDF ourselves would be needed to know more about the problem. 
 + 
 +**Solution:** Until we have more information the best solution is to use a NetCDF version that does not have P-NetCDF support. In any case XIOS uses NC4, which is using HDF5 for parallel write.
  
 **More information:** **More information:**
Line 198: Line 204:
 ==== Issue 3: ==== ==== Issue 3: ====
  
-**Environment:**+**Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification). 
  
-**Problem:**+  Compiler: Intel 2017.4 
 +  MPIIntel 2017.3.196 
 +  NetCDF: 4.4.0 & 4.2 (after removed PNETCDF)  
 +  HDF5: 1.8.19 
 +  * Flags: -O0
  
-**Actions taken:**+**Problem:** The model crashes at one of the steps it is supposed to write an output file, getting what it seems it is an MPI problem. The step of the crash is not always the same, and it can write output files in previous timesteps.  
 + 
 +__ocean.output__: The ocean.output file presents no anomalies. 
 + 
 + 
 +__log.err__:  
 + 
 +<code> 
 +s11r1b56.58976Assertion failure at /nfs/site/home/phcvs2/gitrepo/ifs-all/Ofed_Delta/rpmbuild/BUILD/libpsm2-10.2.175/ptl_ips/ips_proto.c:1869: (scb->payload_size & 0x3) == 0 
 +forrtl: error (76): Abort trap signal 
 +Image              PC                Routine            Line        Source 
 +nemo.exe           00000000027A1F9A  Unknown               Unknown  Unknown 
 +libpthread-2.22.s  00002AD71847FB10  Unknown               Unknown  Unknown 
 +libc-2.22.so       00002AD7189BD8D7  gsignal               Unknown  Unknown 
 +libc-2.22.so       00002AD7189BECAA  abort                 Unknown  Unknown 
 +libpsm2.so.2.1     00002AD73FDD3E6E  Unknown               Unknown  Unknown 
 +libpsm2.so.2.1     00002AD73FDE7D59  Unknown               Unknown  Unknown 
 +libpsm2.so.2.1     00002AD73FDEA3F2  Unknown               Unknown  Unknown 
 +libpsm2.so.2.1     00002AD73FDE419D  Unknown               Unknown  Unknown 
 +libpsm2.so.2.1     00002AD73FDE10D3  Unknown               Unknown  Unknown 
 +libpsm2.so.2.1     00002AD73FDDF93B  Unknown               Unknown  Unknown 
 +libpsm2.so.2.1     00002AD73FDDABF0  psm2_mq_ipeek2        Unknown  Unknown 
 +libtmip_psm2.so.1  00002AD73FBBE2FE  Unknown               Unknown  Unknown 
 +libmpi.so.12.0     00002AD7174E05C1  Unknown               Unknown  Unknown 
 +libmpi.so.12.0     00002AD717364020  Unknown               Unknown  Unknown 
 +libmpi.so.12       00002AD7170E05F2  PMPIDI_CH3I_Progr     Unknown  Unknown 
 +libmpi.so.12       00002AD7174D1BFF  PMPI_Test             Unknown  Unknown 
 +nemo.exe           00000000022E5F14  _ZN4xios13CClient          79  buffer_client.cpp 
 +nemo.exe           00000000018A348B  _ZN4xios14CContex         201  context_client.cpp 
 +nemo.exe           000000000186A6E1  _ZN4xios8CContext         350  context.cpp 
 +nemo.exe           0000000001D262F1  cxios_write_data_         412  icdata.cpp 
 +nemo.exe           0000000001478A79  idata_mp_xios_sen         545  idata.F90 
 +nemo.exe           0000000000BF1463  iom_mp_iom_p2d_          1136  iom.f90 
 +nemo.exe           0000000000810534  diawri_mp_dia_wri         293  diawri.f90 
 +nemo.exe           00000000004A9EFC  step_mp_stp_              284  step.f90 
 +nemo.exe           000000000043A9E9  nemogcm_mp_nemo_g         147  nemogcm.f90 
 +nemo.exe           000000000043A7C6  MAIN__                     18  nemo.f90 
 +nemo.exe           000000000043A79E  Unknown               Unknown  Unknown 
 +libc-2.22.so       00002AD7189A96E5  __libc_start_main     Unknown  Unknown 
 +nemo.exe           000000000043A6A9  Unknown               Unknown  Unknown 
 +</code> 
 + 
 + 
 +**Actions taken:** A similar error was observed with NEMO standalone v3.6r6499. In that case, operations told us to use the //fabric// module, which selects //ofi// as internode fabrics, similarly to the solution used in MN3 (see above). Using this module solved the problem for NEMO standalone, although it had the collateral effect that jobs were never ending. In coupled EC-Earth this module produced a dead lock, commented below.
  
 **Diagnosis:** **Diagnosis:**
Line 208: Line 261:
 **Solution:** **Solution:**
  
 +About Intel Communication Fabrics control:
 +
 +[[https://software.intel.com/en-us/node/528821]]
 ==== Issue 4: ==== ==== Issue 4: ====
  
-**Environment:**+**Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification).  
 + 
 +  * Compiler: Intel 2017.4 
 +  * MPI: Intel 2017.3.196 
 +  * NetCDF: 4.4.0 
 +  * HDF5: 1.8.19 
 +  * Flags: -O0 & -O3
  
-**Problem:**+**Problem:** 
  
 **Actions taken:** **Actions taken:**
library/computing/xios_impi_troubles.1778827398.txt.gz · Last modified: 2026/05/15 06:43 by 84.88.52.107