User Tools

Site Tools


library:computing:xios_impi_troubles

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
library:computing:xios_impi_troubles [2017/08/04 13:20]
84.88.184.232 [Issue 1: NEMO fails to read input files]
library:computing:xios_impi_troubles [2017/08/11 11:11] (current)
mcastril
Line 1: Line 1:
-====== NEMO-XIOS issues with Intel MPI ======+====== NEMO-XIOS Intel MPI issues ======
  
 ===== NEMO-XIOS @ MN3 ===== ===== NEMO-XIOS @ MN3 =====
  
-**Environment:** NEMO 3.6 stable, XIOS 1.0. This bug was documented using the following compilers and MPI libraries:+**Environment:** NEMO 3.6 stable, XIOS 1.0. This bug has been reported using the following compilers and MPI libraries:
  
   * Intel 13.0.1 & Intel MPI 4.1.3.049   * Intel 13.0.1 & Intel MPI 4.1.3.049
Line 126: Line 126:
 </code> </code>
  
-**Actions taken:** Operations had observed this error using NEMO standalone and NetCDF 4.4.4.1, so they installed NetCDF 4.4.0 version. We could not reproduce the failure that they had been reported running NEMO and NetCDF 4.4.4.1 but we got the same error when running EC-Earth. So we also moved to NetCDF 4.4.0 and this error was not arising. However, we got other XIOS errors when writing outputs (commented in following issues) and we asked for the same version we were using at MN3 to be installed. When operations installed 4.2 version we surprisingly got again the same error.+**Actions taken:** BSC Ops had previously observed this error using NEMO standalone and NetCDF 4.4.4.1, so they installed NetCDF 4.4.0. We could not reproduce the failure that they had reported by running NEMO and NetCDF 4.4.4.1 but we got the same error when running EC-Earth. So we also moved to NetCDF 4.4.0 and this error was not arising. However, we got other XIOS errors with the output write (commented in following issues) and we asked for the same version we were using at MN3 to be installed. When operations installed 4.2 version we surprisingly got again the same error.
  
-After looking for differences between NetCDF 4.4.0 and NetCDF 4.2 configurations (using nc-config & nf-config commands), we found out that while NetCDF 4.4.0 was compiled with no support for nc4 nor P-NetCDF (a library used that gives parallel I/O support for classic NetCDF files), while NetCDF was supporting this features. Then operations compiled again __NetCDF without linking P-NetCDF__, and this seemed to fix the problem.+After looking for differences between NetCDF 4.4.0 and NetCDF 4.2 configurations (using nc-config & nf-config commands), we found out that while NetCDF 4.4.0 was compiled with no support for nc4 nor P-NetCDF (a library that gives parallel I/O support for classic NetCDF files), NetCDF 4.2 was installed with support for these features. We commented this to Ops and they installed __NetCDF without linking to P-NetCDF__, and this seemed to fix the problem.
  
-In order to know more about the source of this bug, we __compared the behavior of two NEMO executables__compiled with NetCDF with and without P-NetCDF support. Both executions were linked with NetCDF without P-NetCDF support at runtime. The result is that the __NEMO compiled with P-NetCDF did not run__, no matter the library used at runtime were not using it, so something was wrong at the NEMO binary itself.+In order to know more about the source of this bug, we __compared the behavior of two NEMO executables__: one compiled with NetCDF and another one without P-NetCDF support. Both executions were linked with NetCDF without P-NetCDF support at runtime. The result is that the __NEMO compiled with P-NetCDF did not run__, no matter the library used at runtime were not using it. The conclusion was that something was wrong at the NEMO binary itself.
  
 We did a __comparison of the functions included in both binaries__ through the nm command, and we found that __they were identical__. Then we did a __more in deep comparison of both binaries__ with objdump and we found out little differences, but some of that differences were __pointing to a XIOS header file called netcdf.hpp__. This header is responsible to include some NetCDF function definitions, and its behavior depends on the environment (preprocessing macros). In order to know if this file is the responsible of the bug we would have to compile NetCDF ourselves in debugging mode (with -g flag). We did a __comparison of the functions included in both binaries__ through the nm command, and we found that __they were identical__. Then we did a __more in deep comparison of both binaries__ with objdump and we found out little differences, but some of that differences were __pointing to a XIOS header file called netcdf.hpp__. This header is responsible to include some NetCDF function definitions, and its behavior depends on the environment (preprocessing macros). In order to know if this file is the responsible of the bug we would have to compile NetCDF ourselves in debugging mode (with -g flag).
Line 136: Line 136:
 **Diagnosis:** What we know until now is that compiling NetCDF with P-NetCDF messes up the nc_open function so it cannot be used by NEMO. We are not sure if this problem is produced by NetCDF alone, or if netcdf.hpp header files included with XIOS 2.0 are doing the mess. As stated above, compiling NetCDF ourselves would be needed to know more about the problem. **Diagnosis:** What we know until now is that compiling NetCDF with P-NetCDF messes up the nc_open function so it cannot be used by NEMO. We are not sure if this problem is produced by NetCDF alone, or if netcdf.hpp header files included with XIOS 2.0 are doing the mess. As stated above, compiling NetCDF ourselves would be needed to know more about the problem.
  
-**Solution:** Until we have more information the best solution is to use a NetCDF version that does not have P-NetCDF support. In any case XIOS uses NC4, which is using HDF5 for parallel write.+**Solution:** Until we have more information the best solution is to use a NetCDF version that does not have P-NetCDF support (4.2 and 4.4.0 in MN4). In any case XIOS uses NC4, which is using HDF5 for parallel write.
  
 **More information:** **More information:**
Line 144: Line 144:
 [[https://github.com/Unidata/netcdf4-python/issues/170|https://github.com/Unidata/netcdf4-python/issues/170]] [[https://github.com/Unidata/netcdf4-python/issues/170|https://github.com/Unidata/netcdf4-python/issues/170]]
  
-==== Issue 2: ====+==== Issue 2: XIOS crashes when writing model output ====
  
 **Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification).  **Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification). 
Line 196: Line 196:
  
  
-**Actions taken:** Given that the error is a floating invalid we disabled the -fpe0 flag, but we still were having the same problem. Then we disabled compiler optimizations (use -O0) and the problem disappeared, but this obviously has an effect on performance.+**Actions taken:** Given that the error is a floating invalid we disabled the -fpe0 flag, but we still were having the same problem. Then we disabled compiler optimizations (use -O0) and the problem disappeared, but this obviously has an effect on performance. At second instance we activated -O2 optimizations and the model runs, so this way performance loss is not that important as it would be running with -O0.
  
-**Diagnosis:**+Update: Using the new NetCDF 4.2 installed by Ops and key netcdf_par for XIOS2 the model can run using O3. We have to investigate further this issue. 
 + 
 +**Diagnosis:** By now it is difficult to know the exact source of this problem. Further debugging will be required.
  
-**Solution:** Disabling compiler optimizations (activate -O0).+**Solution:** Use -O2 flag (instead of -O2). But now it seems to work using NetCDF 4.2, -O3 & netcdf_par for XIOS2.
  
-==== Issue 3: ====+==== Issue 3: MPI kills XIOS when writing model output ====
  
 **Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification).  **Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification). 
Line 212: Line 214:
   * Flags: -O0   * Flags: -O0
  
-**Problem:** The model crashes at one of the steps it is supposed to write an output file, getting what it seems it is an MPI problem. The step of the crash is not always the same, and it can write output files in previous timesteps. +**Problem:** The model crashes at one of the steps it is supposed to write an output file, getting what it seems it is an MPI problem. The step of the crash is not always the same, and it may write output files in previous timesteps. 
  
 __ocean.output__: The ocean.output file presents no anomalies. __ocean.output__: The ocean.output file presents no anomalies.
Line 255: Line 257:
  
  
-**Actions taken:** A similar error was observed with NEMO standalone v3.6r6499. In that case, operations told us to use the //fabric// module, which selects //ofi// as internode fabrics, similarly to the solution used in MN3 (see above). Using this module solved the problem for NEMO standalone, although it had the collateral effect that jobs were never ending. In coupled EC-Earth this module produced a dead lock, commented below.+**Actions taken:** A similar error was observed with NEMO standalone v3.6r6499. In that case, Ops told us to use the //fabric// module, which selects //ofi// as internode fabrics, similarly to the solution used in MN3 (see above). Using this module solved the problem for NEMO standalone, although it had the collateral effect that jobs were never ending. In coupled EC-Earth this module produced a dead lock, commented below.
  
-**Diagnosis:**+We tried an alternative solution, which was to __increment the number of XIOS servers__ in order to reduce the number of messages sent to the same process and by the moment it seems that it is effective.
  
-**Solution:**+**Diagnosis:** The problem is that an one point the buffer for data transfer is not 4-byte aligned, and the library assumes it is. This is a really low level problem and we do not completely understand the relation between this and reducing the congestion (which is achieved by adding servers), but maybe in the future we can get more information.   
 + 
 +**Solution:** By the moment the solution used is to use enough number of XIOS servers (47 for SR).
  
 About Intel Communication Fabrics control: About Intel Communication Fabrics control:
  
 [[https://software.intel.com/en-us/node/528821]] [[https://software.intel.com/en-us/node/528821]]
-==== Issue 4: ====+ 
 +Ips_proto.c source code: 
 + 
 +[[https://github.com/01org/psm/blob/master/ptl_ips/ips_proto.c]] 
 +==== Issue 4: EC-Earth enters in a dead lock when using fabric (OFA network fabrics) module ====
  
 **Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification).  **Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification). 
Line 274: Line 282:
   * Flags: -O0 & -O3   * Flags: -O0 & -O3
  
-**Problem:** +**Problem:** When loading module fabric, created by Ops to solve Issue 2 on NEMO (assertion invalid), EC-Earth enters in a deadlock. Our NEMO benchmark was rather running, but MPI_Finalize was not working and jobs never finished until wallclock time limit was reached. 
  
-**Actions taken:**+**Actions taken:** We managed to solve the issues 1 to 3, so there is no need to solve this one by now. However, if we find the time we will debug this problem.
  
 **Diagnosis:** **Diagnosis:**
  
-**Solution:**+**Solution:** No solution yet, but model can work without fabric module.
library/computing/xios_impi_troubles.1501852853.txt.gz · Last modified: 2017/08/04 13:20 by 84.88.184.232