User Tools

Site Tools


library:computing:xios_impi_troubles

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
library:computing:xios_impi_troubles [2017/08/04 13:30]
84.88.184.232
library:computing:xios_impi_troubles [2024/06/02 17:07] (current)
84.88.52.107 old revision restored (2017/08/04 14:29)
Line 130: Line 130:
 After looking for differences between NetCDF 4.4.0 and NetCDF 4.2 configurations (using nc-config & nf-config commands), we found out that while NetCDF 4.4.0 was compiled with no support for nc4 nor P-NetCDF (a library that gives parallel I/O support for classic NetCDF files), NetCDF 4.2 was installed with support for these features. We commented this to Ops and they installed __NetCDF without linking to P-NetCDF__, and this seemed to fix the problem. After looking for differences between NetCDF 4.4.0 and NetCDF 4.2 configurations (using nc-config & nf-config commands), we found out that while NetCDF 4.4.0 was compiled with no support for nc4 nor P-NetCDF (a library that gives parallel I/O support for classic NetCDF files), NetCDF 4.2 was installed with support for these features. We commented this to Ops and they installed __NetCDF without linking to P-NetCDF__, and this seemed to fix the problem.
  
-In order to know more about the source of this bug, we __compared the behavior of two NEMO executables: one __compiled with NetCDF with other one without P-NetCDF support. Both executions were linked with NetCDF without P-NetCDF support at runtime. The result is that the __NEMO compiled with P-NetCDF did not run__, no matter the library used at runtime were not using it. The conclusion was that something was wrong at the NEMO binary itself.+In order to know more about the source of this bug, we __compared the behavior of two NEMO executables__: one compiled with NetCDF and another one without P-NetCDF support. Both executions were linked with NetCDF without P-NetCDF support at runtime. The result is that the __NEMO compiled with P-NetCDF did not run__, no matter the library used at runtime were not using it. The conclusion was that something was wrong at the NEMO binary itself.
  
 We did a __comparison of the functions included in both binaries__ through the nm command, and we found that __they were identical__. Then we did a __more in deep comparison of both binaries__ with objdump and we found out little differences, but some of that differences were __pointing to a XIOS header file called netcdf.hpp__. This header is responsible to include some NetCDF function definitions, and its behavior depends on the environment (preprocessing macros). In order to know if this file is the responsible of the bug we would have to compile NetCDF ourselves in debugging mode (with -g flag). We did a __comparison of the functions included in both binaries__ through the nm command, and we found that __they were identical__. Then we did a __more in deep comparison of both binaries__ with objdump and we found out little differences, but some of that differences were __pointing to a XIOS header file called netcdf.hpp__. This header is responsible to include some NetCDF function definitions, and its behavior depends on the environment (preprocessing macros). In order to know if this file is the responsible of the bug we would have to compile NetCDF ourselves in debugging mode (with -g flag).
Line 144: Line 144:
 [[https://github.com/Unidata/netcdf4-python/issues/170|https://github.com/Unidata/netcdf4-python/issues/170]] [[https://github.com/Unidata/netcdf4-python/issues/170|https://github.com/Unidata/netcdf4-python/issues/170]]
  
-==== Issue 2: ====+==== Issue 2: XIOS crashes when writing model output ====
  
 **Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification).  **Environment:** Auto-EC-Earth 3.2.2_develop_MN4 (EC-Earth 3.2 r4063-runtime-unification). 
Line 255: Line 255:
  
  
-**Actions taken:** A similar error was observed with NEMO standalone v3.6r6499. In that case, operations told us to use the //fabric// module, which selects //ofi// as internode fabrics, similarly to the solution used in MN3 (see above). Using this module solved the problem for NEMO standalone, although it had the collateral effect that jobs were never ending. In coupled EC-Earth this module produced a dead lock, commented below.+**Actions taken:** A similar error was observed with NEMO standalone v3.6r6499. In that case, Ops told us to use the //fabric// module, which selects //ofi// as internode fabrics, similarly to the solution used in MN3 (see above). Using this module solved the problem for NEMO standalone, although it had the collateral effect that jobs were never ending. In coupled EC-Earth this module produced a dead lock, commented below.
  
-**Diagnosis:**+We tried an alternative solution, which was to increment the number of XIOS servers in order to reduce the number of messages sent to the same process and by the moment it seems that it is effective.
  
-**Solution:**+**Diagnosis:** The problem is that an one point the buffer for data transfer is not 4-byte aligned, and the library assumes it is. This is a really low level problem and we do not completely understand the relation between this and reduce the congestion, but maybe in the future we can get more information.   
 + 
 +**Solution:** By the moment the solution used is to use enough number of XIOS servers (47 for SR).
  
 About Intel Communication Fabrics control: About Intel Communication Fabrics control:
  
 [[https://software.intel.com/en-us/node/528821]] [[https://software.intel.com/en-us/node/528821]]
 +
 +Ips_proto.c source code:
 +
 +[[https://github.com/01org/psm/blob/master/ptl_ips/ips_proto.c]]
 ==== Issue 4: ==== ==== Issue 4: ====
  
library/computing/xios_impi_troubles.1501853446.txt.gz · Last modified: 2017/08/04 13:30 by 84.88.184.232