User Tools

Site Tools


library:computing:xios_impi_troubles

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
library:computing:xios_impi_troubles [2017/08/04 15:10]
84.88.184.232 [Issue 3: MPI kills XIOS when writing model output]
library:computing:xios_impi_troubles [2024/05/20 12:58]
84.88.52.107 old revision restored (2017/08/04 14:30)
Line 212: Line 212:
   * Flags: -O0   * Flags: -O0
  
-**Problem:** The model crashes at one of the steps it is supposed to write an output file, getting what it seems it is an MPI problem. The step of the crash is not always the same, and it may write output files in previous timesteps. +**Problem:** The model crashes at one of the steps it is supposed to write an output file, getting what it seems it is an MPI problem. The step of the crash is not always the same, and it can write output files in previous timesteps. 
  
 __ocean.output__: The ocean.output file presents no anomalies. __ocean.output__: The ocean.output file presents no anomalies.
Line 257: Line 257:
 **Actions taken:** A similar error was observed with NEMO standalone v3.6r6499. In that case, Ops told us to use the //fabric// module, which selects //ofi// as internode fabrics, similarly to the solution used in MN3 (see above). Using this module solved the problem for NEMO standalone, although it had the collateral effect that jobs were never ending. In coupled EC-Earth this module produced a dead lock, commented below. **Actions taken:** A similar error was observed with NEMO standalone v3.6r6499. In that case, Ops told us to use the //fabric// module, which selects //ofi// as internode fabrics, similarly to the solution used in MN3 (see above). Using this module solved the problem for NEMO standalone, although it had the collateral effect that jobs were never ending. In coupled EC-Earth this module produced a dead lock, commented below.
  
-We tried an alternative solution, which was to __increment the number of XIOS servers__ in order to reduce the number of messages sent to the same process and by the moment it seems that it is effective.+We tried an alternative solution, which was to increment the number of XIOS servers in order to reduce the number of messages sent to the same process and by the moment it seems that it is effective.
  
 **Diagnosis:** The problem is that an one point the buffer for data transfer is not 4-byte aligned, and the library assumes it is. This is a really low level problem and we do not completely understand the relation between this and reduce the congestion, but maybe in the future we can get more information.   **Diagnosis:** The problem is that an one point the buffer for data transfer is not 4-byte aligned, and the library assumes it is. This is a really low level problem and we do not completely understand the relation between this and reduce the congestion, but maybe in the future we can get more information.