User Tools

Site Tools


library:computing:xios_impi_troubles

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
library:computing:xios_impi_troubles [2017/08/04 16:25]
mcastril [Issue 2: XIOS crashes when writing model output]
library:computing:xios_impi_troubles [2017/08/11 11:11] (current)
mcastril
Line 1: Line 1:
-====== NEMO-XIOS issues with Intel MPI ======+====== NEMO-XIOS Intel MPI issues ======
  
 ===== NEMO-XIOS @ MN3 ===== ===== NEMO-XIOS @ MN3 =====
  
-**Environment:** NEMO 3.6 stable, XIOS 1.0. This bug was documented using the following compilers and MPI libraries:+**Environment:** NEMO 3.6 stable, XIOS 1.0. This bug has been reported using the following compilers and MPI libraries:
  
   * Intel 13.0.1 & Intel MPI 4.1.3.049   * Intel 13.0.1 & Intel MPI 4.1.3.049
Line 261: Line 261:
 We tried an alternative solution, which was to __increment the number of XIOS servers__ in order to reduce the number of messages sent to the same process and by the moment it seems that it is effective. We tried an alternative solution, which was to __increment the number of XIOS servers__ in order to reduce the number of messages sent to the same process and by the moment it seems that it is effective.
  
-**Diagnosis:** The problem is that an one point the buffer for data transfer is not 4-byte aligned, and the library assumes it is. This is a really low level problem and we do not completely understand the relation between this and reduce the congestion, but maybe in the future we can get more information.  +**Diagnosis:** The problem is that an one point the buffer for data transfer is not 4-byte aligned, and the library assumes it is. This is a really low level problem and we do not completely understand the relation between this and reducing the congestion (which is achieved by adding servers), but maybe in the future we can get more information.  
  
 **Solution:** By the moment the solution used is to use enough number of XIOS servers (47 for SR). **Solution:** By the moment the solution used is to use enough number of XIOS servers (47 for SR).
Line 284: Line 284:
 **Problem:** When loading module fabric, created by Ops to solve Issue 2 on NEMO (assertion invalid), EC-Earth enters in a deadlock. Our NEMO benchmark was rather running, but MPI_Finalize was not working and jobs never finished until wallclock time limit was reached.  **Problem:** When loading module fabric, created by Ops to solve Issue 2 on NEMO (assertion invalid), EC-Earth enters in a deadlock. Our NEMO benchmark was rather running, but MPI_Finalize was not working and jobs never finished until wallclock time limit was reached. 
  
-**Actions taken:**+**Actions taken:** We managed to solve the issues 1 to 3, so there is no need to solve this one by now. However, if we find the time we will debug this problem.
  
 **Diagnosis:** **Diagnosis:**
  
-**Solution:**+**Solution:** No solution yet, but model can work without fabric module.
library/computing/xios_impi_troubles.1501863945.txt.gz ยท Last modified: 2017/08/04 16:25 by mcastril