====== XIOS I/O server with Intel MPI ======

===== XIOS @ MN3 =====

**Environment:** NEMO 3.6 stable, XIOS 1.0. This bug was documented using the following compilers and MPI libraries:

  * Intel 13.0.1 & Intel MPI 4.1.3.049
  * Intel 16.0.1 & Intel MPI 5.1.2.150

The problem was reported when using default optimization flags as well as using -O3 optimization flag. 

**Situation:** When using more than 1.920 MPI processes (120 MN3 nodes), during the XIOS initialization, the simulation was falling into a dead lock:

Some of the NEMO clients remain stuck in client.cpp doing an MPI send:

<code>
MPI_Send(buff,buffer.count(),MPI_CHAR,serverLeader,1,CXios::globalComm) ;
</code>

XIOS master server (first in the XIOS MPI Comm World), remains in CServer::listenContext(void) routine at server.cpp, trying to dispatch all the messages:

<code>
MPI_Iprobe(MPI_ANY_SOURCE,1,CXios::globalComm, &flag, &status) ;
</code>

**Actions:** The code was debugged using prints before and after the mentioned call. It could be seen that some NEMO processes were waiting in the MPI barrier (synchronous send), while XIOS master server was looping to infinite in a while, trying to get all the messages (the total number of messages have to be equal to the number of clients, or NEMO processes). 

**Diagnosis:** It seemed that some messages sent from the clients to the master server were lost, maybe because all of these messages were sent at the same time from all the nodes.

**Solution:** Our first workaround was to include one sleep call (sleep(rank%16)) before de MPI_Send in clients' code, to interleave the outcome messages and avoid flooding the buffers and network. This of course is not the cleanest solution, because it introduces a delay of 15 secs., but it is affordable given that this code is only executed in the initialization. 

<code>
char hostName[50];
gethostname(hostName, 50);

// Sleep Fix
sleep(rank%16);

MPI_Comm_create_errhandler(eh, &newerr);
MPI_Comm_set_errhandler(CXios::globalComm, newerr );

error_code = MPI_Send(buff,buffer.count(),MPI_CHAR,serverLeader,1,CXios::globalComm) ;

delete [] buff ;
</code>

BSC operations provided another solution, that is to active the User Datagram protocol by using Intel's MPI environment variables. This alternative works and doesn't need any code modification, but it entails a penalty in performance: we observed that simulations using this option were increasingly slower (5%-20%) the more cores were used, in comparison with the reference ones.

<code>
I_MPI_DAPL_UD=on 
</code>

**More information:**

This bug was reported in the XIOS portal:

[[http://forge.ipsl.jussieu.fr/ioserver/ticket/90]]

About Intel Communication Fabrics control:

[[https://software.intel.com/en-us/node/528821]]

DAPL UD-capable Network Fabrics Control:

[[https://software.intel.com/en-us/node/528824]]