README.md 3.63 KB
Newer Older
spalomas's avatar
spalomas committed
# EC-Earth3-scalability-analysis

Scalability curves for multiple EC-Earth3 components for stand-alone executions.

spalomas's avatar
spalomas committed
The results are always an average of 3 independent executions of the same experiment to mitigate the possible variabiliy. Always using 2-month simulations except for TM5-Aerchem configurations (1 month). The initialization and finalization phases are always omitted.
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
Note that the efficiency plots start at a 100% for all outclasses, meaning that it is computed independently for each outclass. This is a good metric to know how each outclass scales but it is better to use the CHSY to compare how efficiently the resources are being used for different outclasses.
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
## IFS
spalomas's avatar
spalomas committed

![IFS_scalability_per_outclass](./images/IFS_scalability_per_outclass.png#center)
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
The reduced outclass has very little impact on the execution time of IFS (3%). However, the average overhead when using the CMIP6-historical outclass is almost 20% when compared to the execution without output. This impact gets bigger as we add more nodes.
spalomas's avatar
spalomas committed

![IFS_CHPSY_per_outclass](./images/IFS_CHPSY_per_outclass.png)
spalomas's avatar
spalomas committed

![IFS_efficiency](./images/IFS_efficiency.png)
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
To be over 60% we should use no more than 10 nodes (480 cores) or 8 (384 cores) for CMIP6 outclass.
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
## NEMO
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
Always using 47 cores for XIOS. ElPin activated and using the following number of processes:
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
[ 48 144 192 229 285 331 380 521 665 806 1008 1129 1275 ]
spalomas's avatar
spalomas committed

![NEMO_scalability_per_outclass](./images/NEMO_scalability_per_outclass.png)
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
Even though a dedicated I/O server (XIOS) is used, the overhead due to output is bigger in NEMO than in IFS*. The average overhead when using the reduced outclass is 8.14% when compared to the execution without output and almost 37% with the CMIP6-historical outclass. Again, the impact gets bigger as we add more nodes.
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
(*) Assuming that the increase in the amount of data to output is equal for IFS and NEMO when changing the outclass.
spalomas's avatar
spalomas committed

![NEMO_CHPSY_per_outclass](./images/NEMO_CHPSY_per_outclass.png)
spalomas's avatar
spalomas committed

![NEMO_efficiency](./images/NEMO_efficiency.png)
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
We can use up to 521 processes with more than 60% efficiency for all outclasses.
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
#### PISCES
![NEMO+PISCES_ORCA1L75_scalability](./images/NEMO+PISCES_ORCA1L75_scalability.png)
spalomas's avatar
spalomas committed

![NEMO+PISCES_ORCA1L75_CHPSY](./images/NEMO+PISCES_ORCA1L75_CHPSY.png)
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
After 1008 processes, there is a significant drop in NEMO+PISCES performance. Same when using 229 cores.
spalomas's avatar
spalomas committed

![PISCES_overhead](./images/PISCES_overhead.png)
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
## LPJG
LPJG scalability is divided into two parts. The initialization (time to read the initial state) and the computation. The problem with the initialization is that it can take some minutes to complete. During this period of time, any other component of the coupled system will be waiting. This can represent an important waste of resources (especially with higher resolutions where many more cores will be used and, therefore, waiting).
Since LPJG requires the memory of 3 nodes (at least) to work, the first experiments have been executed only changing the number of cores dedicated to LPJG on these nodes. More precisely:
[ 1, 2, 3, 8, 16, 32, 48] on each of the 3 nodes, resulting in [ 3, 6, 9, 24, 48, 96, 144] cores in total.
Additionally, it has been tested with 4 (192 cores) and 5 (240) nodes.
spalomas's avatar
spalomas committed

![LPJG_init](./images/LPJG_init.png)
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
On the other hand, the execution of LPJG, as shown in the next plot, does not need many resources to overpass IFS and NEMO in terms of speed (30 and 65 SYPD respectively at most).
spalomas's avatar
spalomas committed

![LPJG_scalability_without_init](./images/LPJG_scalability_without_init.png)
spalomas's avatar
spalomas committed

![LPJG_scalability](./images/LPJG_scalability.png)
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
## TM5-CO2
spalomas's avatar
spalomas committed

![3d_tm5co2](./images/3d_tm5co2.png)
spalomas's avatar
spalomas committed

spalomas's avatar
spalomas committed
## TM5-Aerchem
spalomas's avatar
spalomas committed

![3d_tm5AerChem](./images/3d_tm5AerChem.png)