README.md

# EC-Earth3-scalability-analysis

Scalability curves for multiple EC-Earth3 components for stand-alone executions.

The results are always an average of 3 independent executions of the same experiment to mitigate the possible variabiliy. Always using 2-month simulations except for TM5-Aerchem configurations (1 month). The initialization and finalization phases are always omitted.

Note that the efficiency plots start at a 100% for all outclasses, meaning that it is computed independently for each outclass. This is a good metric to know how each outclass scales but it is better to use the CHSY to compare how efficiently the resources are being used for different outclasses.

## IFS

![IFS_scalability_per_outclass](./images/IFS_scalability_per_outclass.png#center)

The reduced outclass has very little impact on the execution time of IFS (3%). However, the average overhead when using the CMIP6-historical outclass is almost 20% when compared to the execution without output. This impact gets bigger as we add more nodes.

![IFS_CHPSY_per_outclass](./images/IFS_CHPSY_per_outclass.png)

![IFS_efficiency](./images/IFS_efficiency.png)

To be over 60% we should use no more than 10 nodes (480 cores) or 8 (384 cores) for CMIP6 outclass.

## NEMO

Always using 47 cores for XIOS. ElPin activated and using the following number of processes:

[ 48 144 192 229 285 331 380 521 665 806 1008 1129 1275 ]

![NEMO_scalability_per_outclass](./images/NEMO_scalability_per_outclass.png)

Even though a dedicated I/O server (XIOS) is used, the overhead due to output is bigger in NEMO than in IFS*. The average overhead when using the reduced outclass is 8.14% when compared to the execution without output and almost 37% with the CMIP6-historical outclass. Again, the impact gets bigger as we add more nodes.

(*) Assuming that the increase in the amount of data to output is equal for IFS and NEMO when changing the outclass.

![NEMO_CHPSY_per_outclass](./images/NEMO_CHPSY_per_outclass.png)

![NEMO_efficiency](./images/NEMO_efficiency.png)

We can use up to 521 processes with more than 60% efficiency for all outclasses.

#### PISCES
![NEMO+PISCES_ORCA1L75_scalability](./images/NEMO+PISCES_ORCA1L75_scalability.png)

![NEMO+PISCES_ORCA1L75_CHPSY](./images/NEMO+PISCES_ORCA1L75_CHPSY.png)

After 1008 processes, there is a significant drop in NEMO+PISCES performance. Same when using 229 cores.

![PISCES_overhead](./images/PISCES_overhead.png)

## LPJG
LPJG scalability is divided into two parts. The initialization (time to read the initial state) and the computation. The problem with the initialization is that it can take some minutes to complete. During this period of time, any other component of the coupled system will be waiting. This can represent an important waste of resources (especially with higher resolutions where many more cores will be used and, therefore, waiting).
Since LPJG requires the memory of 3 nodes (at least) to work, the first experiments have been executed only changing the number of cores dedicated to LPJG on these nodes. More precisely:
[ 1, 2, 3, 8, 16, 32, 48] on each of the 3 nodes, resulting in [ 3, 6, 9, 24, 48, 96, 144] cores in total.
Additionally, it has been tested with 4 (192 cores) and 5 (240) nodes.

![LPJG_init](./images/LPJG_init.png)

On the other hand, the execution of LPJG, as shown in the next plot, does not need many resources to overpass IFS and NEMO in terms of speed (30 and 65 SYPD respectively at most).

![LPJG_scalability_without_init](./images/LPJG_scalability_without_init.png)

![LPJG_scalability](./images/LPJG_scalability.png)

## TM5-CO2

![3d_tm5co2](./images/3d_tm5co2.png)

## TM5-Aerchem

![3d_tm5AerChem](./images/3d_tm5AerChem.png)