README.md 10 KB
Newer Older
# Overview

A specific python-based software tool has been developed for the verification of the different seasonal forecast systems available in the Climate Data Store (CDS). This tool provides a complete set of functionalities ranging from direct downloading of selected datasets from the CDS to the computation and plotting of the specified metrics. All these functionalities are integrated in to a workflow that automatically identifies what operations are required and which ones can be skipped. It runs starting from a configuration file, which allows a high degree of parametrization, including all required dataset download parameters, computation specifications (e.g. chunks sizes on selected dimensions, number of cores, maximum allowed memory usage...), metrics parameters and plenty of plotting details. The tool has been developed in a modular fashion to allow easy debugging, modification and extension.

The software uses the xarray package to facilitate multidimensional array manipulation and it allows efficient parallel computing, by integrating the Dask library. For some of the verification metrics (fRPSS and fCRPSS), the R package SpecsVerification is used, which is a specific library for seasonal forecast verification, also used in the preoperational contract C3S_51_Lot3. This R package is wrapped using the python R2Py package. 

An extended documentation including the description of all the parameters that can be set in the `conf.yaml` and other details can be found [here](https://earth.bsc.es/gitlab/external/c3s512-seasonal-forecast-eqc/-/blob/s-prototype/C3S512_sf-prototype_documentation.pdf). A detail description of each of the verification metrics can be found in [C3S512_Seasonal_Forecast_verification_metrics_general_description.pdf](https://earth.bsc.es/gitlab/external/c3s512-seasonal-forecast-eqc/-/blob/s-prototype/). It is recommended to have a look at these documents


# Workflow and Modules

In order to perform the verification of a seasonal forecast system, a reference dataset (usually from observations or a reanalysis) is also required in addition to the seasonal forecast dataset to compute the corresponding verification metrics. Thus, two datasets for the same variable, temporal period, spatial and temporal resolution are needed. Once the two datasets are preprocessed so they are completely equivalent (previous conditions are met), the computation of metrics can start and successively the generation of the corresponding figures. All these required tasks are performed by the seasonal forecast verification tool which comprises four main modules. Each module is responsible for one different task: 1. downloading, 2. transferring, 3. computing and 4. plotting. All the parameters that define these tasks are specified in the configuration file conf.yaml which includes a specific section for each of the modules. The main script, eqc.py, calls the 4 modules in consecutive order. However, the execution of any module can be manually enabled/disabled in the configuration file or automatically skipped in case executing a given task is not necessary (for example if the corresponding expected result already exists).

Each module expects specific inputs and provides specific outputs (figure 1). In this sense, the workflow is defined in a way that it tries to never stop, even in the case of missing inputs (i.e. missing files). If required files for a specific task are missing, the task will be skipped, an error log will be generated and the next task/module will be launched. The same happens (without error) in case the expected results generated by a given tasks are found (because of a previous run); by default the workflow will skip this task to the next module. The tool will keep running tasks until completing all the pending tasks. 

This procedure is very convenient when a set of different variables and/or systems are to be assessed. If in one case data is missing (e.g. files are missing for a specific variable), the tool will proceed with the following variable and/or system. In the same way, an uncompleted task of a given variable/system can be resumed easily launching a complete job including many tasks. Only the missing ones will be executed.


![image](/uploads/98045b80dfdd78333e54093b96a11342/image.png)

An schematic description of this workflow is shown in figure 1 and briefly explain here:

* **Module download** is defined by script `src/download.py`. It downloads the specified datasets from the CDS in the original grib file format. Before sending the download request to the CDS, the script checks what is already downloaded and what is missing. The download task can be set so it only downloads the missing files or on the contrary re-download the dataset even in case the corresponding files are already downloaded (and with problems for example). Check section 3.2 for details.

* **Module transfer** is defined by script `src/transfer.py`. It converts the original grib downloaded file to a more convenient and efficient netcdf file format. It also interpolates the reference dataset to the seasonal forecast grid. The use of netcdf greatly improves the performance allowing different types of chunking and a more efficient use of the memory. During C3S_512 contract, module download was rarely used since the datachecker had already downloaded most of the datasets. Then, the transfer module was in charge of extracting the required data from the much larger files downloaded by the datachecker in grib (located at /shared, these files include more products that the ones required for the verification) and convert them in to netcdf. Module transfer first checks that expected grib files are present in the corresponding source folder and then if the expected outputs in netcdf files are also already generated. The transfer of netcdf files can be set so if corresponding netcdf files are found, the transfer of files is skipped for those found. Check section 3.3 for more details.

* **Module compute** is defined by script `src/compute.py` It computes the specified metrics (mean bias, mean correlation, fRPSS and fCRPSS) between the seasonal forecast system and the selected reference (for the moment reanalysis ERA5). Before computing the metrics, this module checks the consistency of nan’s through the time series, the matching of the variable units between the seasonal forecast and the reference datasets (if there is not a match it allows a unit conversion) and finally prepare the two different multidimensional arrays to be comparable (seasonal forecast dataset has several dimensions not present in the reference: members, start months and leadtime months). For each variable and system, one only netcdf file is created including all the computed metrics. Analogously to all modules, module compute checks if input files in netcdf format are available and if so, checks if metrics are already computed by looking for the corresponding file. Check section 3.4 for more details.

* **Module plot** is defined by script `src/plot.py`. It plots all the previously computed metrics for all the possible combinations of metrics, start dates and leadtime months. Again, it first checks if the required file with the corresponding metrics is available, if so, it checks if expected plots are available. It can be set that only missing plots are generated or the generation of all plots should be generated no matter if they already exists. Check section 3.5 for more details.

A detailed description of all the parameters that can be set in the `conf.yaml` for each of these 4 modules can be found in the [extended documentation](https://earth.bsc.es/gitlab/external/c3s512-seasonal-forecast-eqc/-/blob/s-prototype/C3S512_sf-prototype_documentation.pdf)

# How to run the tool

The main script of the verification tool is `eqc.py`. This script can be converted to executable running the following command.

`$ chmod +x eqc.py`

Once it is executable, the tool can be run simply with:

`$ ./eqc.py`

The main script will automatically load the configuration file `conf.yaml` and all the corresponding parameters. The execution of the tool will provide a detailed log of the ongoing tasks. Also, if any unexpected error appears, an error log will be generated with an unequivocal file name and stored at the path specified in the ‘logs_path’ parameter of the configuration file.

Additionally, as briefly explained in section 4., the additional file `varsys2assess.yaml` can be passed to `eqc.py` in order to allow the computation of multiple combinations of ‘variable’ + ‘originating_centre’ + ‘system’.  Different values for ‘variable’, ‘originating_centre’ and ‘system’ can be provided as lists so the execution of the tool is looped including all the possible combinations. This is run executing:

`$ ./eqc.py varsys2assess.yaml`

Finally, to run the tool in the background when running the tool in a remote machine (allowing disconnection from the server) command nohup can be used and a log file will be stored in the specified log file. For example:

`$ nohup ./eqc.py &>/data/suso/LOGS/log_ecmwfs5_2m_temp.log &`

or 

`$ nohup ./eqc.py varsys2assess.yaml &>/data/suso/LOGS/log_ecmwfs5_dwd2_wind_vars.log &`


# Installation

The seasonal forecast verification tool can be installed by cloning the repository from the BSC gitlab.

`$ git clone https://earth.bsc.es/gitlab/external/c3s512-seasonal-forecast-eqc.git`

A complete list of the required packages can be found at src/packages_list.txt. In order to install all these python packages using anaconda create a new python 3 environment (called for example ‘eqc’) running the following command.

`$ conda create --name eqc --file ./src/packages_list.txt`

Additional packages may require the installation via pip. For example the cdsapi and the cds downloader version developed for the C3S_512 (which allows monitoring the download requests statistics).

`$ pip install cdsapi`

and

`$ pip install C3S512`

R software also needs to be installed together with two specific R packages; SpecsVerification and multiApply. To install these packages, open R in a terminal and simply run:

`> install.packages("SpecsVerification")`

and

`> install.packages("multiApply")`