# C3S-512 CDS Data Checker The main function of this Gitlab Project is to join all the efforts done in the data evaluation of the **C**limate **D**ata **S**tore (**CDS**).
The following software is designed to work both with **GRIB** and **NetCDF** files and perform the following data checks:
* Standard compliance and file format * Spatial/Temporal completeness and consistency * Observed/Plausible data ranges * GRIB to NetCDF experimental C3S conversion (checks for CF compliance) ## Dependencies and libraries In order to run, the data checker requires binaries and libraries: ```bash eccodes - version 2.19.0 https://confluence.ecmwf.int/display/ECC/ecCodes+installation cdo - version 1.9.8 https://code.mpimet.mpg.de/projects/cdo/files ecflow - version 5.4.0 https://confluence.ecmwf.int/display/ECFLOW ``` **Note**: This packages might exist in your apt / yum package repository. It is prefered to use a conda environment
**Please use the versions specified** ## Install & Run When running on the C3S_512 VM, use /data/miniconda3-data to avoid installing anything in /home which has limited space. The TMPDIR should also be set to somewhere with more space than the default /tmp (export TMPDIR='/data/tmp/envs/') to create yout own environment you can execute: ```bash conda create -y python=3 --prefix /data/miniconda3-data/envs/ ``` Then edit your bashrc and set: ```bash export TMPDIR='/data/tmp/envs/' ``` If you create a conda environment, you can easily install these dependencies: ```bash # Create conda virtual environment conda create -y -n dqc python=3 conda activate dqc conda install -c conda-forge eccodes=2.19.0 conda install -c conda-forge cdo=1.9.8 conda install -c conda-forge ecflow=5.4.0 # Get code git clone https://earth.bsc.es/gitlab/ces/c3s512-wp1-datachecker.git cd c3s512-wp1-datachecker pip install . # Install requirements pip install -r requirements.txt # Run cd dqc_checker python checker.py ``` **Note**: In the following section you will find information on how to write your own **config_file**. ## Configure ```bash - In order to run the checker you must write a simple config (RawConfigParser ini format) - There is a general section where general path options are specified - There is a dataset section where dataset dependant information shall be specified - Each config section represents a check/test (ex: file_format or temporal_completeness) - Each config section might have specific parameters related to the specific check (see example below) ``` **Note 1**: Config examples for **ALL** available checks can be found in the **dqc_wrapper/conf** folder.

**Note 2**: The following config checks for temporal consistency. Multiple checks can be stacked in one file. ```` [general] input = /data/dqc_test_data/seasonal/seasonal-monthly-single-levels/2m_temperature fpattern = ecmwf-5_fcmean*.grib log_dir = /my/log/directory res_dir = /my/output/directory forms_dir = /data/cds-forms-c3s [dataset] variable = t2m datatype = fcmean cds_dataset = seasonal-monthly-single-levels cds_variable = 2m_temperature ```` ## Config options (detailed) The **config** is defined in the .ini format compatible with the python RawConfigParser package.

Each section represents an independent data **check**. The following example is for **ALL** available tests:

```` [general]: # Directory or file to be checked. input = /data/dqc_test_data/seasonal/seasonal-monthly-single-levels/2m_temperature # If a directory is provided the pattern can be used to filter the files. Can be empty, then every file is taken fpattern = ecmwf-5*.grib # Directory where DQC logs are stored log_dir = /tmp/dqc_logs # Directory where DQC test results are stored (will be created if it does not exist) res_dir = /tmp/dqc_res # Directory with constraints.json per every cds dataset (a.k.a c3s-cds-forms) forms_dir = /data/cds-forms-c3s [dataset] # Variable to analyze (if grib, see grib_dump command, look for cfVarName) **OPTIONAL** variable = t2m # Data type to analyze (if grib, see grib_ls command) **OPTIONAL** datatype = fcmean # Dataset (as available in c3s catalogue form) cds_dataset = seasonal-monthly-single-levels # Variable (form variable) cds_variable = 2m_temperature # Split dates or use grib_filter in order to reduce memory consumption **OPTIONAL** split_dates = no [file_format]: # No parameters required [standard_compliance]: # No parameters required [spatial_completeness]: # Land/Sea mask if available mask_file = # Variable name within the mask grib file (default is lsm) mask_var = [temporal_completeness] # Origin (for seasonal products, otherwise optional) origin = ecmwf # System (for seasonal products, otherwise optional) system = 5 # Flag indicating if dataset is seasonal (monthly, daily) is_seasonal = [spatial_consistency]: # Resolution of the grid (positive value), typically xinc grid_interval = 1 # Type of Grid (gaussian, lonlat, ...) grid_type = lonlat [temporal_consistency]: # Time step, positive integer value time_step = 1 # Time unit (Hour, Day, Month, Year) or (h,d,m,y) time_granularity = month [valid_ranges] # In case the valid minimum for the data is known (Otherwise, thresholds are set statistically) valid_min = # In case the valid maximum for the data is known (Otherwise, thresholds are set statistically) valid_max = [netcdf_converter] ```` ## Result Each test run produces a result inside the **res_dir** specified in the **general** section.

The result zip file contains a PDF report for each of the tests launched.

The section _result contains (ok/err) indicating sucess and a short message and log location.

```` [spatial_consistency] grid_interval = 0.25 grid_type = lonlat [spatial_consistency_result] res = ok msg = Files are spatially consistent ```` ## Recent updates You can find an updated LOG to track new major modifications here:
* [UPDATE LOG](UPDATE_LOG.md)