README.md 5.84 KB
Newer Older
# C3S-512 CDS Data Checker
Joan Sala Calero's avatar
Joan Sala Calero committed

Joan Sala Calero's avatar
Joan Sala Calero committed
The main function of this Gitlab Project is to join all the efforts done in the data evaluation of the **C**limate **D**ata **S**tore (**CDS**).<br/>
The following software is designed to work both with **GRIB** and **NetCDF** files and perform the following data checks:<br/>

* Standard compliance and file format
* Spatial/Temporal completeness and consistency
* Observed/Plausible data ranges
Joan Sala Calero's avatar
Joan Sala Calero committed
* GRIB to NetCDF experimental C3S conversion (checks for CF compliance)
Joan Sala Calero's avatar
Joan Sala Calero committed

## Dependencies and libraries

In order to run, the data checker requires binaries and libraries:
```bash
eccodes - version 2.19.0
Joan Sala Calero's avatar
Joan Sala Calero committed
https://confluence.ecmwf.int/display/ECC/ecCodes+installation

Joan Sala Calero's avatar
Joan Sala Calero committed
https://code.mpimet.mpg.de/projects/cdo/files
Joan Sala Calero's avatar
Joan Sala Calero committed

Joan Sala Calero's avatar
Joan Sala Calero committed
https://confluence.ecmwf.int/display/ECFLOW
Joan Sala Calero's avatar
Joan Sala Calero committed
```
Joan Sala Calero's avatar
Joan Sala Calero committed

Joan Sala Calero's avatar
Joan Sala Calero committed
**Note**: This packages might exist in your apt / yum package repository. It is prefered to use a conda environment <br/>
Joan Sala Calero's avatar
Joan Sala Calero committed
**Please use the versions specified**
Joan Sala Calero's avatar
Joan Sala Calero committed

## Install & Run
When running on the C3S_512 VM, use /data/miniconda3-data to avoid installing anything in /home which has limited space. The TMPDIR should also be set to somewhere with more space than the default /tmp (export TMPDIR='/data/tmp/envs/')
to create yout own environment you can execute:
```bash
conda create -y python=3 --prefix /data/miniconda3-data/envs/<NAME>
```
Then edit your bashrc and set:

```bash
export TMPDIR='/data/tmp/envs/'
```

If you create a conda environment, you can easily install these dependencies:
Joan Sala Calero's avatar
Joan Sala Calero committed

Joan Sala Calero's avatar
Joan Sala Calero committed
```bash
Joan Sala Calero's avatar
Joan Sala Calero committed
# Create conda virtual environment
Joan Sala Calero's avatar
Joan Sala Calero committed
conda create -y -n dqc python=3
conda activate dqc
conda install -c conda-forge eccodes=2.19.0
conda install -c conda-forge cdo=1.9.8
conda install -c conda-forge ecflow=5.4.0
git clone https://earth.bsc.es/gitlab/ces/c3s512-wp1-datachecker.git
Joan Sala Calero's avatar
Joan Sala Calero committed
cd c3s512-wp1-datachecker
Joan Sala Calero's avatar
Joan Sala Calero committed
pip install .
# Install requirements
Joan Sala Calero's avatar
Joan Sala Calero committed
pip install -r requirements.txt

# Run
cd dqc_checker
Joan Sala Calero's avatar
Joan Sala Calero committed
python checker.py <config_file>
Joan Sala Calero's avatar
Joan Sala Calero committed

```
**Note**: In the following section you will find information on how to write your own **config_file**.

## Configure

```bash
Joan Sala Calero's avatar
Joan Sala Calero committed
- In order to run the checker you must write a simple config (RawConfigParser ini format)
Joan Sala Calero's avatar
Joan Sala Calero committed
- There is a general section where general path options are specified
- There is a dataset section where dataset dependant information shall be specified
- Each config section represents a check/test (ex: file_format or temporal_completeness)
Joan Sala Calero's avatar
Joan Sala Calero committed
- Each config section might have specific parameters related to the specific check (see example below)
Joan Sala Calero's avatar
Joan Sala Calero committed

Joan Sala Calero's avatar
Joan Sala Calero committed
```
Joan Sala Calero's avatar
Joan Sala Calero committed
**Note 1**: Config examples for **ALL** available checks can be found in the **dqc_wrapper/conf** folder.<br></br>
Joan Sala Calero's avatar
Joan Sala Calero committed
**Note 2**: The following config checks for temporal consistency. Multiple checks can be stacked in one file.
Joan Sala Calero's avatar
Joan Sala Calero committed

````
[general]
Joan Sala Calero's avatar
Joan Sala Calero committed
input = /data/dqc_test_data/seasonal/seasonal-monthly-single-levels/2m_temperature
fpattern = ecmwf-5_fcmean*.grib
log_dir = /my/log/directory
res_dir = /my/output/directory
Joan Sala Calero's avatar
Joan Sala Calero committed
forms_dir = /data/cds-forms-c3s
Joan Sala Calero's avatar
Joan Sala Calero committed

[dataset]
variable = t2m
datatype = fcmean
cds_dataset = seasonal-monthly-single-levels
Joan Sala Calero's avatar
Joan Sala Calero committed
cds_variable = 2m_temperature
Joan Sala Calero's avatar
Joan Sala Calero committed
````

## Config options (detailed)

Joan Sala Calero's avatar
Joan Sala Calero committed
The **config** is defined in the .ini format compatible with the python RawConfigParser package.<br></br>
Joan Sala Calero's avatar
Joan Sala Calero committed
Each section represents an independent data **check**. The following example is for **ALL** available tests:<br></br>
Joan Sala Calero's avatar
Joan Sala Calero committed

````
Joan Sala Calero's avatar
Joan Sala Calero committed
[general]:
Joan Sala Calero's avatar
Joan Sala Calero committed
# Directory or file to be checked.
input = /data/dqc_test_data/seasonal/seasonal-monthly-single-levels/2m_temperature
# If a directory is provided the pattern can be used to filter the files. Can be empty, then every file is taken
fpattern = ecmwf-5*.grib 
# Directory where DQC logs are stored
log_dir = /tmp/dqc_logs
# Directory where DQC test results are stored (will be created if it does not exist)
res_dir = /tmp/dqc_res
# Directory with constraints.json per every cds dataset (a.k.a c3s-cds-forms)
Joan Sala Calero's avatar
Joan Sala Calero committed
forms_dir = /data/cds-forms-c3s

[dataset]
# Variable to analyze (if grib, see grib_dump command, look for cfVarName) **OPTIONAL**
Joan Sala Calero's avatar
Joan Sala Calero committed
variable = t2m
# Data type to analyze (if grib, see grib_ls command) **OPTIONAL**
datatype = fcmean
# Dataset (as available in c3s catalogue form)
cds_dataset = seasonal-monthly-single-levels
# Variable (form variable)
cds_variable = 2m_temperature
Joan Sala Calero's avatar
Joan Sala Calero committed
# Split dates or use grib_filter in order to reduce memory consumption **OPTIONAL**
split_dates = no
Joan Sala Calero's avatar
Joan Sala Calero committed

[file_format]:
Joan Sala Calero's avatar
Joan Sala Calero committed
# No parameters required
Joan Sala Calero's avatar
Joan Sala Calero committed

[standard_compliance]:
Joan Sala Calero's avatar
Joan Sala Calero committed
# No parameters required
Joan Sala Calero's avatar
Joan Sala Calero committed

[spatial_completeness]:
Joan Sala Calero's avatar
Joan Sala Calero committed
# Land/Sea mask if available
mask_file = 
# Variable name within the mask grib file (default is lsm)
mask_var = 
Joan Sala Calero's avatar
Joan Sala Calero committed

[temporal_completeness]
Joan Sala Calero's avatar
Joan Sala Calero committed
# Origin (for seasonal products, otherwise optional)
origin = ecmwf
# System (for seasonal products, otherwise optional)
system = 5
# Flag indicating if dataset is seasonal (monthly, daily)
is_seasonal = 
Joan Sala Calero's avatar
Joan Sala Calero committed

[spatial_consistency]:
Joan Sala Calero's avatar
Joan Sala Calero committed
# Resolution of the grid (positive value), typically xinc
grid_interval = 1
# Type of Grid (gaussian, lonlat, ...)
grid_type = lonlat
Joan Sala Calero's avatar
Joan Sala Calero committed

Joan Sala Calero's avatar
Joan Sala Calero committed
[temporal_consistency]:
Joan Sala Calero's avatar
Joan Sala Calero committed
# Time step, positive integer value
time_step = 1
# Time unit (Hour, Day, Month, Year) or (h,d,m,y)
time_granularity = month

[valid_ranges]
# In case the valid minimum for the data is known (Otherwise, thresholds are set statistically)
valid_min =
# In case the valid maximum for the data is known (Otherwise, thresholds are set statistically)
valid_max =

Joan Sala Calero's avatar
Joan Sala Calero committed
[netcdf_converter]
Joan Sala Calero's avatar
Joan Sala Calero committed
````
## Result 
Joan Sala Calero's avatar
Joan Sala Calero committed

Joan Sala Calero's avatar
Joan Sala Calero committed
Each test run produces a result inside the **res_dir** specified in the **general** section.<br></br>
The result zip file contains a PDF report for each of the tests launched.<br></br>
The section _result contains (ok/err) indicating sucess and a short message and log location.<br></br>

````
[spatial_consistency]
grid_interval = 0.25
grid_type = lonlat

[spatial_consistency_result]
res = ok
msg = Files are spatially consistent
Joan Sala Calero's avatar
Joan Sala Calero committed
````
Joan Sala Calero's avatar
Joan Sala Calero committed

## Recent updates

You can find an updated LOG to track new major modifications here:<br>
* [UPDATE LOG](UPDATE_LOG.md) 
Joan Sala Calero's avatar
Joan Sala Calero committed