README.md 5.84 KB
Newer Older
# C3S-512 CDS Data Checker

The main function of this Gitlab Project is to join all the efforts done in the data evaluation of the **C**limate **D**ata **S**tore (**CDS**).<br/>
The following software is designed to work both with **GRIB** and **NetCDF** files and perform the following data checks:<br/>

* Standard compliance and file format
* Spatial/Temporal completeness and consistency
* Observed/Plausible data ranges
* GRIB to NetCDF experimental C3S conversion (checks for CF compliance)

## Dependencies and libraries

In order to run, the data checker requires binaries and libraries:
```bash
eccodes - version 2.19.0
https://confluence.ecmwf.int/display/ECC/ecCodes+installation

cdo - version 1.9.8
https://code.mpimet.mpg.de/projects/cdo/files

ecflow - version 5.4.0
https://confluence.ecmwf.int/display/ECFLOW
```

**Note**: This packages might exist in your apt / yum package repository. It is prefered to use a conda environment <br/>
**Please use the versions specified**

## Install & Run
When running on the C3S_512 VM, use /data/miniconda3-data to avoid installing anything in /home which has limited space. The TMPDIR should also be set to somewhere with more space than the default /tmp (export TMPDIR='/data/tmp/envs/')
to create yout own environment you can execute:
```bash
conda create -y python=3 --prefix /data/miniconda3-data/envs/<NAME>
```
Then edit your bashrc and set:

```bash
export TMPDIR='/data/tmp/envs/'
```

If you create a conda environment, you can easily install these dependencies:

```bash
# Create conda virtual environment
conda create -y -n dqc python=3
conda activate dqc
conda install -c conda-forge eccodes=2.19.0
conda install -c conda-forge cdo=1.9.8
conda install -c conda-forge ecflow=5.4.0

# Get code
git clone https://earth.bsc.es/gitlab/ces/c3s512-wp1-datachecker.git
cd c3s512-wp1-datachecker
pip install .
# Install requirements
pip install -r requirements.txt

# Run
cd dqc_checker
python checker.py <config_file>

```
**Note**: In the following section you will find information on how to write your own **config_file**.

## Configure

```bash
- In order to run the checker you must write a simple config (RawConfigParser ini format)
- There is a general section where general path options are specified
- There is a dataset section where dataset dependant information shall be specified
- Each config section represents a check/test (ex: file_format or temporal_completeness)
- Each config section might have specific parameters related to the specific check (see example below)

```
**Note 1**: Config examples for **ALL** available checks can be found in the **dqc_wrapper/conf** folder.<br></br>
**Note 2**: The following config checks for temporal consistency. Multiple checks can be stacked in one file.

````
[general]
input = /data/dqc_test_data/seasonal/seasonal-monthly-single-levels/2m_temperature
fpattern = ecmwf-5_fcmean*.grib
log_dir = /my/log/directory
res_dir = /my/output/directory
forms_dir = /data/cds-forms-c3s

[dataset]
variable = t2m
datatype = fcmean
cds_dataset = seasonal-monthly-single-levels
cds_variable = 2m_temperature
````

## Config options (detailed)

The **config** is defined in the .ini format compatible with the python RawConfigParser package.<br></br>
Each section represents an independent data **check**. The following example is for **ALL** available tests:<br></br>

````
[general]:
# Directory or file to be checked.
input = /data/dqc_test_data/seasonal/seasonal-monthly-single-levels/2m_temperature
# If a directory is provided the pattern can be used to filter the files. Can be empty, then every file is taken
fpattern = ecmwf-5*.grib 
# Directory where DQC logs are stored
log_dir = /tmp/dqc_logs
# Directory where DQC test results are stored (will be created if it does not exist)
res_dir = /tmp/dqc_res
# Directory with constraints.json per every cds dataset (a.k.a c3s-cds-forms)
forms_dir = /data/cds-forms-c3s

[dataset]
# Variable to analyze (if grib, see grib_dump command, look for cfVarName) **OPTIONAL**
variable = t2m
# Data type to analyze (if grib, see grib_ls command) **OPTIONAL**
datatype = fcmean
# Dataset (as available in c3s catalogue form)
cds_dataset = seasonal-monthly-single-levels
# Variable (form variable)
cds_variable = 2m_temperature
# Split dates or use grib_filter in order to reduce memory consumption **OPTIONAL**
split_dates = no

[file_format]:
# No parameters required

[standard_compliance]:
# No parameters required

[spatial_completeness]:
# Land/Sea mask if available
mask_file = 
# Variable name within the mask grib file (default is lsm)
mask_var = 

[temporal_completeness]
# Origin (for seasonal products, otherwise optional)
origin = ecmwf
# System (for seasonal products, otherwise optional)
system = 5
# Flag indicating if dataset is seasonal (monthly, daily)
is_seasonal = 

[spatial_consistency]:
# Resolution of the grid (positive value), typically xinc
grid_interval = 1
# Type of Grid (gaussian, lonlat, ...)
grid_type = lonlat

[temporal_consistency]:
# Time step, positive integer value
time_step = 1
# Time unit (Hour, Day, Month, Year) or (h,d,m,y)
time_granularity = month

[valid_ranges]
# In case the valid minimum for the data is known (Otherwise, thresholds are set statistically)
valid_min =
# In case the valid maximum for the data is known (Otherwise, thresholds are set statistically)
valid_max =

[netcdf_converter]
````
## Result 

Each test run produces a result inside the **res_dir** specified in the **general** section.<br></br>
The result zip file contains a PDF report for each of the tests launched.<br></br>
The section _result contains (ok/err) indicating sucess and a short message and log location.<br></br>

````
[spatial_consistency]
grid_interval = 0.25
grid_type = lonlat

[spatial_consistency_result]
res = ok
msg = Files are spatially consistent
````

## Recent updates

You can find an updated LOG to track new major modifications here:<br>
* [UPDATE LOG](UPDATE_LOG.md) 

<br><br>