User Tools

Site Tools


tools:eionet-utdretriever

EIONET UTD (up-to-date) Air Quality data retrieval

Goal

This tool provides automated data collection from the EEA's Air Quality Portal.

The aim is to further improve the gathering of air quality observations across Europe by using the LIVE Air Quality Data service.
The purpose of this tool is also to provide a single, common tool to gather air quality observations from stations, thus phasing out the several scripts used to date -as of December 2015- (one for each Comunidad Autónoma (regional government), Ayuntamiento (city council)) in the CALIOPE forecast evaluation.

Description

The functionality provided is a retriever of (near real time) air quality observations from stations adhered to EIONET network.

The tool is able to connect to EIONET servers, download the required data, parse it, check for validity of observations and store them in the Air Quality Forecast Evaluation (eval_new) database. Please note no CSV or any other output format is provided. All observations are centralized in the database for exploitation through the different systems (like the CALIOPE Visor)
This tool also inserts new stations in the “STATIONS” table of the DB when a new one is found. The fields automatically inserted are Station code, station name, lat, lon, and height above sea level.

The validity checks performed on the retrieved data are the ones described in this PDF: Filtrado de observaciones CALIOPE
This is the summary of the quality control performed over the data downloaded:

  • Flag IR is applied if observation is not in range for its pollutant.
  • Flag DS is applied if observation is constant for the past five hours (-5h..-1h range) and above a value of 10 for its pollutant.
  • Flag MP is applied if previous observation (past hour) is an outlier (has maximum slide change) for its pollutant.

Any observation flagged is not considered by the different evaluations and post processes in the CALIOPE Forecast (mainly the Kalman Filter).

The output of the process is stored into the table “OBS_AQ” of the MySQL database (eval_new) with the identifier DOMAINS_id=6, as this is the Domain for the values coming from Station observations.

Data coverage

This tool can provide observations for any pollutant and country available.
The full list of pollutants available at the EIONET download service can be found here: Pollutants.csv (originalsource)
However, only the following pollutants are available in the primary up-to-date assessment data - measurements (E2a):

CALIOPE acronym EIONET notation
O3 O3
NO2 NO2
SO2 SO2
PM10T PM10
PM2.5T PM2.5
CO CO
n/a? C6H6
NO NO

The EIONET (up-to-date) air quality data is available for the following geographic areas/countries:

Code Name Code Name
AD Andorra IE Ireland
AT Austria LT Lithuania
BE Belgium LU Luxembourg
DE Germany LV Latvia
DK Denmark MK Macedonia
ES Spain MT Malta
FI Finland NL Netherlands
FR France NO Norway
GB United Kingdom PL Poland
GI Gibraltar PT Portugal
HR Croatia SE Sweden
HU Hungary SI Slovenia
SK Slovakia

To see the current status on data delivery, which countries are delivering data and what they deliver, please see this E2a/UTD Air quality - primary pollutants delivery LIVE report. The former, more detailed report can be found here: live report (in use until Dec/2015).

For further information about stations definitions used by EIONET please refer to: EIONET definitions for AQ Stations

Please note that EIONET reports the different network_timezone for each station in which the reported observations are. This tool automatically translates all observations to UTC. All observations stored in the eval_new database are in UTC.
Since 20/apr/2016 EIONET is reporting the metadata for the time zone of observations in Spain, Lithuania, Macedonia and Slovenia (that was previously missing). Therefore, all observations since 20/apr/2016 should be as correct as the information provided by the Member States to EIONET. To date, 05/may/2016 we are still troubleshooting (in contact with Generalitat de Catalunya and EIONET) a discrepancy of 1h of difference in observations for Catalunya.
Until 20/apr/2016 it was assumed that observations were in UTC, when no metadata was provided.

Please refer to the “Usage” section for configuration options and to see details about pollutant acronyms/notations and time spans of observations.

Requirements

This tool requires Python interpreter 3.4.
In the source code it is required to have packages “requests” (at least v.2.8.1) and “pymysql” (at least v.0.6.7) installed. One can get them by calling: pip3 install –user requests pymysql.

Usage

This tool is designed to be run by a cron (a time-based job scheduler) job. However, it can also be used as a once-time-execution command-line application. Please see below for command-line usage.
It is recommended to call the process daily every 8 hours in order to avoid EIONET saturation. The process can be called at any time but it is discouraged to run this tool at frequencies below 2 hours because the service will not have any new observations to serve. It is also discouraged to run this tool at periods greater than 24 hours if near real time evaluation of the air quality forecast is wanted.

Command-line usage/interface

The calls to this tool are as follows, using the common format EIONETretriever.py [command] configs [options]:

python3 EIONETretriever.py download ES

Is the standard usage. The tool will look for the “ES.conf” file under “config” directory. The log will be written to the “logs/ES” directory. Conversely, if we want to download observations for France and Italy the following syntax is needed (assuming the corresponding FRIT.conf file is under config directory):

python3 EIONETretriever.py download FRIT

The two available commands are:

  • download: Normal operation, automated data download data uses the UpdatedSince and CreatedSince filters. In this mode, the retriever will keep track of the dates of last successful download for each pollutant and country, storing it in the DOWNLOAD_DATE table in the database.
  • download_no_filters: Manual operation, intended to troubleshoot missing data in the DB. Since the filters are not used in this mode all observations from the time window specified will be downloaded (be aware of reaching the 50k observations limit has set in place). The retriever will not keep track of the last successful downloads in this mode.
  • download_sliding_no_filters: Same as “download_no_filters” mode but it can be used with relative dates (to today). Please see example below.

In the “download_no_filters” mode the usage is as follows:

python3 EIONETretriever.py download_no_filters GER --fromDate 2016-01-01 --toDate 2016-01-03 > logs/GER/GER-20160101-20160102.log

In this example, all observations of the countries (de,at,pl) and pollutants defined in the “GER.conf” file from 01/jan/2016 to 03/jan/2016 (not included) will be downloaded and stored in the database.

Example for “download_sliding_no_filters”:
If today is 2016-01-27 and we want to download the observations of days 2016-01-12 and 2016-01-13 we can get them in two ways:

python3 EIONETretriever.py download_no_filters ES --fromDate 2016-01-12 --toDate 2016-01-14 > logs/ES/ES-20160112-20160113.log
python3 EIONETretriever.py download_sliding_no_filters ES --fromDaysAgo 15 --toDaysAgo 13 > logs/ES/ES-20160112-20160113.log

Please note that in this mode, if the option –toDaysAgo is not provided the download will be until to date (now).

As usual, the –help (or -h) option will also display the command-line manual/help. Please note that this article always refers to the 'download' command if not otherwise explicitly stated.

Configuration file

The amount of data to be downloaded can be easily configured in this tool by modifying the corresponding “*.conf” file in the config source directory. The available options are:

  • countrycodes: List of EEA/EIONET member states (separated by comma; i.e.: es,de,pt) of which we want to download observations from their stations. ISO 3166-1 alpha-2 notation must be used.
  • [pollutants] section: Pollutants to be downloaded (observations of those pollutants). To add another pollutant to be downloaded add a line with the CALIOPE acronym as key and EIONET notation as value (acronym in CALIOPE DB / notation EIONET for request). Please note that EIONET notation must be written in URI safe format (i.e.: PM2.5 should be PM2%2E5)
  • days_of_obs: Time frame of past Days of observations (since today) to be downloaded.

All configurations are automatically handled and the purpose of the following descriptions is just to document the functionality.

Pollutants retrieved

The observations of the pollutants to be retrieved are defined in the 'pollutants' section in the configuration file. Please note that the CALIOPE and EIONET notations must be provided. Please refer to CSV in the “Data coverage” section to see all EIONET notations.

Time span of observations retrieved

The tool requests observations in time span of 'days_of_obs' to the current date and time (now).
Please note that when FromDate is used, then ToDate is mandatory to obtain a response from the service. (When using the command-line options if ToDate is not defined it will be 'now' by default).
Due to the different upload and update patterns of the data providers it is needed to request at least a couple of days of observations for the FromDate field, and at the same time, use the UpdatedSinceDate and InsertedSinceDate filters (see “Filters on data download” below) to avoid downloading too much duplicate data and avoid hitting the maximum records per request of the EIONET service (set at 50k records).
The recommended value for 'days_of_obs' is 8 to allow some data providers to recover and upload observations from malfunctions at stations. This value can be safely increased when the filters on data download are used.

Filters on data download

There are two filters in the tool ('UPDATE_FILTER', 'INSERTED_FILTER') available to be used to request new data since last download. Due to unreliable update patterns of some data providers, this tool first launches the request using the UpdatedSinceDate filter, and then, launches the request to EIONET service with the InsertedSinceDate filter. After some data curation and testing it has been found some data providers allocate space for observations (InsertedSinceDate filter) and few hours or days later then update the value (UpdatedSinceDate filter).

As a summary, the time span of observations that this tool requests are:

  1. FromDate: (n-days) –> ToDate: now (irrespective of InsertedSinceDate nor UpdatedSinceDate filters)
  2. UpdatedSinceDate: last execution (FromDate and ToDate must be used)
  3. InsertedSinceDate: last execution (FromDate and ToDate are mandatory)

In case of duplicate observations due to time span of the data downloaded this tool will prevent inserting duplicates (date, Station_id, Pollutant_id are unique).

Further documentation

Repository

The link to the GIT repository is:

https://earth.bsc.es/gitlab/es/EIONET.git

Contact

The developer of this tool is Jordi Cuadrado Borbonés jordi.cuadrado@bsc.es under guidance of Kim Serradell kim.serradell@bsc.es.

Style Guide

This tool is coded in Python3.
You can check the general style guide for Python development here

tools/eionet-utdretriever.txt · Last modified: 2016/12/29 09:09 by kserrade