Earth Sciences

Earth Sciences Wiki

User Tools

Site Tools


data:data_theory

Useful information

Here you can find some theoretical basis about the data stored in the repository of the Earth Sciences Department, as well as some useful information and how-tos.


Types of Data

There are three main types: observations, reconstructions and experiments.

Observations

Data from in situ, satellites and other sources where dynamical models are not used. It could be directly the raw data or it could have been processed previously (for instance, to grid the observations).

Examples from the shared repository:

  • HadISST (version 1.1): /esnas/obs/nasa/hadisst_v1.1
  • CRU (version 3.0): /esnas/obs/uea/cru_v3.0
Reconstructions

Data that comes from the combination of model information and observations in different ways, usually involving data assimilation, to produce the best estimate of the atmosphere and other components of the climate system at a given time.

The term analysis refers to reconstructions of the contemporaneous state of the climate system, while reanalysis corresponds to an analysis for the past using all the observations available at the time the analysis is performed.

Examples from the shared repository:

  • ERA-Interim: /esnas/recon/ecmwf/erainterim
  • MERRA: /esnas/recon/nasa/merra
  • GLORYS 2 (version 3): /esnas/recon/mercator/glorys2_v3
Experiments

Data from model simulations that can use reconstructions as initial conditions to start the simulations in the case of climate predictions or that are performed from some sort of model restart.

The experiments can have different ensemble sizes, starting from one member. Each member is part of the same experiment but changing slightly the initial conditions.

Examples from the shared repository:

  • ECMWF System 4: /esnas/exp/ecmwf/system4_m1
  • EC-Earth: /esnas/exp/ecearth

Types of variables

There are several types of variables:

  • Instantaneous (or intensive): Variables with values taken at a precise instant. These are variables like temperature or pressure.
  • Extensive: Variables that depend on system size, like mass or volume. Precipitation is an example of an extensive variable. It can be found as an accumulation of the volume since a given hour (for example in meters) or as a flux (in meters/seconds).
  • Spectral: Variables that are represented in spectral coefficients, like geopotential and pressure-level wind in IFS. The ECMWF model uses a spectral method (based on a spherical harmonics expansion) as one of its numerical representations of global fields. It is useful to save storage space.
Extensive Variables

This type of variables such as precipitation, usually come with accumulated values. This means that if we have a value at a given time, that value represents the sum of a volume since the starting time (in the case of precipitation this is normally expressed in meters).

Sometimes is more useful to have that information as a flux, a mean of the accumulated values for a period of time divided by the number of seconds of that period (for precipitation it would be expressed in meters/second).

Accumulation

Accumulation

Flux

Flux

To calculate this flux we have different approaches. As an example, we will use the ERA-Interim dataset to demonstrate these calculations.

In this dataset, we have start dates every 12 hours and for each start date we have information for the next 10 days following these timesteps:

  • every 3 hours for the first day (beginning at 3 hours)
  • every 6 hours for the second day
  • every 12 hours for the rest of the days

We would like to have a file for every month, with a value of the flux for each day. To calculate that we have two options:

Flux calculations

The first option would be doing the sum of the first 12 hours (the 4th timestep) of the two start date data for each day (at 00 and 12), and then divide it by the number of seconds in a day (86400 seconds):

(C + E) / 86400 = day 01 (s0-12h)

The other way to do it is using the second 12 hours of each start date (using the timestep number 4 and 8 to calculate it). In this case, we have to remove the accumulation of the first 12 hours from the 8th timestep value, to only keep the second 12 hours, and then divide it by the number of seconds:

((B - A) + (D - C)) / 86400 = day 01 (s12-24h)

In our case, we use the second approach as it was closer to the observations values. With this option, we have to take into account that to calculate the flux for one day we need data from the last start day of the day before. This means that for the first month of the first year of this dataset, the first day would contain only the flux calculated from the second 12 hours of that day.

In other datasets such as system4 m1 from ECMWF, it is easier to calculate the flux. In this case, we only have to remove the accumulation of the previous timestep to the current one and then divide it by the number of seconds.


Storing the data: scale factor and offset

There are several ways to store the data in the NetCDF files. For example, depending on the type of the variables the resultant file will occupy more or less space on disk. For integers up to the number 65535 (or +/-32767, with signed values), we can use the type short that uses 2 bytes to store a number. But there are also other types like float and double that use 4 and 8 bytes respectively. This can also affect the speed when performing operations with the data.

For storing a real number like 4,832320977, we can use directly the type float or, equivalently, we can store the value using the scale factor and offset numbers following the formula:

Y = offset + X * scale_factor

With this formula, the file occupies less memory, as the main data (the X in the formula) can be stored using the type short. The other values in the formula, the offset and the scale factor, are floats but they are also constant for all the data stored in the NetCDF file and we can keep them as attributes of the variable.

If we take the previous example as Y = 4,832320977, we can see that with an offset of 17.7961916718735 and an scale factor of 0.00054312584083277, the number we have to keep stored is only X = -23869. This corresponds to the first value of the variable sfcWind in the file /esarchive/exp/ecmwf/s2s-monthly_ensforhc/daily/sfcWind/20161201/sfcWind_19961201.nc:

> ncdump -h /esarchive/exp/ecmwf/s2s-monthly_ensforhc/daily/sfcWind/20161201/sfcWind_19961201.nc
netcdf sfcWind_19961201 {
dimensions:
  longitude = 240 ;
  latitude = 121 ;
  time = 47 ;
  ensemble = 11 ;
variables:
  float longitude(longitude) ;
      longitude:units = "degrees_east" ;
      longitude:long_name = "longitude" ;
  float latitude(latitude) ;
      latitude:units = "degrees_north" ;
      latitude:long_name = "latitude" ;
  int time(time) ;
      time:units = "hours since 1900-01-01 00:00:0.0" ;
      time:long_name = "time" ;
      time:calendar = "gregorian" ;
  int realization(ensemble) ;
      realization:long_name = "ensemble_member" ;
  short sfcWind(time, ensemble, latitude, longitude) ;
      sfcWind:scale_factor = 0.00054312584083277 ;
      sfcWind:add_offset = 17.7961916718735 ;
      sfcWind:_FillValue = -32767s ;
      sfcWind:missing_value = -32767s ;
      sfcWind:units = "m s**-1" ;
      sfcWind:long_name = "10 metre wind speed" ;

Using the files

We will see no difference while using the file with the department tools as s2dverification, regardless of the way the data is stored. These tools and others like ncview do the conversion from the scaled version to the real value automatically. The only common way to see this difference is looking at the values with ncdump.

You can look at these two ways of storing the data with these 2 files of the hindcast of the monthly prediction system from ECMWF and the same data for the version from the S2S project:

  • with float data type: /esarchive/exp/ecmwf/monthly_ensforhc/weekly_mean/sfcWind_f6h/20160204/sfcWind_19960204.nc
  • with short data type: /esarchive/exp/ecmwf/s2s-monthly_ensforhc/weekly_mean/sfcWind_f24h/20160204/sfcWind_19960204.nc

More information

More information on scale factor: Wikipedia.

Calculating means

When we work with the data in the department, we usually want the data in a different frequency than the original one and it is a typical task for the data team to convert the files to the needed frequencies.

Common means

For the daily, monthly and yearly means, we normally use the CDO tool:

  • for daily means: cdo daymean <ifile> <ofile>
  • for monthly means: cdo monmean <ifile> <ofile>
  • for yearly means: cdo yearmean <ifile> <ofile>

CDO allows us to calculate also other common statistical values like minimum, maximum, sum, etc, using similar operators: daymin, monmax, yearsum, etc.

You can find a reference card with a summary of the CDO operators in here: Reference Card

Weekly mean

For the calculation of the weekly mean there is no direct operation available to use like with the daily or the monthly mean. For this mean we use a combination of CDO and NCO commands.

We need a file with at least daily data (it can be also, 6hourly, hourly,…). The weekly mean that we calculate in the department does not calculate the mean of the natural week of that particular year and month, but instead calculates the mean of each 7 days after the first 4 days in the file for 4 times:

weekly mean scheme

So, the mean of the first week will have the timestamps of the day number 8 and it'll be the mean of the days from the day 5 to the day 11. The second week the same but between the day 12 and the day 18, and so forth until the fourth week. The rest of the data will be discarded.

With a file with daily data, we would use the following commands:

  • Perform the mean every 7 timesteps, skiping the first 4 timestemps (for 6hourly data we would select 28 timestepsto perform the mean and skip the first 16):
cdo timselmean,7,4 <ifile> <ofile>
  • From the resultant file, keep only the first 4 timesteps (the 4 weeks):
ncks -O -d time,0,3 <ofile> <ofile>
  • By default, the version 1.9.0 of CDO will put the middle day of the mean as the time value for the timestep (the day 8 for the first week, the 15 for the second, the 22 for the third and the 29 for the last one). But if we are using an older version of CDO, we will have to change manually the time value for all the timesteps subtracting 3 days (72 hours if that is the unit for the time):
ncap2 -O -s "time=time-72" <ofile> <ofile>

Examples:

> cdo showtimestamp /esarchive/exp/ecmwf/s2s-monthly_ensforhc/weekly_mean/sfcWind_f24h/20150101/sfcWind_19950101.nc
1995-01-08T00:00:00  1995-01-15T00:00:00  1995-01-22T00:00:00  1995-01-29T00:00:00
> cdo showtimestamp /esarchive/exp/ecmwf/s2s-monthly_ensforhc/weekly_mean/sfcWind_f24h/20160107/sfcWind_19960107.nc
1996-01-14T00:00:00  1996-01-21T00:00:00  1996-01-28T00:00:00  1996-02-04T00:00:00

ecFlow: how-to and useful information

The official description of ecFlow from the ECMWF homepage is:

ecFlow is a client/server workflow package that enables users to run a large
number of programs (with dependencies on each other and on time) in a
controlled environment. It provides reasonable tolerance for hardware and
software failures, combined with good restart capabilities.

ecFlow submits tasks (jobs) and receives acknowledgements from tasks when
they change status and when they send events, using child commands embedded
in the scripts. Ecflow stores the relationship between tasks, and is able to
submit tasks dependent on triggers.

Here in the department, we (the data team) use the ecFlow to manage most of our downloading and formatting requests. It allows us to keep track of the state of the different jobs and also detect and correct more eas¡ly the possible errors that may appear.

In this page, we will describe how to execute your own server and launch your workflows from your client. We will also explain how to use the ecFlow user interface and look at your own jobs, as well as how to follow the state of other user jobs.

Configure the server and the client

The package is accessible through the module system in all the machines of the department. We only need to load it:

> module load ecFlow

Be aware that right now we have two versions of ecflow:

ecFlow/4.5.0-foss-2015a (D)    ecFlow/4.7.1-foss-2015a

The one by default is 4.5.0. This is the one to load when using ecflow_ui as the newer one has some bugs and some of the basic operations are not available through the user interface.

Server

The first thing we have to do after loading the module, it's start the server. If we plan to execute jobs that will be operational or that will have to be executed every month at a given time, then it is convenient to have the server running in a machine that will not be usually shut down. In our case, we use the bscearht000.int.bsc.es to host our ecflow servers. Then we can launch the client from any other machine.

To launch our server we have to execute (for example in the bscearth000 machine):

> ecflow_start.sh

EcFlow allows that multiple users run their servers on the same machine at the same time. With ecflow_start.sh the server gets started, assigning a unique port the user that executed it. The host and the port will be used to connect later from the client to the server, so it is important to remember them. Example:

jginer@bscearth000:~> ecflow_start.sh
[15:11:21 8.1.2018] Request( --ping :jginer ), Failed to connect to
 bscearth000:2966. After 2 attempts. Is the server running ?

bscearth000 bscearth000 2966
lun ene  8 15:11:21 UTC 2018

User "1466" attempting to start ecf server on "bscearth000" using ECF_PORT
 "2966" and with:
ECF_HOME     : "/home/Earth/jginer/ecflow_server"
ECF_LOG      : "bscearth000.2966.ecf.log"
ECF_CHECK    : "bscearth000.2966.check"
ECF_CHECKOLD : "bscearth000.2966.check.b"
ECF_OUT      : "/dev/null"

client version is Ecflow version(4.5.0) boost(1.63.0) compiler(gcc 4.9.2)
 protocol(TEXT_ARCHIVE) Compiled on Mar 17 2017 15:19:39
Checking if the server is already running on bscearth000 and port 2966
[15:11:22 8.1.2018] Request( --ping :jginer ), Failed to connect to
 bscearth000:2966. After 2 attempts. Is the server running ?

Backing up check point and log files

OK starting ecFlow server...

Placing server into RESTART mode...

To view server on ecflow_ui - goto Servers/Manage Servers... and enter
Name        : <unique ecFlow server name>
Host        : bscearth000
Port Number : 2966

If we execute that command when the server was already running it will say that the server was already running and will remind us the port number. Example:

jginer@bscearth000:~> ecflow_start.sh
ping server(bscearth000:2966) succeeded in 00:00:00.190036  ~190 milliseconds
server is already started

In this case for the user jginer, it assigned the port number 2966. We can manually assign a port number and other settings passing arguments when starting a new server (see the official documentation for more information).

If we want to stop the server for some reason, we can do it executing the following command:

> ecflow_stop.sh
Client

The workflow and the different tasks to perform will be defined in a file that we can write using Python. There will be several files that we will need to complete the workflow along with our own scripts, but we will have to execute only one to send it to the server. This will contain the information about the host and the port where we launched our server.

We have a repository in GitLab with the workflows we are currently using and some others that we used at some point. You can check them out and adapt it to your needs here: https://earth.bsc.es/gitlab/jginer/ecflow_workflows.git. Remember to follow the instructions from the README file to modify them with your own host and port number.

There is a lot more of information about how to set up the workflow files and all the specifications on the homepage of ecFlow (ecFlow homepage).

Taking as an example the test workflow from the ecflow_workflow repository, to send the workflow to the server from the client in our machine:

jginer@bscearth320:/esarchive/scratch/jginer/ecflow_workflows/test> module load ecFlow
jginer@bscearth320:/esarchive/scratch/jginer/ecflow_workflows/test> python test-workflow.py
Creating suite definition
# 4.5.0
suite TEST
  edit ECF_HOME '/esarchive/scratch/jginer/ecflow_workflows/test'
  family t
    task Test
      edit HOST 'bsceslogin01.bsc.es'
      edit ECF_JOB_CMD 'ssh %HOST% 'sbatch %ECF_JOB% > %ECF_JOBOUT%.submit 2>&1''
    task Check
      trigger Test == complete
      edit HOST 'bsceslogin01.bsc.es'
      edit ECF_JOB_CMD 'ssh %HOST% 'sbatch %ECF_JOB% > %ECF_JOBOUT%.submit 2>&1''
  endfamily
endsuite

Checking job creation: .ecf -> .job0

Saving definition to file 'test.def'
Load the in memory definition(defs) into the server

And then we only need to start it, something that we can do through the user interface.

User interface

EcFlow has available a user interface that appears when we execute the command:

> module load ecFlow
> ecflow_ui

In here we can see the suites that we have on our servers and servers from other users that we follow.

ecflow-ui-start

To follow a server we have to:

  • select the tabServers
  • click Manage servers
  • click the button Add server
  • introduce the Name for the server (for us, can be whatever we want), the Host (for example bscearth000.int.bsc.es) and the Port (in the example 2966)

ecflow-ui-servers

After adding several suites (workflows), with some tasks waiting to be executed following their crons and other being executed, the user interface can look like this:

ecflow_ui

More information

More official information in here (take a look especially at the Tutorial): ecFlow homepage

Troubleshooting

data/data_theory.txt · Last modified: 2018/01/12 10:28 by jginer