Load.Rd 44 KB
Newer Older
aho's avatar
aho committed
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Load.R
\name{Load}
\alias{Load}
Virginie Guemas's avatar
Virginie Guemas committed
\title{Loads Experimental And Observational Data}
aho's avatar
aho committed
\usage{
aho's avatar
aho committed
Load(
  var,
  exp = NULL,
  obs = NULL,
  sdates,
  nmember = NULL,
  nmemberobs = NULL,
  nleadtime = NULL,
  leadtimemin = 1,
  leadtimemax = NULL,
  storefreq = "monthly",
  sampleperiod = 1,
  lonmin = 0,
  lonmax = 360,
  latmin = -90,
  latmax = 90,
  output = "areave",
  method = "conservative",
  grid = NULL,
  maskmod = vector("list", 15),
  maskobs = vector("list", 15),
  configfile = NULL,
  varmin = NULL,
  varmax = NULL,
  silent = FALSE,
  nprocs = NULL,
  dimnames = NULL,
  remapcells = 2,
  path_glob_permissive = "partial"
)
aho's avatar
aho committed
}
\arguments{
\item{var}{Short name of the variable to load. It should coincide with the 
variable name inside the data files.\cr
E.g.: \code{var = 'tos'}, \code{var = 'tas'}, \code{var = 'prlr'}.\cr
In some cases, though, the path to the files contains twice or more times 
the short name of the variable but the actual name of the variable inside 
the data files is different. In these cases it may be convenient to provide 
\code{var} with the name that appears in the file paths (see details on 
parameters \code{exp} and \code{obs}).}
aho's avatar
aho committed
\item{exp}{Parameter to specify which experimental datasets to load data 
from.\cr
It can take two formats: a list of lists or a vector of character strings. 
Each format will trigger a different mechanism of locating the requested 
datasets.\cr
The first format is adequate when loading data you'll only load once or 
occasionally. The second format is targeted to avoid providing repeatedly 
the information on a certain dataset but is more complex to use.\cr\cr
IMPORTANT: Place first the experiment with the largest number of members 
and, if possible, with the largest number of leadtimes. If not possible, 
the arguments 'nmember' and/or 'nleadtime' should be filled to not miss 
any member or leadtime.\cr
If 'exp' is not specified or set to NULL, observational data is loaded for 
each start-date as far as 'leadtimemax'. If 'leadtimemax' is not provided, 
\code{Load()} will retrieve data of a period of time as long as the time 
period between the first specified start date and the current date.\cr\cr
List of lists:\cr
A list of lists where each sub-list contains information on the location 
and format of the data files of the dataset to load.\cr
Each sub-list can have the following components:
aho's avatar
aho committed
    \item{'name': A character string to identify the dataset. Optional.}
    \item{'path': A character string with the pattern of the path to the 
      files of the dataset. This pattern can be built up making use of some 
      special tags that \code{Load()} will replace with the appropriate 
      values to find the dataset files. The allowed tags are $START_DATE$, 
      $YEAR$, $MONTH$, $DAY$, $MEMBER_NUMBER$, $STORE_FREQ$, $VAR_NAME$, 
      $EXP_NAME$ (only for experimental datasets), $OBS_NAME$ (only for 
      observational datasets) and $SUFFIX$\cr
      Example: /path/to/$EXP_NAME$/postprocessed/$VAR_NAME$/\cr
       $VAR_NAME$_$START_DATE$.nc\cr
      If 'path' is not specified and 'name' is specified, the dataset 
      information will be fetched with the same mechanism as when using 
      the vector of character strings (read below).
aho's avatar
aho committed
    \item{'nc_var_name': Character string with the actual variable name 
      to look for inside the dataset files. Optional. Takes, by default, 
      the same value as the parameter 'var'.
aho's avatar
aho committed
    \item{'suffix': Wildcard character string that can be used to build 
      the 'path' of the dataset. It can be accessed with the tag $SUFFIX$. 
      Optional. Takes '' by default.
aho's avatar
aho committed
    \item{'var_min': Important: Character string. Minimum value beyond 
      which read values will be deactivated to NA. Optional. No deactivation 
      is performed by default.
aho's avatar
aho committed
    \item{'var_max': Important: Character string. Maximum value beyond 
      which read values will be deactivated to NA. Optional. No deactivation 
      is performed by default.
aho's avatar
aho committed
The tag $START_DATES$ will be replaced with all the starting dates 
specified in 'sdates'. $YEAR$, $MONTH$ and $DAY$ will take a value for each 
iteration over 'sdates', simply these are the same as $START_DATE$ but 
split in parts.\cr
$MEMBER_NUMBER$ will be replaced by a character string with each member 
number, from 1 to the value specified in the parameter 'nmember' (in 
experimental datasets) or in 'nmemberobs' (in observational datasets). It 
will range from '01' to 'N' or '0N' if N < 10.\cr
$STORE_FREQ$ will take the value specified in the parameter 'storefreq' 
('monthly' or 'daily').\cr
$VAR_NAME$ will take the value specified in the parameter 'var'.\cr
aho's avatar
aho committed
$EXP_NAME$ will take the value specified in each component of the parameter 
'exp' in the sub-component 'name'.\cr
$OBS_NAME$ will take the value specified in each component of the parameter 
'obs' in the sub-component 'obs.\cr
$SUFFIX$ will take the value specified in each component of the parameters 
'exp' and 'obs' in the sub-component 'suffix'.\cr
list(
  list(
    name = 'experimentA',
    path = file.path('/path/to/$DATASET_NAME$/$STORE_FREQ$',
                     '$VAR_NAME$$SUFFIX$',
                     '$VAR_NAME$_$START_DATE$.nc'),
    nc_var_name = '$VAR_NAME$',
    suffix = '_3hourly',
    var_min = '-1e19',
    var_max = '1e19'
  )
)
}
aho's avatar
aho committed
This will make \code{Load()} look for, for instance, the following paths, 
if 'sdates' is c('19901101', '19951101', '20001101'):\cr
  /path/to/experimentA/monthly_mean/tas_3hourly/tas_19901101.nc\cr
  /path/to/experimentA/monthly_mean/tas_3hourly/tas_19951101.nc\cr
  /path/to/experimentA/monthly_mean/tas_3hourly/tas_20001101.nc\cr\cr
Vector of character strings:
aho's avatar
aho committed
To avoid specifying constantly the same information to load the same 
datasets, a vector with only the names of the datasets to load can be 
specified.\cr
\code{Load()} will then look for the information in a configuration file 
whose path must be specified in the parameter 'configfile'.\cr
Check \code{?ConfigFileCreate}, \code{ConfigFileOpen}, 
\code{ConfigEditEntry} & co. to learn how to create a new configuration 
file and how to add the information there.\cr
Example: c('experimentA', 'experimentB')}
aho's avatar
aho committed
\item{obs}{Argument with the same format as parameter 'exp'. See details on 
parameter 'exp'.\cr
If 'obs' is not specified or set to NULL, no observational data is loaded.\cr}

\item{sdates}{Vector of starting dates of the experimental runs to be loaded 
following the pattern 'YYYYMMDD'.\cr
aho's avatar
aho committed
E.g. c('19601101', '19651101', '19701101')}

\item{nmember}{Vector with the numbers of members to load from the specified 
experimental datasets in 'exp'.\cr
If not specified, the automatically detected number of members of the 
first experimental dataset is detected and replied to all the experimental 
datasets.\cr
If a single value is specified it is replied to all the experimental 
datasets.\cr
Data for each member is fetched in the file system. If not found is 
filled with NA values.\cr
An NA value in the 'nmember' list is interpreted as "fetch as many members 
of each experimental dataset as the number of members of the first 
experimental dataset".\cr
Note: It is recommended to specify the number of members of the first 
experimental dataset if it is stored in file per member format because 
there are known issues in the automatic detection of members if the path 
to the dataset in the configuration file contains Shell Globbing wildcards 
such as '*'.\cr
E.g., c(4, 9)}

\item{nmemberobs}{Vector with the numbers of members to load from the 
specified observational datasets in 'obs'.\cr
If not specified, the automatically detected number of members of the 
first observational dataset is detected and replied to all the 
observational datasets.\cr
If a single value is specified it is replied to all the observational 
datasets.\cr
Data for each member is fetched in the file system. If not found is 
filled with NA values.\cr
An NA value in the 'nmemberobs' list is interpreted as "fetch as many 
members of each observational dataset as the number of members of the 
first observational dataset".\cr
Note: It is recommended to specify the number of members of the first 
observational dataset if it is stored in file per member format because 
there are known issues in the automatic detection of members if the path 
to the dataset in the configuration file contains Shell Globbing wildcards 
such as '*'.\cr
E.g., c(1, 5)}

\item{nleadtime}{Deprecated. See parameter 'leadtimemax'.}

\item{leadtimemin}{Only lead-times higher or equal to 'leadtimemin' are 
loaded. Takes by default value 1.}

aho's avatar
aho committed
\item{leadtimemax}{Only lead-times lower or equal to 'leadtimemax' are loaded. 
Takes by default the number of lead-times of the first experimental 
dataset in 'exp'.\cr
If 'exp' is NULL this argument won't have any effect 
(see \code{?Load} description).}

aho's avatar
aho committed
\item{storefreq}{Frequency at which the data to be loaded is stored in the 
file system. Can take values 'monthly' or 'daily'.\cr
aho's avatar
aho committed
Note: Data stored in other frequencies with a period which is divisible by 
a month can be loaded with a proper use of 'storefreq' and 'sampleperiod' 
parameters. It can also be loaded if the period is divisible by a day and 
the observational datasets are stored in a file per dataset format or 
'obs' is empty.}

\item{sampleperiod}{To load only a subset between 'leadtimemin' and 
'leadtimemax' with the period of subsampling 'sampleperiod'.\cr
Takes by default value 1 (all lead-times are loaded).\cr
aho's avatar
aho committed
See 'storefreq' for more information.}

\item{lonmin}{If a 2-dimensional variable is loaded, values at longitudes 
lower than 'lonmin' aren't loaded.\cr
Must take a value in the range [-360, 360] (if negative longitudes are 
found in the data files these are translated to this range).\cr
aho's avatar
aho committed
If 'lonmin' > 'lonmax', data across Greenwich is loaded.}

\item{lonmax}{If a 2-dimensional variable is loaded, values at longitudes 
higher than 'lonmax' aren't loaded.\cr
Must take a value in the range [-360, 360] (if negative longitudes are 
found in the data files these are translated to this range).\cr
aho's avatar
aho committed
If 'lonmin' > 'lonmax', data across Greenwich is loaded.}

\item{latmin}{If a 2-dimensional variable is loaded, values at latitudes 
lower than 'latmin' aren't loaded.\cr
Must take a value in the range [-90, 90].\cr
aho's avatar
aho committed
It is set to -90 if not specified.}

\item{latmax}{If a 2-dimensional variable is loaded, values at latitudes 
higher than 'latmax' aren't loaded.\cr
Must take a value in the range [-90, 90].\cr
aho's avatar
aho committed
It is set to 90 if not specified.}

\item{output}{This parameter determines the format in which the data is 
arranged in the output arrays.\cr
Can take values 'areave', 'lon', 'lat', 'lonlat'.\cr
aho's avatar
aho committed
  \itemize{
    \item{'areave': Time series of area-averaged variables over the specified domain.}
    \item{'lon': Time series of meridional averages as a function of longitudes.}
    \item{'lat': Time series of zonal averages as a function of latitudes.}
    \item{'lonlat': Time series of 2d fields.}
}
Takes by default the value 'areave'. If the variable specified in 'var' is 
a global mean, this parameter is forced to 'areave'.\cr
All the loaded data is interpolated into the grid of the first experimental 
dataset except if 'areave' is selected. In that case the area averages are 
computed on each dataset original grid. A common grid different than the 
first experiment's can be specified through the parameter 'grid'. If 'grid' 
is specified when selecting 'areave' output type, all the loaded data is 
interpolated into the specified grid before calculating the area averages.}

\item{method}{This parameter determines the interpolation method to be used 
when regridding data (see 'output'). Can take values 'bilinear', 'bicubic', 
'conservative', 'distance-weighted'.\cr
See \code{remapcells} for advanced adjustments.\cr
aho's avatar
aho committed
Takes by default the value 'conservative'.}

\item{grid}{A common grid can be specified through the parameter 'grid' when 
loading 2-dimensional data. Data is then interpolated onto this grid 
whichever 'output' type is specified. If the selected output type is 
'areave' and a 'grid' is specified, the area averages are calculated after 
interpolating to the specified grid.\cr
If not specified and the selected output type is 'lon', 'lat' or 'lonlat', 
this parameter takes as default value the grid of the first experimental 
dataset, which is read automatically from the source files.\cr
The grid must be supported by 'cdo' tools. Now only supported: rNXxNY 
or tTRgrid.\cr
Both rNXxNY and tRESgrid yield rectangular regular grids. rNXxNY yields 
grids that are evenly spaced in longitudes and latitudes (in degrees). 
tRESgrid refers to a grid generated with series of spherical harmonics 
truncated at the RESth harmonic. However these spectral grids are usually 
associated to a gaussian grid, the latitudes of which are spaced with a 
Gaussian quadrature (not evenly spaced in degrees). The pattern tRESgrid 
will yield a gaussian grid.\cr
E.g., 'r96x72' 
Advanced: If the output type is 'lon', 'lat' or 'lonlat' and no common 
grid is specified, the grid of the first experimental or observational 
dataset is detected and all data is then interpolated onto this grid. 
If the first experimental or observational dataset's data is found shifted 
along the longitudes (i.e., there's no value at the longitude 0 but at a 
longitude close to it), the data is re-interpolated to suppress the shift. 
This has to be done in order to make sure all the data from all the 
datasets is properly aligned along longitudes, as there's no option so far 
in \code{Load} to specify grids starting at longitudes other than 0. 
This issue doesn't affect when loading in 'areave' mode without a common 
grid, the data is not re-interpolated in that case.}

\item{maskmod}{List of masks to be applied to the data of each experimental 
dataset respectively, if a 2-dimensional variable is specified in 'var'.\cr
Each mask can be defined in 2 formats:\cr
a) a matrix with dimensions c(longitudes, latitudes).\cr
b) a list with the components 'path' and, optionally, 'nc_var_name'.\cr
aho's avatar
aho committed
In the format a), the matrix must have the same size as the common grid 
or with the same size as the grid of the corresponding experimental dataset 
if 'areave' output type is specified and no common 'grid' is specified.\cr
In the format b), the component 'path' must be a character string with the 
path to a NetCDF mask file, also in the common grid or in the grid of the 
corresponding dataset if 'areave' output type is specified and no common 
'grid' is specified. If the mask file contains only a single variable, 
there's no need to specify the component 'nc_var_name'. Otherwise it must 
be a character string with the name of the variable inside the mask file 
that contains the mask values. This variable must be defined only over 2 
dimensions with length greater or equal to 1.\cr
Whichever the mask format, a value of 1 at a point of the mask keeps the 
original value at that point whereas a value of 0 disables it (replaces 
by a NA value).\cr
By default all values are kept (all ones).\cr
aho's avatar
aho committed
The longitudes and latitudes in the matrix must be in the same order as in 
the common grid or as in the original grid of the corresponding dataset 
when loading in 'areave' mode. You can find out the order of the longitudes 
and latitudes of a file with 'cdo griddes'.\cr
Note that in a common CDO grid defined with the patterns 't<RES>grid' or 
'r<NX>x<NY>' the latitudes and latitudes are ordered, by definition, from 
-90 to 90 and from 0 to 360, respectively.\cr
If you are loading maps ('lonlat', 'lon' or 'lat' output types) all the 
data will be interpolated onto the common 'grid'. If you want to specify 
a mask, you will have to provide it already interpolated onto the common 
grid (you may use 'cdo' libraries for this purpose). It is not usual to 
apply different masks on experimental datasets on the same grid, so all 
the experiment masks are expected to be the same.\cr
Warning: When loading maps, any masks defined for the observational data 
will be ignored to make sure the same mask is applied to the experimental 
and observational data.\cr
Warning: list() compulsory even if loading 1 experimental dataset only!\cr
aho's avatar
aho committed
E.g., list(array(1, dim = c(num_lons, num_lats)))}

\item{maskobs}{See help on parameter 'maskmod'.}

\item{configfile}{Path to the s2dverification configuration file from which 
to retrieve information on location in file system (and other) of datasets.\cr
If not specified, the configuration file used at BSC-ES will be used 
(it is included in the package).\cr
Check the BSC's configuration file or a template of configuration file in 
the folder 'inst/config' in the package.\cr
Check further information on the configuration file mechanism in 
\code{ConfigFileOpen()}.}

\item{varmin}{Loaded experimental and observational data values smaller 
than 'varmin' will be disabled (replaced by NA values).\cr
By default no deactivation is performed.}

\item{varmax}{Loaded experimental and observational data values greater 
than 'varmax' will be disabled (replaced by NA values).\cr
By default no deactivation is performed.}

\item{silent}{Parameter to show (FALSE) or hide (TRUE) information messages.\cr
Warnings will be displayed even if 'silent' is set to TRUE.\cr
aho's avatar
aho committed
Takes by default the value 'FALSE'.}

\item{nprocs}{Number of parallel processes created to perform the fetch 
and computation of data.\cr
These processes will use shared memory in the processor in which Load() 
is launched.\cr
By default the number of logical cores in the machine will be detected 
and as many processes as logical cores there are will be created.\cr
A value of 1 won't create parallel processes.\cr
aho's avatar
aho committed
When running in multiple processes, if an error occurs in any of the 
processes, a crash message appears in the R session of the original 
process but no detail is given about the error. A value of 1 will display 
all error messages in the original and only R session.\cr
Note: the parallel process create other blocking processes each time they 
need to compute an interpolation via 'cdo'.}

\item{dimnames}{Named list where the name of each element is a generic 
name of the expected dimensions inside the NetCDF files. These generic 
names are 'lon', 'lat' and 'member'. 'time' is not needed because it's 
detected automatically by discard.\cr
The value associated to each name is the actual dimension name in the 
NetCDF file.\cr
The variables in the file that contain the longitudes and latitudes of 
the data (if the data is a 2-dimensional variable) must have the same 
name as the longitude and latitude dimensions.\cr
By default, these names are 'longitude', 'latitude' and 'ensemble. If any 
of those is defined in the 'dimnames' parameter, it takes priority and 
overwrites the default value.
E.g., list(lon = 'x', lat = 'y')
In that example, the dimension 'member' will take the default value 'ensemble'.}

\item{remapcells}{When loading a 2-dimensional variable, spatial subsets can 
be requested via \code{lonmin}, \code{lonmax}, \code{latmin} and 
\code{latmax}. When \code{Load()} obtains the subset it is then 
interpolated if needed with the method specified in \code{method}.\cr
The result of this interpolation can vary if the values surrounding the 
spatial subset are not present. To better control this process, the width 
in number of grid cells of the surrounding area to be taken into account 
can be specified with \code{remapcells}. A value of 0 will take into 
account no additional cells but will generate less traffic between the 
storage and the R processes that load data.\cr
A value beyond the limits in the data files will be automatically runcated 
to the actual limit.\cr
The default value is 2.}

\item{path_glob_permissive}{In some cases, when specifying a path pattern 
aho's avatar
aho committed
(either in the parameters 'exp'/'obs' or in a configuration file) one can 
specify path patterns that contain shell globbing expressions. Too much 
freedom in putting globbing expressions in the path patterns can be 
dangerous and make \code{Load()} find a file in the file system for a 
start date for a dataset that really does not belong to that dataset. 
For example, if the file system contains two directories for two different 
experiments that share a part of their path and the path pattern contains 
globbing expressions:
  /experiments/model1/expA/monthly_mean/tos/tos_19901101.nc
  /experiments/model2/expA/monthly_mean/tos/tos_19951101.nc
And the path pattern is used as in the example right below to load data of 
only the experiment 'expA' of the model 'model1' for the starting dates 
'19901101' and '19951101', \code{Load()} will undesiredly yield data for 
both starting dates, even if in fact there is data only for the 
first one:\cr
    \code{
expA <- list(path = file.path('/experiments/*/expA/monthly_mean/$VAR_NAME$',
                              '$VAR_NAME$_$START_DATE$.nc')
data <- Load('tos', list(expA), NULL, c('19901101', '19951101'))
    }
To avoid these situations, the parameter \code{path_glob_permissive} is 
set by default to \code{'partial'}, which forces \code{Load()} to replace 
all the globbing expressions of a path pattern of a data set by fixed 
values taken from the path of the first found file for each data set, up 
to the folder right before the final files (globbing expressions in the 
file name will not be replaced, only those in the path to the file). 
Replacement of globbing expressions in the file name can also be triggered 
by setting \code{path_glob_permissive} to \code{FALSE} or \code{'no'}. If 
needed to keep all globbing expressions, \code{path_glob_permissive} can 
be set to \code{TRUE} or \code{'yes'}.}
aho's avatar
aho committed
\code{Load()} returns a named list following a structure similar to the 
used in the package 'downscaleR'.\cr
The components are the following:
aho's avatar
aho committed
 \itemize{
   \item{
     'mod' is the array that contains the experimental data. It has the 
     attribute 'dimensions' associated to a vector of strings with the 
     labels of each dimension of the array, in order. The order of the 
     latitudes is always forced to be from 90 to -90 whereas the order of 
     the longitudes is kept as in the original files (if possible). The 
     longitude values provided in \code{lon} lower than 0 are added 360 
     (but still kept in the original order). In some cases, however, if 
     multiple data sets are loaded in longitude-latitude mode, the 
     longitudes (and also the data arrays in \code{mod} and \code{obs}) are 
     re-ordered afterwards by \code{Load()} to range from 0 to 360; a 
     warning is given in such cases. The longitude and latitude of the 
     center of the grid cell that corresponds to the value [j, i] in 'mod' 
     (along the dimensions latitude and longitude, respectively) can be 
     found in the outputs \code{lon}[i] and \code{lat}[j]
   }
   \item{'obs' is the array that contains the observational data. The 
     same documentation of parameter 'mod' applies to this parameter.}
   \item{'lat' and 'lon' are the latitudes and longitudes of the centers of 
     the cells of the grid the data is interpolated into (0 if the loaded 
     variable is a global mean or the output is an area average).\cr
     Both have the attribute 'cdo_grid_des' associated with a character 
     string with the name of the common grid of the data, following the CDO 
     naming conventions for grids.\cr
     'lon' has the attributes 'first_lon' and 'last_lon', with the first 
     and last longitude values found in the region defined by 'lonmin' and 
     'lonmax'. 'lat' has also the equivalent attributes 'first_lat' and 
     'last_lat'.\cr
     'lon' has also the attribute 'data_across_gw' which tells whether the 
     requested region via 'lonmin', 'lonmax', 'latmin', 'latmax' goes across 
     the Greenwich meridian. As explained in the documentation of the 
     parameter 'mod', the loaded data array is kept in the same order as in 
     the original files when possible: this means that, in some cases, even 
     if the data goes across the Greenwich, the data array may not go 
     across the Greenwich. The attribute 'array_across_gw' tells whether 
     the array actually goes across the Greenwich. E.g: The longitudes in 
     the data files are defined to be from 0 to 360. The requested 
     longitudes are from -80 to 40. The original order is kept, hence the 
     longitudes in the array will be ordered as follows: 
     0, ..., 40, 280, ..., 360. In that case, 'data_across_gw' will be TRUE 
     and 'array_across_gw' will be FALSE.\cr
     The attribute 'projection' is kept for compatibility with 'downscaleR'.
   }
   \item{'Variable' has the following components:
     \itemize{
       \item{'varName', with the short name of the loaded variable as 
         specified in the parameter 'var'.
       }
       \item{'level', with information on the pressure level of the 
         variable. Is kept to NULL by now.
       }
     }
   And the following attributes:
     \itemize{
       \item{'is_standard', kept for compatibility with 'downscaleR', 
         tells if a dataset has been homogenized to standards with 
         'downscaleR' catalogs.
       }
       \item{'units', a character string with the units of measure of the 
         variable, as found in the source files.
       }
       \item{'longname', a character string with the long name of the 
         variable, as found in the source files.
       }
       \item{'daily_agg_cellfun', 'monthly_agg_cellfun', 
         'verification_time', kept for compatibility with 'downscaleR'.
       }
     }
   }
   \item{'Datasets' has the following components:
     \itemize{
       \item{'exp', a named list where the names are the identifying 
         character strings of each experiment in 'exp', each associated to 
         a list with the following components:
         \itemize{
           \item{'members', a list with the names of the members of the dataset.}
           \item{'source', a path or URL to the source of the dataset.}
         }
       }
       \item{'obs', similar to 'exp' but for observational datasets.}
     }
   }
   \item{'Dates', with the follwing components:
     \itemize{
       \item{'start', an array of dimensions (sdate, time) with the POSIX 
         initial date of each forecast time of each starting date.
       } 
       \item{'end', an array of dimensions (sdate, time) with the POSIX 
         final date of each forecast time of each starting date.
       }
     }
   }
   \item{'InitializationDates', a vector of starting dates as specified in 
     'sdates', in POSIX format.
   }
   \item{'when', a time stamp of the date the \code{Load()} call to obtain 
     the data was issued.
   }
   \item{'source_files', a vector of character strings with complete paths 
     to all the found files involved in the \code{Load()} call.
   }
   \item{'not_found_files', a vector of character strings with complete 
     paths to not found files involved in the \code{Load()} call.
   }
 }
aho's avatar
aho committed
\description{
This function loads monthly or daily data from a set of specified 
experimental datasets together with data that date-corresponds from a set 
of specified observational datasets. See parameters 'storefreq', 
'sampleperiod', 'exp' and 'obs'.\cr\cr
A set of starting dates is specified through the parameter 'sdates'. Data of 
each starting date is loaded for each model.
\code{Load()} arranges the data in two arrays with a similar format both 
with the following dimensions:
 \enumerate{
   \item{The number of experimental datasets determined by the user through 
   the argument 'exp' (for the experimental data array) or the number of 
   observational datasets available for validation (for the observational 
   array) determined as well by the user through the argument 'obs'.}
   \item{The greatest number of members across all experiments (in the 
   experimental data array) or across all observational datasets (in the 
   observational data array).}
   \item{The number of starting dates determined by the user through the 
   'sdates' argument.}
   \item{The greatest number of lead-times.}
   \item{The number of latitudes of the selected zone.}
   \item{The number of longitudes of the selected zone.}
 }
Dimensions 5 and 6 are optional and their presence depends on the type of 
the specified variable (global mean or 2-dimensional) and on the selected 
output type (area averaged time series, latitude averaged time series, 
longitude averaged time series or 2-dimensional time series).\cr
In the case of loading an area average the dimensions of the arrays will be 
only the first 4.\cr\cr
Only a specified variable is loaded from each experiment at each starting 
date. See parameter 'var'.\cr
Afterwards, observational data that matches every starting date and lead-time 
of every experimental dataset is fetched in the file system (so, if two 
predictions at two different start dates overlap, some observational values 
will be loaded and kept in memory more than once).\cr
If no data is found in the file system for an experimental or observational 
array point it is filled with an NA value.\cr\cr
If the specified output is 2-dimensional or latitude- or longitude-averaged 
time series all the data is interpolated into a common grid. If the 
specified output type is area averaged time series the data is averaged on 
the individual grid of each dataset but can also be averaged after 
interpolating into a common grid. See parameters 'grid' and 'method'.\cr
Once the two arrays are filled by calling this function, other functions in 
the s2dverification package that receive as inputs data formatted in this 
data structure can be executed (e.g: \code{Clim()} to compute climatologies, 
\code{Ano()} to compute anomalies, ...).\cr\cr
Load() has many additional parameters to disable values and trim dimensions 
of selected variable, even masks can be applied to 2-dimensional variables. 
See parameters 'nmember', 'nmemberobs', 'nleadtime', 'leadtimemin', 
'leadtimemax', 'sampleperiod', 'lonmin', 'lonmax', 'latmin', 'latmax', 
'maskmod', 'maskobs', 'varmin', 'varmax'.\cr\cr
The parameters 'exp' and 'obs' can take various forms. The most direct form 
is a list of lists, where each sub-list has the component 'path' associated 
to a character string with a pattern of the path to the files of a dataset 
to be loaded. These patterns can contain wildcards and tags that will be 
replaced automatically by \code{Load()} with the specified starting dates, 
member numbers, variable name, etc.\cr
See parameter 'exp' or 'obs' for details.\cr\cr
Only NetCDF files are supported. OPeNDAP URLs to NetCDF files are also 
supported.\cr
\code{Load()} can load 2-dimensional or global mean variables in any of the 
following formats:
 \itemize{
   \item{experiments:
     \itemize{
       \item{file per ensemble per starting date 
       (YYYY, MM and DD somewhere in the path)}
       \item{file per member per starting date 
       (YYYY, MM, DD and MemberNumber somewhere in the path. Ensemble 
       experiments with different numbers of members can be loaded in 
       a single \code{Load()} call.)}
     }
   (YYYY, MM and DD specify the starting dates of the predictions)
   }
   \item{observations:
     \itemize{
       \item{file per ensemble per month 
       (YYYY and MM somewhere in the path)}
       \item{file per member per month 
       (YYYY, MM and MemberNumber somewhere in the path, obs with different 
       numbers of members supported)}
       \item{file per dataset (No constraints in the path but the time axes 
       in the file have to be properly defined)}
     }
   (YYYY and MM correspond to the actual month data in the file)
   }
 }
In all the formats the data can be stored in a daily or monthly frequency, 
or a multiple of these (see parameters 'storefreq' and 'sampleperiod').\cr
All the data files must contain the target variable defined over time and 
potentially over members, latitude and longitude dimensions in any order, 
time being the record dimension.\cr
In the case of a two-dimensional variable, the variables longitude and 
latitude must be defined inside the data file too and must have the same 
names as the dimension for longitudes and latitudes respectively.\cr
The names of these dimensions (and longitude and latitude variables) and the 
name for the members dimension are expected to be 'longitude', 'latitude' 
and 'ensemble' respectively. However, these names can be adjusted with the 
parameter 'dimnames' or can be configured in the configuration file (read 
below in parameters 'exp', 'obs' or see \code{?ConfigFileOpen} 
for more information.\cr
All the data files are expected to have numeric values representable with 
32 bits. Be aware when choosing the fill values or infinite values in the 
datasets to load.\cr\cr
The Load() function returns a named list following a structure similar to 
the used in the package 'downscaleR'.\cr
The components are the following:
 \itemize{
   \item{'mod' is the array that contains the experimental data. It has the 
   attribute 'dimensions' associated to a vector of strings with the labels 
   of each dimension of the array, in order.}
   \item{'obs' is the array that contains the observational data. It has 
   the attribute 'dimensions' associated to a vector of strings with the 
   labels of each dimension of the array, in order.}
   \item{'obs' is the array that contains the observational data.}
   \item{'lat' and 'lon' are the latitudes and longitudes of the grid into 
   which the data is interpolated (0 if the loaded variable is a global 
   mean or the output is an area average).\cr
   Both have the attribute 'cdo_grid_des' associated with a character
   string with the name of the common grid of the data, following the CDO 
   naming conventions for grids.\cr
   The attribute 'projection' is kept for compatibility with 'downscaleR'.
   }
   \item{'Variable' has the following components:
     \itemize{
       \item{'varName', with the short name of the loaded variable as 
       specified in the parameter 'var'.}
       \item{'level', with information on the pressure level of the variable. 
       Is kept to NULL by now.}
     }
   And the following attributes:
     \itemize{
       \item{'is_standard', kept for compatibility with 'downscaleR', 
       tells if a dataset has been homogenized to standards with 
       'downscaleR' catalogs.}
       \item{'units', a character string with the units of measure of the 
       variable, as found in the source files.}
       \item{'longname', a character string with the long name of the 
       variable, as found in the source files.}
       \item{'daily_agg_cellfun', 'monthly_agg_cellfun', 'verification_time', 
       kept for compatibility with 'downscaleR'.}
     }
   }
   \item{'Datasets' has the following components:
     \itemize{
       \item{'exp', a named list where the names are the identifying 
       character strings of each experiment in 'exp', each associated to a 
       list with the following components:
         \itemize{
           \item{'members', a list with the names of the members of the 
           dataset.}
           \item{'source', a path or URL to the source of the dataset.}
         }
       }
       \item{'obs', similar to 'exp' but for observational datasets.}
     }
   }
   \item{'Dates', with the follwing components:
     \itemize{
       \item{'start', an array of dimensions (sdate, time) with the POSIX 
       initial date of each forecast time of each starting date.} 
       \item{'end', an array of dimensions (sdate, time) with the POSIX 
       final date of each forecast time of each starting date.}
     }
   }
   \item{'InitializationDates', a vector of starting dates as specified in 
   'sdates', in POSIX format.}
   \item{'when', a time stamp of the date the \code{Load()} call to obtain 
   the data was issued.}
   \item{'source_files', a vector of character strings with complete paths 
   to all the found files involved in the \code{Load()} call.}
   \item{'not_found_files', a vector of character strings with complete 
   paths to not found files involved in the \code{Load()} call.}
 }
aho's avatar
aho committed
\details{
The two output matrices have between 2 and 6 dimensions:\cr
 \enumerate{
   \item{Number of experimental/observational datasets.}
   \item{Number of members.}
   \item{Number of startdates.}
   \item{Number of leadtimes.}
   \item{Number of latitudes (optional).}
   \item{Number of longitudes (optional).}
 }
but the two matrices have the same number of dimensions and only the first 
two dimensions can have different lengths depending on the input arguments.    
For a detailed explanation of the process, read the documentation attached 
to the package or check the comments in the code.
}
# Let's assume we want to perform verification with data of a variable
# called 'tos' from a model called 'model' and observed data coming from 
# an observational dataset called 'observation'.
#
# The model was run in the context of an experiment named 'experiment'. 
# It simulated from 1st November in 1985, 1990, 1995, 2000 and 2005 for a 
# period of 5 years time from each starting date. 5 different sets of 
# initial conditions were used so an ensemble of 5 members was generated 
# for each starting date.
# The model generated values for the variables 'tos' and 'tas' in a 
# 3-hourly frequency but, after some initial post-processing, it was 
# averaged over every month.
# The resulting monthly average series were stored in a file for each 
# starting date for each variable with the data of the 5 ensemble members.
# The resulting directory tree was the following:
#   model
#    |--> experiment
#          |--> monthly_mean
#                |--> tos_3hourly
#                |     |--> tos_19851101.nc
#                |     |--> tos_19901101.nc
#                |               .
#                |               .
#                |     |--> tos_20051101.nc 
#                |--> tas_3hourly
#                      |--> tas_19851101.nc
#                      |--> tas_19901101.nc
#                                .
#                                .
#                      |--> tas_20051101.nc
# 
# The observation recorded values of 'tos' and 'tas' at each day of the 
# month over that period but was also averaged over months and stored in 
# a file per month. The directory tree was the following:
#   observation
#    |--> monthly_mean
#          |--> tos
#          |     |--> tos_198511.nc
#          |     |--> tos_198512.nc
#          |     |--> tos_198601.nc
#          |               .
#          |               .
#          |     |--> tos_201010.nc
#          |--> tas
#                |--> tas_198511.nc
#                |--> tas_198512.nc
#                |--> tas_198601.nc
#                          .
#                          .
#                |--> tas_201010.nc
#
# The model data is stored in a file-per-startdate fashion and the
# observational data is stored in a file-per-month, and both are stored in 
# a monthly frequency. The file format is NetCDF.
# Hence all the data is supported by Load() (see details and other supported 
# conventions in ?Load) but first we need to configure it properly.
#
# These data files are included in the package (in the 'sample_data' folder),
# only for the variable 'tos'. They have been interpolated to a very low 
# resolution grid so as to make it on CRAN.
# The original grid names (following CDO conventions) for experimental and 
# observational data were 't106grid' and 'r180x89' respectively. The final
# resolutions are 'r20x10' and 'r16x8' respectively. 
# The experimental data comes from the decadal climate prediction experiment 
# run at IC3 in the context of the CMIP5 project. Its name within IC3 local 
# database is 'i00k'. 
# The observational dataset used for verification is the 'ERSST' 
# observational dataset.
#
# The next two examples are equivalent and show how to load the variable 
# 'tos' from these sample datasets, the first providing lists of lists to 
# the parameters 'exp' and 'obs' (see documentation on these parameters) and 
# the second providing vectors of character strings, hence using a 
# configuration file.
#
# The code is not run because it dispatches system calls to 'cdo' which is 
# not allowed in the examples as per CRAN policies. You can run it on your 
# system though. 
# Instead, the code in 'dontshow' is run, which loads the equivalent
# already processed data in R.
#
# Example 1: Providing lists of lists to 'exp' and 'obs':
#
aho's avatar
aho committed
 \dontrun{
data_path <- system.file('sample_data', package = 's2dverification')
exp <- list(
aho's avatar
aho committed
        name = 'experiment',
        path = file.path(data_path, 'model/$EXP_NAME$/monthly_mean',
                         '$VAR_NAME$_3hourly/$VAR_NAME$_$START_DATES$.nc')
      )
obs <- list(
aho's avatar
aho committed
        name = 'observation',
        path = file.path(data_path, 'observation/$OBS_NAME$/monthly_mean',
                         '$VAR_NAME$/$VAR_NAME$_$YEAR$$MONTH$.nc')
      )
# Now we are ready to use Load().
startDates <- c('19851101', '19901101', '19951101', '20001101', '20051101')
sampleData <- Load('tos', list(exp), list(obs), startDates,
aho's avatar
aho committed
                  output = 'areave', latmin = 27, latmax = 48, 
                  lonmin = -12, lonmax = 40)
 }
#
# Example 2: Providing vectors of character strings to 'exp' and 'obs'
#            and using a configuration file.
#
# The configuration file 'sample.conf' that we will create in the example 
# has the proper entries to load these (see ?LoadConfigFile for details on 
# writing a configuration file). 
aho's avatar
aho committed
 \dontrun{
data_path <- system.file('sample_data', package = 's2dverification')
expA <- list(name = 'experiment', path = file.path(data_path, 
aho's avatar
aho committed
            'model/$EXP_NAME$/$STORE_FREQ$_mean/$VAR_NAME$_3hourly',
            '$VAR_NAME$_$START_DATE$.nc'))
obsX <- list(name = 'observation', path = file.path(data_path,
aho's avatar
aho committed
            '$OBS_NAME$/$STORE_FREQ$_mean/$VAR_NAME$',
            '$VAR_NAME$_$YEAR$$MONTH$.nc'))

# Now we are ready to use Load().
startDates <- c('19851101', '19901101', '19951101', '20001101', '20051101')
sampleData <- Load('tos', list(expA), list(obsX), startDates,
aho's avatar
aho committed
                  output = 'areave', latmin = 27, latmax = 48, 
                  lonmin = -12, lonmax = 40)
#
# Example 2: providing character strings in 'exp' and 'obs', and providing
# a configuration file.
# The configuration file 'sample.conf' that we will create in the example 
# has the proper entries to load these (see ?LoadConfigFile for details on 
# writing a configuration file). 
#
configfile <- paste0(tempdir(), '/sample.conf')
ConfigFileCreate(configfile, confirm = FALSE)
c <- ConfigFileOpen(configfile)
c <- ConfigEditDefinition(c, 'DEFAULT_VAR_MIN', '-1e19', confirm = FALSE)
c <- ConfigEditDefinition(c, 'DEFAULT_VAR_MAX', '1e19', confirm = FALSE)
data_path <- system.file('sample_data', package = 's2dverification')
exp_data_path <- paste0(data_path, '/model/$EXP_NAME$/')
obs_data_path <- paste0(data_path, '/$OBS_NAME$/')
c <- ConfigAddEntry(c, 'experiments', dataset_name = 'experiment', 
aho's avatar
aho committed
    var_name = 'tos', main_path = exp_data_path,
    file_path = '$STORE_FREQ$_mean/$VAR_NAME$_3hourly/$VAR_NAME$_$START_DATE$.nc')
c <- ConfigAddEntry(c, 'observations', dataset_name = 'observation', 
aho's avatar
aho committed
    var_name = 'tos', main_path = obs_data_path,
    file_path = '$STORE_FREQ$_mean/$VAR_NAME$/$VAR_NAME$_$YEAR$$MONTH$.nc')
ConfigFileSave(c, configfile, confirm = FALSE)

# Now we are ready to use Load().
startDates <- c('19851101', '19901101', '19951101', '20001101', '20051101')
sampleData <- Load('tos', c('experiment'), c('observation'), startDates, 
aho's avatar
aho committed
                  output = 'areave', latmin = 27, latmax = 48, 
                  lonmin = -12, lonmax = 40, configfile = configfile)
 }
 \dontshow{
startDates <- c('19851101', '19901101', '19951101', '20001101', '20051101')
sampleData <- s2dverification:::.LoadSampleData('tos', c('experiment'), 
aho's avatar
aho committed
                                               c('observation'), startDates,
                                               output = 'areave', 
                                               latmin = 27, latmax = 48, 
                                               lonmin = -12, lonmax = 40) 
 } 
}
\author{
History:\cr
aho's avatar
aho committed
0.1  -  2011-03  (V. Guemas)  -  Original code\cr
1.0  -  2013-09  (N. Manubens)  -  Formatting to CRAN\cr
1.2  -  2015-02  (N. Manubens)  -  Generalisation + parallelisation\cr
1.3  -  2015-07  (N. Manubens)  -  Improvements related to configuration file mechanism\cr
1.4  -  2016-01  (N. Manubens)  -  Added subsetting capabilities
\keyword{datagen}