Cluster.Rd 4.38 KB
Newer Older
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Cluster.R
\name{Cluster}
\alias{Cluster}
\title{K-means Clustering}
\usage{
Cluster(
  data,
  weights,
  time_dim = "sdate",
  nclusters = NULL,
  index = "sdindex",
  ncores = NULL
)
}
\arguments{
\item{data}{A numeric array with named dimensions that at least have 
'time_dim' corresponding to time and the dimensions of 'weights'
corresponding to either area-averages over a series of domains or the grid 
points for any sptial grid structure.}

\item{weights}{A numeric array with named dimension of multiplicative weights
based on the areas covering each domain/region or grid-cell of 'data'. The 
dimensions must also be part of the 'data' dimensions.}

\item{time_dim}{A character string indicating the name of time dimension in 
'data'. The default value is 'sdate'.}

\item{nclusters}{A positive integer K that must be bigger than 1 indicating
the number of clusters to be computed, or K initial cluster centers to be 
used in the method. The default is NULL, and users have to specify which
index from NbClust and the associated criteria for selecting the optimal
number of clusters will be used for K-means clustering of 'data'.}

\item{index}{A character string of the validity index from NbClust package 
that can be used to determine optimal K if K is not specified with 
'nclusters'. The default value is 'sdindex' (Halkidi et al. 2001, JIIS). 
Other indices available in NBClust are "kl", "ch", "hartigan", "ccc", 
"scott", "marriot", "trcovw", "tracew", "friedman", "rubin", "cindex", "db",
"silhouette", "duda", "pseudot2", "beale", "ratkowsky", "ball", 
"ptbiserial", "gap", "frey", "mcclain", "gamma", "gplus", "tau", "dunn", 
"hubert", "sdindex", and "sdbw".\n
One can also use all of them with the option 'alllong' or almost all indices
clusters K is detremined by the majority rule (the maximum of histogram of 
the results of all indices with finite solutions). Use of some indices on 
a big and/or unstructured dataset can be computationally intense and/or 
could lead to numerical singularity.}

\item{ncores}{An integer indicating the number of cores to use for parallel 
computation. The default value is NULL.}
}
\value{
A list containing:
\item{$cluster}{
 An integer vector of the occurrence of a cluster along time, i.e., when
 certain data member in time is allocated to a specific cluster.
}
\item{$centers}{
 A matrix of cluster centres or centroids (e.g. [1:K, 1:spatial degrees of freedom]).
}
\item{$totss}{
 A number of the total sum of squares.
}
\item{$withinss}{
 A vector of within-cluster sum of squares, one component per cluster.
}
\item{$tot.withinss}{
 A number of the total within-cluster sum of squares, i.e., sum(withinss).
}
\item{$betweenss}{
 A number of the between-cluster sum of squares, i.e. totss-tot.withinss.
}
\item{$size}{
 A vector of the number of points in each cluster.
}
\item{$iter}{
 An interger as the number of (outer) iterations.
}
\item{$ifault}{
 An integer as an indicator of a possible algorithm problem.
}
}
\description{
Compute cluster centers and their time series of occurrences, with the 
K-means clustering method using Euclidean distance, of an array of input data
with any number of dimensions that at least contain time_dim. 
Specifically, it partitions the array along time axis in K groups or clusters
in which each space vector/array belongs to (i.e., is a member of) the 
cluster with the nearest center or centroid. This function relies on the 
NbClust package (Charrad et al., 2014 JSS).
}
\examples{
# Generating synthetic data
a1 <- array(dim = c(200, 4))
mean1 <- 0
sd1 <- 0.3 

c0 <- seq(1, 200)
c1 <- sort(sample(x = 1:200, size = sample(x = 50:150, size = 1), replace = FALSE))
x1 <- c(1, 1, 1, 1)
for (i1 in c1) {
 a1[i1, ] <- x1 + rnorm(4, mean = mean1, sd = sd1)
}

c1p5 <- c0[!(c0 \%in\% c1)]
c2 <- c1p5[seq(1, length(c1p5), 2)] 
x2 <- c(2, 2, 4, 4)
for (i2 in c2) {
 a1[i2, ] <- x2 + rnorm(4, mean = mean1, sd = sd1)
}

c3 <- c1p5[seq(2, length(c1p5), 2)]
x3 <- c(3, 3, 1, 1)
for (i3 in c3) {
 a1[i3, ] <- x3 + rnorm(4, mean = mean1, sd = sd1)
}

# Computing the clusters
res1 <- Cluster(var = a1, weights = array(1, dim = dim(a1)[2]), nclusters = 3)
print(res1$cluster)
print(res1$centers)

res2 <- Cluster(var = a1, weights = array(1, dim = dim(a1)[2]))
print(res2$cluster)
print(res2$centers)

}
\references{
Wilks, 2011, Statistical Methods in the Atmospheric Sciences, 3rd ed., Elsevire, pp 676.
}