Cluster.Rd 5.5 KB
Newer Older
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Cluster.R
\name{Cluster}
\alias{Cluster}
\title{K-means Clustering}
\usage{
Cluster(
  data,
  time_dim = "sdate",
  nclusters = NULL,
  index = "sdindex",
  ncores = NULL
)
}
\arguments{
\item{data}{A numeric array with named dimensions that at least have 
'time_dim' corresponding to time and 'space_dim' (optional) corresponding 
to either area-averages over a series of domains or the grid points for any
sptial grid structure.}

\item{weights}{A numeric array with named dimension of multiplicative weights
based on the areas covering each domain/region or grid-cell of 'data'. The 
dimensions must be equal to the 'space_dim' in 'data'. The default value is
NULL which means no weighting is applied.}

\item{time_dim}{A character string indicating the name of time dimension in 
'data'. The default value is 'sdate'.}

\item{space_dim}{A character vector indicating the names of spatial dimensions
in 'data'. The default value is NULL.}

\item{nclusters}{A positive integer K that must be bigger than 1 indicating
the number of clusters to be computed, or K initial cluster centers to be 
aho's avatar
aho committed
used in the method. The default value is NULL, which means that the number
of clusters will be determined by NbClust(). The parameter 'index' 
therefore needs to be specified for NbClust() to find the optimal number of 
clusters to be used for K-means clustering calculation.}

\item{index}{A character string of the validity index from NbClust package 
that can be used to determine optimal K if K is not specified with 
'nclusters'. The default value is 'sdindex' (Halkidi et al. 2001, JIIS). 
Other indices available in NBClust are "kl", "ch", "hartigan", "ccc", 
"scott", "marriot", "trcovw", "tracew", "friedman", "rubin", "cindex", "db",
"silhouette", "duda", "pseudot2", "beale", "ratkowsky", "ball", 
"ptbiserial", "gap", "frey", "mcclain", "gamma", "gplus", "tau", "dunn", 
aho's avatar
aho committed
"hubert", "sdindex", and "sdbw".
One can also use all of them with the option 'alllong' or almost all indices
clusters K is detremined by the majority rule (the maximum of histogram of 
the results of all indices with finite solutions). Use of some indices on 
a big and/or unstructured dataset can be computationally intense and/or 
could lead to numerical singularity.}

\item{ncores}{An integer indicating the number of cores to use for parallel 
computation. The default value is NULL.}
}
\value{
A list containing:
\item{$cluster}{
 An integer array of the occurrence of a cluster along time, i.e., when
 certain data member in time is allocated to a specific cluster. The dimensions
 are same as 'data' without 'space_dim'.
}
\item{$centers}{
 A nemeric array of cluster centres or centroids (e.g. [1:K, 1:spatial degrees 
 of freedom]). The rest dimensions are same as 'data' except 'time_dim' 
 and 'space_dim'.
}
\item{$totss}{
 A numeric array of the total sum of squares. The dimensions are same as 'data'
 except 'time_dim' and 'space_dim'.
}
\item{$withinss}{
 A numeric array of within-cluster sum of squares, one component per cluster. 
 The first dimenion is the number of cluster, and the rest dimensions are 
 same as 'data' except 'time_dim' and 'space_dim'.
}
\item{$tot.withinss}{
 A numeric array of the total within-cluster sum of squares, i.e., 
 sum(withinss). The dimensions are same as 'data' except 'time_dim' and 
 'space_dim'.
}
\item{$betweenss}{
 A numeric array of the between-cluster sum of squares, i.e. totss-tot.withinss.
 The dimensions are same as 'data' except 'time_dim' and 'space_dim'.
}
\item{$size}{
 A numeric array of the number of points in each cluster. The first dimenion 
 is the number of cluster, and the rest dimensions are same as 'data' except
 'time_dim' and 'space_dim'.
}
\item{$iter}{
 A numeric array of the number of (outer) iterations. The dimensions are 
 same as 'data' except 'time_dim' and 'space_dim'.
}
\item{$ifault}{
 A numeric array of an indicator of a possible algorithm problem. The 
 dimensions are same as 'data' except 'time_dim' and 'space_dim'.
}
}
\description{
Compute cluster centers and their time series of occurrences, with the 
K-means clustering method using Euclidean distance, of an array of input data
with any number of dimensions that at least contain time_dim. 
Specifically, it partitions the array along time axis in K groups or clusters
in which each space vector/array belongs to (i.e., is a member of) the 
aho's avatar
aho committed
cluster with the nearest center or centroid. This function is a wrapper of 
kmeans() and relies on the NbClust package (Charrad et al., 2014 JSS) to 
determine the optimal number of clusters used for K-means clustering if it is
not provided by users.
}
\examples{
# Generating synthetic data
a1 <- array(dim = c(200, 4))
mean1 <- 0
sd1 <- 0.3 

c0 <- seq(1, 200)
c1 <- sort(sample(x = 1:200, size = sample(x = 50:150, size = 1), replace = FALSE))
x1 <- c(1, 1, 1, 1)
for (i1 in c1) {
 a1[i1, ] <- x1 + rnorm(4, mean = mean1, sd = sd1)
}

c1p5 <- c0[!(c0 \%in\% c1)]
c2 <- c1p5[seq(1, length(c1p5), 2)] 
x2 <- c(2, 2, 4, 4)
for (i2 in c2) {
 a1[i2, ] <- x2 + rnorm(4, mean = mean1, sd = sd1)
}

c3 <- c1p5[seq(2, length(c1p5), 2)]
x3 <- c(3, 3, 1, 1)
for (i3 in c3) {
 a1[i3, ] <- x3 + rnorm(4, mean = mean1, sd = sd1)
}

# Computing the clusters
aho's avatar
aho committed
names(dim(a1)) <- c('sdate', 'space')
aho's avatar
aho committed
res1 <- Cluster(data = a1, weights = array(1, dim = dim(a1)[2]), nclusters = 3)
res2 <- Cluster(data = a1, weights = array(1, dim = dim(a1)[2]))

}
\references{
Wilks, 2011, Statistical Methods in the Atmospheric Sciences, 3rd ed., Elsevire, pp 676.
}