Newer
Older
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Cluster.R
\name{Cluster}
\alias{Cluster}
\title{K-means Clustering}
\usage{
Cluster(
data,
nclusters = NULL,
index = "sdindex",
ncores = NULL
)
}
\arguments{
\item{data}{A numeric array with named dimensions that at least have
'time_dim' corresponding to time and 'space_dim' (optional) corresponding
to either area-averages over a series of domains or the grid points for any
sptial grid structure.}
\item{weights}{A numeric array with named dimension of multiplicative weights
based on the areas covering each domain/region or grid-cell of 'data'. The
dimensions must be equal to the 'space_dim' in 'data'. The default value is
NULL which means no weighting is applied.}
\item{time_dim}{A character string indicating the name of time dimension in
'data'. The default value is 'sdate'.}
\item{space_dim}{A character vector indicating the names of spatial dimensions
in 'data'. The default value is NULL.}
\item{nclusters}{A positive integer K that must be bigger than 1 indicating
the number of clusters to be computed, or K initial cluster centers to be
used in the method. The default value is NULL, which means that the number
of clusters will be determined by NbClust(). The parameter 'index'
therefore needs to be specified for NbClust() to find the optimal number of
clusters to be used for K-means clustering calculation.}
\item{index}{A character string of the validity index from NbClust package
that can be used to determine optimal K if K is not specified with
'nclusters'. The default value is 'sdindex' (Halkidi et al. 2001, JIIS).
Other indices available in NBClust are "kl", "ch", "hartigan", "ccc",
"scott", "marriot", "trcovw", "tracew", "friedman", "rubin", "cindex", "db",
"silhouette", "duda", "pseudot2", "beale", "ratkowsky", "ball",
"ptbiserial", "gap", "frey", "mcclain", "gamma", "gplus", "tau", "dunn",
One can also use all of them with the option 'alllong' or almost all indices
clusters K is detremined by the majority rule (the maximum of histogram of
the results of all indices with finite solutions). Use of some indices on
a big and/or unstructured dataset can be computationally intense and/or
could lead to numerical singularity.}
\item{ncores}{An integer indicating the number of cores to use for parallel
computation. The default value is NULL.}
}
\value{
A list containing:
\item{$cluster}{
An integer array of the occurrence of a cluster along time, i.e., when
certain data member in time is allocated to a specific cluster. The dimensions
are same as 'data' without 'space_dim'.
A nemeric array of cluster centres or centroids (e.g. [1:K, 1:spatial degrees
of freedom]). The rest dimensions are same as 'data' except 'time_dim'
and 'space_dim'.
A numeric array of the total sum of squares. The dimensions are same as 'data'
except 'time_dim' and 'space_dim'.
A numeric array of within-cluster sum of squares, one component per cluster.
The first dimenion is the number of cluster, and the rest dimensions are
same as 'data' except 'time_dim' and 'space_dim'.
A numeric array of the total within-cluster sum of squares, i.e.,
sum(withinss). The dimensions are same as 'data' except 'time_dim' and
'space_dim'.
A numeric array of the between-cluster sum of squares, i.e. totss-tot.withinss.
The dimensions are same as 'data' except 'time_dim' and 'space_dim'.
A numeric array of the number of points in each cluster. The first dimenion
is the number of cluster, and the rest dimensions are same as 'data' except
'time_dim' and 'space_dim'.
A numeric array of the number of (outer) iterations. The dimensions are
same as 'data' except 'time_dim' and 'space_dim'.
A numeric array of an indicator of a possible algorithm problem. The
dimensions are same as 'data' except 'time_dim' and 'space_dim'.
}
}
\description{
Compute cluster centers and their time series of occurrences, with the
K-means clustering method using Euclidean distance, of an array of input data
with any number of dimensions that at least contain time_dim.
Specifically, it partitions the array along time axis in K groups or clusters
in which each space vector/array belongs to (i.e., is a member of) the
cluster with the nearest center or centroid. This function is a wrapper of
kmeans() and relies on the NbClust package (Charrad et al., 2014 JSS) to
determine the optimal number of clusters used for K-means clustering if it is
not provided by users.
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
}
\examples{
# Generating synthetic data
a1 <- array(dim = c(200, 4))
mean1 <- 0
sd1 <- 0.3
c0 <- seq(1, 200)
c1 <- sort(sample(x = 1:200, size = sample(x = 50:150, size = 1), replace = FALSE))
x1 <- c(1, 1, 1, 1)
for (i1 in c1) {
a1[i1, ] <- x1 + rnorm(4, mean = mean1, sd = sd1)
}
c1p5 <- c0[!(c0 \%in\% c1)]
c2 <- c1p5[seq(1, length(c1p5), 2)]
x2 <- c(2, 2, 4, 4)
for (i2 in c2) {
a1[i2, ] <- x2 + rnorm(4, mean = mean1, sd = sd1)
}
c3 <- c1p5[seq(2, length(c1p5), 2)]
x3 <- c(3, 3, 1, 1)
for (i3 in c3) {
a1[i3, ] <- x3 + rnorm(4, mean = mean1, sd = sd1)
}
# Computing the clusters
res1 <- Cluster(data = a1, weights = array(1, dim = dim(a1)[2]), nclusters = 3)
res2 <- Cluster(data = a1, weights = array(1, dim = dim(a1)[2]))