Cluster analysis
or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called cluster) are more similar (in
some sense or another) to each other than to those in other groups (clusters).
It is a main task of exploratory data mining,
and a common technique for statistical data
analysis used in many fields, including machine learning, pattern recognition, imageanalysis, information retrieval, and bioinformatics.[1]
The purpose of cluster analysis is to place objects
into groups, or clusters, suggested by the data, not defined by a priori, such
that objects in a given cluster tend to be similar to each other in some sense,
and objects in different clusters tend to be dissimilar. You can also use
cluster analysis to summarize data rather than to find "natural" or
"real" clusters; this use of clustering is sometimes called
dissection.
As an Analyst one always faces this question, where
you need to organize the data that you are observing into some kind of
meaningful structure or pattern. This is where clustering comes in handy. In
more simple words, clustering allows you to break the population into smaller
groups where each observation within every group is more similar to each other
than it is to an observation in another group. So, the idea is to group together
similar kind of objects into smaller groups.
How Clustering Works
Let us take an example to understand how clustering
works exactly. Imagine that we own a chain of ice cream shops. We have got a
number of ice cream shops right across the country, say we have got eight of
them and we sell two flavors of ice creams. We sell chocolate ice cream and
vanilla ice cream. Consider the table below: There we can see the sales of both
chocolate and vanilla ice cream across our eight stores. The units and
time-frame are not important in this example, just imagine that this is the
data that you are looking at. Now, there are many different ways one can make
sense of this data. We can locate some statistics, calculate mean, median,
dispersion etc. But, one very intuitive way to do this is to plot this data on
a graph.
Below is the graph that we get once we plot the
above data on a graph:
Here we plotted the sales of vanilla and chocolate
ice cream for each of the eight stores. Each of the dot above represents a
store. On the Y-axis we have chocolate sales and on the X-axis we have vanilla
sales. Looking at the graph above it is clear that we have divided our data
into two distinct groups. One group (A)
has five stores and the other (B)
has three stores. Obviously, the difference between two groups is with respect
to the magnitude of sales. This is exactly how clustering works. It is a simple
two dimensional example of how clustering works. This was an example where we
had two flavors of ice cream and only eight stores. Now imagine that we have
expanded our chain of ice-cream stores and instead of two flavors we are selling
thirty different flavors. How this information can be plotted in a graph now?
It is impossible to draw a thirty dimensional graph. What if we grow to five
hundred stores? Instead of eight points on the graph we will have five hundred
different points! What if there is something else which has got a million
records then it will have million different points! There is a mathematical way
to deal with these problems and that is what Cluster Analysis does.
What is Good
Clustering?
A good clustering method will produce high quality
clusters where:
·
The intra-class similarity (that is
within a cluster) is high
·
The inter class similarity (that is
between cluster) is low.
The second thing which describes a good clustering
is its ability to discover some or all of the hidden patterns in the data.
There are broadly three types of clustering algorithms:
·
K-means and its variants
·
Hierarchical Clustering
·
Density-based clustering
K-means clustering:
K-Means is a rather simple but well known algorithms
for grouping objects, clustering. Again all objects need to be represented as a
set of numerical features. In addition the user has to specify the number of
groups (referred to as k) he wishes to identify.
Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task.
After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and re-computing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations.[2]
Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task.
After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and re-computing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations.[2]
Limitations
of K-Means
K-mean has a problem when clusters are of differing
sizes, densities or non-globular shapes. K-mean also has a problem when data
contains outliers. K-mean algorithm
is sensitive to outliers. It is because Centroid is average of cluster members
and outlier can dominate average computation. However, these limitations can be
overcome by deploying various techniques.
Hierarchical
Clustering:
Hierarchical clustering produces a set of nested
clusters organized as a hierarchical tree and visualized as a dendrogram. In Hierarchical
clustering one does not have to assume any particular number of clusters. Any
number of clusters can be obtained by cutting the dendogram at the proper
level. There are two main types of hierarchical clustering: [3]
Agglomerative:
Start
with one, all inclusive cluster and at each step, mergre the closest pair of
clusters until on one cluster (or K clusters) left.
Divisive:
Start
with one, all inclusive cluster and at each step, split a cluster until each
cluster contains a single object (or there are K clusters).
Limitations
and Problems with Hierarchical Clustering
The biggest problem is that, once a decision is made
to combine two clusters, it cannot be undone. Also, no objective function is
directly minimized. Depending on the scheme that is being deployed one or more
of the following problems may arise:
·
Sensitivity to noise and outliers
·
Difficulty handling different sized
clusters and complex shapes
·
Breaking large clusters
Density Based
Clustering
Density based clustering algorithm has played a
vital role in finding non-linear shapes structure based on the density.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is most
widely used density based algorithm. It uses the concept of density
reachability and density connectivity.[4]
The advantages of using DBSCAN are:
·
It does not require a prior
specification of number of clusters.
·
It is able to identify noise data while
clustering
·
DBSCAN algorithm is able to find
arbitrarily size and arbitrarily shaped clusters.
Limitations
of Density Based Clustering
Basically there are two problems associated with
Density based clustering:
·
DBSCAN algorithm fails in case of
varying density clusters
·
It fails in case of neck type data-set.
Possible
applications of Clustering
Clustering algorithms can be applied in many fields,
for instance:
- Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;
- Biology: classification of plants and animals given their features;
- Libraries: book ordering;
- Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds;
- City-planning: identifying groups of houses according to their house type, value and geographical location;
- Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones;
- WWW: document classification; clustering weblog data to discover groups of similar access patterns.
Requirements
The main requirements that a
clustering algorithm should satisfy are:
- scalability;
- dealing with different types of attributes;
- discovering clusters with arbitrary shape;
- minimal requirements for domain knowledge to determine input parameters;
- ability to deal with noise and outliers;
- insensitivity to order of input records;
- high dimensionality;
- Interpretability and usability.
Common Problems with
Clustering
There are a number of problems with clustering. Some
among them are:
- current clustering techniques do not address all the requirements adequately (and concurrently);
- dealing with large number of dimensions and large number of data items can be problematic because of time complexity;
- the effectiveness of the method depends on the definition of “distance” (for distance-based clustering);
- if an obvious distance measure doesn’t exist we must “define” it, which is not always easy, especially in multi-dimensional spaces;
- The result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways. [5]
References:
No comments:
Post a Comment