Cluster is a group of data objects that are similar to one another within the same cluster, whereas, dissimilar to the objects in the other clusters.
Cluster analysis is a technique used to classify the data objects into relative groups called clusters.
Clustering is an unsupervised learning approach in which there are no predefined classes.
The basic aim of clustering is to group the related entities in a way that the entities within a group are alike to each other but the groups are dissimilar from each other.
In K-Means clustering, “K” defines the number of clusters. K-means Clustering, Hierarchical Clustering, and Density Based Spatial Clustering are more popular clustering algorithms.
Examples of Clustering Applications:
- Cluster analyses are used in marketing for the segmentation of customers based on the benefits obtained from the purchase of the merchandise and find out homogenous groups of the consumers.
- Cluster analyses are used for earthquake studies.
- Cluster analyses are used for city planning in order to find out the collection of houses according to their house type, worth and geographical locality.
Major Clustering Approaches:
Major clustering approaches are described as under: –
In this technique, datasets are subdivided into a set of k-groups (where k is the no. of groups, which is predefined by the analyst).
K-means is the well-known clustering technique in which each cluster is represented by the center of the data points belonging to the cluster.
K-medoids clustering is an alternative technique of K-means, which is less sensitive to outliers as compare to k-means.
K-means clustering method is also known as hard clustering as it produces partitions in which each observation belongs to only one cluster.
Hierarchy Clustering is used to identify the groups in the dataset but the analyst does not require to pre-specify the number of clusters to be generated.
The result obtained from this clustering is tree-based representation of the objects, which is recognized as a dendrogram. Furthermore, observations can also sub-divided into groups by slicing the dendrogram at the desired resemblance level.
Fuzzy clustering is also known as soft clustering which permits one piece of data to belong to more than one cluster.
Fuzzy clustering is frequently used in pattern recognition. Fuzzy C-means clustering algorithm is commonly used worldwide.
Density-based Clustering (DBSCAN)
DBSCAN stands for Density-based spatial clustering of applications with noise. It is a method that has been introduced by Ester et al. in 1996 that can be utilized to find out the clusters of any shape in a dataset having noise and outliers.
The main advantage of DBSCAN is that there is no need to specify the number of clusters to be generated by the user.
This clustering approach utilizes a multi-resolution grid data structure having high processing speed with a small amount of memory consumption.
In this clustering approach, it is assumed that the data is coming from a dispersal that is a combination of two or more clusters.
Model based clustering is utilized to resolve the issues that can arise in K-means or Fuzzy K-means algorithms.
Difference between Classification and Clustering
|Classification technique is widely utilized in mining for classifying datasets where the output variable is a category like black or white, plus or minus.||Cluster is a group of data objects that are similar to one another within the same cluster, whereas, dissimilar to the objects in the other clusters. Cluster analysis is a technique used to classify the data objects into relative groups called clusters.|
|Naïve Bayes, Support Vector Machine, Decision Tree are the most popular supervised machine learning algorithms.||Clustering is unsupervised learning in which there are no predefined classes.|
Process of applying K-mean Clustering
- Choose the number of clusters
- Specify the cluster seeds
- Assign each point to a centroid
- Adjust the centroid
Pros and Cons of Clustering
- Pros: It is simple to comprehend, work better on small as well as large datasets. This clustering technique is fast and efficient.
- Cons: There is a dire need to select the number of clusters
- Pros: The ideal number of clusters can be acquired by the model itself.
- Cons: Hierarchical clustering is not suitable for large datasets.
K-Means Clustering Example (Python)
These are the steps to perform the example.
Import the relevant libraries.
Load the data
Now we load the data in .csv format in the same folder where clustering.ipynb file saved and also check the data what is inside the file. Look at this figure.
In order to map the data, we will create a new variable data_mapped which is equal to data.copy() and data_mapped[‘continent’] equal to data_mapped[continent].map and also Africa to 0, Asia to 1, Europe to 2, North America to 3 and South America to 4 as shown in this figure.
Further, we will select the features that we intend to utilize for clustering as below
In the above picture, we select three columns and left only one column i.e. country.
Perform K-Mean Clustering
In above span, we perform K-mean clustering with 5 clusters and the results shown in below figure.
Now we create a data frame i.e. data_with_clusters which is equal to data. Furthermore, we add an extra column i.e. Cluster which is equal to identified_clusters, as shown in figure
It is clear from the above picture that Angola, Burundi & Benin in cluster 0, Aruba, Anguilla, Antigua & Barb in cluster 1, Albania, Aland, Andorra, Austria & Belgium in cluster 2 and Afghanistan, United Arab Emirates & Azerbaijan in cluster 3.
Finally, we are going to plot a scatter plot in order to obtain a map of the real world. We will take the Longitude along the y-axis and Latitude along the x-axis.
These clusters are based on geographical location, therefore, the result is shown in this figure.