Dinesh Asanka
Usage of Classification with Clustering technique.

Clustering in Azure Machine Learning

January 13, 2021 by

Introduction

In this article, we will be discussing Clustering in Azure Machine Learning which is another machine learning technique such as Regression analysis, Classification analysis. During this article series, we have discussed the basic cleaning techniques, feature selection techniques and Principal component analysis, Comparing Models and Cross-Validation until today. We will introduce further few techniques that were not discussed before in this article as well. Previously, we have discussed how to perform clustering in SQL Server during the SQL Server Data Mining series.

What is Clustering?

Clustering is an unsupervised technique that is used to perform natural grouping on your data set. For example, if you are the head of customer relations, you would prefer to group your customers depending on the number of officers you have so that you can allocate separate officers to each given cluster. What are attributes that can be used to group your customers as there are many attributes such as age, salary, designation, education qualification, etc.? Since there are a lot of attributes, you need machine learning techniques to group them. Clustering in Azure machine learning provides you with techniques to cluster your data set.

There are different cluster techniques as shown in the below figure.

Different types of Clusteting techniques.

Though you will see a large number of clustering techniques, K-Means is the only technique that is supported in Clustering in Azure Machine learning.

Simple Clustering

Let us look at how to create simple Clustering in Azure Machine Learning. This time we will not be using the same controls that were used for Classification and Regression in the previous articles.

Let us start with creating an empty experiment after login into azure machine studio. Then drag and drop the adventure dataset which we have been using in the previous articles. Since we do not need all the columns, let us select MaritalStatus, Gender, YearlyIncome, TotalChildren, NumberChildrenAtHome, EnglishEducation, EnglishOccupation, HouseOwnerFlag, NumberCarsOwned, CommuteDistance, Region, Age and BikeBuyer attributes by using Select Columns in Dataset as shown in the below screen.

Choosing only required columns in Azure Machine Learning.

Though we can use the clustering straightway, you will see that your clusters will not be in good shape as YearlyIncome is not normalized. To accomplish the uniformity, we can use Normalized Data control as Zscore as the transformation function. With that, you are ready to perform clustering techniques.

Including Cluster controls in Azure Machine Learning.

In Clustering in Azure Machine Learning, we are introducing two new components that were not introduced before. Those two controls are K-Means Clustering and Train Clustering Model. There are connected to the Normalized Data control set as shown in the above figure.

Let us see how we can configure, these controls. First, we will start with K-Means Clustering.

K-Means Clustering

As indicated before, Azure Machine learning supports only K-Means clustering and there are a few configurations to be done as shown below.

Configuring K-Means Clustering in Azure Machine Learning.

In the K-Means clustering, there are two types of trainer mode, Single Parameter and Range Parameter. The Single parameter is used if you know your parameters. The range Parameter option is used along with the Sweep Clustering control that will be discussed later. The number of centroids is an important parameter in clustering that will define how many clusters you will get. This value is by default is set to two and if you set this to a large value, it does not guarantee that you will get the clusters. Two types of distance measures are Euclidean and Cosine.

Train Clustering Model

In the train clustering model, you can choose the columns that are required for the cluster. This means that you do not have to use Select Columns in Dataset control though we have used it as a practice. Following is the cluster graph that shows these there is less discrimination between these clusters.

Cluster graph

With these two configurations, you are done with the modeling of K-Means Clustering in Azure Machine Learning. Since this an unsupervised training, evaluation of the clustering is not possible as classification or regression. The next task is to allocate different data points to the derived clusters.

Assign Data to Clusters

In this exercise, we need a data source. It can be the same data source that we used before or a different data source. To simulate the real-world scenario, let us use a different dataset.

Complete configuration of Clustering in Azure Machine Learning.

Assign Data to Clusters control does not have any configurations. This control simply joins the data set with the trained model. To visualize the cluster data, you need to use the Convert to Dataset control. From the Covert to DataSet control, you can view the Cluster assignments and distance to each cluster as shown in the below figure.

Output of the Cluster assignment.

If you further analyze the above data output, you will see that the relevant customer or the record is assigned to the closest cluster.

In a practical situation, we would like to see the output as the contact details of customers rather than the model parameters such as name, address, email address and phone number. It is much prettier if you can concatenate names and addresses into a single column rather than displaying them in separate columns. Since there is no inbuilt technique to concatenate in Azure Machine learning, you can write a python script. Please note that third-party controls are available and you find them with a simple google search.

Following is the python code that can be used in the Execute Python Script Control.

Then after performing a simple Column Selection, now clustered data is available in Clustering in Azure Machine Learning as shown in the below figure.

Output of the Cluster assignment after data manipulation for better usage.

Now, you can deploy this model as a web service and use it for prediction as we did in a previous article.

If you want to distribute this data to different data streams you can do this by using a Split Data control. In this Split Data configuration, it is configured as below.

Configuring Split Data control

If you have more than two clusters and you need to separate them into different data streams, you need to use multiple Split Data as Split Data supports only two divisions.

This is the final model for Clustering in Azure Machine Learning.

final model for Clustering in Azure Machine Learning

Choosing the Optimum Number of Clusters

As we discussed briefly, rather than we define the number of clusters, we can use Sweep Clustering control to define the optimum number of clusters as shown below.

Defining the optimum number of clusters.

Let us see how to configure Sweep Clustering. To retrieve the optimum clusters, we need to modify the K-Mean clustering as follows.

Configuring K-Means clustering for Sweep Clustering.

In the above configuration, we are specifying the number of clusters in a range. Following are the outputs of Sweep Configurations. In the sweep clustering control, the first output is the best model while the last output is the execution of multiple clusters. The second output provides the data with cluster assignments.

Finding the best cluster with Highest Cluster Metric.

As you can see, the optimum number of clusters is four as it has the highest cluster metric.

Clustering and Classification

As we have discussed the classification before, now we can integrate Clustering and Classification together. Rather than applying classification for the entire data, we can configure classification for separate clusters so that we can choose optimum classification techniques. You can have different classification techniques for different clusters.

Usage of Classification with Clustering technique.

However, it is important to make sure that you will have an adequate number of data for each classification as you perform classification for divided clusters, it will reduce the number of records.

Conclusion

This article discussed one of the main unsupervised techniques, Clustering in Azure Machine Learning. Though there are a lot of clustering techniques, K-Means is the only technique that is supported in Azure Machine Learning. By using clustering, we can assign the data set to the defined clusters. Similarly, we can use Sweep Clustering to find the optimum clusters. Further, we looked at the option of combining Clusters and Classification in order to achieve better results.

Further References

Table of contents

Introduction to Azure Machine Learning using Azure ML Studio
Data Cleansing in Azure Machine Learning
Prediction in Azure Machine Learning
Feature Selection in Azure Machine Learning
Data Reduction Technique: Principal Component Analysis in Azure Machine Learning
Prediction with Regression in Azure Machine Learning
Prediction with Classification in Azure Machine Learning
Comparing models in Azure Machine Learning
Cross Validation in Azure Machine Learning
Clustering in Azure Machine Learning
Tune Model Hyperparameters for Azure Machine Learning models
Time Series Anomaly Detection in Azure Machine Learning
Designing Recommender Systems in Azure Machine Learning
Language Detection in Azure Machine Learning with basic Text Analytics Techniques
Azure Machine Learning: Named Entity Recognition in Text Analytics
Filter based Feature Selection in Text Analytics
Latent Dirichlet Allocation in Text Analytics
Recommender Systems for Customer Reviews
AutoML in Azure Machine Learning
AutoML in Azure Machine Learning for Regression and Time Series
Building Ensemble Classifiers in Azure Machine Learning
Text Classification in Azure Machine Learning using Word Vectors
Dinesh Asanka
168 Views