What is k-Means Cluster Analysis?

Cluster analysis is a method for automatically grouping data into a smaller number of subsets or clusters so that the records grouped are most statistically similar to each other based on the attributes of the data compared.

In statistics and data miningk-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

Given a set of observations (x1x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ nS = {S1S2, …, Sk} so as to minimize the within-cluster sum of squares.

\underset{\mathbf{S}} {\operatorname{arg\,min}} \sum_{i=1}^{k} \sum_{\mathbf x_j \in S_i} \left\| \mathbf x_j - \boldsymbol\mu_i \right\|^2

where μi is the mean of points in Si.Wikipedia 

Scatter Chart Clusters and Centroids

Scatter chart with coloured clusters and their centroids displayed in red

k-Means cluster analysis achieves this by partitioning the data into the required number of clusters by grouping records so that the euclidean distance between the record’s dimensions and the clusters centroid (point with the average dimensions of the points in the cluster) are as small as possible.

The following is a macro I wrote in VBA for Microsoft Excel that performs k-Means Cluster Analysis on the table selected.

The k-Means Algorithm

The k-Means algorithm is an iteration of the following steps until stability is achieved i.e. the cluster assignments of individual records are no longer changing.

  1. Determine the coordinates of the centroids. (initially the centroids are random, unique points, thereafter the mean coordinates of the members of the cluster are assigned to the centroids).
  2. Determine the euclidean distance of each record to each centroid.
  3. Group records with their nearest centroid.

The Code

Firstly I have created a private type to represent our records and centroids and created two class level arrays to hold them as well as a class level variable to hold the table on which the analysis is being performed.

Private Type Records
Dimension() As Double
Distance() As Double
Cluster As Integer
End Type

Dim Table As Range
Dim Record() As Records
Dim Centroid() As Records

User Interface

The following method,Run() can be used as a starting point and hooked to buttons etc.

Sub Run()
'Run k-Means
If Not kMeansSelection Then
Call MsgBox("Error: " & Err.Description, vbExclamation, "kMeans Error")
End If
End Sub

Next, a method is created that prompts the user to select the table to be analysed and to input the desired number of clusters that the data should be grouped into. The function does not require any arguments and returns a boolean indicating whether or not any errors have been encountered.

Function kMeansSelection() As Boolean

'Get user table selection
On Error Resume Next
Set Table = Application.InputBox(Prompt:= _
"Please select the range to analyse.", _
title:="Specify Range", Type:=8)

If Table Is Nothing Then Exit Function 'Cancelled

'Check table dimensions
If Table.Rows.Count < 4 Or Table.columns.Count < 2 Then Err.Raise Number:=vbObjectError + 1000, Source:="k-Means Cluster Analysis", Description:="Table has insufficent rows or columns." End If 'Get number of clusters Dim numClusters As Integer numClusters = Application.InputBox("Specify Number of Clusters", "k Means Cluster Analysis", Type:=1) If Not numClusters > 0 Or numClusters = False Then
Exit Function 'Cancelled
End If
If Err.Number = 0 Then
If kMeans(Table, numClusters) Then
End If
End If

kMeansSelection = (Err.Number = 0)
End Function

If a table has been selected and a number of clusters defined appropriately, the kMeans (Table, numClusters) method is invoked with the Table and number of clusters as parameters.

If the kMeans (Table, numClusters) method executes without errors, a final method, outputClusters() is invoked which creates a new worksheet in the active workbook and outputs the results of the analysis.

Assigning Records to Clusters

This is where the actual analysis of the records takes place and cluster assignments are made.
First and foremost, the method is declared with Function kMeans(Table As Range, Clusters As Integer) As Boolean. the Function takes two parameters, the table being analysed as an Excel Range object and Clusters, an integer denoting the number of clusters to be created.

Function kMeans(Table As Range, Clusters As Integer) As Boolean
'Table - Range of data to group. Records (Rows) are grouped according to attributes/dimensions(columns)
'Clusters - Number of clusters to reduce records into.

On Error Resume Next

'Script Performance Variables
Dim PassCounter As Integer

'Initialize Data Arrays
ReDim Record(2 To Table.Rows.Count)
Dim r As Integer 'record
Dim d As Integer 'dimension index
Dim d2 As Integer 'dimension index
Dim c As Integer 'centroid index
Dim c2 As Integer 'centroid index
Dim di As Integer 'distance

Dim x As Double 'Variable Distance Placeholder
Dim y As Double 'Variable Distance Placeholder

On error Resume Next is used to pass errors up to to the calling method, and a number of array index variables are declared. x and y are declared for later use in mathematical operations.

The first step is to size the Record() array to the number of rows in the table. (2 to Table.Rows.Count) is used as it is assumed (and required) that the first row of the table holds the column titles.

Then, for every record in the Record() array, the Record type’s Dimension() array is sized to the number of columns (again assuming that the first column holds the row titles) and the Distance() array is sized to the number of clusters. An internal loop then assigns the values of the columns in the row to the Dimension() array.

For r = LBound(Record) To UBound(Record)
'Initialize Dimension Value Arrays
ReDim Record(r).Dimension(2 To Table.columns.Count)
'Initialize Distance Arrays
ReDim Record(r).Distance(1 To Clusters)
For d = LBound(Record(r).Dimension) To UBound(Record(r).Dimension)
Record(r).Dimension(d) = Table.Rows(r).Cells(d).Value
Next d
Next r

In much the same way, the initial centroids must be initialized. I have assigned the coordinates of the first few records as the initial centroids coordinates checking that each new centroid has unique coordinates. If not, the script simply moves on to the next record until a unique set of coordinates is found for the centroid.

Euclidean DistanceThe method used to calculate centroid uniqueness here is almost exactly the same as the method used later on to calculate the distance between individual records and the centroids. Here the centroids are checked for uniqueness by measuring their dimensions’ distance from 0.

'Initialize Initial Centroid Arrays
ReDim Centroid(1 To Clusters)
Dim uniqueCentroid As Boolean

For c = LBound(Centroid) To UBound(Centroid)
'Initialize Centroid Dimension Depth
ReDim Centroid(c).Dimension(2 To Table.columns.Count)

'Initialize record index to next record
r = LBound(Record) + c - 2

Do ' Loop to ensure new centroid is unique
r = r + 1 'Increment record index throughout loop to find unique record to use as a centroid

'Assign record dimensions to centroid
For d = LBound(Centroid(c).Dimension) To UBound(Centroid(c).Dimension)
Centroid(c).Dimension(d) = Record(r).Dimension(d)
Next d

uniqueCentroid = True

For c2 = LBound(Centroid) To c - 1

'Loop Through Record Dimensions and check if all are the same
x = 0
y = 0
For d2 = LBound(Centroid(c).Dimension) To _
x = x + Centroid(c).Dimension(d2) ^ 2
y = y + Centroid(c2).Dimension(d2) ^ 2
Next d2

uniqueCentroid = Not Sqr(x) = Sqr(y)
If Not uniqueCentroid Then Exit For
Next c2

Loop Until uniqueCentroid

Next c

The next step is to calculate each records distance from each centroid and assign the record to the nearest cluster.

  • Dim lowestDistance As Double – The lowestDistance variable holds the shortest distance measured between a record and centroid thus far for evaluation against subsequent measurements.
  • Dim lastCluster As Integer – The lastCluster variable holds the cluster a record is assigned to before any new assignments are made and is used to evaluate whether or not stability has been achieved.
  • Dim ClustersStable As Boolean – The cluster assignment and centroid re-calculation phases are repeated until ClustersStable = true.
Dim lowestDistance As Double
Dim lastCluster As Integer
Dim ClustersStable As Boolean

Do 'While Clusters are not Stable

PassCounter = PassCounter + 1
ClustersStable = True 'Until Proved otherwise

'Loop Through Records
For r = LBound(Record) To UBound(Record)

lastCluster = Record(r).Cluster
lowestDistance = 0 'Reset lowest distance

'Loop through record distances to centroids
For c = LBound(Centroid) To UBound(Centroid)

' Calculate Euclidean Distance
' d(p,q) = Sqr((q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2)
' X = (q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2
' d(p,q) = X

x = 0
y = 0
'Loop Through Record Dimensions
For d = LBound(Record(r).Dimension) To _
y = Record(r).Dimension(d) - Centroid(c).Dimension(d)
y = y ^ 2
x = x + y
Next d

x = Sqr(x) 'Get square root

'If distance to centroid is lowest (or first pass) assign record to centroid cluster.
If c = LBound(Centroid) Or x < lowestDistance Then
lowestDistance = x
'Assign distance to centroid to record
Record(r).Distance(c) = lowestDistance
'Assign record to centroid
Record(r).Cluster = c
End If
Next c

'Only change if true
If ClustersStable Then ClustersStable = Record(r).Cluster = lastCluster

Next r

Once each record is assigned to a cluster, the centroids of the clusters are re-positioned to the mean coordinates of the cluster. After the centroids have moved, each records closest centroid is re-evaluated and the process is iterated until stability is achieved (i.e. cluster assignments are no longer changing).

'Move Centroids to calculated cluster average
For c = LBound(Centroid) To UBound(Centroid) 'For every cluster

'Loop through cluster dimensions
For d = LBound(Centroid(c).Dimension) To _

Centroid(c).Cluster = 0 'Reset nunber of records in cluster
Centroid(c).Dimension(d) = 0 'Reset centroid dimensions

'Loop Through Records
For r = LBound(Record) To UBound(Record)

'If Record is in Cluster then
If Record(r).Cluster = c Then
'Use to calculate avg dimension for records in cluster

'Add to number of records in cluster
Centroid(c).Cluster = Centroid(c).Cluster + 1
'Add record dimension to cluster dimension for later division
Centroid(c).Dimension(d) = Centroid(c).Dimension(d) + _

End If

Next r

'Assign Average Dimension Distance
Centroid(c).Dimension(d) = Centroid(c).Dimension(d) / _
Next d
Next c

Loop Until ClustersStable

kMeans = (Err.Number = 0)
End Function

Displaying the Results

the outputClusters() method outputs the results in two tables. The first table contains each record name and the assigned cluster number, and the second contains the centroid coordinates.

Function outputClusters() As Boolean

Dim c As Integer 'Centroid Index
Dim r As Integer 'Row Index
Dim d As Integer 'Dimension Index

Dim oSheet As Worksheet
On Error Resume Next

Set oSheet = addWorksheet("Cluster Analysis", ActiveWorkbook)

'Loop Through Records
Dim rowNumber As Integer
rowNumber = 1

'Output Headings
With oSheet.Rows(rowNumber)
With .Cells(1)
.Value = "Row Title"
.Font.Bold = True
.HorizontalAlignment = xlCenter
End With
With .Cells(2)
.Value = "Centroid"
.Font.Bold = True
.HorizontalAlignment = xlCenter
End With
End With

'Print by Row
rowNumber = rowNumber + 1 'Blank Row
For r = LBound(Record) To UBound(Record)
oSheet.Rows(rowNumber).Cells(1).Value = Table.Rows(r).Cells(1).Value
oSheet.Rows(rowNumber).Cells(2).Value = Record(r).Cluster
rowNumber = rowNumber + 1
Next r

'Print Centroids - Headings
rowNumber = rowNumber + 1
For d = LBound(Centroid(LBound(Centroid)).Dimension) To UBound(Centroid(LBound(Centroid)).Dimension)
With oSheet.Rows(rowNumber).Cells(d)
.Value = Table.Rows(1).Cells(d).Value
.Font.Bold = True
.HorizontalAlignment = xlCenter
End With
Next d

'Print Centroids
rowNumber = rowNumber + 1
For c = LBound(Centroid) To UBound(Centroid)
With oSheet.Rows(rowNumber).Cells(1)
.Value = "Centroid " & c
.Font.Bold = True
End With
'Loop through cluster dimensions
For d = LBound(Centroid(c).Dimension) To UBound(Centroid(c).Dimension)
oSheet.Rows(rowNumber).Cells(d).Value = Centroid(c).Dimension(d)
Next d
rowNumber = rowNumber + 1
Next c

oSheet.columns.AutoFit '//AutoFit columns to contents

outputClusters = (Err.Number = 0)
End Function

It’s unlikely that this type of output will be of much use, but it serves to demonstrate the way in which the record cluster assignments or cluster records can be accessed in your own solutions.

The outputClusters() function makes use of another custom method: addWorksheet() which adds a worksheet to the specified/active workbook with the specified name. If a worksheet with the same name already exists, the outputClusters() function adds/increments a number appended to the worksheet name. The WorksheetExists() Function is also included in the following:

Function addWorksheet(Name As String, Optional Workbook As Workbook) As Worksheet
On Error Resume Next
'// If a Workbook wasn't specified, use the active workbook
If Workbook Is Nothing Then Set Workbook = ActiveWorkbook

Dim Num As Integer
'// If a worksheet(s) exist with the same name, add/increment a number after the name
While WorksheetExists(Name, Workbook)
Num = Num + 1
If InStr(Name, " (") > 0 Then Name = Left(Name, InStr(Name, " ("))
Name = Name & " (" & Num & ")"

'//Add a sheet to the workbook
Set addWorksheet = Workbook.Worksheets.Add

'//Name the sheet
addWorksheet.Name = Name
End Function

Public Function WorksheetExists(WorkSheetName As String, Workbook As Workbook) As Boolean
On Error Resume Next
WorksheetExists = (Workbook.Sheets(WorkSheetName).Name <> "")
On Error GoTo 0
End Function


Download the entire module (10.2 KB)

Download an example workbook (39.3 KB)