# What is k-Means Cluster Analysis?

Cluster analysis is a method for automatically grouping data into a smaller number of subsets or clusters so that the records grouped are most statistically similar to each other based on the attributes of the data compared.

“In statistics and data mining,is a method of cluster analysis which aims to partitionk-means clusteringnobservations intokclusters in which each observation belongs to the cluster with the nearest mean.Given a set of observations (

x_{1},x_{2}, …,x_{n}), where each observation is ad-dimensional real vector,k-means clustering aims to partition thenobservations intoksets (k≤n)S= {S_{1},S_{2}, …,S_{k}} so as to minimize the within-cluster sum of squares.

where

μ_{i}is the mean of points inS_{i}.“–Wikipedia

k-Means cluster analysis achieves this by partitioning the data into the required number of clusters by grouping records so that the euclidean distance between the record’s dimensions and the clusters centroid (point with the average dimensions of the points in the cluster) are as small as possible.

The following is a macro I wrote in VBA for Microsoft Excel that performs k-Means Cluster Analysis on the table selected.

# The k-Means Algorithm

The k-Means algorithm is an iteration of the following steps until stability is achieved i.e. the cluster assignments of individual records are no longer changing.

- Determine the coordinates of the centroids. (initially the centroids are random, unique points, thereafter the mean coordinates of the members of the cluster are assigned to the centroids).
- Determine the euclidean distance of each record to each centroid.
- Group records with their nearest centroid.

# The Code

Firstly I have created a private type to represent our records and centroids and created two class level arrays to hold them as well as a class level variable to hold the table on which the analysis is being performed.

Dimension() As Double

Distance() As Double

Cluster As Integer

End Type

Dim Table As Range

Dim Record() As Records

Dim Centroid() As Records

User Interface

The following method,`Run()`

can be used as a starting point and hooked to buttons etc.

'Run k-Means

If Not kMeansSelection Then

Call MsgBox("Error: " & Err.Description, vbExclamation, "kMeans Error")

End If

End Sub

Next, a method is created that prompts the user to select the table to be analysed and to input the desired number of clusters that the data should be grouped into. The function does not require any arguments and returns a boolean indicating whether or not any errors have been encountered.

'Get user table selection

On Error Resume Next

Set Table = Application.InputBox(Prompt:= _

"Please select the range to analyse.", _

title:="Specify Range", Type:=8)

If Table Is Nothing Then Exit Function 'Cancelled

'Check table dimensions

If Table.Rows.Count < 4 Or Table.columns.Count < 2 Then

Err.Raise Number:=vbObjectError + 1000, Source:="k-Means Cluster Analysis", Description:="Table has insufficent rows or columns."

End If

'Get number of clusters

Dim numClusters As Integer

numClusters = Application.InputBox("Specify Number of Clusters", "k Means Cluster Analysis", Type:=1)

If Not numClusters > 0 Or numClusters = False Then

Exit Function 'Cancelled

End If

If Err.Number = 0 Then

If kMeans(Table, numClusters) Then

outputClusters

End If

End If

kMeansSelection_Error:

kMeansSelection = (Err.Number = 0)

End Function

If a table has been selected and a number of clusters defined appropriately, the `kMeans (Table, numClusters)`

method is invoked with the Table and number of clusters as parameters.

If the `kMeans (Table, numClusters)`

method executes without errors, a final method, `outputClusters()`

is invoked which creates a new worksheet in the active workbook and outputs the results of the analysis.

## Assigning Records to Clusters

This is where the actual analysis of the records takes place and cluster assignments are made.

First and foremost, the method is declared with `Function kMeans(Table As Range, Clusters As Integer) As Boolean`

. the Function takes two parameters, the table being analysed as an Excel Range object and Clusters, an integer denoting the number of clusters to be created.

'Table - Range of data to group. Records (Rows) are grouped according to attributes/dimensions(columns)

'Clusters - Number of clusters to reduce records into.

On Error Resume Next

'Script Performance Variables

Dim PassCounter As Integer

'Initialize Data Arrays

ReDim Record(2 To Table.Rows.Count)

Dim r As Integer 'record

Dim d As Integer 'dimension index

Dim d2 As Integer 'dimension index

Dim c As Integer 'centroid index

Dim c2 As Integer 'centroid index

Dim di As Integer 'distance

Dim x As Double 'Variable Distance Placeholder

Dim y As Double 'Variable Distance Placeholder

`On error Resume Next`

is used to pass errors up to to the calling method, and a number of array index variables are declared. `x`

and `y`

are declared for later use in mathematical operations.

The first step is to size the `Record()`

array to the number of rows in the table. `(2 to Table.Rows.Count)`

is used as it is assumed (and required) that the first row of the table holds the column titles.

Then, for every record in the `Record()`

array, the Record type’s `Dimension()`

array is sized to the number of columns (again assuming that the first column holds the row titles) and the `Distance()`

array is sized to the number of clusters. An internal loop then assigns the values of the columns in the row to the `Dimension()`

array.

'Initialize Dimension Value Arrays

ReDim Record(r).Dimension(2 To Table.columns.Count)

'Initialize Distance Arrays

ReDim Record(r).Distance(1 To Clusters)

For d = LBound(Record(r).Dimension) To UBound(Record(r).Dimension)

Record(r).Dimension(d) = Table.Rows(r).Cells(d).Value

Next d

Next r

In much the same way, the initial centroids must be initialized. I have assigned the coordinates of the first few records as the initial centroids coordinates checking that each new centroid has unique coordinates. If not, the script simply moves on to the next record until a unique set of coordinates is found for the centroid.

The method used to calculate centroid uniqueness here is almost exactly the same as the method used later on to calculate the distance between individual records and the centroids. Here the centroids are checked for uniqueness by measuring their dimensions’ distance from 0.

ReDim Centroid(1 To Clusters)

Dim uniqueCentroid As Boolean

For c = LBound(Centroid) To UBound(Centroid)

'Initialize Centroid Dimension Depth

ReDim Centroid(c).Dimension(2 To Table.columns.Count)

'Initialize record index to next record

r = LBound(Record) + c - 2

Do ' Loop to ensure new centroid is unique

r = r + 1 'Increment record index throughout loop to find unique record to use as a centroid

'Assign record dimensions to centroid

For d = LBound(Centroid(c).Dimension) To UBound(Centroid(c).Dimension)

Centroid(c).Dimension(d) = Record(r).Dimension(d)

Next d

uniqueCentroid = True

For c2 = LBound(Centroid) To c - 1

'Loop Through Record Dimensions and check if all are the same

x = 0

y = 0

For d2 = LBound(Centroid(c).Dimension) To _

UBound(Centroid(c).Dimension)

x = x + Centroid(c).Dimension(d2) ^ 2

y = y + Centroid(c2).Dimension(d2) ^ 2

Next d2

uniqueCentroid = Not Sqr(x) = Sqr(y)

If Not uniqueCentroid Then Exit For

Next c2

Loop Until uniqueCentroid

Next c

The next step is to calculate each records distance from each centroid and assign the record to the nearest cluster.

`Dim lowestDistance As Double`

– The lowestDistance variable holds the shortest distance measured between a record and centroid thus far for evaluation against subsequent measurements.`Dim lastCluster As Integer`

– The lastCluster variable holds the cluster a record is assigned to before any new assignments are made and is used to evaluate whether or not stability has been achieved.`Dim ClustersStable As Boolean`

– The cluster assignment and centroid re-calculation phases are repeated until`ClustersStable = true`

.

Dim lastCluster As Integer

Dim ClustersStable As Boolean

Do 'While Clusters are not Stable

PassCounter = PassCounter + 1

ClustersStable = True 'Until Proved otherwise

'Loop Through Records

For r = LBound(Record) To UBound(Record)

lastCluster = Record(r).Cluster

lowestDistance = 0 'Reset lowest distance

'Loop through record distances to centroids

For c = LBound(Centroid) To UBound(Centroid)

'======================================================

' Calculate Euclidean Distance

'======================================================

' d(p,q) = Sqr((q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2)

'------------------------------------------------------

' X = (q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2

' d(p,q) = X

x = 0

y = 0

'Loop Through Record Dimensions

For d = LBound(Record(r).Dimension) To _

UBound(Record(r).Dimension)

y = Record(r).Dimension(d) - Centroid(c).Dimension(d)

y = y ^ 2

x = x + y

Next d

x = Sqr(x) 'Get square root

'If distance to centroid is lowest (or first pass) assign record to centroid cluster.

If c = LBound(Centroid) Or x < lowestDistance Then

lowestDistance = x

'Assign distance to centroid to record

Record(r).Distance(c) = lowestDistance

'Assign record to centroid

Record(r).Cluster = c

End If

Next c

'Only change if true

If ClustersStable Then ClustersStable = Record(r).Cluster = lastCluster

Next r

Once each record is assigned to a cluster, the centroids of the clusters are re-positioned to the mean coordinates of the cluster. After the centroids have moved, each records closest centroid is re-evaluated and the process is iterated until stability is achieved (i.e. cluster assignments are no longer changing).

For c = LBound(Centroid) To UBound(Centroid) 'For every cluster

'Loop through cluster dimensions

For d = LBound(Centroid(c).Dimension) To _

UBound(Centroid(c).Dimension)

Centroid(c).Cluster = 0 'Reset nunber of records in cluster

Centroid(c).Dimension(d) = 0 'Reset centroid dimensions

'Loop Through Records

For r = LBound(Record) To UBound(Record)

'If Record is in Cluster then

If Record(r).Cluster = c Then

'Use to calculate avg dimension for records in cluster

'Add to number of records in cluster

Centroid(c).Cluster = Centroid(c).Cluster + 1

'Add record dimension to cluster dimension for later division

Centroid(c).Dimension(d) = Centroid(c).Dimension(d) + _

Record(r).Dimension(d)

End If

Next r

'Assign Average Dimension Distance

Centroid(c).Dimension(d) = Centroid(c).Dimension(d) / _

Centroid(c).Cluster

Next d

Next c

Loop Until ClustersStable

kMeans = (Err.Number = 0)

End Function

# Displaying the Results

the `outputClusters()`

method outputs the results in two tables. The first table contains each record name and the assigned cluster number, and the second contains the centroid coordinates.

Dim c As Integer 'Centroid Index

Dim r As Integer 'Row Index

Dim d As Integer 'Dimension Index

Dim oSheet As Worksheet

On Error Resume Next

Set oSheet = addWorksheet("Cluster Analysis", ActiveWorkbook)

'Loop Through Records

Dim rowNumber As Integer

rowNumber = 1

'Output Headings

With oSheet.Rows(rowNumber)

With .Cells(1)

.Value = "Row Title"

.Font.Bold = True

.HorizontalAlignment = xlCenter

End With

With .Cells(2)

.Value = "Centroid"

.Font.Bold = True

.HorizontalAlignment = xlCenter

End With

End With

'Print by Row

rowNumber = rowNumber + 1 'Blank Row

For r = LBound(Record) To UBound(Record)

oSheet.Rows(rowNumber).Cells(1).Value = Table.Rows(r).Cells(1).Value

oSheet.Rows(rowNumber).Cells(2).Value = Record(r).Cluster

rowNumber = rowNumber + 1

Next r

'Print Centroids - Headings

rowNumber = rowNumber + 1

For d = LBound(Centroid(LBound(Centroid)).Dimension) To UBound(Centroid(LBound(Centroid)).Dimension)

With oSheet.Rows(rowNumber).Cells(d)

.Value = Table.Rows(1).Cells(d).Value

.Font.Bold = True

.HorizontalAlignment = xlCenter

End With

Next d

'Print Centroids

rowNumber = rowNumber + 1

For c = LBound(Centroid) To UBound(Centroid)

With oSheet.Rows(rowNumber).Cells(1)

.Value = "Centroid " & c

.Font.Bold = True

End With

'Loop through cluster dimensions

For d = LBound(Centroid(c).Dimension) To UBound(Centroid(c).Dimension)

oSheet.Rows(rowNumber).Cells(d).Value = Centroid(c).Dimension(d)

Next d

rowNumber = rowNumber + 1

Next c

oSheet.columns.AutoFit '//AutoFit columns to contents

outputClusters_Error:

outputClusters = (Err.Number = 0)

End Function

It’s unlikely that this type of output will be of much use, but it serves to demonstrate the way in which the record cluster assignments or cluster records can be accessed in your own solutions.

The `outputClusters()`

function makes use of another custom method: `addWorksheet()`

which adds a worksheet to the specified/active workbook with the specified name. If a worksheet with the same name already exists, the `outputClusters()`

function adds/increments a number appended to the worksheet name. The `WorksheetExists()`

Function is also included in the following:

On Error Resume Next

'// If a Workbook wasn't specified, use the active workbook

If Workbook Is Nothing Then Set Workbook = ActiveWorkbook

Dim Num As Integer

'// If a worksheet(s) exist with the same name, add/increment a number after the name

While WorksheetExists(Name, Workbook)

Num = Num + 1

If InStr(Name, " (") > 0 Then Name = Left(Name, InStr(Name, " ("))

Name = Name & " (" & Num & ")"

Wend

'//Add a sheet to the workbook

Set addWorksheet = Workbook.Worksheets.Add

'//Name the sheet

addWorksheet.Name = Name

End Function

Public Function WorksheetExists(WorkSheetName As String, Workbook As Workbook) As Boolean

On Error Resume Next

WorksheetExists = (Workbook.Sheets(WorkSheetName).Name <> "")

On Error GoTo 0

End Function

# Downloads

Download the entire module (10.2 KB)

Download an example workbook (39.3 KB)

Pingback: clustering in excel()

Pingback: maillot Marseille()