# What is k-Means Cluster Analysis?

Cluster analysis is a method for automatically grouping data into a smaller number of subsets or clusters so that the records grouped are most statistically similar to each other based on the attributes of the data compared.

In statistics and data miningk-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

Given a set of observations (x1x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ nS = {S1S2, …, Sk} so as to minimize the within-cluster sum of squares.

where μi is the mean of points in Si.Wikipedia

Scatter chart with coloured clusters and their centroids displayed in red

k-Means cluster analysis achieves this by partitioning the data into the required number of clusters by grouping records so that the euclidean distance between the record’s dimensions and the clusters centroid (point with the average dimensions of the points in the cluster) are as small as possible.

The following is a macro I wrote in VBA for Microsoft Excel that performs k-Means Cluster Analysis on the table selected.

# The k-Means Algorithm

The k-Means algorithm is an iteration of the following steps until stability is achieved i.e. the cluster assignments of individual records are no longer changing.

1. Determine the coordinates of the centroids. (initially the centroids are random, unique points, thereafter the mean coordinates of the members of the cluster are assigned to the centroids).
2. Determine the euclidean distance of each record to each centroid.
3. Group records with their nearest centroid.

# The Code

Firstly I have created a private type to represent our records and centroids and created two class level arrays to hold them as well as a class level variable to hold the table on which the analysis is being performed.

Private Type Records
Dimension() As Double
Distance() As Double
Cluster As Integer
End Type

Dim Table As Range
Dim Record() As Records
Dim Centroid() As Records

User Interface

The following method,Run() can be used as a starting point and hooked to buttons etc.

Sub Run()
'Run k-Means
If Not kMeansSelection Then
Call MsgBox("Error: " & Err.Description, vbExclamation, "kMeans Error")
End If
End Sub

Next, a method is created that prompts the user to select the table to be analysed and to input the desired number of clusters that the data should be grouped into. The function does not require any arguments and returns a boolean indicating whether or not any errors have been encountered.

Function kMeansSelection() As Boolean

'Get user table selection
On Error Resume Next
Set Table = Application.InputBox(Prompt:= _
"Please select the range to analyse.", _
title:="Specify Range", Type:=8)

If Table Is Nothing Then Exit Function        'Cancelled

'Check table dimensions
If Table.Rows.Count < 4 Or Table.columns.Count < 2 Then
Err.Raise Number:=vbObjectError + 1000, Source:="k-Means Cluster Analysis", Description:="Table has insufficent rows or columns."
End If

'Get number of clusters
Dim numClusters As Integer
numClusters = Application.InputBox("Specify Number of Clusters", "k Means Cluster Analysis", Type:=1)

If Not numClusters > 0 Or numClusters = False Then
Exit Function        'Cancelled
End If
If Err.Number = 0 Then
If kMeans(Table, numClusters) Then
outputClusters
End If
End If

kMeansSelection_Error:
kMeansSelection = (Err.Number = 0)
End Function

If a table has been selected and a number of clusters defined appropriately, the kMeans (Table, numClusters) method is invoked with the Table and number of clusters as parameters.

If the kMeans (Table, numClusters) method executes without errors, a final method, outputClusters() is invoked which creates a new worksheet in the active workbook and outputs the results of the analysis.

## Assigning Records to Clusters

This is where the actual analysis of the records takes place and cluster assignments are made.
First and foremost, the method is declared with Function kMeans(Table As Range, Clusters As Integer) As Boolean. the Function takes two parameters, the table being analysed as an Excel Range object and Clusters, an integer denoting the number of clusters to be created.

Function kMeans(Table As Range, Clusters As Integer) As Boolean
'Table - Range of data to group. Records (Rows) are grouped according to attributes/dimensions(columns)
'Clusters - Number of clusters to reduce records into.

On Error Resume Next

'Script Performance Variables
Dim PassCounter As Integer

'Initialize Data Arrays
ReDim Record(2 To Table.Rows.Count)
Dim r As Integer        'record
Dim d As Integer        'dimension index
Dim d2 As Integer        'dimension index
Dim c As Integer        'centroid index
Dim c2 As Integer        'centroid index
Dim di As Integer        'distance

Dim x As Double        'Variable Distance Placeholder
Dim y As Double        'Variable Distance Placeholder

On error Resume Next is used to pass errors up to to the calling method, and a number of array index variables are declared. x and y are declared for later use in mathematical operations.

The first step is to size the Record() array to the number of rows in the table. (2 to Table.Rows.Count) is used as it is assumed (and required) that the first row of the table holds the column titles.

Then, for every record in the Record() array, the Record type’s Dimension() array is sized to the number of columns (again assuming that the first column holds the row titles) and the Distance() array is sized to the number of clusters. An internal loop then assigns the values of the columns in the row to the Dimension() array.

For r = LBound(Record) To UBound(Record)
'Initialize Dimension Value Arrays
ReDim Record(r).Dimension(2 To Table.columns.Count)
'Initialize Distance Arrays
ReDim Record(r).Distance(1 To Clusters)
For d = LBound(Record(r).Dimension) To UBound(Record(r).Dimension)
Record(r).Dimension(d) = Table.Rows(r).Cells(d).Value
Next d
Next r

In much the same way, the initial centroids must be initialized. I have assigned the coordinates of the first few records as the initial centroids coordinates checking that each new centroid has unique coordinates. If not, the script simply moves on to the next record until a unique set of coordinates is found for the centroid.

The method used to calculate centroid uniqueness here is almost exactly the same as the method used later on to calculate the distance between individual records and the centroids. Here the centroids are checked for uniqueness by measuring their dimensions’ distance from 0.

'Initialize Initial Centroid Arrays
ReDim Centroid(1 To Clusters)
Dim uniqueCentroid As Boolean

For c = LBound(Centroid) To UBound(Centroid)
'Initialize Centroid Dimension Depth
ReDim Centroid(c).Dimension(2 To Table.columns.Count)

'Initialize record index to next record
r = LBound(Record) + c - 2

Do        ' Loop to ensure new centroid is unique
r = r + 1        'Increment record index throughout loop to find unique record to use as a centroid

'Assign record dimensions to centroid
For d = LBound(Centroid(c).Dimension) To UBound(Centroid(c).Dimension)
Centroid(c).Dimension(d) = Record(r).Dimension(d)
Next d

uniqueCentroid = True

For c2 = LBound(Centroid) To c - 1

'Loop Through Record Dimensions and check if all are the same
x = 0
y = 0
For d2 = LBound(Centroid(c).Dimension) To _
UBound(Centroid(c).Dimension)
x = x + Centroid(c).Dimension(d2) ^ 2
y = y + Centroid(c2).Dimension(d2) ^ 2
Next d2

uniqueCentroid = Not Sqr(x) = Sqr(y)
If Not uniqueCentroid Then Exit For
Next c2

Loop Until uniqueCentroid

Next c

The next step is to calculate each records distance from each centroid and assign the record to the nearest cluster.

• Dim lowestDistance As Double – The lowestDistance variable holds the shortest distance measured between a record and centroid thus far for evaluation against subsequent measurements.
• Dim lastCluster As Integer – The lastCluster variable holds the cluster a record is assigned to before any new assignments are made and is used to evaluate whether or not stability has been achieved.
• Dim ClustersStable As Boolean – The cluster assignment and centroid re-calculation phases are repeated until ClustersStable = true.
Dim lowestDistance As Double
Dim lastCluster As Integer

Do        'While Clusters are not Stable

PassCounter = PassCounter + 1
ClustersStable = True        'Until Proved otherwise

'Loop Through Records
For r = LBound(Record) To UBound(Record)

lastCluster = Record(r).Cluster
lowestDistance = 0        'Reset lowest distance

'Loop through record distances to centroids
For c = LBound(Centroid) To UBound(Centroid)

'======================================================
'           Calculate Euclidean Distance
'======================================================
' d(p,q) = Sqr((q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2)
'------------------------------------------------------
' X = (q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2
' d(p,q) = X

x = 0
y = 0
'Loop Through Record Dimensions
For d = LBound(Record(r).Dimension) To _
UBound(Record(r).Dimension)
y = Record(r).Dimension(d) - Centroid(c).Dimension(d)
y = y ^ 2
x = x + y
Next d

x = Sqr(x)        'Get square root

'If distance to centroid is lowest (or first pass) assign record to centroid cluster.
If c = LBound(Centroid) Or x < lowestDistance Then
lowestDistance = x
'Assign distance to centroid to record
Record(r).Distance(c) = lowestDistance
'Assign record to centroid
Record(r).Cluster = c
End If
Next c

'Only change if true

Next r

Once each record is assigned to a cluster, the centroids of the clusters are re-positioned to the mean coordinates of the cluster. After the centroids have moved, each records closest centroid is re-evaluated and the process is iterated until stability is achieved (i.e. cluster assignments are no longer changing).

'Move Centroids to calculated cluster average
For c = LBound(Centroid) To UBound(Centroid)        'For every cluster

'Loop through cluster dimensions
For d = LBound(Centroid(c).Dimension) To _
UBound(Centroid(c).Dimension)

Centroid(c).Cluster = 0        'Reset nunber of records in cluster
Centroid(c).Dimension(d) = 0        'Reset centroid dimensions

'Loop Through Records
For r = LBound(Record) To UBound(Record)

'If Record is in Cluster then
If Record(r).Cluster = c Then
'Use to calculate avg dimension for records in cluster

'Add to number of records in cluster
Centroid(c).Cluster = Centroid(c).Cluster + 1
'Add record dimension to cluster dimension for later division
Centroid(c).Dimension(d) = Centroid(c).Dimension(d) + _
Record(r).Dimension(d)

End If

Next r

'Assign Average Dimension Distance
Centroid(c).Dimension(d) = Centroid(c).Dimension(d) / _
Centroid(c).Cluster
Next d
Next c

kMeans = (Err.Number = 0)
End Function

# Displaying the Results

the outputClusters() method outputs the results in two tables. The first table contains each record name and the assigned cluster number, and the second contains the centroid coordinates.

Function outputClusters() As Boolean

Dim c As Integer        'Centroid Index
Dim r As Integer        'Row Index
Dim d As Integer        'Dimension Index

Dim oSheet As Worksheet
On Error Resume Next

Set oSheet = addWorksheet("Cluster Analysis", ActiveWorkbook)

'Loop Through Records
Dim rowNumber As Integer
rowNumber = 1

With oSheet.Rows(rowNumber)
With .Cells(1)
.Value = "Row Title"
.Font.Bold = True
.HorizontalAlignment = xlCenter
End With
With .Cells(2)
.Value = "Centroid"
.Font.Bold = True
.HorizontalAlignment = xlCenter
End With
End With

'Print by Row
rowNumber = rowNumber + 1        'Blank Row
For r = LBound(Record) To UBound(Record)
oSheet.Rows(rowNumber).Cells(1).Value = Table.Rows(r).Cells(1).Value
oSheet.Rows(rowNumber).Cells(2).Value = Record(r).Cluster
rowNumber = rowNumber + 1
Next r

rowNumber = rowNumber + 1
For d = LBound(Centroid(LBound(Centroid)).Dimension) To UBound(Centroid(LBound(Centroid)).Dimension)
With oSheet.Rows(rowNumber).Cells(d)
.Value = Table.Rows(1).Cells(d).Value
.Font.Bold = True
.HorizontalAlignment = xlCenter
End With
Next d

'Print Centroids
rowNumber = rowNumber + 1
For c = LBound(Centroid) To UBound(Centroid)
With oSheet.Rows(rowNumber).Cells(1)
.Value = "Centroid " & c
.Font.Bold = True
End With
'Loop through cluster dimensions
For d = LBound(Centroid(c).Dimension) To UBound(Centroid(c).Dimension)
oSheet.Rows(rowNumber).Cells(d).Value = Centroid(c).Dimension(d)
Next d
rowNumber = rowNumber + 1
Next c

oSheet.columns.AutoFit        '//AutoFit columns to contents

outputClusters_Error:
outputClusters = (Err.Number = 0)
End Function

It’s unlikely that this type of output will be of much use, but it serves to demonstrate the way in which the record cluster assignments or cluster records can be accessed in your own solutions.

The outputClusters() function makes use of another custom method: addWorksheet() which adds a worksheet to the specified/active workbook with the specified name. If a worksheet with the same name already exists, the outputClusters() function adds/increments a number appended to the worksheet name. The WorksheetExists() Function is also included in the following:

Function addWorksheet(Name As String, Optional Workbook As Workbook) As Worksheet
On Error Resume Next
'// If a Workbook wasn't specified, use the active workbook
If Workbook Is Nothing Then Set Workbook = ActiveWorkbook

Dim Num As Integer
'// If a worksheet(s) exist with the same name, add/increment a number after the name
While WorksheetExists(Name, Workbook)
Num = Num + 1
If InStr(Name, " (") > 0 Then Name = Left(Name, InStr(Name, " ("))
Name = Name & " (" & Num & ")"
Wend

'//Add a sheet to the workbook

'//Name the sheet
End Function

Public Function WorksheetExists(WorkSheetName As String, Workbook As Workbook) As Boolean
On Error Resume Next
WorksheetExists = (Workbook.Sheets(WorkSheetName).Name <> "")
On Error GoTo 0
End Function

• Mark Murphy

I have been looking for a k-means application in VBA and after some searching could not find a drop in working script. I was about to start working on my own version when I ran across this site. After a little confusion about how to select the data the implementation ran perfectly!

Thank you!

Note:
The confusion with the table selection was easy to address. Three columns need to be selected. The k-means as implemented ignores the first row assuming column headings. And it ignores the first column assuming the rows may have names.

• Mogjibs

Hello,

It might sound very ‘basic’ question but I would love to know the following:
once I have downloaded this macro, how could I use this it (launch it) in Excel?

Any comment will be highly appreciated!

Thanks,
M

• Hi Mogjibs.

You’ll need to import it into an excel VBA project. First open the Excel workbook that you are working with then press Alt + F11 to open the VBA IDE.

Press Ctrl + R to ensure that the Project Explorer window is displayed on the left. In the Project Explorer you’ll see your Workbook listed as a VBA project node.

Expand the node to reveal the three sub-level categories, “MS Excel Objects”, “Forms” and “Modules”.

Right click on the “Modules” node and select “Import File”. Then select the downloaded module to import it.

Open the new Module named “Cluster Analysis and find the kMeansSelection function near the beginning.

Change the line that reads “Function kMeansSelection() As Boolean” to “Sub kMeansSelection()”. Ensure that the last line of the function is also changed to read “End Sub” instead of “End Function” and any lines within the function that read “Exit function” are changed to read “Exit Sub”.

Return to your Excel workbook and press Alt + F8 to view a list of available macros, then select the “kMeansSelection” macro and run it.

If this does not work for you let me know and I’ll email you an example in an Excel workbook.

• Anonymous

Hi Mogjibs.

You’ll need to import it into an excel VBA project. First open the Excel workbook that you are working with then press Alt + F11 to open the VBA IDE.

Press Ctrl + R to ensure that the Project Explorer window is displayed on the left. In the Project Explorer you’ll see your Workbook listed as a VBA project node.

Expand the node to reveal the three sub-level categories, “MS Excel Objects”, “Forms” and “Modules”.

Right click on the “Modules” node and select “Import File”. Then select the downloaded module to import it.

Open the new Module named “Cluster Analysis and find the kMeansSelection function near the beginning.

Change the line that reads “Function kMeansSelection() As Boolean” to “Sub kMeansSelection()”. Ensure that the last line of the function is also changed to read “End Sub” instead of “End Function” and any lines within the function that read “Exit function” are changed to read “Exit Sub”.

Return to your Excel workbook and press Alt + F8 to view a list of available macros, then select the “kMeansSelection” macro and run it.

If this does not work for you let me know and I’ll email you an example in an Excel workbook.

• Akashdeep Khera

Hello Sheldon, I came across your code while looking for clustering in excel. Can you send me the example workbook at akashdeep.khera@gmail.com? I am having issues with running this code.

Thanks,
Akash

• Anonymous

It’s on its way. I’ve posted a link above also:)

• Elddy hosea

Sheldon , I need your help. The link had the files in XML form which I have no idea how to understand them.. I also have one dimension data. Can you possibly send me a basic Excel Workbook example.. Please Help email me elderkm2012@gmail.com

• SheldonNeilson

Hi. It sounds like the file has been uncompressed. Modern Excel files are essentially “zip” files containing a number of xml files. Which version of Excel are you using? You’ll need 2007 or newer for that particular file but I can make one for 2003 if you like?

• Elddyhosea

First I’d like to say Thank you. I have 2007, I got it to work for me.. How can I make a small donation for your work

• SheldonNeilson

That won’t be necessary. Glad it helped you:)

• jai

K means clustering worked very well Thanks a lot for creating it.
Is there any other clustering technique available in excel created by you?

• Anupam

Der Neilson,

Could you please email an example in excel workbook on shrivastava4u@hotmail.com

regards
anupam

• antonio

Hello
SheldonNeilson, can you send me a worksheet with code, all the macro and sheets and graphs? I’m not able to use your files…

• U-Aftab

Hi Neilson, I really like your solution. I carried out some basic tests and it ran very robustly. However, when I test it for 14+ clusters, the Excel application seems to hang indefinitely :S .. Is there some limitation I need to look out for or add some condition?

• Anonymous

Hi there. Without having much to go on it sounds like a memory issue.. how big is your data set?

• U-Aftab

I am testing for 19 rows and 5 columns for cluster size = 13. Anything 13 works perfect and in no time :).

• U-Aftab

Anything below**

• SheldonNeilson

Hey U-Aftab. Sorry for the delayed reply.
I just tried this with 30 records with 10 randomly generated attributes, requesting 20 clusters and it worked perfectly.. I’d be interested to see the spreadsheet you’re working with?

• Mbonillacr

Nice Job!!!

• Alex

Hi Nelson, thank you very much for creating this. What if I only have 1 dimension of data? For example I want to divide salaries into 5 clusters? Do I make the y dimension column of the table to be “1” for all items so the euclidean distance is limited to the distance on a line?

• SheldonNeilson

Hi Alex. You can go ahead and select just the one dimension.

• Mrscottsunderland

This is outstanding! Very, Very good work! Thank you very much…. now just figuring out how best to display my results!

• miles

This is impressive… did they add cluster analysis to data analysis to the latest excel? Sheldon i recall spss had weightings as an option? do you think these is something you could take a look at?

gift horse in the mouth..I know! thank sheldon

• Patrick

Hi Sheldon – great tool, thanks. After playing around with it for a bit I have found that it doesn’t seem to work properly if you request, say, 300 clusters (I have 325 rows). Is there a limit to the number of clusters that should be requested?

• Peter

Hi Sheldon – Great tool. Works fine on Mac OS 10.5.8 with Excel for Mac 2011. Had memory problems initially but they resolved when I updated my Excel software.

• mansoorsa

Hi Sheldon. Great work you have done. I have a situation, its sales database 200K rows, 47 columns. It has information such as branch name, branch region, customer occupation, customer family size, customer income, customer city, customer region, brand, financial terms. Unique key is Invoice Number. I need to do customer segmentation. How can I do cluster analysis using XL you have prepared?

• SheldonNeilson

Hi there. As this method uses numeric properties, I’d suggest trying to attach numeric values to your categorical properties. e.g. City A = 1 & City B = 2 (try to tie these values to something concrete like geographical distance). This probably isn’t the best method for segmenting your customers, but having come from a market research background I know that the best scientific methods will inevitably be confounded by somebody’s opinion anyway;)

• Elke

Hi Sheldon,

I’m looking for a way to assign new respondents to existing clusters but I have no expertise in programming in Excel.

Do you think your module is a good starting point for doing this (the predefined clusters are made up by 13 variables) and maybe needs only a few alterations?

Thanx!

Elke

• SheldonNeilson

Hi Elke.

I’m assuming, by your use of the word “respondents” that you’re working with survey data. If the original segmentation was done for you by a research company, they should have provided you with the segmentation criteria that they used to assign respondents to particular segments?

e.g. if a respondent is Male, between the ages of 30 and 50 and buys low-fat milk he belongs to Segment 1.

If you attempt to use this method on your new, larger data set the likelihood is that the segments will change and a few of your other existing records could even change segments.

If you’re talking about customer segmentation: I’ve used cluster analysis as a preliminary method of generating some high-level market segments, but then usually only as a basis for what eventually becomes a more qualitative profile of a market segment.

If this in connection with a research project: One could determine the location of any existing cluster centroids by running the function on the existing data set and then force the use of those centroid locations when running the new data. This would require some modification of the code.

• Elke

Hi,

The data I’m working with is indeed survey data within a research project. I’ve executed the original K-means clustering in SPSS. The coming months we will get new data sets that will have to be assigned to the original clusters. We work with a third party that has no (experience with) SPSS software, so I’m looking for a way to let them conduct the assignment of new respondents in Excel. Do you think this is a realistic plan?

• Simon Maizels

Hi Sheldon,

Thank you for your excellent work. Firstly I am not a statistician so i’m not sure if my question is valid. When I run this for many variables, say 7, is it still possible to get an XY value for each record’s final distance from its centroid? I’m not sure how to reasonably scatterchart the results otherwise.

Thanks again,
Simon.

• SheldonNeilson

Hi Simon. Yes, it would be possible to get a records distance from it’s centroid even if it has seven variables but that would be a single number. For a scatter chart, you would need both an X and a Y coordinate. Since, with 7 variables you essentially have 7 dimensions, a 2 dimensional scatter chart is not really a rational way to display the data?

Maybe you could try something like a grid with nested headings?

• Yves

Hi Sheldon,
Thank you for your Code. It run well under Excel 2003.
I am also trying to plot the results (data points, centroids) on a sheet (with later the bounding area of each cluster)
I have found a vba code which could be of some help.
But When runing it under Excel 2003 I have an error message for this variable msoElementPrimaryCategoryGridLinesMajor

Did you see how I could solve this bug? Or may be did you have some code to realize what I mean?
Best wishes
Yves

• SheldonNeilson

Hi, My guess is that charts in 2003 don’t have that property. Unfortunately I don’t have access to 2003 to check but you could just delete or that line “cht.SetElement (msoElementPrimaryCategoryGridLinesMajor)”. All it does is create those grey vertical lines on the plot area

• Pingback: clustering in excel()

• MVT

Hi Sheldon, great job.
I have data that cover a wide range of values​​, much of which 0 (voting percentages for one group per “departments”). I tried your program and it works very well… However, to obtain valid results (homogeneous data classes, I mean for example class [min value, max value[ population, eg. [0 – 0[ 30 , [0.1-10[ 5 [10 – 15[ 3 [15 – 33[ 2, etc.), I need to sort my data in ascending order ​​(column B), run the program and then to re-sort ascending column A.

Is it possible to integrate this kind of sort? I tried but I have a problem with “ranges” and “arrays” !

Thanks

• SheldonNeilson

Hi, it definitely is possible to sort your data at different stages of the script. It would look something like this:

Dim SortRange As Range
‘Set SortRange = THE TABLE TO SORT

Dim SortColumn As Range
‘Set SortColumn = THE COLUMN WITHIN THE TABLE ABOVE TO SORT ON

With ActiveSheet.Sort.SortFields
.Clear
End With

With ActiveSheet.Sort
.SetRange SortRange
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With

To set the ranges use something like:

With ActiveSheet
Set SortRange = .Range(.Cells(FIRST_ROW_NUMBER, FIRST_COLUMN_NUMBER), _
.Cells(LAST_ROW_NUMBER, LAST_COLUMN_NUMBER))
End With

I can help you with a more specific script if this doesn’t help.

• MVT

Thanks for your quick reply. I’ll try this and I’ll let you know.

I’ll try something with “no with” … may be too complicate !

• astha pareek

hi i am doing my reseach …i am using SPSS software…i have taken educational data nd i have applied this data on SPSS using k-clustering technique…but i am unable to understand the results …plz guide me how to understand the results

• Marie Joubert Wallis
• Sherry

Hi. I have 6 groups of data, I calculated distances between each two of these data, but how could I transform them into coordinates? Without coordinates, I don’tthink I can run this model.

• SheldonNeilson

Hi Sherry. I’m not sure I understand your question. Each record only needs one or more numerical attribute (2 if you are trying to get an x and y coordinate to plot on a scatter chart). Can you explain what you are trying to achieve and with what sort of data?

• Alex

This code is amazing!!!
I never knew how much power excel actually has!
Is there a method to display the maximum/minimum of each centroid? along with the original data set sorted rather than just displaying the cluster number?

• Pingback: maillot Marseille()

• SEM

Hey Sheldon,

Excellent Script. I tried running it with this dataset, using 3 clusters but notice that cluster 1 and cluster 3 are not what you would expect it to be. Items associated with cluster 1 should be with cluster 3. Any help with this appreciated, thanks.

x y
Item 1 10 41
Item 2 36 38
Item 3 6 20
Item 4 58 48
Item 5 17 12
Item 6 6 24
Item 7 18 25
Item 8 40 13
Item 9 36 33
Item 10 48 59
Item 11 51 60
Item 12 15 63
Item 13 45 32
Item 14 14 39
Item 15 41 61
Item 16 42 56
Item 17 34 22
Item 18 38 24
Item 19 7 32
Item 20 25 62
Item 21 55 27
Item 22 15 10
Item 23 52 53
Item 24 51 15
Item 25 6 62
Item 26 57 55
Item 27 56 47
Item 28 28 53
Item 29 27 10
Item 30 8 53

• SEM

Sorry please ignore, had to reset the Excel chart to pick the correct centroid values. Works perfectly. Thanks.

• Shmelky

Do you have k-means clustering using Pearson distance measure? Pearson distance measure is different than Euclidean in that it groups data that have similar data patterns as opposed to similar data scores.

• SheldonNeilson

Hi. Unfortunately this isn’t something that I have done as the need simply never arose. I’m 100% sure it’s possible though.

• Andrea

Hi, well done man. I have an answer: if i would start with not random centroids but determined by some heuristics how could i do? thank you very much

• Ram

Hi Neilson,

thank you for this excellent tool.

This is my setup – have 6 columns timeseries of data – 1 stock & 5 other indices
Date Stock return INDEX1 return INDEX2 return
12/1/2008 3% 2% 1%
12/2/2008 1% 2% 3%
…………….
………..etc , total 1250 rows

I want to do clustering to find out which index is most closest to my stock.

I am wondering if this k-means method is the right way to approach this? Basically I want a result that clusters say Index 1 & the stock together while the rest of the indices may be in other clusters.

Rgds,
Ram

• Hi Neilson,
Would it be possible to send me a excel workbook? I download the module into Excel 2010 and I’m getting some buggy errors. Also, lets say if I have a 50 stores with 7 attributes (size, sales, employees,etc) would this tool be able to handel it

• /u/inmateAle

This is really outstanding. Not only is the code robust and efficient, it’s well documented! Thanks so much for developing it, and then sharing it with the world. I just used your code to help a guy on /r/excel (that’s a Reddit Excel forum if you’re not familiar), and I was amazed how easy it was for me to adapt for my purposes. I’m definitely going to get a lot more leverage out of this in the future. Take care!

• Manika Tiwari

Hey, What I see here is two dimensional data macro. I need to run a macro for one dimension data. Can you please help me with that. Its very urgent.

• SheldonNeilson

Hi Manika,
This can be used for a one dimensional data set also. You can try it by excluding the second column of data when prompted to select your data in the example.

• Manika

Thank you Neilson ! I will try doing it.

• Manika

Hi Neilson,
Sorry for bothering you. Its working perfectly.Kudos to you!

• PUNIT MANGAL

Hi Neil,

This Macros seems to be very useful for my work. but i work for around 1MN database and it is everytime crashing.

can u pls help.
Punit

• SheldonNeilson

Hi Punit. Do you mean 1 million records? The memory usage would be extreme. I don’t think Excel would be the best platform for a dataset of that size.. I’m sure I could build something for you outside of Microsoft Office that could accommodate datasets of that size much more efficiently. Contact me privately if you’re interested.

Sheldon

• PUNIT MANGAL

hi Sheldon,
at a time I use around 500000 records. what would be the best solution to do this cluster hierarchical analysis. I have only excel and access as tools as of now

Regards
Punit

• SheldonNeilson

I was suggesting building some custom software to do the job but I’m testing it now and it’s really only hogging the one CPU. Memory usage is at 5.5GB but I have several other things going on at the same time.

Are you seeing any error messages? Is the structure of your file OK and are all the values numeric?

It’s painfully slow on my side but my machine’s managing it. It would be entirely possible to create a program to do this much more efficiently than in VBA, but if you want to stick to the excel solution it seems to work.. just very slowly.

• PUNIT MANGAL

Hi Sheldon,
The structure of my file is fine and having all numeric values. the problem is, if i run on 30K records then it given report in 10 seconds but if i uses it for 40K records it is just running for eternity. my current database is for 290K on which i want to run this macro. request your help to do so.

Also pls let me know if i split the file in 30K chunks and run this macro, will i get the different result or the same.

• Xuhua Wei

Hi Sheldon ,
Your work is great. It is very useful to me. I have a problem with these codes. I found it cannot work when k=2 or 3 in the following situation:

ID F1 F2 F3
M1 1 2 0
M2 1 2 0
M3 0 1 2
M4 0 1 2
M5 2 0 1
M6 2 0 1

I don’t know why. Could help me to check the reason? Thanks!

• SheldonNeilson

Hi. This particular algorithm will fail because all of your records add up to 3. The data is not unique enough for this method so you will either have to modify the code (it will fail near the beginning after the comment, “Initialize Initial Centroid Arrays”) or find another more appropriate method.

• Riaan Pienaar

Hi Sheldon,

I am a newbie when it comes to this type of modelling.
I want to try and use your code for clustering different customer GPS coordinates.

1.) Do you think it will be accurate?
2.) Is it possible to add a volume weighting to the clustering”?

Thank you very much for sharing your knowledge!

• SheldonNeilson

Hi Riaan. This is a perfect application for this algorithm. I’ve just tested it and it worked well. The resulting scatter chart even resembled the same coordinates plotted on a map.

Be sure to use decimal notation so that your coordinates are picked up as numbers. The algorithm handles negative values fine.

This code only does what it says on the box, but all things are possible. How exactly are you trying to weight your coordinates?

• Riaan Pienaar

Hi Sheldon.

The coordinates that I am currently inputting are in the following format:

-26.312727,28.13566

I have 360 customers with different volume going to every one of them.
I now have to decide where to put 7 depots to optimally service them.
According to my knowledge, you have to put the weighting in before you cluster them, as this will influence where the centroid will be.

Will it be possible to add a weight factor into the code to influence the position of the centroid to take volume into account?

• SheldonNeilson

Sure.. the weighting would be added before the clustering takes place. You can try it by adding a ‘weight’ column and simply selecting all 3 columns latitude, longitude & weight when running the clustering.

However what this will try to do is keep similar customers in the same group, so if two customers have similar sales volumes they’ll likely end up in the same group with the same centroid(depot), even if another depot is closer geographically but is for a higher/lower sales volume group.

If you want to “pull” your centroids towards high volume customers, another idea for doing this is could be to use something I did for triangulating mobile devices:

Customer = cellphone tower
Customer volume = cellphone tower signal strength
Depot = cellphone

If we pretended each customer is a cellphone tower and the depot is the device that we are trying to triangulate, we could use each customer’s sales volume as the cellphone tower’s signal strength. The triangulation method would naturally find the device nearer to towers with greater signal strength. http://www.neilson.co.za/mobile-network-geolocation-obtaining-the-cell-ids-the-signal-strength-of-surrounding-towers-from-a-gsm-modem/

This can only really be one step in the process of determining where to build your depots. This will give you a good idea of the required spread, or even cluster your depots more tightly in high density areas but sometimes the best answer wouldn’t be to build 2 depots in high density areas but rather to build one larger one. Or perhaps you’ll find you centroid placed in an inappropriate/unrealistic area when you consider other factors about the area in the real world like proximity to road/rail infrastructure, land availability/cost, crime etc…

• carl

Hi Sheldon…. great web site and info. I have a similar issue with Weight. I want to look at 900 districts , each with population stats, and want to understand how to cluster distribution points. I loved the downloadable Excel, cool, but it considers every district to be equal which its not ( i have Pop/Sq.km for districts) . Thanks !

• Serge

Hello Sheldon,

It works perfect! Thank you!

In my case the maximum number of records in cluster is limited by a maximum value.
I have 200 records, number of clusters is 20 and the maximum number of records in cluster is 40.

Thank you again,
Serge

This code is outstanding, very helpful and runs fast! I’m no expert on coding and I need your help, Neilson. I don’t have an x and y data, I have only one long column of data. Is there a way of altering the code to analyze the first 144 rows and cluster it into 2 groups then continuo to the second 144 cells, gives its centroids and so on, until the end of the column. I really need ur help for my internship project. Thank you

• HJP3

Sheldon,
This code is fantastic, and I see that it works for 1 and 2 variable data sets, but will it work for more than 2? The biggest problem I can see is the chart, as I am unsure if excel can plot in 3 dimensions much less more.

• SheldonNeilson

Hi, I just used this reply above: “Sure, you can use this on records with any number of attributes (rows with any number of columns). It becomes difficult to plot the results visually but the cluster assignments are still valid.”

• suriri

Hi Sheldon,

Can this be used for three dimensional data (x.y and z)?

• SheldonNeilson

Sure, you can use this on records with any number of attributes (rows with any number of columns). It becomes difficult to plot the results visually but the cluster assignments are still valid.

• João Almeida

Super cool stuff

• Pierre

Can (and how) the algorithm modified to K-medoid instead of K-mean?

• P

This has to be one of the best documented pieces of code I’ve ever seen. Great work and well done, sir.

• Sherman

Great code, Sheldon! Thumbs up. Works brilliantly for my challenge.

• tdlemmon

Thank you very much for remarkable work you proivded to excel community.
It’s a must for everyone needs to implement k-means in his worksheets.
Well documented code, great article, ease of use and quickness.
Really thankful for that.

• Prerit Kohli

Excellent! Thank you.

• Caterina P.

Hi Sheldon, thank you for your work. It has been so useful to me! I need some further informations. May you help me? How many columns of the input table can the code read? I have to analyze d-dimensional vectors, i.e. each observation has more than 3 attributes (like for example length, weight, height, energy etc.). Does the code and expecially the Euclidean distance calculation improved in the code read and consider all my “coordinates”, or is the code valid only for table with n-rows and 2 columns?

Thank you very much for your support!

Caterina

• Caterina P.

I have found your answer to another person in the forum who had the same doubts. Thank you very much!

• SheldonNeilson

🙂

• rahul

how can i cluster the students academic marks and applying euclidean distance by k means… plzzz help

• Carlos Eduardo Brantes

Great code!!! Thanks from Brazil!

can it be used to cluster a binary data? example i have data like :

Item A : 1 0 0 1
item B : 0 0 0 1
item C : 1 1 0 1
item D : 1 1 1 0
item E : 1 0 0 0

• SheldonNeilson

Hi. Not if you attempt to put each bit in a column. This particular algorithm is actually quite a basic one and will not work correctly due to a lack in the the uniqueness of the sum of the columns in that kind of data. You can convert your numbers to decimal first (excel function BIN2DEC) and run this on the base 10 result of that function per row. Each base 10 number representing a specific combination of bits would be unique

• SheldonNeilson

Though that would probably not give you the result you’re looking for. It will group them based on the numerical vicinity of the base 10 numbers to each other, not some sort of semantic relationship between “features”?

• Sahil Singhal

Hi Sheldon, the macro gives different results based on how i sort my data. If I sort by data based on the first attribute, it creates clusters almost the same way this column is sorted. If I sort the data based on a different attribute my clusters change,. Is this how it is supposed to work?

• brand46

Thanks Sheldon! Would this VBA work with Excel on Mac? Unfortunately, us Apple users get short-changed when it comes to most MS Office plug-ins so we rely on intrepid folks like yourself for our solutions. Also, is there a limit to the number of data points it can process? I tried on an old Windows machine running Windows 7 and Office 2010 and it couldn’t handle my 67 x 8890 matrix.

• brand46

A little light troubleshooting answered both questions. Yes, the macros work on my Mac running Office 2016. And the ‘type mismatch’ error was a result of null values in my matrix which I re-coded as ‘0’. Still waiting to see if the entire dataset will process but my machine is churning 6.34gb of RAM right now.

• Chinnu Gombey

can u tell me what to specify in specify number of clusters please??? output screen

• Max

Hi Sheldon ,
Your work is wonderful. There is no error with other cluster numbers but for “k=6”! My data range is 5 by 47 (46 records and 4 variables) and when I select 6 clusters there is a “Overflow” error. I need to run it for different K’s and then find the best number of clusters by Elbow Method, for example. Do you have any idea to resolve it?

• Samantha Smith

Hi! Thank you so much for this robust description and code. I am trying to set the initial centroids on my own. I am a little rusty on my VBA. Does anyone have any suggestions on how modify how it chooses the initial centroids? Thank you!

• Igor Udovenko

I second the positive comments. This is really excellent. Thanks for sharing

• Emilio Muñoz

Hi Sheldon, first thank you for sharing. Your code is amazing and really powerful. I’m trying to modify it adding a constraint. I need the same number of points in each cluster. Is there any simple way to do this starting from your solution?. Excuse my english and best regards from Spain.

• Daniel McCracken

This is great, though a couple of changes would make it substantially faster/more stable.

One is changing all of the “Integer” variables to “Long” to avoid overflow issues.

Another is not writing to/reading from the worksheet using loops like this:

For d = LBound(Centroid(c).Dimension) To UBound(Centroid(c).Dimension)
oSheet.Rows(rowNumber).Cells(d).Value = Centroid(c).Dimension(d)
Next d

Each of those read/write operations has a performance cost. It’s a better practice to read the data into an array all at once, then loop through the array to populate your “Record” objects. You can do the same thing when you’re writing the results back to the worksheet: fill an array with the values from your “Record” objects, then paste the array all at once.

After making those changes, the procedure processed 1,000,000 records (using two attributes and three clusters) in ~90 seconds. For comparison, it takes ~200 seconds to process the same number of records if you only make the first change (switching the “Integer” declarations to “Long”).

• SheldonNeilson

Excellent feedback. Thanks Daniel!

• Florence

Hi

When I have tried to select more than 240 items to analyse, the following message appears ‘Error: overflow’. How can I increase the number of items to be analysed?

• SheldonNeilson

Hi Florence. Take a look at Daniel McCracken’s comments from a few days ago. Sounds like converting to larger variable types might help.

• Norbert Mirani

Outstanding VBA tool. What should I change to perform a cluster analysis with 50 items instead of 30?