k-Means Cluster Analysis in Microsoft Excel

What is k-Means Cluster Analysis?

Cluster analysis is a method for automatically grouping data into a smaller number of subsets or clusters so that the records grouped are most statistically similar to each other based on the attributes of the data compared.

In statistics and data miningk-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

Given a set of observations (x1x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ nS = {S1S2, …, Sk} so as to minimize the within-cluster sum of squares.

\underset{\mathbf{S}} {\operatorname{arg\,min}} \sum_{i=1}^{k} \sum_{\mathbf x_j \in S_i} \left\| \mathbf x_j - \boldsymbol\mu_i \right\|^2

where μi is the mean of points in Si. -Wikipedia 

Scatter Chart Clusters and Centroids

Scatter chart with coloured clusters and their centroids displayed in red

k-Means cluster analysis achieves this by partitioning the data into the required number of clusters by grouping records so that the elucidean distance between the record’s dimensions and the clusters centroid (point with the average dimensions of the points in the cluster) are as small as possible.

The following is a macro I wrote in VBA for Microsoft Excel that performs k-Means Cluster Analysis on the table selected.

The k-Means Algorithm

The k-Means algorithm is an iteration of the following steps until stability is achieved i.e. the cluster assignments of individual records are no longer changing.

  1. Determine the coordinates of the centroids. (initially the centroids are random, unique points, thereafter the mean coordinates of the members of the cluster are assigned to the centroids).
  2. Determine the elucidean distance of each record to each centroid.
  3. Group records with their nearest centroid.

The Code

Firstly I have created a private type to represent our records and centroids and created two class level arrays to hold them as well as a class level variable to hold the table on which the analysis is being performed.

Private Type Records
    Dimension() As Double
    Distance() As Double
    Cluster As Integer
End Type

Dim Table As Range
Dim Record() As Records
Dim Centroid() As Records

User Interface

The following method,Run() can be used as a starting point and hooked to buttons etc.

Sub Run()
'Run k-Means
   If Not kMeansSelection Then
        Call MsgBox("Error: " & Err.Description, vbExclamation, "kMeans Error")
    End If
End Sub

Next, a method is created that prompts the user to select the table to be analysed and to input the desired number of clusters that the data should be grouped into. The function does not require any arguments and returns a boolean indicating whether or not any errors have been encountered.

Function kMeansSelection() As Boolean

'Get user table selection
   On Error Resume Next
    Set Table = Application.InputBox(Prompt:= _
                                     "Please select the range to analyse.", _
                                     title:="Specify Range", Type:=8)

    If Table Is Nothing Then Exit Function        'Cancelled

    'Check table dimensions
   If Table.Rows.Count < 4 Or Table.columns.Count < 2 Then
        Err.Raise Number:=vbObjectError + 1000, Source:="k-Means Cluster Analysis", Description:="Table has insufficent rows or columns."
    End If

    'Get number of clusters
   Dim numClusters As Integer
    numClusters = Application.InputBox("Specify Number of Clusters", "k Means Cluster Analysis", Type:=1)

    If Not numClusters > 0 Or numClusters = False Then
        Exit Function        'Cancelled
   End If
    If Err.Number = 0 Then
        If kMeans(Table, numClusters) Then
            outputClusters
        End If
    End If

kMeansSelection_Error:
    kMeansSelection = (Err.Number = 0)
End Function

If a table has been selected and a number of clusters defined appropriately, the kMeans (Table, numClusters) method is invoked with the Table and number of clusters as parameters.

If the kMeans (Table, numClusters) method executes without errors, a final method, outputClusters() is invoked which creates a new worksheet in the active workbook and outputs the results of the analysis.

Assigning Records to Clusters

This is where the actual analysis of the records takes place and cluster assignments are made.
First and foremost, the method is declared with Function kMeans(Table As Range, Clusters As Integer) As Boolean. the Function takes two parameters, the table being analysed as an Excel Range object and Clusters, an integer denoting the number of clusters to be created.

Function kMeans(Table As Range, Clusters As Integer) As Boolean
'Table - Range of data to group. Records (Rows) are grouped according to attributes/dimensions(columns)
'Clusters - Number of clusters to reduce records into.

    On Error Resume Next

    'Script Performance Variables
   Dim PassCounter As Integer

    'Initialize Data Arrays
   ReDim Record(2 To Table.Rows.Count)
    Dim r As Integer        'record
   Dim d As Integer        'dimension index
   Dim d2 As Integer        'dimension index
   Dim c As Integer        'centroid index
   Dim c2 As Integer        'centroid index
   Dim di As Integer        'distance

    Dim x As Double        'Variable Distance Placeholder
   Dim y As Double        'Variable Distance Placeholder

On error Resume Next is used to pass errors up to to the calling method, and a number of array index variables are declared. x and y are declared for later use in mathematical operations.

The first step is to size the Record() array to the number of rows in the table. (2 to Table.Rows.Count) is used as it is assumed (and required) that the first row of the table holds the column titles.

Then, for every record in the Record() array, the Record type’s Dimension() array is sized to the number of columns (again assuming that the first column holds the row titles) and the Distance() array is sized to the number of clusters. An internal loop then assigns the values of the columns in the row to the Dimension() array.

    For r = LBound(Record) To UBound(Record)
        'Initialize Dimension Value Arrays
       ReDim Record(r).Dimension(2 To Table.columns.Count)
        'Initialize Distance Arrays
       ReDim Record(r).Distance(1 To Clusters)
        For d = LBound(Record(r).Dimension) To UBound(Record(r).Dimension)
            Record(r).Dimension(d) = Table.Rows(r).Cells(d).Value
        Next d
    Next r

In much the same way, the initial centroids must be initialized. I have assigned the coordinates of the first few records as the initial centroids coordinates checking that each new centroid has unique coordinates. If not, the script simply moves on to the next record until a unique set of coordinates is found for the centroid.

Eludicean DistanceThe method used to calculate centroid uniqueness here is almost exactly the same as the method used later on to calculate the distance between individual records and the centroids. Here the centroids are checked for uniqueness by measuring their dimensions’ distance from 0.

    'Initialize Initial Centroid Arrays
   ReDim Centroid(1 To Clusters)
    Dim uniqueCentroid As Boolean

    For c = LBound(Centroid) To UBound(Centroid)
        'Initialize Centroid Dimension Depth
       ReDim Centroid(c).Dimension(2 To Table.columns.Count)

        'Initialize record index to next record
       r = LBound(Record) + c - 2

        Do        ' Loop to ensure new centroid is unique
           r = r + 1        'Increment record index throughout loop to find unique record to use as a centroid

            'Assign record dimensions to centroid
           For d = LBound(Centroid(c).Dimension) To UBound(Centroid(c).Dimension)
                Centroid(c).Dimension(d) = Record(r).Dimension(d)
            Next d

            uniqueCentroid = True

            For c2 = LBound(Centroid) To c - 1

                'Loop Through Record Dimensions and check if all are the same
               x = 0
                y = 0
                For d2 = LBound(Centroid(c).Dimension) To _
                    UBound(Centroid(c).Dimension)
                    x = x + Centroid(c).Dimension(d2) ^ 2
                    y = y + Centroid(c2).Dimension(d2) ^ 2
                Next d2

                uniqueCentroid = Not Sqr(x) = Sqr(y)
                If Not uniqueCentroid Then Exit For
            Next c2

        Loop Until uniqueCentroid

    Next c

The next step is to calculate each records distance from each centroid and assign the record to the nearest cluster.

  • Dim lowestDistance As Double – The lowestDistance variable holds the shortest distance measured between a record and centroid thus far for evaluation against subsequent measurements.
  • Dim lastCluster As Integer – The lastCluster variable holds the cluster a record is assigned to before any new assignments are made and is used to evaluate whether or not stability has been achieved.
  • Dim ClustersStable As Boolean – The cluster assignment and centroid re-calculation phases are repeated until ClustersStable = true.
    Dim lowestDistance As Double
    Dim lastCluster As Integer
    Dim ClustersStable As Boolean

    Do        'While Clusters are not Stable

        PassCounter = PassCounter + 1
        ClustersStable = True        'Until Proved otherwise

        'Loop Through Records
       For r = LBound(Record) To UBound(Record)

            lastCluster = Record(r).Cluster
            lowestDistance = 0        'Reset lowest distance

            'Loop through record distances to centroids
           For c = LBound(Centroid) To UBound(Centroid)

                '======================================================
               '           Calculate Elucidean Distance
               '======================================================
               ' d(p,q) = Sqr((q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2)
               '------------------------------------------------------
               ' X = (q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2
               ' d(p,q) = X

                x = 0
                y = 0
                'Loop Through Record Dimensions
               For d = LBound(Record(r).Dimension) To _
                    UBound(Record(r).Dimension)
                    y = Record(r).Dimension(d) - Centroid(c).Dimension(d)
                    y = y ^ 2
                    x = x + y
                Next d

                x = Sqr(x)        'Get square root

                'If distance to centroid is lowest (or first pass) assign record to centroid cluster.
               If c = LBound(Centroid) Or x < lowestDistance Then
                    lowestDistance = x
                    'Assign distance to centroid to record
                   Record(r).Distance(c) = lowestDistance
                    'Assign record to centroid
                   Record(r).Cluster = c
                End If
            Next c

            'Only change if true
           If ClustersStable Then ClustersStable = Record(r).Cluster = lastCluster

        Next r

Once each record is assigned to a cluster, the centroids of the clusters are re-positioned to the mean coordinates of the cluster. After the centroids have moved, each records closest centroid is re-evaluated and the process is iterated until stability is achieved (i.e. cluster assignments are no longer changing).

 'Move Centroids to calculated cluster average
       For c = LBound(Centroid) To UBound(Centroid)        'For every cluster

            'Loop through cluster dimensions
           For d = LBound(Centroid(c).Dimension) To _
                UBound(Centroid(c).Dimension)

                Centroid(c).Cluster = 0        'Reset nunber of records in cluster
               Centroid(c).Dimension(d) = 0        'Reset centroid dimensions

                'Loop Through Records
               For r = LBound(Record) To UBound(Record)

                    'If Record is in Cluster then
                   If Record(r).Cluster = c Then
                        'Use to calculate avg dimension for records in cluster

                        'Add to number of records in cluster
                       Centroid(c).Cluster = Centroid(c).Cluster + 1
                        'Add record dimension to cluster dimension for later division
                       Centroid(c).Dimension(d) = Centroid(c).Dimension(d) + _
                                                   Record(r).Dimension(d)

                    End If

                Next r

                'Assign Average Dimension Distance
               Centroid(c).Dimension(d) = Centroid(c).Dimension(d) / _
                                           Centroid(c).Cluster
            Next d
        Next c

    Loop Until ClustersStable

    kMeans = (Err.Number = 0)
End Function

Displaying the Results

the outputClusters() method outputs the results in two tables. The first table contains each record name and the assigned cluster number, and the second contains the centroid coordinates.

Function outputClusters() As Boolean

    Dim c As Integer        'Centroid Index
   Dim r As Integer        'Row Index
   Dim d As Integer        'Dimension Index

    Dim oSheet As Worksheet
    On Error Resume Next

    Set oSheet = addWorksheet("Cluster Analysis", ActiveWorkbook)

    'Loop Through Records
   Dim rowNumber As Integer
    rowNumber = 1

    'Output Headings
   With oSheet.Rows(rowNumber)
        With .Cells(1)
            .Value = "Row Title"
            .Font.Bold = True
            .HorizontalAlignment = xlCenter
        End With
        With .Cells(2)
            .Value = "Centroid"
            .Font.Bold = True
            .HorizontalAlignment = xlCenter
        End With
    End With

    'Print by Row
   rowNumber = rowNumber + 1        'Blank Row
   For r = LBound(Record) To UBound(Record)
        oSheet.Rows(rowNumber).Cells(1).Value = Table.Rows(r).Cells(1).Value
        oSheet.Rows(rowNumber).Cells(2).Value = Record(r).Cluster
        rowNumber = rowNumber + 1
    Next r

    'Print Centroids - Headings
   rowNumber = rowNumber + 1
    For d = LBound(Centroid(LBound(Centroid)).Dimension) To UBound(Centroid(LBound(Centroid)).Dimension)
        With oSheet.Rows(rowNumber).Cells(d)
            .Value = Table.Rows(1).Cells(d).Value
            .Font.Bold = True
            .HorizontalAlignment = xlCenter
        End With
    Next d

    'Print Centroids
   rowNumber = rowNumber + 1
    For c = LBound(Centroid) To UBound(Centroid)
        With oSheet.Rows(rowNumber).Cells(1)
            .Value = "Centroid " & c
            .Font.Bold = True
        End With
        'Loop through cluster dimensions
       For d = LBound(Centroid(c).Dimension) To UBound(Centroid(c).Dimension)
            oSheet.Rows(rowNumber).Cells(d).Value = Centroid(c).Dimension(d)
        Next d
        rowNumber = rowNumber + 1
    Next c

    oSheet.columns.AutoFit        '//AutoFit columns to contents

outputClusters_Error:
    outputClusters = (Err.Number = 0)
End Function

It’s unlikely that this type of output will be of much use, but it serves to demonstrate the way in which the record cluster assignments or cluster records can be accessed in your own solutions.

The outputClusters() function makes use of another custom method: addWorksheet() which adds a worksheet to the specified/active workbook with the specified name. If a worksheet with the same name already exists, the outputClusters() function adds/increments a number appended to the worksheet name. The WorksheetExists() Function is also included in the following:

Function addWorksheet(Name As String, Optional Workbook As Workbook) As Worksheet
    On Error Resume Next
    '// If a Workbook wasn't specified, use the active workbook
   If Workbook Is Nothing Then Set Workbook = ActiveWorkbook
   
    Dim Num As Integer
    '// If a worksheet(s) exist with the same name, add/increment a number after the name
   While WorksheetExists(Name, Workbook)
        Num = Num + 1
        If InStr(Name, " (") > 0 Then Name = Left(Name, InStr(Name, " ("))
        Name = Name & " (" & Num & ")"
    Wend
   
    '//Add a sheet to the workbook
   Set addWorksheet = Workbook.Worksheets.Add
   
    '//Name the sheet
   addWorksheet.Name = Name
End Function

Public Function WorksheetExists(WorkSheetName As String, Workbook As Workbook) As Boolean
    On Error Resume Next
    WorksheetExists = (Workbook.Sheets(WorkSheetName).Name <> "")
    On Error GoTo 0
End Function

Downloads

Download the entire module (10.2 KB)

Download an example workbook (39.3 KB)

  • Mark Murphy

    I have been looking for a k-means application in VBA and after some searching could not find a drop in working script. I was about to start working on my own version when I ran across this site. After a little confusion about how to select the data the implementation ran perfectly!

    Thank you!

    Note:
    The confusion with the table selection was easy to address. Three columns need to be selected. The k-means as implemented ignores the first row assuming column headings. And it ignores the first column assuming the rows may have names.

  • Mogjibs

    Hello,

    It might sound very ‘basic’ question but I would love to know the following:
    once I have downloaded this macro, how could I use this it (launch it) in Excel?

    Any comment will be highly appreciated!

    Thanks,
    M

    • http://www.neilson.co.za Sheldon Neilson

      Hi Mogjibs.

      You’ll need to import it into an excel VBA project. First open the Excel workbook that you are working with then press Alt + F11 to open the VBA IDE.

      Press Ctrl + R to ensure that the Project Explorer window is displayed on the left. In the Project Explorer you’ll see your Workbook listed as a VBA project node.

      Expand the node to reveal the three sub-level categories, “MS Excel Objects”, “Forms” and “Modules”.

      Right click on the “Modules” node and select “Import File”. Then select the downloaded module to import it.

      Open the new Module named “Cluster Analysis and find the kMeansSelection function near the beginning.

      Change the line that reads “Function kMeansSelection() As Boolean” to “Sub kMeansSelection()”. Ensure that the last line of the function is also changed to read “End Sub” instead of “End Function” and any lines within the function that read “Exit function” are changed to read “Exit Sub”.

      Return to your Excel workbook and press Alt + F8 to view a list of available macros, then select the “kMeansSelection” macro and run it.

      If this does not work for you let me know and I’ll email you an example in an Excel workbook.

    • Anonymous

      Hi Mogjibs.

      You’ll need to import it into an excel VBA project. First open the Excel workbook that you are working with then press Alt + F11 to open the VBA IDE.

      Press Ctrl + R to ensure that the Project Explorer window is displayed on the left. In the Project Explorer you’ll see your Workbook listed as a VBA project node.

      Expand the node to reveal the three sub-level categories, “MS Excel Objects”, “Forms” and “Modules”.

      Right click on the “Modules” node and select “Import File”. Then select the downloaded module to import it.

      Open the new Module named “Cluster Analysis and find the kMeansSelection function near the beginning.

      Change the line that reads “Function kMeansSelection() As Boolean” to “Sub kMeansSelection()”. Ensure that the last line of the function is also changed to read “End Sub” instead of “End Function” and any lines within the function that read “Exit function” are changed to read “Exit Sub”.

      Return to your Excel workbook and press Alt + F8 to view a list of available macros, then select the “kMeansSelection” macro and run it.

      If this does not work for you let me know and I’ll email you an example in an Excel workbook.

      • Akashdeep Khera

        Hello Sheldon, I came across your code while looking for clustering in excel. Can you send me the example workbook at akashdeep.khera@gmail.com? I am having issues with running this code.

        Thanks,
        Akash

        • Anonymous

          It’s on its way. I’ve posted a link above also:)

          • Elddy hosea

            Sheldon , I need your help. The link had the files in XML form which I have no idea how to understand them.. I also have one dimension data. Can you possibly send me a basic Excel Workbook example.. Please Help email me elderkm2012@gmail.com

          • SheldonNeilson

            Hi. It sounds like the file has been uncompressed. Modern Excel files are essentially “zip” files containing a number of xml files. Which version of Excel are you using? You’ll need 2007 or newer for that particular file but I can make one for 2003 if you like?

          • Elddyhosea

            First I’d like to say Thank you. I have 2007, I got it to work for me.. How can I make a small donation for your work

          • SheldonNeilson

            That won’t be necessary. Glad it helped you:)

          • jai

            K means clustering worked very well Thanks a lot for creating it.
            Is there any other clustering technique available in excel created by you?

      • Anupam

        Der Neilson,

        Could you please email an example in excel workbook on shrivastava4u@hotmail.com

        regards
        anupam

      • antonio

        Hello
        SheldonNeilson, can you send me a worksheet with code, all the macro and sheets and graphs? I’m not able to use your files…

  • U-Aftab

    Hi Neilson, I really like your solution. I carried out some basic tests and it ran very robustly. However, when I test it for 14+ clusters, the Excel application seems to hang indefinitely :S .. Is there some limitation I need to look out for or add some condition?

    • Anonymous

      Hi there. Without having much to go on it sounds like a memory issue.. how big is your data set?

      • U-Aftab

        I am testing for 19 rows and 5 columns for cluster size = 13. Anything 13 works perfect and in no time :).

        • U-Aftab

          Anything below**

          • SheldonNeilson

            Hey U-Aftab. Sorry for the delayed reply.
            I just tried this with 30 records with 10 randomly generated attributes, requesting 20 clusters and it worked perfectly.. I’d be interested to see the spreadsheet you’re working with?

  • Mbonillacr

    Nice Job!!!

  • Alex

    Hi Nelson, thank you very much for creating this. What if I only have 1 dimension of data? For example I want to divide salaries into 5 clusters? Do I make the y dimension column of the table to be “1″ for all items so the euclidean distance is limited to the distance on a line?

    • SheldonNeilson

      Hi Alex. You can go ahead and select just the one dimension.

  • Mrscottsunderland

    This is outstanding! Very, Very good work! Thank you very much…. now just figuring out how best to display my results!

  • miles

    This is impressive… did they add cluster analysis to data analysis to the latest excel? Sheldon i recall spss had weightings as an option? do you think these is something you could take a look at?

    gift horse in the mouth..I know! thank sheldon

  • Patrick

    Hi Sheldon – great tool, thanks. After playing around with it for a bit I have found that it doesn’t seem to work properly if you request, say, 300 clusters (I have 325 rows). Is there a limit to the number of clusters that should be requested?

  • Peter

    Hi Sheldon – Great tool. Works fine on Mac OS 10.5.8 with Excel for Mac 2011. Had memory problems initially but they resolved when I updated my Excel software.

  • mansoorsa

    Hi Sheldon. Great work you have done. I have a situation, its sales database 200K rows, 47 columns. It has information such as branch name, branch region, customer occupation, customer family size, customer income, customer city, customer region, brand, financial terms. Unique key is Invoice Number. I need to do customer segmentation. How can I do cluster analysis using XL you have prepared?

    • SheldonNeilson

      Hi there. As this method uses numeric properties, I’d suggest trying to attach numeric values to your categorical properties. e.g. City A = 1 & City B = 2 (try to tie these values to something concrete like geographical distance). This probably isn’t the best method for segmenting your customers, but having come from a market research background I know that the best scientific methods will inevitably be confounded by somebody’s opinion anyway;)

  • Elke

    Hi Sheldon,

    I’m looking for a way to assign new respondents to existing clusters but I have no expertise in programming in Excel.

    Do you think your module is a good starting point for doing this (the predefined clusters are made up by 13 variables) and maybe needs only a few alterations?

    Thanx!

    Elke

    • SheldonNeilson

      Hi Elke.

      I’m assuming, by your use of the word “respondents” that you’re working with survey data. If the original segmentation was done for you by a research company, they should have provided you with the segmentation criteria that they used to assign respondents to particular segments?

      e.g. if a respondent is Male, between the ages of 30 and 50 and buys low-fat milk he belongs to Segment 1.

      If you attempt to use this method on your new, larger data set the likelihood is that the segments will change and a few of your other existing records could even change segments.

      If you’re talking about customer segmentation: I’ve used cluster analysis as a preliminary method of generating some high-level market segments, but then usually only as a basis for what eventually becomes a more qualitative profile of a market segment.

      If this in connection with a research project: One could determine the location of any existing cluster centroids by running the function on the existing data set and then force the use of those centroid locations when running the new data. This would require some modification of the code.

      • Elke

        Hi,

        The data I’m working with is indeed survey data within a research project. I’ve executed the original K-means clustering in SPSS. The coming months we will get new data sets that will have to be assigned to the original clusters. We work with a third party that has no (experience with) SPSS software, so I’m looking for a way to let them conduct the assignment of new respondents in Excel. Do you think this is a realistic plan?

  • Simon Maizels

    Hi Sheldon,

    Thank you for your excellent work. Firstly I am not a statistician so i’m not sure if my question is valid. When I run this for many variables, say 7, is it still possible to get an XY value for each record’s final distance from its centroid? I’m not sure how to reasonably scatterchart the results otherwise.

    Thanks again,
    Simon.

    • SheldonNeilson

      Hi Simon. Yes, it would be possible to get a records distance from it’s centroid even if it has seven variables but that would be a single number. For a scatter chart, you would need both an X and a Y coordinate. Since, with 7 variables you essentially have 7 dimensions, a 2 dimensional scatter chart is not really a rational way to display the data?

      Maybe you could try something like a grid with nested headings?

  • Yves

    Hi Sheldon,
    Thank you for your Code. It run well under Excel 2003.
    I am also trying to plot the results (data points, centroids) on a sheet (with later the bounding area of each cluster)
    On this link : http://social.msdn.microsoft.com/Forums/pl/exceldev/thread/0da876c2-ec3a-4750-8ebe-d88d7de124ef
    I have found a vba code which could be of some help.
    But When runing it under Excel 2003 I have an error message for this variable msoElementPrimaryCategoryGridLinesMajor

    Did you see how I could solve this bug? Or may be did you have some code to realize what I mean?
    Best wishes
    Yves

    • SheldonNeilson

      Hi, My guess is that charts in 2003 don’t have that property. Unfortunately I don’t have access to 2003 to check but you could just delete or that line “cht.SetElement (msoElementPrimaryCategoryGridLinesMajor)”. All it does is create those grey vertical lines on the plot area

  • Pingback: clustering in excel

  • MVT

    Hi Sheldon, great job.
    I have data that cover a wide range of values​​, much of which 0 (voting percentages for one group per “departments”). I tried your program and it works very well… However, to obtain valid results (homogeneous data classes, I mean for example class [min value, max value[ population, eg. [0 – 0[ 30 , [0.1-10[ 5 [10 – 15[ 3 [15 – 33[ 2, etc.), I need to sort my data in ascending order ​​(column B), run the program and then to re-sort ascending column A.

    Is it possible to integrate this kind of sort? I tried but I have a problem with “ranges” and “arrays” !

    Thanks

    • SheldonNeilson

      Hi, it definitely is possible to sort your data at different stages of the script. It would look something like this:

      Dim SortRange As Range
      ‘Set SortRange = THE TABLE TO SORT

      Dim SortColumn As Range
      ‘Set SortColumn = THE COLUMN WITHIN THE TABLE ABOVE TO SORT ON

      With ActiveSheet.Sort.SortFields
      .Clear
      .Add Key:=SortColumn, SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
      End With

      With ActiveSheet.Sort
      .SetRange SortRange
      .Header = xlYes
      .MatchCase = False
      .Orientation = xlTopToBottom
      .SortMethod = xlPinYin
      .Apply
      End With

      To set the ranges use something like:

      With ActiveSheet
      Set SortRange = .Range(.Cells(FIRST_ROW_NUMBER, FIRST_COLUMN_NUMBER), _
      .Cells(LAST_ROW_NUMBER, LAST_COLUMN_NUMBER))
      End With

      I can help you with a more specific script if this doesn’t help.

      • MVT

        Thanks for your quick reply. I’ll try this and I’ll let you know.

        I’ll try something with “no with” … may be too complicate !

  • astha pareek

    hi i am doing my reseach …i am using SPSS software…i have taken educational data nd i have applied this data on SPSS using k-clustering technique…but i am unable to understand the results …plz guide me how to understand the results

  • Sherry

    Hi. I have 6 groups of data, I calculated distances between each two of these data, but how could I transform them into coordinates? Without coordinates, I don’tthink I can run this model.

    • SheldonNeilson

      Hi Sherry. I’m not sure I understand your question. Each record only needs one or more numerical attribute (2 if you are trying to get an x and y coordinate to plot on a scatter chart). Can you explain what you are trying to achieve and with what sort of data?

  • Alex

    This code is amazing!!!
    I never knew how much power excel actually has!
    Is there a method to display the maximum/minimum of each centroid? along with the original data set sorted rather than just displaying the cluster number?

  • Pingback: maillot Marseille

  • SEM

    Hey Sheldon,

    Excellent Script. I tried running it with this dataset, using 3 clusters but notice that cluster 1 and cluster 3 are not what you would expect it to be. Items associated with cluster 1 should be with cluster 3. Any help with this appreciated, thanks.

    x y
    Item 1 10 41
    Item 2 36 38
    Item 3 6 20
    Item 4 58 48
    Item 5 17 12
    Item 6 6 24
    Item 7 18 25
    Item 8 40 13
    Item 9 36 33
    Item 10 48 59
    Item 11 51 60
    Item 12 15 63
    Item 13 45 32
    Item 14 14 39
    Item 15 41 61
    Item 16 42 56
    Item 17 34 22
    Item 18 38 24
    Item 19 7 32
    Item 20 25 62
    Item 21 55 27
    Item 22 15 10
    Item 23 52 53
    Item 24 51 15
    Item 25 6 62
    Item 26 57 55
    Item 27 56 47
    Item 28 28 53
    Item 29 27 10
    Item 30 8 53

    • SEM

      Sorry please ignore, had to reset the Excel chart to pick the correct centroid values. Works perfectly. Thanks.

  • Shmelky

    Do you have k-means clustering using Pearson distance measure? Pearson distance measure is different than Euclidean in that it groups data that have similar data patterns as opposed to similar data scores.

    • SheldonNeilson

      Hi. Unfortunately this isn’t something that I have done as the need simply never arose. I’m 100% sure it’s possible though.

  • Andrea

    Hi, well done man. I have an answer: if i would start with not random centroids but determined by some heuristics how could i do? thank you very much

  • Ram

    Hi Neilson,

    thank you for this excellent tool.

    This is my setup – have 6 columns timeseries of data – 1 stock & 5 other indices
    Date Stock return INDEX1 return INDEX2 return
    12/1/2008 3% 2% 1%
    12/2/2008 1% 2% 3%
    …………….
    ………..etc , total 1250 rows

    I want to do clustering to find out which index is most closest to my stock.

    I am wondering if this k-means method is the right way to approach this? Basically I want a result that clusters say Index 1 & the stock together while the rest of the indices may be in other clusters.

    Your advice is much appreciated!

    Rgds,
    Ram

  • http://batman-news.com Mike T Weiss

    Hi Neilson,
    Would it be possible to send me a excel workbook? I download the module into Excel 2010 and I’m getting some buggy errors. Also, lets say if I have a 50 stores with 7 attributes (size, sales, employees,etc) would this tool be able to handel it

  • /u/inmateAle

    This is really outstanding. Not only is the code robust and efficient, it’s well documented! Thanks so much for developing it, and then sharing it with the world. I just used your code to help a guy on /r/excel (that’s a Reddit Excel forum if you’re not familiar), and I was amazed how easy it was for me to adapt for my purposes. I’m definitely going to get a lot more leverage out of this in the future. Take care!

  • Manika Tiwari

    Hey, What I see here is two dimensional data macro. I need to run a macro for one dimension data. Can you please help me with that. Its very urgent.

    • SheldonNeilson

      Hi Manika,
      This can be used for a one dimensional data set also. You can try it by excluding the second column of data when prompted to select your data in the example.

      • Manika

        Thank you Neilson ! I will try doing it.

      • Manika

        Hi Neilson,
        Sorry for bothering you. Its working perfectly.Kudos to you!

  • PUNIT MANGAL

    Hi Neil,

    This Macros seems to be very useful for my work. but i work for around 1MN database and it is everytime crashing.

    can u pls help.
    Punit

    • SheldonNeilson

      Hi Punit. Do you mean 1 million records? The memory usage would be extreme. I don’t think Excel would be the best platform for a dataset of that size.. I’m sure I could build something for you outside of Microsoft Office that could accommodate datasets of that size much more efficiently. Contact me privately if you’re interested.

      Sheldon

      • PUNIT MANGAL

        hi Sheldon,
        at a time I use around 500000 records. what would be the best solution to do this cluster hierarchical analysis. I have only excel and access as tools as of now

        Regards
        Punit

        • SheldonNeilson

          I was suggesting building some custom software to do the job but I’m testing it now and it’s really only hogging the one CPU. Memory usage is at 5.5GB but I have several other things going on at the same time.

          Are you seeing any error messages? Is the structure of your file OK and are all the values numeric?

          It’s painfully slow on my side but my machine’s managing it. It would be entirely possible to create a program to do this much more efficiently than in VBA, but if you want to stick to the excel solution it seems to work.. just very slowly.

          • PUNIT MANGAL

            Hi Sheldon,
            The structure of my file is fine and having all numeric values. the problem is, if i run on 30K records then it given report in 10 seconds but if i uses it for 40K records it is just running for eternity. my current database is for 290K on which i want to run this macro. request your help to do so.

            Also pls let me know if i split the file in 30K chunks and run this macro, will i get the different result or the same.

  • Xuhua Wei

    Hi Sheldon ,
    Your work is great. It is very useful to me. I have a problem with these codes. I found it cannot work when k=2 or 3 in the following situation:

    ID F1 F2 F3
    M1 1 2 0
    M2 1 2 0
    M3 0 1 2
    M4 0 1 2
    M5 2 0 1
    M6 2 0 1

    I don’t know why. Could help me to check the reason? Thanks!

    • SheldonNeilson

      Hi. This particular algorithm will fail because all of your records add up to 3. The data is not unique enough for this method so you will either have to modify the code (it will fail near the beginning after the comment, “Initialize Initial Centroid Arrays”) or find another more appropriate method.

  • Riaan Pienaar

    Hi Sheldon,

    I am a newbie when it comes to this type of modelling.
    I want to try and use your code for clustering different customer GPS coordinates.

    1.) Do you think it will be accurate?
    2.) Is it possible to add a volume weighting to the clustering”?

    Thank you very much for sharing your knowledge!

    • SheldonNeilson

      Hi Riaan. This is a perfect application for this algorithm. I’ve just tested it and it worked well. The resulting scatter chart even resembled the same coordinates plotted on a map.

      Be sure to use decimal notation so that your coordinates are picked up as numbers. The algorithm handles negative values fine.

      This code only does what it says on the box, but all things are possible. How exactly are you trying to weight your coordinates?

      • Riaan Pienaar

        Hi Sheldon.

        Thank you for your response.

        The coordinates that I am currently inputting are in the following format:

        -26.312727,28.13566

        I have 360 customers with different volume going to every one of them.
        I now have to decide where to put 7 depots to optimally service them.
        According to my knowledge, you have to put the weighting in before you cluster them, as this will influence where the centroid will be.

        Will it be possible to add a weight factor into the code to influence the position of the centroid to take volume into account?

        • SheldonNeilson

          Sure.. the weighting would be added before the clustering takes place. You can try it by adding a ‘weight’ column and simply selecting all 3 columns latitude, longitude & weight when running the clustering.

          However what this will try to do is keep similar customers in the same group, so if two customers have similar sales volumes they’ll likely end up in the same group with the same centroid(depot), even if another depot is closer geographically but is for a higher/lower sales volume group.

          If you want to “pull” your centroids towards high volume customers, another idea for doing this is could be to use something I did for triangulating mobile devices:

          Customer = cellphone tower
          Customer volume = cellphone tower signal strength
          Depot = cellphone

          If we pretended each customer is a cellphone tower and the depot is the device that we are trying to triangulate, we could use each customer’s sales volume as the cellphone tower’s signal strength. The triangulation method would naturally find the device nearer to towers with greater signal strength. http://www.neilson.co.za/mobile-network-geolocation-obtaining-the-cell-ids-the-signal-strength-of-surrounding-towers-from-a-gsm-modem/

          This can only really be one step in the process of determining where to build your depots. This will give you a good idea of the required spread, or even cluster your depots more tightly in high density areas but sometimes the best answer wouldn’t be to build 2 depots in high density areas but rather to build one larger one. Or perhaps you’ll find you centroid placed in an inappropriate/unrealistic area when you consider other factors about the area in the real world like proximity to road/rail infrastructure, land availability/cost, crime etc…

  • Serge

    Hello Sheldon,

    It works perfect! Thank you!

    In my case the maximum number of records in cluster is limited by a maximum value.
    I have 200 records, number of clusters is 20 and the maximum number of records in cluster is 40.

    Could you please advice how it can be done?

    Thank you again,
    Serge

  • Fahad Almotairi

    This code is outstanding, very helpful and runs fast! I’m no expert on coding and I need your help, Neilson. I don’t have an x and y data, I have only one long column of data. Is there a way of altering the code to analyze the first 144 rows and cluster it into 2 groups then continuo to the second 144 cells, gives its centroids and so on, until the end of the column. I really need ur help for my internship project. Thank you