Welcome

This is part 2 of the 2 part course from CDRC on the Internet User Classification (IUC) and K-Means Clustering. The video in this part introduces K-Means clustering, and the practical session shows you how create your own K-Means cluster in R.

After completing this material, you will:



Part 2: K-Means Clustering

We are going to replicate the geodemographic classification for internet access and use.

Downloading Classification Source Data

To create an index, we are going to use some source files I have put together. These are for each MSOA, and consist of:

Variable Definition
GEO_CODE MSOA Code
age_total, age0to4pc, age5to14pc, age16to24pc, age25to44pc, age45to64pc, age75pluspc percentage of people in various age groups
NSSEC.Total, HigherManagerial1pc, `LowerManagerial2pc, Intermediate3pm, SmallEmployers4pc, SupervisoryTechnical5pc, SemiRoutine6pc, Routine7pc, LongTermUnemployed8pc percentage of people in different NS-SEC groups
Average.download.speed..Mbps., Superfast.availability, Unable.to.receive.decent.broadband, Receiving.under.2.Mbps, Receiving.under.10.Mbps, Receiving.over.30.Mbps measures of broadband use and availability

These data have all been processed to be at MSOA level.

  • Download the file internet-input-data.csv from the website.

  • Have a look at the file in Excel.

Loading Source data in R

  • Start a new Script in RStudio.

  • Set your working directory.

  • Use this code to read in the file:

data <- read.csv("internet-input-data.csv")
  • Use head() to check what the data are:
head(data)
   GEO_CODE age_total  age0to4pc age5to14pc age16to24pc age25to44pc age45to64pc
1 E02000001      7375 0.03200000 0.03877966  0.09613559   0.4069153   0.2725424
2 E02000002      6775 0.09269373 0.12162362  0.11321033   0.2801476   0.1856827
3 E02000003     10045 0.08292683 0.10154306  0.11796914   0.3060229   0.2248880
4 E02000004      6182 0.05904238 0.10239405  0.13879004   0.2544484   0.2495956
5 E02000005      8562 0.09296893 0.11936463  0.11948143   0.2994627   0.2136183
6 E02000005      8562 0.09296893 0.11936463  0.11948143   0.2994627   0.2136183
  age75pluspc NSSEC.Total HigherManagerial1pc LowerManagerial2pc
1  0.06074576        5816          0.38462861          0.3390646
2  0.09800738        3926          0.05680082          0.1711666
3  0.06460926        6483          0.08175228          0.2076199
4  0.08864445        4041          0.06780500          0.2063846
5  0.05010512        5368          0.04936662          0.1663562
6  0.05010512        5368          0.04936662          0.1663562
  Intermediate3pm SmallEmployers4pc SupervisoryTechnical5pc SemiRoutine6pc
1      0.07806052        0.06464924              0.02596286     0.04401651
2      0.14314824        0.11462048              0.06444218     0.20529801
3      0.18247725        0.11059695              0.07064631     0.15502082
4      0.19079436        0.10541945              0.08364266     0.15070527
5      0.15648286        0.10413562              0.08736960     0.19243666
6      0.15648286        0.10413562              0.08736960     0.19243666
  Routine7pc LongTermUnemployed8pc Average.download.speed..Mbps.
1 0.02665062            0.03696699                          24.4
2 0.13015792            0.11436577                          77.7
3 0.09656023            0.09532624                          74.3
4 0.12150458            0.07374412                          81.1
5 0.14456036            0.09929210                          70.7
6 0.14456036            0.09929210                          70.7
  Superfast.availability Unable.to.receive.decent.broadband
1                  0.544                              0.002
2                  0.989                              0.010
3                  0.990                              0.002
4                  0.992                              0.008
5                  0.970                              0.030
6                  0.970                              0.030
  Receiving.under.2.Mbps Receiving.under.10.Mbps Receiving.over.30.Mbps
1                  0.013                   0.249                  0.203
2                  0.020                   0.159                  0.798
3                  0.008                   0.164                  0.784
4                  0.026                   0.162                  0.799
5                  0.021                   0.223                  0.737
6                  0.021                   0.223                  0.737
  • Is this the data we expect to see?

  • Use View() to look at the data.

  • Is this what you expect?

There will be character chr, integer int and numeric num values in this data frame. Make sure you can identify which is which, and that you know what the differences are.

  • How are the data distributed? (try hist()).

Setup

We need some libraries for this work. Remember to install them if you haven’t got them installed already.

#load libraries (and install if needed)
  library(reshape)
  library(ggplot2)

Extract data for Clustering

The K-means cluster function will cluster all of the data in the table, so we need to remove everything we don’t want clusters. Use colnames(data) to get the column index numbers, and then extract the columns we want to use in the K-Means Clustering Process

#separate out data to be clustered
  colnames(data)
 [1] "GEO_CODE"                           "age_total"                         
 [3] "age0to4pc"                          "age5to14pc"                        
 [5] "age16to24pc"                        "age25to44pc"                       
 [7] "age45to64pc"                        "age75pluspc"                       
 [9] "NSSEC.Total"                        "HigherManagerial1pc"               
[11] "LowerManagerial2pc"                 "Intermediate3pm"                   
[13] "SmallEmployers4pc"                  "SupervisoryTechnical5pc"           
[15] "SemiRoutine6pc"                     "Routine7pc"                        
[17] "LongTermUnemployed8pc"              "Average.download.speed..Mbps."     
[19] "Superfast.availability"             "Unable.to.receive.decent.broadband"
[21] "Receiving.under.2.Mbps"             "Receiving.under.10.Mbps"           
[23] "Receiving.over.30.Mbps"            
  cluster.data <- data[, c(3:8,10:17,18:20)]
  head(cluster.data)
   age0to4pc age5to14pc age16to24pc age25to44pc age45to64pc age75pluspc
1 0.03200000 0.03877966  0.09613559   0.4069153   0.2725424  0.06074576
2 0.09269373 0.12162362  0.11321033   0.2801476   0.1856827  0.09800738
3 0.08292683 0.10154306  0.11796914   0.3060229   0.2248880  0.06460926
4 0.05904238 0.10239405  0.13879004   0.2544484   0.2495956  0.08864445
5 0.09296893 0.11936463  0.11948143   0.2994627   0.2136183  0.05010512
6 0.09296893 0.11936463  0.11948143   0.2994627   0.2136183  0.05010512
  HigherManagerial1pc LowerManagerial2pc Intermediate3pm SmallEmployers4pc
1          0.38462861          0.3390646      0.07806052        0.06464924
2          0.05680082          0.1711666      0.14314824        0.11462048
3          0.08175228          0.2076199      0.18247725        0.11059695
4          0.06780500          0.2063846      0.19079436        0.10541945
5          0.04936662          0.1663562      0.15648286        0.10413562
6          0.04936662          0.1663562      0.15648286        0.10413562
  SupervisoryTechnical5pc SemiRoutine6pc Routine7pc LongTermUnemployed8pc
1              0.02596286     0.04401651 0.02665062            0.03696699
2              0.06444218     0.20529801 0.13015792            0.11436577
3              0.07064631     0.15502082 0.09656023            0.09532624
4              0.08364266     0.15070527 0.12150458            0.07374412
5              0.08736960     0.19243666 0.14456036            0.09929210
6              0.08736960     0.19243666 0.14456036            0.09929210
  Average.download.speed..Mbps. Superfast.availability
1                          24.4                  0.544
2                          77.7                  0.989
3                          74.3                  0.990
4                          81.1                  0.992
5                          70.7                  0.970
6                          70.7                  0.970
  Unable.to.receive.decent.broadband
1                              0.002
2                              0.010
3                              0.002
4                              0.008
5                              0.030
6                              0.030

We also need to rescale the data so everything is on a similar scale, as the speed data (Mbps) is very different to the other data.

 cluster.data <- scale(cluster.data) 

Elbow Plot

Our first step in the analysis is to create an elbow plot, to see how many clusters we want to specify for the K-Means analysis.

We run the K-Means analysis for 1, 2, 3 and so on up to 15 clusters, and the calculate and plot the Within Sum of Squares (wss) for each one.

#Determine number of clusters required by elbow plot
  wss <- (nrow(cluster.data)-1)*sum(apply(cluster.data,2,var))
  for (i in 2:15) wss[i] <- sum(kmeans(cluster.data,
                                       centers=i)$withinss)
Warning: did not converge in 10 iterations
  plot(1:15, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares", main = "K-means: distance from mean by cluster amount", cex.axis = 0.90, lwd = 2)
  abline(v=c(1:15), lty=1.5, lwd=0.5, col=336) 

Based on the elbow plot, we need to choose the number of clusters. Based on this example above, we are going to use 7 clusters.

The interpretation of an elbow plot can be quite subjective, and in this example, 8 clusters could be easily justified. You would need to try with 8 as well as 7, and see what output you get.

K-Means Cluster Analysis

We are going to run our cluster analysis 10 times, because there is an element of randomness within the clustering, and we want to make sure we get the best clustering output. This will run the clustering code 10 times.

#K-Means (7 clusters with 10 iterations)
  clusters <- list()
  fit <- NA
  for (i in 1:10){
    print(paste("starting run", i, sep=" "))
    class.7 <- kmeans(x=cluster.data, centers=7, iter.max=1000000, nstart=1)
    fit[i] <- class.7$tot.withinss
    if (fit[i] < min(fit[1:(i-1)])){
      clusters <- class.7}
  }
[1] "starting run 1"
[1] "starting run 2"
[1] "starting run 3"
[1] "starting run 4"
[1] "starting run 5"
[1] "starting run 6"
[1] "starting run 7"
[1] "starting run 8"
[1] "starting run 9"
[1] "starting run 10"

We then have a bit of post processing to extract some useful summary data for each cluster:

#process and show cluster size
  kfit5 <- clusters
  kfit5$size
[1]  662  849 2666  228 1905  997 2120
#extract m values
  k5_mvalues<- aggregate(cluster.data,by=list(kfit5$cluster),FUN=mean)
#process cluster data to extract
  k5_mvalues_N <- as.data.frame(k5_mvalues[,])
  View(k5_mvalues_N)
  write.csv(k5_mvalues_N, "cluster-means.csv")

This (k5_mvalues_N) is the mean values for each of the variables for each cluster. We have also saved this as a CSV file which we can open in Excel. I have opened this in Excel, split up the data into the three groups (age, internet and NS-SEC) and then applied Conditional Formatting to each group, to pick out the highest and lowest values. Remember, your values may be different

Here we can see that only cluster 1 shows a clear pattern with Internet usage / infrastructure, and clusters 3 and 5 are driven primarily by age and clusters 1, 3 and 6 are driven by NS-SEC.

We could develop our analysis by looking at this summary, and removing certain variables, and re-running the cluster analysis.

We can also plot the mean variation within clusters within R:

  library(ggplot2)
  #melt data for graph
  k5df <- melt(k5_mvalues_N ,  id = "Group.1", variable_name = "Variables")
  View(k5df)
  #graph data  
  ggmeans <- ggplot(k5df, aes(Group.1,value)) + geom_line(aes(colour = Variables)) + 
  scale_x_continuous(breaks=seq(1, 7, by=1)) + theme(text = element_text(size=20))
  ggmeans + ggtitle("Mean variation within clusters")

There is a lot of data there, so it might be easier just to look at one of the three groups. You can do this by dropping the excess rows from k5_mvalues_N and rerunning the graph code above.

Exporting the Data

Finally, we need to append the cluster allocation to our original MSOA data:

  #Append cluster-MSOAs
  MSOA_data_centriods <- data
  MSOA_data_centriods$cluster <- kfit5$cluster
  write.csv(MSOA_data_centriods, file = "20201013-MSOA.csv")

When interpreting the clusters, often mapping them can be really helpful. This is what the data looked like when I joined my version in QGIS: