This is part 2 of the 2 part course from CDRC on the Internet User Classification (IUC) and K-Means Clustering. The video in this part introduces K-Means clustering, and the practical session shows you how create your own K-Means cluster in R.
After completing this material, you will:
We are going to replicate the geodemographic classification for internet access and use.
To create an index, we are going to use some source files I have put together. These are for each MSOA, and consist of:
Variable | Definition |
---|---|
GEO_CODE |
MSOA Code |
age_total , age0to4pc , age5to14pc , age16to24pc , age25to44pc , age45to64pc , age75pluspc |
percentage of people in various age groups |
NSSEC.Total , HigherManagerial1pc , `LowerManagerial2pc , Intermediate3pm , SmallEmployers4pc , SupervisoryTechnical5pc , SemiRoutine6pc , Routine7pc , LongTermUnemployed8pc |
percentage of people in different NS-SEC groups |
Average.download.speed..Mbps. , Superfast.availability , Unable.to.receive.decent.broadband , Receiving.under.2.Mbps , Receiving.under.10.Mbps , Receiving.over.30.Mbps |
measures of broadband use and availability |
These data have all been processed to be at MSOA level.
Download the file internet-input-data.csv
from the website.
Have a look at the file in Excel.
Start a new Script in RStudio.
Set your working directory.
Use this code to read in the file:
data <- read.csv("internet-input-data.csv")
head()
to check what the data are:head(data)
GEO_CODE age_total age0to4pc age5to14pc age16to24pc age25to44pc age45to64pc
1 E02000001 7375 0.03200000 0.03877966 0.09613559 0.4069153 0.2725424
2 E02000002 6775 0.09269373 0.12162362 0.11321033 0.2801476 0.1856827
3 E02000003 10045 0.08292683 0.10154306 0.11796914 0.3060229 0.2248880
4 E02000004 6182 0.05904238 0.10239405 0.13879004 0.2544484 0.2495956
5 E02000005 8562 0.09296893 0.11936463 0.11948143 0.2994627 0.2136183
6 E02000005 8562 0.09296893 0.11936463 0.11948143 0.2994627 0.2136183
age75pluspc NSSEC.Total HigherManagerial1pc LowerManagerial2pc
1 0.06074576 5816 0.38462861 0.3390646
2 0.09800738 3926 0.05680082 0.1711666
3 0.06460926 6483 0.08175228 0.2076199
4 0.08864445 4041 0.06780500 0.2063846
5 0.05010512 5368 0.04936662 0.1663562
6 0.05010512 5368 0.04936662 0.1663562
Intermediate3pm SmallEmployers4pc SupervisoryTechnical5pc SemiRoutine6pc
1 0.07806052 0.06464924 0.02596286 0.04401651
2 0.14314824 0.11462048 0.06444218 0.20529801
3 0.18247725 0.11059695 0.07064631 0.15502082
4 0.19079436 0.10541945 0.08364266 0.15070527
5 0.15648286 0.10413562 0.08736960 0.19243666
6 0.15648286 0.10413562 0.08736960 0.19243666
Routine7pc LongTermUnemployed8pc Average.download.speed..Mbps.
1 0.02665062 0.03696699 24.4
2 0.13015792 0.11436577 77.7
3 0.09656023 0.09532624 74.3
4 0.12150458 0.07374412 81.1
5 0.14456036 0.09929210 70.7
6 0.14456036 0.09929210 70.7
Superfast.availability Unable.to.receive.decent.broadband
1 0.544 0.002
2 0.989 0.010
3 0.990 0.002
4 0.992 0.008
5 0.970 0.030
6 0.970 0.030
Receiving.under.2.Mbps Receiving.under.10.Mbps Receiving.over.30.Mbps
1 0.013 0.249 0.203
2 0.020 0.159 0.798
3 0.008 0.164 0.784
4 0.026 0.162 0.799
5 0.021 0.223 0.737
6 0.021 0.223 0.737
Is this the data we expect to see?
Use View()
to look at the data.
Is this what you expect?
There will be character chr
, integer int
and numeric num
values in this data frame. Make sure you can identify which is which, and that you know what the differences are.
hist()
).We need some libraries for this work. Remember to install them if you haven’t got them installed already.
#load libraries (and install if needed)
library(reshape)
library(ggplot2)
The K-means cluster function will cluster all of the data in the table, so we need to remove everything we don’t want clusters. Use colnames(data)
to get the column index numbers, and then extract the columns we want to use in the K-Means Clustering Process
#separate out data to be clustered
colnames(data)
[1] "GEO_CODE" "age_total"
[3] "age0to4pc" "age5to14pc"
[5] "age16to24pc" "age25to44pc"
[7] "age45to64pc" "age75pluspc"
[9] "NSSEC.Total" "HigherManagerial1pc"
[11] "LowerManagerial2pc" "Intermediate3pm"
[13] "SmallEmployers4pc" "SupervisoryTechnical5pc"
[15] "SemiRoutine6pc" "Routine7pc"
[17] "LongTermUnemployed8pc" "Average.download.speed..Mbps."
[19] "Superfast.availability" "Unable.to.receive.decent.broadband"
[21] "Receiving.under.2.Mbps" "Receiving.under.10.Mbps"
[23] "Receiving.over.30.Mbps"
cluster.data <- data[, c(3:8,10:17,18:20)]
head(cluster.data)
age0to4pc age5to14pc age16to24pc age25to44pc age45to64pc age75pluspc
1 0.03200000 0.03877966 0.09613559 0.4069153 0.2725424 0.06074576
2 0.09269373 0.12162362 0.11321033 0.2801476 0.1856827 0.09800738
3 0.08292683 0.10154306 0.11796914 0.3060229 0.2248880 0.06460926
4 0.05904238 0.10239405 0.13879004 0.2544484 0.2495956 0.08864445
5 0.09296893 0.11936463 0.11948143 0.2994627 0.2136183 0.05010512
6 0.09296893 0.11936463 0.11948143 0.2994627 0.2136183 0.05010512
HigherManagerial1pc LowerManagerial2pc Intermediate3pm SmallEmployers4pc
1 0.38462861 0.3390646 0.07806052 0.06464924
2 0.05680082 0.1711666 0.14314824 0.11462048
3 0.08175228 0.2076199 0.18247725 0.11059695
4 0.06780500 0.2063846 0.19079436 0.10541945
5 0.04936662 0.1663562 0.15648286 0.10413562
6 0.04936662 0.1663562 0.15648286 0.10413562
SupervisoryTechnical5pc SemiRoutine6pc Routine7pc LongTermUnemployed8pc
1 0.02596286 0.04401651 0.02665062 0.03696699
2 0.06444218 0.20529801 0.13015792 0.11436577
3 0.07064631 0.15502082 0.09656023 0.09532624
4 0.08364266 0.15070527 0.12150458 0.07374412
5 0.08736960 0.19243666 0.14456036 0.09929210
6 0.08736960 0.19243666 0.14456036 0.09929210
Average.download.speed..Mbps. Superfast.availability
1 24.4 0.544
2 77.7 0.989
3 74.3 0.990
4 81.1 0.992
5 70.7 0.970
6 70.7 0.970
Unable.to.receive.decent.broadband
1 0.002
2 0.010
3 0.002
4 0.008
5 0.030
6 0.030
We also need to rescale the data so everything is on a similar scale, as the speed data (Mbps) is very different to the other data.
cluster.data <- scale(cluster.data)
Our first step in the analysis is to create an elbow plot, to see how many clusters we want to specify for the K-Means analysis.
We run the K-Means analysis for 1, 2, 3 and so on up to 15 clusters, and the calculate and plot the Within Sum of Squares (wss
) for each one.
#Determine number of clusters required by elbow plot
wss <- (nrow(cluster.data)-1)*sum(apply(cluster.data,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(cluster.data,
centers=i)$withinss)
Warning: did not converge in 10 iterations
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main = "K-means: distance from mean by cluster amount", cex.axis = 0.90, lwd = 2)
abline(v=c(1:15), lty=1.5, lwd=0.5, col=336)
Based on the elbow plot, we need to choose the number of clusters. Based on this example above, we are going to use 7 clusters.
The interpretation of an elbow plot can be quite subjective, and in this example, 8 clusters could be easily justified. You would need to try with 8 as well as 7, and see what output you get.
We are going to run our cluster analysis 10 times, because there is an element of randomness within the clustering, and we want to make sure we get the best clustering output. This will run the clustering code 10 times.
#K-Means (7 clusters with 10 iterations)
clusters <- list()
fit <- NA
for (i in 1:10){
print(paste("starting run", i, sep=" "))
class.7 <- kmeans(x=cluster.data, centers=7, iter.max=1000000, nstart=1)
fit[i] <- class.7$tot.withinss
if (fit[i] < min(fit[1:(i-1)])){
clusters <- class.7}
}
[1] "starting run 1"
[1] "starting run 2"
[1] "starting run 3"
[1] "starting run 4"
[1] "starting run 5"
[1] "starting run 6"
[1] "starting run 7"
[1] "starting run 8"
[1] "starting run 9"
[1] "starting run 10"
We then have a bit of post processing to extract some useful summary data for each cluster:
#process and show cluster size
kfit5 <- clusters
kfit5$size
[1] 662 849 2666 228 1905 997 2120
#extract m values
k5_mvalues<- aggregate(cluster.data,by=list(kfit5$cluster),FUN=mean)
#process cluster data to extract
k5_mvalues_N <- as.data.frame(k5_mvalues[,])
View(k5_mvalues_N)
write.csv(k5_mvalues_N, "cluster-means.csv")
This (k5_mvalues_N
) is the mean values for each of the variables for each cluster. We have also saved this as a CSV file which we can open in Excel. I have opened this in Excel, split up the data into the three groups (age, internet and NS-SEC) and then applied Conditional Formatting to each group, to pick out the highest and lowest values. Remember, your values may be different
Here we can see that only cluster 1 shows a clear pattern with Internet usage / infrastructure, and clusters 3 and 5 are driven primarily by age and clusters 1, 3 and 6 are driven by NS-SEC.
We could develop our analysis by looking at this summary, and removing certain variables, and re-running the cluster analysis.
We can also plot the mean variation within clusters within R:
library(ggplot2)
#melt data for graph
k5df <- melt(k5_mvalues_N , id = "Group.1", variable_name = "Variables")
View(k5df)
#graph data
ggmeans <- ggplot(k5df, aes(Group.1,value)) + geom_line(aes(colour = Variables)) +
scale_x_continuous(breaks=seq(1, 7, by=1)) + theme(text = element_text(size=20))
ggmeans + ggtitle("Mean variation within clusters")
There is a lot of data there, so it might be easier just to look at one of the three groups. You can do this by dropping the excess rows from k5_mvalues_N
and rerunning the graph code above.
Finally, we need to append the cluster allocation to our original MSOA data:
#Append cluster-MSOAs
MSOA_data_centriods <- data
MSOA_data_centriods$cluster <- kfit5$cluster
write.csv(MSOA_data_centriods, file = "20201013-MSOA.csv")
When interpreting the clusters, often mapping them can be really helpful. This is what the data looked like when I joined my version in QGIS: