Welcome

This is part 1 of the 2 part course from CDRC on the Internet User Classification (IUC) and K-Means Clustering. The video in this part introduces the IUC data set, and the practical session shows you how to work with the IUC dataset in R. If you are completely new to RStudio, please check out our Short Course on Using R as a GIS.

After completing this material, you will:



Part 1: Internet User Classification (IUC)

Downloading IUC Data

Our first step is to download the IUC dataset:

  • Open a web browser and go to https://data.cdrc.ac.uk

  • Register if you need to, or if you are already registered, make sure you are logged in.

  • Search for IUC

  • Open the Internet User Classification page.

  • Scroll down and choose the download option for the IUC 2018 (CSV)

  • Save the iuc2018.csv file to your working directory.

You can download the shapefile with the data already joined to the LSOA boundaries, but this is the national data set and is quite large (75MB). R will work with this, but might be a bit slow. The steps below will only get the shapefile for Liverpool, which will be a much smaller file.

  • Open the file in Excel - what data do we have?

  • Check out the User Guide if you want to.

Loading IUC data in R

  • Start a new Script in RStudio.

  • Set your working directory.

  • Use this code to read in the file:

iuc <- read.csv("iuc2018.csv")
  • Use head() to check what the data are:
head(iuc)
  SHP_ID LSOA11_CD       LSOA11_NM GRP_CD                     GRP_LABEL
1      1 E01020179 South Hams 012C      5       e-Rational Utilitarians
2      2 E01033289   Cornwall 007E      9   Settled Offline Communities
3      3 W01000189      Conwy 015F      5       e-Rational Utilitarians
4      4 W01001022   Bridgend 014B      7 Passive and Uncommitted Users
5      5 W01000532 Ceredigion 007B      9   Settled Offline Communities
6      6 E01018888   Cornwall 071G      9   Settled Offline Communities
  • Is this the data we expect to see?

  • Use View() to look at the data.

  • Use str() to see whether they are character or numeric variables.

str(iuc)
'data.frame':   41729 obs. of  5 variables:
 $ SHP_ID   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ LSOA11_CD: chr  "E01020179" "E01033289" "W01000189" "W01001022" ...
 $ LSOA11_NM: chr  "South Hams 012C" "Cornwall 007E" "Conwy 015F" "Bridgend 014B" ...
 $ GRP_CD   : int  5 9 5 7 9 9 9 9 5 6 ...
 $ GRP_LABEL: chr  "e-Rational Utilitarians" "Settled Offline Communities" "e-Rational Utilitarians" "Passive and Uncommitted Users" ...
  • Is this what you expect?

There will be character chr, integer int and numeric num values in this data frame. Make sure you can identify which is which, and that you know what the differences are.

  • How are the data distributed? (try hist()).

Loading Spatial Data

To create any maps, we need some spatial data.

  • Go to http://census.edina.ac.uk/ and select Boundary Data Selector.
  • Then set Country to England, Geography to Statistical Building Block, dates to 2011 and later, and click Find.
  • Select English Lower Layer Super Output Areas, 2011 and click List Areas.
  • Select Liverpool from the list and click Extract Boundary Data.
  • After a 5 to 20 second wait, click BoundaryData.zip to download the files.

Extract the files, and move all the files starting with the name england_lsoa_2011 to your working folder.

We will also need some spatial libraries:

#load libraries
library(sf)
library(tmap)

Read in the spatial data we downloaded from Edina:

#read in shapefile
LSOA <- st_read("england_lsoa_2011.shp")
  • Has the data been read in correctly? Try head(), class() and str().

Let’s do a quick map:

qtm(LSOA)

  • Is the data correct?

Joining Attribute and Spatial Data

Next step is to join the attribute data (iuc) to the spatial data (LSOA).

Use head() to check which columns we are using for the join:

#check which columns we are joining
head(iuc)
head(LSOA)
#join attribute data to LSOA
LSOA <- merge(LSOA, iuc, by.x="code", by.y="LSOA11_CD")

#check output
head(LSOA)
Simple feature collection with 6 features and 7 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 334715 ymin: 385417 xmax: 339020.8 ymax: 390548
projected CRS:  OSGB 1936 / British National Grid
       code                       label           name SHP_ID      LSOA11_NM
1 E01006512 E08000012E02001377E01006512 Liverpool 031A  26586 Liverpool 031A
2 E01006513 E08000012E02006932E01006513 Liverpool 060A  24660 Liverpool 060A
3 E01006514 E08000012E02001383E01006514 Liverpool 037A  27675 Liverpool 037A
4 E01006515 E08000012E02001383E01006515 Liverpool 037B  26856 Liverpool 037B
5 E01006518 E08000012E02001390E01006518 Liverpool 044A  28180 Liverpool 044A
6 E01006519 E08000012E02001402E01006519 Liverpool 056A  27474 Liverpool 056A
  GRP_CD           GRP_LABEL                       geometry
1      1 e-Cultural Creators POLYGON ((336203 390010, 33...
2      1 e-Cultural Creators POLYGON ((335402.8 390317.5...
3      1 e-Cultural Creators POLYGON ((335651.3 389926.8...
4     10         e-Withdrawn POLYGON ((335186 389604, 33...
5     10         e-Withdrawn POLYGON ((335537.2 389034.5...
6      3          e-Veterans POLYGON ((338014.6 386447.2...

Finally, we can plot the maps quickly using qtm() from the tmap library:

#ahah index
qtm(LSOA, "GRP_CD")

This works well. However we don’t get many options with this. We can use a different function tm_shape(), which will give us more options.

tm_shape(LSOA) +
tm_polygons("GRP_CD")
tm_shape(LSOA) +
  tm_polygons("GRP_CD", palette = "Set3", n = 10) +
tm_layout(legend.title.size = 0.8)

This allows us to change the title, colours and legend title size. We are now using a qualitative palette as these data have no inherent order (i.e. group 1 is not more or less than group 2).

Customising Colours

We can customise the colours. Try running this code to see the different palette options:

library(RColorBrewer)
display.brewer.all()