CDRC Modelled Ethnicity Proportions (LSOA Geography)

Population & Mobility

These data combine historical electoral roll and linked consumer register data (on surnames, forenames and locations) from 1997 onwards, with an aggregated metric derived from ONS data which lists the most frequently selected second-level ethnicity category for the most common forenames and surnames. The data are aggregated to Local Super Output Area (LSOA11CD) or equivalent scale.

Users of these data should be mindful that they concern ethnicity categories, and not migration, citizenship, nationality, or country of origin. The roll/ registers have been linked together for data inference and cleaning, to provide population continuity and result in a smoother, higher quality temporal output.

These data were derived as part of an ESRC-funded project 'Ethnicity Estimator' - Virtual Microdata Laboratory project number: 0000013; and comprise a diagnostic table resulting from the application of a bespokealgorithm. The aggregate data were provided by the ONS within the Virtual Microdata Laboratory (VML).

The data is available as CSV files, one for each of the ethnicity groups. Each row contains the LSOA11CD, followed by the proportion of the population that is believed to be of that ethnicity (based on surname analysis) rounded to the nearest 0.5%.

Methodology
To create the data a slight aggregation on the second-level ethnicity categories is carried out. We then aggregate by LSOA11CD. Category populations less than 5 are set to 0. The results are then divided by the total population and rounded to the nearest 0.005 (i.e. 0.5%). A value of 0 indicates there is no measurable total population for this LSOA and year combination. These values generally only occur for the first few years and in only a small number of LSOA areas. Totals may not add up to 1.000 (100%) because of rounding, but also because of an Unknown ethnicity category which a small proportion of names are assigned to.

Details of the model method can be found in this paper: https://doi.org/10.1371/journal.pone.0201774 - the model used is EE-A6, on a deterministic (not probabilistic) basis.

For detailed description of the columns contained within the data, see the Variable Dictionary; and for an overview of the characteristics of the data, see the Data Summary. These files can be downloaded from the bottom of this page.

Quality, Representation and Bias

The data are synthetically modelled, based on the most common ethnicities stated for particular surnames (regardless of location) in England/Wales. It is not actual measured data for the populations. Because this particular source is only from England/Wales, we would expect marginally less accurate results from Scotland.

The underpinning data are the Linked Consumer Registers (LCRs), the provenance of which is set out in two papers in the Journal of the Royal Statistical Society Series A (Lansley et al 2019; van Dijk et al 2021). Consumer and administrative data were acquired directly or indirectly from multiple data providers without warranties about accuracy or coverage, consistent with industry practice. Extensive internal and external validation procedures were developed in order to render the diverse data formats consistent and to establish the provenance of the consolidated registers. Known shortcomings in the data and over-all assessment of quality are set out in the peer-reviewed research papers.

In addition to establishing consistency of address referencing, the research papers document the completeness of the data. In terms of coverage, the LCRs tend to under estimate LSOA adult population sizes relative to UK mid-year population estimates for 2003-2020. The research papers describe procedures developed by the CDRC to fill in known gaps where possible.

The counts of individuals in the original LCR data fluctuate according to data supplier in addition to actual population size changes. As such, meta data describing the annual distribution of population counts across LSOA that have been used in calculating Modelled Ethnicity Proportions are made available. Modelled Ethnicity Proportions may be out of line with census counts and users should consult census statistics if they have concerns.

Data Sources

ONS Census 2011 - most common ethnicities by forename, most common ethnicities by surname. Single composition result for all of England/Wales. Used for more common forenames and surnames.
Onomap (various sources, e.g. phonebooks) - forename/surname pairs used in the Onomap model, from which some ethnicity based groupings are identified. Global coverage. Data typically from 2000-12 with some more recent data. Used for less common forenames and surnames only.
CDRC Linked Consumer Register - names and addresses of individual people in households, every year from 1997 to 2023.

Safeguarded

Ethnicity

Population

Controller:

University College London (UCL)

Additional Info:

Field	Value
Source	ONS; CDRC Linked Consumer Register
Attribution	Data provided by the Consumer Data Research Centre, an ESRC Data Investment: ES/L011840/1, ES/L011891/1

Data and Resources

Technical Report: Metadata and Validation for Modelled Ethnicity Proportionspdf
Download
Data Summarycsv
The MEP 2023 is used as an example. The other year files are similar.

Preview Download
Variable Dictionarycsv
Preview Download

Dataset Info:

Field	Value
Modified	2024-11-20
Release Date	2019-11-20
Spatial / Geographical Coverage Location	Great Britain
Temporal Coverage	January 1997 to January 2023
Granularity	LSOA11CD, DZ11CD
Author	Todd, James; van Dijk, Justin; Li, Wen; Lansley, Guy; Kandt, Jens
Contact Name	Oliver O'Brien
Contact Email	data@cdrc.ac.uk

POLYGON ((-8.9948498999 49.688302644, 2.0867431164 49.688302644, 2.0867431164 61.0684288668, -8.9948498999 61.0684288668, -8.9948498999 49.688302644))

You are here

CDRC Modelled Ethnicity Proportions (LSOA Geography)

Data and Resources

Data Extent

Apply for the data:

License