You are here

CDRC Modelled Ethnicity Proportions (LA Geography)

These data combine historical electoral roll and linked consumer register data (on surnames, forenames and locations) from 1997 onwards, with an aggregated metric derived from ONS data which lists the most frequently selected second-level ethnicity category for the most common forenames and surnames.

The data are aggregated to Local Authority Districts (LAD) as defined in 2021 and separately for 2023, as some local authorities were created/eliminated between these years. They are supplied as Open data, and can be downloaded at the bottom of this page - there is a zip file containing the data, for each Local Authority year. For more spatially detailed data (at Lower Super Output Area level), these are available on application, please see the Related Content link below for further details.

Users of this dataset should be mindful that these data concern ethnicity categories and not one showing migration, citizenship, nationality or country of origin.

These data were derived as part of an ESRC-funded project 'Ethnicity Estimator' - Virtual Microdata Laboratory project number: 0000013; and comprise a diagnostic table resulting from the application of a bespokealgorithm. The aggregate data were provided by the ONS within the Virtual Microdata Laboratory (VML).

The data is available as CSV files, one for each of the ethnicity groups. Each row contains the LAD ID, followed by the proportion of the population that is believed to be of that ethnicity (based on surname analysis) rounded to the nearest 0.5%.

To create the data a slight aggregation on the second-level ethnicity categories is carried out. We then aggregate by 2021 LAD and separately by 2023 LAD. Category populations less than 5 are set to 0. The results are then divided by the total population and rounded to the nearest 0.005 (i.e. 0.5%). A value of 0 indicates there is no measurable total population for this LAD and year combination. These values generally only occur for the first few years and in only a small number of LAD areas. Totals may not add up to 1.000 (100%) because of rounding, but also because of an Unknown ethnicity category which a small proportion of names are assigned to.

Details of the model method can be found in this paper: https://doi.org/10.1371/journal.pone.0201774 - the model used is EE-A6, on a deterministic (not probabilistic) basis.

For detailed description of the columns contained within the data, see the Variable Dictionary; and for an overview of the characteristics of the data, see the Data Summary. These files can be downloaded from the bottom of this page.

Quality, Representation and Bias

The data are synthetically modelled, based on the most common ethnicities stated for particular surnames (regardless of location) in England/Wales. It is not actual measured data for the populations. Because this particular source is only from England/Wales, we would expect marginally less accurate results from Scotland/Northern Ireland.

The underpinning data are the Linked Consumer Registers (LCRs), the provenance of which is set out in two papers in the Journal of the Royal Statistical Society Series A (Lansley et al 2019; van Dijk et al 2021). Consumer and administrative data were acquired directly or indirectly from multiple data providers without warranties about accuracy or coverage, consistent with industry practice. Extensive internal and external validation procedures were developed in order to render the diverse data formats consistent and to establish the provenance of the consolidated registers. Known shortcomings in the data and over-all assessment of quality are set out in the peer-reviewed research papers.

In addition to establishing consistency of address referencing, the research papers document the completeness of the data. In terms of coverage, the LCRs tend to under estimate LSOA adult population sizes relative to UK mid-year population estimates for 2003-2020. The research papers describe procedures developed by the CDRC to fill in known gaps where possible.

Data for Northern Ireland are estimated to be less complete because of specific administrative procedures and legislative requirements. Additional UK-wide issues are created by second-home owners and students.

The counts of individuals in the original LCR data fluctuate according to data supplier in addition to actual population size changes. As such, meta data describing the annual distribution of population counts across LAD that have been used in calculating Modelled Ethnicity Proportions are made available. Modelled Ethnicity Proportions may be out of line with census counts and users should consult census statistics if they have concerns.

Data Sources

  • ONS Census 2011 - most common ethnicities by forename, most common ethnicities by surname. Single composition result for all of England/Wales. Used for more common forenames and surnames.
  • Onomap (various sources, e.g. phonebooks) - forename/surname pairs used in the Onomap model, from which some ethnicity based groupings are identified. Global coverage. Data typically from 2000-12 with some more recent data. Used for less common forenames and surnames only.
  • CDRC Linked Consumer Register - names and addresses of individual people in households, every year from 1997 to 2023.
Controller: 
University College London (UCL)
Additional Info: 
FieldValue

Attribution

ESRC Consumer Data Research Centre. Funder: Economic and Social Research Council (ES/L011840/1).

Source

ONS; CDRC Linked Consumer Register

Data and Resources

FieldValue
Modified
2024-11-20
Release Date
2021-12-06
Frequency
Annually
Spatial / Geographical Coverage Location
United Kingdom
Temporal Coverage
January 1997 to December 2023
Granularity
LAD21CD, LAD23CD
Author
Chi, Bin; Todd, James; van Dijk, Justin; Li, Wen; Lansley, Guy; Kandt, Jens
Contact Name
Bin Chi
Contact Email
Public Access Level
Public
POLYGON ((-8.9731907844543 49.685033124181, -8.9731907844543 61.345632152747, 2.7291369438171 61.345632152747, 2.7291369438171 49.685033124181))