You are here

Data News

CDRC Supporting Development of Sktime - 18th Feb 2021

Markus Löning is a PhD student at UCL with the CDRC, and is one of the lead developers of sktime - a Python library for time series machine learning. Time series analysis is a challenging area and many existing tools do not work well with time series data.

Solving data science problems with time series data in Python is challenging.

Why? Existing tools are not well-suited to time series tasks and do not easily integrate together. Methods in the scikit-learn package assume that data is structured in a tabular format and each row is i.i.d. — assumptions that do not hold for time series data. Packages containing time series learning modules, such as statsmodels, do not integrate well together. Further, many essential time series operations, such as splitting data into train and test sets across time, are not available in existing Python packages.

To address these challenges, sktime was created.

sktime logo

Logo of the sktime library (Github: https://github.com/alan-turing-institute/sktime)

sktime is an open-source Python toolbox for machine learning with time series. It is a community-driven project funded by the UK Economic and Social Research Council, the Consumer Data Research Centre, and The Alan Turing Institute.

sktime extends the scikit-learn API to time series tasks. It provides the necessary algorithms and transformation tools to efficiently solve time series regression, forecasting, and classification tasks. The library includes dedicated time series learning algorithms and transformation methods not readily available in other common libraries.

sktime was designed to interoperate with scikit-learn, easily adapt algorithms for interrelated time series tasks, and build composite models. How? Many time series tasks are related. An algorithm that can solve one task can often be re-used to help solve a related one. This idea is called reduction. For example, a model for time series regression (use a series to predict an output value) can be re-used for a time series forecasting task (the predicted output value is a future value).

  • Mission statement: “sktime enables understandable and composable machine learning with time series. It provides scikit-learn compatible algorithms and model composition tools, supported by a clear taxonomy of learning tasks, with instructive documentation and a friendly community.”*

sktime is a great example of the user community coming together to produce a understandable, compatible, standards based, open source tool to solve a specific problem. CDRC is proud to support the project through Markus's involvement and aims to provide similar support to many other projects in the future.

For more details, please check out this blog post by Alexandra Amidon.

Written by Dr Nick Bearman, Project Delivery Manager

===========================================================================================================================

Understanding and Comparing Mobility Data - 4th Feb 2021

Through the ABC (Accelerating Business Collaboration) Research Programme, funded by ESRC & UBEL, PhD candidate James Todd worked with Geolytix to validate the representativeness of mobile mobility data from Unacast. Geolytix were interested in gaining a deeper understanding of how comparable their (Unacast) data is to alternative mobility data sources as well as insights into the factors that influence the number of devices that are found within small geographical areas.

Overall, the analysis within this project finds that Unacast mobility data is a comparable to many alternative mobility data sources, observing a 70-100% decline in activity by the start of April 2020 across the vast majority of mobility data sources.

This research project composed of 2 main methods. Firstly, a descriptive analysis of mobility trends in London were assessed by comparing Unacast mobility data to a large number of open mobility data sources (Google, Apple, Purple, Open Table, Transport for London, City Mapper, Santander Bike Sharing). Using this method, it was possible to visually compare multiple mobility data sources within the context of Covid-19 lockdown restrictions.

Dataset Description Source (link)
Unacast Mobile mobility data Geolytix (private)
Google Categorised mobility data Google (open source)
Apple Categorised mobility data Apple (open source)
SSS Wifi footfall data CDRC (private)
Purple Wifi footfall data Purple (open source)
Open Table Restaurant reservation data Open Table (open source)
TfL Transport use data TfL (open source)
City Mapper Mobility index data City Mapper (open source)
Santander Bike Sharing Bikeshare activity data CDRC (open source)
Open Street Map Geographical features data OSM (open source)

Table 1. Sources of Mobility Data used in this analysis

To enable a deeper understanding of the representativeness of Unacast data, statistical regression analysis was conducted. A fixed-effect regression was conducted to find the representativeness of Unacast mobile devices in relation to the Local Data Company’s (LDC) Smart Street Sensor (SSS) footfall data. In addition to this, a linear regression was conducted to find the relationship between Unacast mobility data to local geographic features taken from Open Street Map (OSM).

Geolytix were very happy with the project. Blair Freebairn (CEO Geolytix Ltd), said "The work is valuable to us in and of itself, but also as it has sparked additional areas of interest. In particular the comparisons to other broad brush indicators of human movement has provided context and reassurance as to the high-level appropriateness of mobility data. The micro correlations at site level are well elucidated and have shed new light on the nature of mobility data."

James Todd, PhD candidate, said "This experience has been extremely valuable as it has given me insights into the private sector’s area of interest in the context of mobility data, which I have been working on within my PhD. This has given me many ideas on how I would like to adapt my PhD to include similar analysis as part of an empirical chapter."

Written by Dr Nick Bearman, Project Delivery Manager

===========================================================================================================================

DUG Conference: Data Analysis in a Crisis, plus CDRC Masters Dissertation Scheme - 11th Nov 2020

On Tuesday 10th November, the retail industry DUG (Data Analysts User Group) hosted its annual conference on the theme of Data Analysis in a Crisis. Consonant with this theme, the usual industry-led event could not take place at the usual Royal Society venue this year, but nonetheless attracted an audience of 70 participants online using WebEx.

Full details of the programme and activities are available on the DUG website, including videos of the presentations. DUG Director Tim Drye opened the proceedings with an overview of the science and art underpinning the Data Analyst role, illustrating that foundations from each are essential to understanding data and presenting them to an audience in an intelligible manner.

Mark Stern, Eoin Gleeson and Fraser Gray from Ladbrokes Coral then addressed the organisational setting to high performance data analytics through effective team-building, drawing upon their many varied experiences.

Prof. Paul Longley then introduced the CDRC Masters Dissertation Scheme, noting upcoming launch of the 2021 scheme and the opportunities that it offers for career-enhancing interactions between business, academia and student-centred problem-solving. (The website has more information and can be used to make enquiries or submit projects.) Four selected students who took part in the 2020 Masters Dissertation Scheme then presented their collaborative work:

  • Lucy (Ludmila) Sabelnikova, City University worked with Movement Strategies, in an evaluation of the ways in which footfall and mobile network data can be used to predict consumer behaviour at events – view Lucy’s presentation and project overview.

  • Samuel Li, UCL also worked with Movement Strategies, on an assessment of the Impact of weather upon shipping movements, as evidenced using AIS data and weather APIs – view Samuel’s presentation and project overview.

  • Nombuyiselo Murage, University of Liverpool worked with Tamoco UK Ltd., to derive spatio-temporal geographies of activity patterns from mobile GPS data – view Nombuyiselo’s presentation and project overview.

  • Taeyang Jung, Imperial College worked with the Phoenix Partnership to identify and evaluate barriers to use of electronic health records in applied settings – view Taeyang’s presentation and project overview.

All of this year’s Masters Dissertations were submitted to the annual national CDRC competition, judged this year by Sarah Hitchcock (Geolytix) and Martin Squires (Pets at Home and UCL Visiting Industrial Professor). This year’s winner of the £500 cash prize was awarded to Lucy (Ludmila) Sabelnikova, and the two runner-up prizes were awarded to Nombuyiselo Murage and Samuel Li. Nombuyiselo also won the Presentation Prize for her contribution to the DUG conference, with honourable mentions also going to Samuel and Ludmila.

Congratulations to all the prize winners, and thank you to Lucy, Sam, Nombuyiselo and Taeyang for the excellent presentations., that will be made available as part of the conference proceedings.

The presentations were followed by a presentation from Dr Andrew Larner that took stock of how local councils are adapting to the Coronavirus, bringing together a range of experiences from across the globe. The contribution of the National Statistician, Professor Sir Ian Diamond was unfortunately cancelled because of technical issues.

Gary Cole highlighted the benefits of DUG membership and outlined how DUG is now moving forward, and Tim Drye wrapped things up sharing his reflections from the meeting.

It was a great opportunity to hear from industry, and see how the CDRC Masters Dissertation Students completed their projects over the summer. If you are interested in submitting projects for next years Scheme, please have a look at our website. If you have any questions, please email projects@cdrc.ac.uk.

Written by Dr Nick Bearman, Project Delivery Manager

===========================================================================================================================

CDRC Masters Dissertation Scheme at Registry Trust: skills, experience and employability - 9th Nov 2020

Millie Corless completed her masters dissertation through the CDRC Masters Dissertation Scheme (MDS) with the Registry Trust. I spoke with her after she completed her degree to find out what it was like. The scheme really appealed to her, and it was one of the parts of the MSc Geospatial Data Science (GDS) degree at University of Liverpool that convinced her to apply for that Masters programme. Our industry collaborations allow us to provide projects that have real world impact, giving students the experience of working with a real world dataset, and feeding into the industry partners work.

The Registry Trust is a small company (~ 30 staff) and had one person in their data analytics team. Millie gave them the extra capacity to ask a MSc GDS student with skills and experience in GDS and coding to take one of their data sets which they hadn’t had much work done on it, and spend a significant amount of time analysing the data. The Registry Trust were fairly sure the data set had a good story within it, about the scale and patters of county court judgements for indebtedness but they didn’t have the time or expertise to dig into this and find out the details. Offering the project through the CDRC MDS allowed them to get someone with the skills and time to do this.

Test Image

Throughout the scheme, the projects vary but they always have some degree of flexibility for the student to focus the project on their areas of interest. For example, Millie is very interested in health and she looked at the CCJ data and explored its’ relationship with health. She met with a number of different people to develop and refine the project proposal, including her industry supervisor (the current data analyst) and others from Registry Trust, including the CEO and Chair of the company. One of the benefits of working with a small company (30 staff) is that she was able to work closely with a range of staff members and they gave her some great insights into working in industry. She felt like she was working within a bigger team for her dissertation; there was a group of people she could go do for advice, including her academic supervisor, her industry supervisor, and others within and outside the Registry Trust.

This project also had a great real world impact; the analysis Millie completed has fed into a blog post by the Registry Trust, and future projects and their policy recommendations. The real world impact was one of the elements that Millie really liked about the Masters Dissertation Scheme, and the Registry Trust project in particular. The fact this project already existed was a great help for Millie – “it allowed me to concentrate on the project rather than needing to come up with a topic. I also knew I was interested in health, so the flexibility within the project allowed me to include that in my research questions and analysis which was great.”

Test Image2More details on her research are available in the Registry Trust blog post and in her dissertation.

After the scheme, the position of Data Analyst within in the Registry Trust became available, Millie applied and is now working full time for the Registry Trust. “I applied and was interviewed with a number of other candidates, but having taken part in the Masters Dissertation Scheme, I already knew the data sets they were working with and the type of analyses they were interested in which gave me an advantage.” Her role as Data Analyst is developing the in-house skill set that the Registry Trust can utilise and will feed into a number of projects and outputs over the coming year.

Her recommendation to anyone considering applying for the Masters Dissertation Scheme is to “go for it”. The scheme gave her great experience and looks good on her CV, and going forward into any career (in industry or academia or elsewhere) the Masters Dissertation Scheme shows you are interested in the application of the skills you have learnt and gives you experience of working with others in industry.

The Scheme will be open soon (November) for businesses proposing projects, and then available in the new year for masters students to apply. Please have a look at https://www.cdrc.ac.uk/education-and-training/masters-dissertation-scheme/ for more details.

Written by Dr Nick Bearman, Data Services Manager.

===========================================================================================================================

CDRC Open Data Survey & Prize Draw - 29th Oct 2020

The CDRC is currently conducting a review of past and ongoing applications of our data sets.

Users of our open data services are invited to participate in a short survey. Completing the survey will automatically enter you into a prize draw, with a chance of winning one of four Amazon gift vouchers:

  • 1 x £200
  • 1 x £100
  • 2 x £50

We will contact the winning participants with details of how to claim their prize shortly after the survey closes on November 13th 2020.

We are gathering information to track the applications of our data services and to better develop our services with our users’ needs in mind. As an open and accessible data service provider, user feedback is crucial to improve the service CDRC provides and to maintain CDRC as a user-centric platform.

All of those users who have registered to access our open datasets should have received an email for the survey. If you have not, and would like to contribute, the survey is available online at https://liverpool.onlinesurveys.ac.uk/cdrc-open-data-survey.

Please contact james.brookes@liverpool.ac.uk or info@cdrc.ac.uk if you have questions about completing the survey.

Written by Dr Nick Bearman, Data Services Manager.

===========================================================================================================================

Secure Labs Reopening and Remote Data Service - 7th Sept 2020

It has been a very interesting few months, with many of our working practices changing, with both positive and negative changes. I am very happy to announce that our secure labs in London and Liverpool are now re-open, with Covid safe rules to allow users safe access to the labs. We will be in touch with lab users, do please contact us if you have any questions.

As one of the Economic and Social Research Council’s data infrastructure investments, CDRC was asked to join a recent meeting discussing how we have been able to respond to COVID-19, both in terms of what our research has been used for, and how we have pivoted to provide more services online. We have had to adapt to and change how we work, often on a relatively short timescale, but hopefully for a better experience overall.

New Remote Secure Data Facilities

One of our main developments which is being rolled out is secure, remote access to some of our Secure data sets. Making data available through UCL’s Data Safe Haven allows us to provide access to some of our secure data sets which were previously only available within our secure labs, requiring a physical visit to London or Liverpool. We have had to renegotiate our data licensing agreements with our data partners to enable this, so currently only some secure datasets are available using this method.

Data Safe Haven is an ISO 27001 accredited facility, with 2 factor authentication ensuring that only those who are allowed to access the data can. We have also implemented our standard secure data output checking, ensuring that any outputs from the lab are secure and non-disclosing.

Remote Working

All of our staff are now working from home which has required us to update our working practices. Both UCL and University of Liverpool have now adopted Microsoft Teams, and working within the Teams framework has allowed us to simplify and rationalised our collaboration, scheduling and document management processes. We must always remember the variety of people’s opinions, with some of our staff very keen on home working, and some very keen to get back to the office as soon as possible. With many of our staff in London, space at home is often at a premium, particularly for a full time home office, which many of us never envisaged before.

One change this has precipitated is a move to electronic signatures for signing user agreements. Spearheaded by UCL Legal, we are now able to accept electronic signatures (using Adobe’s DocuSign process) on our user agreements, removing the need to print, physically sign and scan documents.

The last six months have brought new ways of working, and new approaches to all of our lives – who would have thought that everyone wearing face coverings would become accepted in everyday life? We will continue to keep you up to date with developments with our Secure Labs and new remote secure data technologies.

Remote Training

We have also been able to move all of our training provision online, with a number of courses run through Zoom recently to enable online delivery. We are also in the process of developing two new courses (Advanced GIS Methods Training: AHAH and Multi-Dimensional Indices and Advanced GIS Training Methods: IUC and K-means Clustering) which will be delivered online in the autumn. Check out the links for more details. Whilst online training is not the same as in person training, it does have the advantages of not requiring travel, and overnight stays, which is a bit positive to many people.

We will continue to provide updates to how our service changes and develops. If you would like to use our data, or if you have any questions, please do get in touch via at nick.bearman@ucl.ac.uk or info@cdrc.ac.uk.

Written by Dr Nick Bearman, Data Services Manager.

===========================================================================================================================

Home Working and Horizon Scanning - 2nd April 2020

Work has been transformed by the coronavirus crisis with remote working now the norm for millions of workers. But distance from the office is also providing some opportunities to take a wider perspective of the data landscape and to scan business horizons using data sources that we might have overlooked or never investigated in detail.

The CDRC Data Store remains open for business, and our Open and Safeguarded data products are available as normal. Our Secure labs are closed for the duration of the crisis, but we are still accepting Secure data applications for access when things return to normal.

For students, our Masters Dissertation Scheme is still running with a record number of projects for students to complete in the coming months using business and CDRC data. The scheme gives Masters students registered at any UK university a unique opportunity to engage with horizon scanning or other business problems using novel datasets and interesting business perspectives on applied problem-solving. In the past, many participating students have carried out work at the businesses office, but this year students are being offered opportunities to work with businesses through homeworking for the duration of the crisis. The Scheme still brings together the best of academic and business perspectives upon applied problem-solving. Academic supervisors similarly gain the opportunity to collaborate on potentially high impact research with the business community.

So… if you are a Master’s student interested in collaborating with business, but can no longer do this through fieldwork or primary data collection, why not click here to see if any of the CDRC projects interest you? A number of the organisations that we work with are very keen to use part of their homeworking to coach students in the workings of business, especially if you have relevant skills and ways of working to offer!

We also have the CDRC Data Store which has a wide range of data sets available, some of which may be very useful in your dissertation or current research.

Written by Dr Nick Bearman, Data Services Manager.