Big Data and Official Statistics – Experiences at Statistics Netherlands
Transcription
Big Data and Official Statistics – Experiences at Statistics Netherlands
Big Data and Official Statistics – Experiences at Statistics Netherlands Peter Struijs Poznań, Poland, 10 September 2015 Outline Big Data and official statistics Experiences at Statistics Netherlands with: - use of road sensor data - use of mobile phone location data - use of public social media messages Issues and solutions Strategic, policy and organisational challenges Cooperation and collaboration 2 What is Big Data? 3 4 Data sources and approaches Surveys / questionnaires sampling theory Administrative data sources Where does Big Data fit in? New methods may be needed, e.g. modeling for nowcasting and other methods not based on sampling theory 5 Potential Opportunities New statistics More detailed statistics More timely statistics Nowcasts and early indicators Quality improvement Response burden reduction Cost reduction and higher efficiency 6 Examples of possible Big Data sources Road sensor data Mobile phone location data Public social media messages Websites Google Trends Satellite information Etc… 7 Data scientists involved in the research shown Piet Daas (photo) May Offermans Marco Puts Martijn Tennekes 8 Statistics based on road sensor data Aim: Statistics on traffic intensities Characteristics of the data source Research on the usability of the data Process of using the data for statistics Issues when using traffic loop data 9 Road sensor data Source: National Data Warehouse for Traffic Information (NDW) There are 20.000 traffic loops on Dutch motorways, and 40.000 on provincial roads Each minute (24/7) the number of passing vehicles is counted, and their average speed Three different length classes are distinguished No identification of vehicles Around 230 million records a day used 10 Locations The main roads 11 A special dike 12 Road sensors in the dike 13 Minute data of one sensor for 196 days 14 Researching the data Cross correlation between sensor pairs - Used to validate metadata Trajectory speed vs. point speed - Average speed is 98 Km/h Small, medium-sized & large vehicles 22 Sensors in a road segment 17 Process of making traffic intensities statistics Select sensors on Dutch highways Preprocessing -‐ -‐ -‐ -‐ Remove non-informative variables Remove bad records Exclude bad sensors Quality indicators for daily data per sensor Processing -‐ -‐ -‐ -‐ 18 Reduce dimensions on same road and region Obtain number of vehicles for each road and region For each road and region, calculate monthly traffic intensity Use of R-Hadoop Validation and publication Data options Historical database -‐ Request data via web interface -‐ Minute data for all highways (48 variables, Jan 2010April 2014: around 2.5 TB) Data stream -‐ Every minute, all data for all active sensors -‐ Continuously collected 19 Road sensor data: Issues and non-issues Non-is s ues : Privacy Data acquisition Is s ues : Methodology - Selectivity - Quality Infrastructural needs Other issues - Skills needed - Transition from research to regular statistics 20 References: statistical use of road sensor data Publication of statistical results (in Dutch): http://www.cbs.nl/nl-NL/menu/themas/verkeervervoer/publicaties/artikelen/archief/2015/a13-drukste-rijksweg.htm Explanation in English: http://www.cbs.nl/NR/rdonlyres/25CE3592-A756-42B7-BABFC3E4C4E9375B/0/a13busiestnationalmotorwayinthenetherlands.pdf Research reference: Puts, M., Tennekes, M. and Daas, P. (2014) Using Road Sensor Data for Official Statistics: Towards a Big Data Methodology. Paper for Strata + Hadoop World, Barcelona, Spain. 21 Statistics based on mobile phone location data Why use mobile phone data for official statistics? Characteristics of the data source Research on the usability of the data Issues when using mobile phone location data Solutions to the issues 22 Possible uses of mobile phone data Daytime population statistics Mobility statistics Tourism statistics Other uses 23 Mobile phone activity as a data source Nearly every person in the Netherlands has a mobile phone -‐ Usually on them -‐ Almost always switched on -‐ Many people are very active during the day There is a grid of antennas with good coverage Data of a single mobile company was used -‐ Hourly aggregates per area -‐ Threshold of 15 events 24 Daytime population based on mobile phone data Issues when using mobile phone data Privacy Data acquisition Methodology - Representativeness - Selectivity - Quality Other issues - Infrastructure - Skills needed 26 Solutions Agreement with data provider to provide only aggregates and apply a threshold Data provider performed analysis of mobile phone ownership characteristics A large number of analyses were made, with the regular population registration data as a reference A number of assumptions had to be made 27 References: statistical use of mobile phone data Research references: Daas, P.J.H., Puts, M.J., Buelens, B. and van den Hurk, P.A.M. (2015) Big Data as a Source for Official Statistics. Journal of Official Statistics 31(2), pp. 249-262. Daas, P. and Burger, J. (2014) Profiling big data sources to assess their selectivity. Paper for the 2015 New Techniques and Technologies for Statistics conference, Brussels, Belgium. 28 Mobile phone data versus road sensor data 29 Statistics based on social media data Why use social media data for official statistics? Characteristics of the data source Research on the usability of the data Issues when using social media data 30 Possible uses of social media data Sentiment indicators - e.g. consumer confidence index Social indicators - e.g. social coherence indices Other uses 31 Social media – Dutch are very active on social media! -‐ Around 60% according to a surveyna altijd bij zich en staat vrijwel altijd aan • Steeds meer mensen hebben een smartphone! – Mogelijke informatiebron voor: -‐ Welke onderwerpen zijn actueel: • Aantal berichten en sentiment hierover -‐ Als meetinstrument te gebruiken voor: • . 32 Map by Eric Fischer (via Fast Company) The data All social media messages: - that are written in Dutch and are public These messages are systematically and instantly collected by the Dutch firm Coosto Dataset of more than 3.5 billion messages: - covering June 2010 till the present between 3-4 million new messages added per day 33 Research question C an we replicate the consumer confidence index by only using social media data, while reducing production time? 34 Sentiment determination ‘Bag of words’ approach - list of Dutch words with their associated sentiment added social media specific words (‘FAIL’, ‘LOL’, ‘OMG’ etc.) Use overall score to determine sentiment - is either positive, negative or neutral Average sentiment per period (day / week / month) - (#positive - #negative)/#total * 100% 35 Sentiment per platform (~10%) (~80%) Build a model Idea: Fitting characteristics derived from social media messages to consumer confidence Success: If correlation can be found that is high and remains high, that is, has predictive power 37 Figure 1. Development of daily, weekly and monthly aggregates of social media sentiment from June 2010 until November 2013, in green, red and black, respectively. In the insert the development of consumer confidence is shown for the identical period. 38 Results High correlation achieved (0.9) Changes in consumer confidence preceed changes in sentiment by one week Short processing time, so time-to-market may be reduced. Sentiment index can be produced on a weekly basis To be considered: - Use model-based figures as early indicators - Reduce sampling of consumer confidence index 39 General sentiment indicator (draft version) 40 Issues when using social media data Lesser issues: Privacy Data acquisition Main issues: Methodology - Selectivity - Meaning of the data - Validity of methods used Other issues - Skills needed 41 Questions on the validity of methods used Is it acceptable, under certain conditions, to base official statistics on correlations? If so, what are the conditions? What to do if there is a shock? 42 Reference: statistical use of social media data Research reference: Daas, P.J.H. and Puts, M.J.H. (2014) Social Media Sentiment and Consumer Confidence. European Central Bank Statistics Paper Series No. 5, Frankfurt, Germany. 43 Big Data Characteristics Definition: Volume Velocity Variety Data characteristics: Unstructured data Selectivity Population dynamics Event data Organic data Distributed data Data use: Other ways of processing Fundamentally new applications 44 Overview of Issues Getting access to the data Usability of the data - Meaning of the data, stability of the source, reproducability Methodologal issues - Selectivity, representativeness, unknown population, quality and validity Privacy, confidentiality and reputation IT-infrastructure and security Knowledge and skills Transition from research to production Strategic challenges 45 Possible responses to the issues Invest in good relations with the data provider Invest in methodological research and play with the data to get a grip on quality Use only aggregate data if possible Explore alternatives to population-based estimation methods Keep an open mindset Take the strategic challenges seriously 46 Strategic aspects Others start producing statistics - there may be quality issues - but they are extremely rapid - and there is obviously demand Need for good, impartial information (benchmark information) will remain - without a monopoly for NSIs There is a need for validation of information produced by others 47 Billion Prices Project MIT 48 49 The Roadmap Approach Awareness that Big Data is a strategic issue Position paper for Board of Directors Roadmap Big Data External validation of the Roadmap Roadmap updated twice a year for Board of Directors Roadmap monitor Deputy Director General responsible at strategic level Coordination group for Big Data 50 The Scope of the Roadmap Identification of outputs to be based on Big Data For each output, definition of time target and ownership Identification by owner of conditions to be fulfilled Commitment by supporting services for fulfilling the conditions (IT, data collection, methodological support, …) Supporting programmes 51 The Roadmap Projects Focus projects Road sensor data for traffic intensities statistics Mobile phone data for daytime population statistics Other projects Internet data for price statistics Financial transactions data for statistics Social media data for detecting trends in social cohesion Internet data for encoding enterprise purchases and sales 52 Supporting Programmes Big Data features in: Innovation programme Methodological research programme 53 Cooperation and Collaboration on Big Data Statistics Netherlands works together with: Other NSIs UN, UNECE, EU, WorldBank ESSnet on Big Data (to be confirmed) Government organisations Universities and research organisations Data providers IT providers Big Data firms Research consortia (e.g. H2020) 54 UNECE Big Data Activities – Classification of Big Data sources – Big Data project in 2014, with three Task Teams: -‐ Partnerships -‐ Privacy -‐ Quality – Sandbox in 2014, 2015 and possibly beyond – Big Data survey, together with UNSD Results: http://www1.unece.org/stat/platform/display/bigdata/2014+Project 55 UN Big Data Activities Global Working Group on Big Data for Official Statistics with eight Task Teams: -‐ -‐ -‐ -‐ -‐ -‐ -‐ -‐ Mobile phone data Satellite imagery Social media data Access / partnerships Advocacy / communication Big Data and SDGs Training / skills / capacity building Cross-cutting issues UNSD survey on Big Data for official statistics 56 Draft Big Data Access Principles (UN) Social responsibility Level playing field Equal treatment Confidentiality and security Transparency Respect for business interest Proportionality 57 Conclusion: The Way Forward Get to know Big Data Use Big Data for efficiency and response burden reduction Use Big Data for early indicators Use Big Data for filling gaps and new demands Use new professional methods where needed Create the right environment Don’t do it alone! 58 General references Glasson, M., Trepanier, J., Patruno, V., Daas, P., Skaliotis, M. and Khan, A. (2013) What does "Big Data" mean for Official Statistics? Paper for the High-Level Group for the Modernization of Statistical Production and Services, March 10. Struijs, P., Braaksma, B. and Daas, P. (2014) Official Statistics and Big Data. Big Data & Society, April–June, pp. 1–6. Struijs, P. and Daas, P.J.H. (2013) Big Data, Big Impact? Paper for the Seminar on Statistical Data Collection, Geneva, Switzerland. Struijs, P. and Daas, P. (2014) Quality Approaches to Big Data in Official Statistics. Paper for the European Conference on Quality in Official Statistics 2014, Vienna, Austria. 59 The Future 60 Questions? Thank you for your attention! [email protected] 61