Big Data and Official Statistics – Experiences at Statistics Netherlands

Transcription

Big Data and Official Statistics – Experiences at Statistics Netherlands
Big Data and Official Statistics – Experiences
at Statistics Netherlands
Peter Struijs
Poznań, Poland, 10 September 2015
Outline
Big Data and official statistics
Experiences at Statistics Netherlands with:
- use of road sensor data
- use of mobile phone location data
- use of public social media messages
Issues and solutions
Strategic, policy and organisational challenges
Cooperation and collaboration
2
What is Big Data?
3
4
Data sources and approaches
Surveys / questionnaires
sampling theory
Administrative data sources
Where does Big Data fit in?
New methods may be needed, e.g. modeling for
nowcasting and other methods not based on sampling
theory
5
Potential Opportunities
New statistics
More detailed statistics
More timely statistics
Nowcasts and early indicators
Quality improvement
Response burden reduction
Cost reduction and higher efficiency
6
Examples of possible Big Data sources
Road sensor data
Mobile phone location data
Public social media messages
Websites
Google Trends
Satellite information
Etc…
7
Data scientists involved in the research shown
Piet Daas (photo)
May Offermans
Marco Puts
Martijn Tennekes
8
Statistics based on road sensor data
Aim: Statistics on traffic intensities
Characteristics of the data source
Research on the usability of the data
Process of using the data for statistics
Issues when using traffic loop data
9
Road sensor data
Source: National Data Warehouse for Traffic
Information (NDW)
There are 20.000 traffic loops on Dutch
motorways, and 40.000 on provincial roads
Each minute (24/7) the number of passing
vehicles is counted, and their average speed
Three different length classes are
distinguished
No identification of vehicles
Around 230 million records a day used
10
Locations
The main roads
11
A special dike
12
Road sensors in the dike
13
Minute data of one sensor for 196 days
14
Researching the data
Cross correlation between sensor pairs
- Used to validate metadata
Trajectory speed vs. point speed
- Average speed is 98 Km/h
Small, medium-sized & large vehicles
22
Sensors in a road segment
17
Process of making traffic intensities statistics
Select sensors on Dutch highways
Preprocessing
-­‐
-­‐
-­‐
-­‐
Remove non-informative variables
Remove bad records
Exclude bad sensors
Quality indicators for daily data per sensor
Processing
-­‐
-­‐
-­‐
-­‐
18
Reduce dimensions on same road and region
Obtain number of vehicles for each road and region
For each road and region, calculate monthly traffic intensity
Use of R-Hadoop
Validation and publication
Data options
Historical database
-­‐ Request data via web interface
-­‐ Minute data for all highways (48 variables, Jan 2010April 2014: around 2.5 TB)
Data stream
-­‐ Every minute, all data for all active sensors
-­‐ Continuously collected
19
Road sensor data: Issues and non-issues
Non-is s ues :
Privacy
Data acquisition
Is s ues :
Methodology
- Selectivity
- Quality
Infrastructural needs
Other issues
- Skills needed
- Transition from research to
regular statistics
20
References: statistical use of road sensor data
Publication of statistical results (in Dutch):
http://www.cbs.nl/nl-NL/menu/themas/verkeervervoer/publicaties/artikelen/archief/2015/a13-drukste-rijksweg.htm
Explanation in English:
http://www.cbs.nl/NR/rdonlyres/25CE3592-A756-42B7-BABFC3E4C4E9375B/0/a13busiestnationalmotorwayinthenetherlands.pdf
Research reference:
Puts, M., Tennekes, M. and Daas, P. (2014) Using Road Sensor Data for
Official Statistics: Towards a Big Data Methodology. Paper for Strata +
Hadoop World, Barcelona, Spain.
21
Statistics based on mobile phone location data
Why use mobile phone data for official statistics?
Characteristics of the data source
Research on the usability of the data
Issues when using mobile phone location data
Solutions to the issues
22
Possible uses of mobile phone data
Daytime population statistics
Mobility statistics
Tourism statistics
Other uses
23
Mobile phone activity as a data source
Nearly every person in the Netherlands has a mobile
phone
-­‐ Usually on them
-­‐ Almost always switched on
-­‐ Many people are very active during the day
There is a grid of antennas with good coverage
Data of a single mobile company was used
-­‐ Hourly aggregates per area
-­‐ Threshold of 15 events
24
Daytime population based on mobile phone data
Issues when using mobile phone data
Privacy
Data acquisition
Methodology
- Representativeness
- Selectivity
- Quality
Other issues
- Infrastructure
- Skills needed
26
Solutions
Agreement with data provider to provide only
aggregates and apply a threshold
Data provider performed analysis of mobile phone
ownership characteristics
A large number of analyses were made, with the regular
population registration data as a reference
A number of assumptions had to be made
27
References: statistical use of mobile phone data
Research references:
Daas, P.J.H., Puts, M.J., Buelens, B. and van den Hurk, P.A.M. (2015)
Big Data as a Source for Official Statistics. Journal of Official Statistics
31(2), pp. 249-262.
Daas, P. and Burger, J. (2014) Profiling big data sources to assess their
selectivity. Paper for the 2015 New Techniques and Technologies for
Statistics conference, Brussels, Belgium.
28
Mobile phone data versus road sensor data
29
Statistics based on social media data
Why use social media data for official statistics?
Characteristics of the data source
Research on the usability of the data
Issues when using social media data
30
Possible uses of social media data
Sentiment indicators
- e.g. consumer confidence index
Social indicators
- e.g. social coherence indices
Other uses
31
Social media
– Dutch are very active on social media!
-­‐ Around 60% according to a surveyna altijd bij zich en staat vrijwel altijd aan
• Steeds meer mensen hebben een smartphone!
– Mogelijke informatiebron voor:
-­‐ Welke onderwerpen zijn actueel:
• Aantal berichten en sentiment hierover
-­‐ Als meetinstrument te gebruiken voor:
• .
32
Map by Eric Fischer (via Fast Company)
The data
All social media messages:
-
that are written in Dutch
and are public
These messages are systematically and
instantly collected by the Dutch firm
Coosto
Dataset of more than 3.5 billion messages:
-
covering June 2010 till the present
between 3-4 million new messages added per day
33
Research question
C an we replicate the consumer confidence index
by only using social media data,
while reducing production time?
34
Sentiment determination
‘Bag of words’ approach
-
list of Dutch words with their associated sentiment
added social media specific words (‘FAIL’, ‘LOL’, ‘OMG’ etc.)
Use overall score to determine sentiment
-
is either positive, negative or neutral
Average sentiment per period (day / week / month)
-
(#positive - #negative)/#total * 100%
35
Sentiment per platform
(~10%)
(~80%)
Build a model
Idea: Fitting characteristics derived from social media
messages to consumer confidence
Success: If correlation can be found that is high and
remains high, that is, has predictive power
37
Figure 1. Development of daily, weekly and monthly aggregates of social media sentiment from June 2010 until November 2013, in green, red
and black, respectively. In the insert the development of consumer confidence is shown for the identical period.
38
Results
High correlation achieved (0.9)
Changes in consumer confidence preceed changes in
sentiment by one week
Short processing time, so time-to-market may be
reduced.
Sentiment index can be produced on a weekly basis
To be considered:
- Use model-based figures as early indicators
- Reduce sampling of consumer confidence index
39
General sentiment indicator (draft version)
40
Issues when using social media data
Lesser issues:
Privacy
Data acquisition
Main issues:
Methodology
- Selectivity
- Meaning of the data
- Validity of methods used
Other issues
- Skills needed
41
Questions on the validity of methods used
Is it acceptable, under certain conditions, to base official
statistics on correlations?
If so, what are the conditions?
What to do if there is a shock?
42
Reference: statistical use of social media data
Research reference:
Daas, P.J.H. and Puts, M.J.H. (2014) Social Media Sentiment and
Consumer Confidence. European Central Bank Statistics Paper Series
No. 5, Frankfurt, Germany.
43
Big Data Characteristics
Definition:
Volume
Velocity
Variety
Data characteristics:
Unstructured data
Selectivity
Population dynamics
Event data
Organic data
Distributed data
Data use:
Other ways of processing
Fundamentally new applications
44
Overview of Issues
Getting access to the data
Usability of the data
- Meaning of the data, stability of the source, reproducability
Methodologal issues
- Selectivity, representativeness, unknown population, quality and validity
Privacy, confidentiality and reputation
IT-infrastructure and security
Knowledge and skills
Transition from research to production
Strategic challenges
45
Possible responses to the issues
Invest in good relations with the data provider
Invest in methodological research and play with the data
to get a grip on quality
Use only aggregate data if possible
Explore alternatives to population-based estimation
methods
Keep an open mindset
Take the strategic challenges seriously
46
Strategic aspects
Others start producing statistics
- there may be quality issues
- but they are extremely rapid
- and there is obviously demand
Need for good, impartial information
(benchmark information) will remain
- without a monopoly for NSIs
There is a need for validation of
information produced by others
47
Billion Prices Project MIT
48
49
The Roadmap Approach
Awareness that Big Data is a strategic issue
Position paper for Board of Directors
Roadmap Big Data
External validation of the Roadmap
Roadmap updated twice a year for Board of Directors
Roadmap monitor
Deputy Director General responsible at strategic level
Coordination group for Big Data
50
The Scope of the Roadmap
Identification of outputs to be based on Big Data
For each output, definition of time target and ownership
Identification by owner of conditions to be fulfilled
Commitment by supporting services for fulfilling the
conditions (IT, data collection, methodological support, …)
Supporting programmes
51
The Roadmap Projects
Focus projects
Road sensor data for traffic intensities statistics
Mobile phone data for daytime population statistics
Other projects
Internet data for price statistics
Financial transactions data for statistics
Social media data for detecting trends in social cohesion
Internet data for encoding enterprise purchases and sales
52
Supporting Programmes
Big Data features in:
Innovation programme
Methodological research
programme
53
Cooperation and Collaboration on Big Data
Statistics Netherlands works together with:
Other NSIs
UN, UNECE, EU, WorldBank
ESSnet on Big Data (to be confirmed)
Government organisations
Universities and research organisations
Data providers
IT providers
Big Data firms
Research consortia (e.g. H2020)
54
UNECE Big Data Activities
– Classification of Big Data sources
– Big Data project in 2014, with three Task Teams:
-­‐ Partnerships
-­‐ Privacy
-­‐ Quality
– Sandbox in 2014, 2015 and possibly beyond
– Big Data survey, together with UNSD
Results:
http://www1.unece.org/stat/platform/display/bigdata/2014+Project
55
UN Big Data Activities
Global Working Group on Big Data for Official Statistics
with eight Task Teams:
-­‐
-­‐
-­‐
-­‐
-­‐
-­‐
-­‐
-­‐
Mobile phone data
Satellite imagery
Social media data
Access / partnerships
Advocacy / communication
Big Data and SDGs
Training / skills / capacity building
Cross-cutting issues
UNSD survey on Big Data for official statistics
56
Draft Big Data Access Principles (UN)
Social responsibility
Level playing field
Equal treatment
Confidentiality and security
Transparency
Respect for business interest
Proportionality
57
Conclusion: The Way Forward
Get to know Big Data
Use Big Data for efficiency and
response burden reduction
Use Big Data for early indicators
Use Big Data for filling gaps and
new demands
Use new professional methods
where needed
Create the right environment
Don’t do it alone!
58
General references
Glasson, M., Trepanier, J., Patruno, V., Daas, P., Skaliotis, M. and Khan,
A. (2013) What does "Big Data" mean for Official Statistics? Paper for
the High-Level Group for the Modernization of Statistical Production
and Services, March 10.
Struijs, P., Braaksma, B. and Daas, P. (2014) Official Statistics and Big
Data. Big Data & Society, April–June, pp. 1–6.
Struijs, P. and Daas, P.J.H. (2013) Big Data, Big Impact? Paper for the
Seminar on Statistical Data Collection, Geneva, Switzerland.
Struijs, P. and Daas, P. (2014) Quality Approaches to Big Data in Official
Statistics. Paper for the European Conference on Quality in Official
Statistics 2014, Vienna, Austria.
59
The Future
60
Questions?
Thank you for your attention!
[email protected]
61