Synthetic Data Generation for Firm Links - NSF

Comments

Transcription

Synthetic Data Generation for Firm Links - NSF
Synthetic Data Generation for
Firm Links
The Synthetic Longitudinal Business Database
Saki Kinney
6th January 2016
A portion of this work was conducted by Special Sworn Status researchers of the U.S. Census
Bureau at the Triangle Federal Statistical Research Data Center. Research results and conclusions
expressed are those of the authors and do not necessarily reflect the views of the Census Bureau.
Results have been screened to ensure that no confidential data are revealed. This work has been
supported by the US Census Bureau; Phase 1 by NSF Grant ITR-0427889.
RTI International is a registered trademark and a trade name of Research Triangle Institute.
www.rti.org
Longitudinal Business Database (LBD)

Longitudinal economic census covering all private non-farm business
establishments with paid employees
–
Developed by U.S. Census Bureau Center for Economic Studies
– Starts with 1976, updated annually
– >30 million establishments
– Low depth, high coverage

Unique research dataset used for looking at business formation and
growth, job flows, market volatility, business cycles, international
comparisons…
–

2
Linkable to hundreds of datasets (within secure computing environment)
Confidential data protected by US law (Title 13 and Title 26)
Research Access to the LBD

Data can only be accessed in a secure Federal Statistical Research
Data Center (RDC)
–
May require travel
– All outputs require disclosure review

Project-specific applications required
–
Straightforward but very time consuming. Additional time for background
checks and special data requests
– Involves substantial user fee or institutional membership
3
SynLBD project

Goal is to create public-use file for LBD using synthetic data methods
–
Test case for generating public-use business microdata
– Part of larger goal to expand research access

Provide users with disclosure-proofed microdata that permits users
to draw valid inferences for subset of uses
–
Reduce the number of requests for special tabulations.
– No need to utilize the RDC network for some researchers.
– Aids users requiring RDC access.

Phase 1: Initial SynLBD released
–

4
First public-use establishment-level microdata.
Phase 2 (underway): Adding firm links, geography, other
improvements
SynLBD Access




80 researchers from ~40 institutions and 6 countries have requested
access
7 researchers requested validation
Frequent requests include firm structure, geographic detail, NAICS
codes, longer time series
Small proposal required – quick turnaround, reviewed for feasibility
–
Access via remote desktop emulating RDC environment (Cornell Virtual
RDC)
– All SynLBD analyses can be released w/o disclosure review.
– Or submitted for validation on restricted data (subject to disclosure review)
5
Why (partially) synthetic data?

Concerns about confidentiality protection for longitudinal census of
establishments
–
Data are more disclosive than cross-sectional samples of people.
– No actual values of confidential values may be released (i.e., swapping,
etc. would provide insufficient protection)
6
Variables Used (Phase 2)
Variable
y1
y2
y3 (t)
y4 (t)
y5 (t)
y6 (t)
y7 (t)
x1
x2
Table 1: Synthetic LBD Variable Names
Name
Type
Description
Firstyear
Categorical First year establishment exists
Lastyear
Categorical Last year establishment exists
Inactive
Binary
Inactive in year t
Multiunit
Binary
Part of multiunit firm in year t
Employment Continuous March 12th employment year t
Payroll
Continuous Total payroll in year t
Firm ID
Categorical Firm ID in year t
Geography
Categorical State
NAICS
Categorical 3 digit Industry Code
Notes:
–
There is also a randomly generated estab ID number,
LBDnum.
– Published SynLBD contains one implicate, excludes
Geography, Inactive, and FirmID, uses SIC instead of NAICS
7
Synthesized
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Synthesis: General Approach

Generate joint posterior predictive distribution of Y|X
–

f 𝑦1 , 𝑦2 , 𝑦3, … 𝑋 = f 𝑦1 X f 𝑦2 𝑦1 , X f 𝑦3 𝑦1 , 𝑦2 , X …
To draw from a posterior predictive distribution 𝑓(𝑦𝑘 |𝑋, 𝑦1 ,…, 𝑦𝑘−1 )
–
Fit model using observed data
– Draw new values of model parameters from their posterior distributions
– Use new parameters to predict 𝑦𝑘 from 𝑋 and synthetic values of 𝑦1 ,…,
𝑦𝑘−1
8
First Year


Year establishment enters LBD
Impute Firstyear | NAICS, State using variant of Dirichlet-Multinomial
model
Informative “confidentiality prior” used to protect small cells
– Synthetic values obtained from sampling from posterior multinomial
distribution
–
10
Last Year



Year establishment exits LBD
Impute Last Year | First Year, State SIC
Dirichlet-multinomial with flat prior (“simple multinomial”)
–
11
Multinomial cell probabilities for synthetic data obtained from matching
cells in observed data
Inactive and multiunit status

Longitudinal binary indicators
–
Inactive = I(estab. exited but will return)
– Multiunit = I(estab part of firm)


12
Modeled with simplified version of Dirichlet-multinomial (Betabinomial) for each year.
Status does not change for most establishments
Employment and Payroll


Highly skewed continuous variables
Imputed year by year for employment, then year by year for payroll
–
Impute emp(t)|emp(t-1), other predictors
– Impute pay(t)|pay(t-1), emp(t), other predictors

CART models with Bayesian bootstrap and optional density
smoothing (Reiter 2005)
–
13
Employment: For births predict employment; thereafter predict changes
CART synthesis method


Goal: Synthesize Y | X.
Grow maximum tree. Prune
for confidentiality.
–

Partition X space so that subsets
of units formed by partitions have relatively homogenous Y.
Drawing Y:
–
For any X, trace down tree until
reach appropriate leaf.
– Draw Y from leaf using Bayes bootstrap optionally with smoothed density
estimate (with agency-specified bandwidth).
14
Employment and Payroll
–
15
CART synthesis found to be too good at predicting outliers. So
multiplicative noise applied to outliers prior to synthesis (Evans, Slanta, &
Zayatz).
Bias observed in Phase 1
Job Creation Rates: LBD and Implicates by Year
2000
1999
1997
1998
1996
Year
LBD
16
Implicate 1
Implicate 2
Implicate (Mean)
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1994
1995
Implicate (Mean)
1980
Implicate 2
1979
Implicate 1
50
45
40
35
30
25
20
15
10
5
0
1977
LBD
Job Destruction Rates: LBD and Implicates by Year
1978
Year
1993
1991
1992
1990
1989
1988
1987
1985
1986
1984
1983
1982
1981
1979
1980
1978
1977
50
45
40
35
30
25
20
15
10
5
0
17
18
Firm ID




19
Firm ID: Identification variable linking establishments that belong to
the same firm.
Can vary year to year due to mergers, acquisitions, sales, etc.
For most establishments it does not change
Feature most requested by SynLBD users and potential users
Synthesis of Firm IDs
Impute, year by year, binary variables indicating whether
establishments keep or switch firm ID (small % of establishments
switch)
1.
–
2.
3.
20
Longitudinal binary variables with continuous predictors, modeled using
CART approach
For new and switching estabs, predict firm characteristics
(employment, # estabs, age, etc.) simultaneously using multivariate
recursive partition trees (mvpart)
Link real and synthetic establishments using propensity score
matching on firm characteristics.
Firm Links
Assign Synthetic Firm ID = real Firm ID of linked establishment
Discard synthetic firm characteristics – not needed and not safe to
release.
4.
5.
–
6.
21
Firm characteristics computed using Firm IDs of synthetic establishments,
ensuring logical consistency
Replace synthetic firm IDs with pseudo-identifiers
22
23
24
25
Disclosure Risk Assessment

Fundamental trade-off between disclosure risk and analytic utility.
–

Phase 1 overall approach was ‘Maximum knowledge’ intruder
scenario
–

Disclosure risk = probability of re-identification of establishment and/or
attributes
With both observed data and synthetic data in hand, can mapping be
obtained?
In Phase 1, satisfactory differential privacy assessment done on a
subset.
–
Phase 1 analyses can be replicated in Phase 2, adapted for firms
– Additional work needed for Phase 2 changes and additions
26
Current Status


Phase 2 synthesis is being finalized
Next steps
–
Evaluate analytical validity
– Disclosure risk analysis


Future
–
27
Seek IRS and Census DRB disclosure approval
Updating SynLBD as LBD updates, multiple implicates (?)
Thank you

For more info or to access Phase 1 SynLBD :
–



28
http://www.census.gov/ces/dataproducts/synlbd/
Phase 2 coauthors: Jerry Reiter, Javier Miranda
Additional Phase 1 coauthors: John Abowd, Ron Jarmin, Arnold
Reznek
Other support: Lars Vilhuber, Kevin McKinney, many others

Similar documents