Synthetic Data Generation for Firm Links - NSF
Synthetic Data Generation for
The Synthetic Longitudinal Business Database
6th January 2016
A portion of this work was conducted by Special Sworn Status researchers of the U.S. Census
Bureau at the Triangle Federal Statistical Research Data Center. Research results and conclusions
expressed are those of the authors and do not necessarily reflect the views of the Census Bureau.
Results have been screened to ensure that no confidential data are revealed. This work has been
supported by the US Census Bureau; Phase 1 by NSF Grant ITR-0427889.
RTI International is a registered trademark and a trade name of Research Triangle Institute.
Longitudinal Business Database (LBD)
Longitudinal economic census covering all private non-farm business
establishments with paid employees
Developed by U.S. Census Bureau Center for Economic Studies
– Starts with 1976, updated annually
– >30 million establishments
– Low depth, high coverage
Unique research dataset used for looking at business formation and
growth, job flows, market volatility, business cycles, international
Linkable to hundreds of datasets (within secure computing environment)
Confidential data protected by US law (Title 13 and Title 26)
Research Access to the LBD
Data can only be accessed in a secure Federal Statistical Research
Data Center (RDC)
May require travel
– All outputs require disclosure review
Project-specific applications required
Straightforward but very time consuming. Additional time for background
checks and special data requests
– Involves substantial user fee or institutional membership
Goal is to create public-use file for LBD using synthetic data methods
Test case for generating public-use business microdata
– Part of larger goal to expand research access
Provide users with disclosure-proofed microdata that permits users
to draw valid inferences for subset of uses
Reduce the number of requests for special tabulations.
– No need to utilize the RDC network for some researchers.
– Aids users requiring RDC access.
Phase 1: Initial SynLBD released
First public-use establishment-level microdata.
Phase 2 (underway): Adding firm links, geography, other
80 researchers from ~40 institutions and 6 countries have requested
7 researchers requested validation
Frequent requests include firm structure, geographic detail, NAICS
codes, longer time series
Small proposal required – quick turnaround, reviewed for feasibility
Access via remote desktop emulating RDC environment (Cornell Virtual
– All SynLBD analyses can be released w/o disclosure review.
– Or submitted for validation on restricted data (subject to disclosure review)
Why (partially) synthetic data?
Concerns about confidentiality protection for longitudinal census of
Data are more disclosive than cross-sectional samples of people.
– No actual values of confidential values may be released (i.e., swapping,
etc. would provide insufficient protection)
Variables Used (Phase 2)
Table 1: Synthetic LBD Variable Names
Categorical First year establishment exists
Categorical Last year establishment exists
Inactive in year t
Part of multiunit firm in year t
Employment Continuous March 12th employment year t
Continuous Total payroll in year t
Categorical Firm ID in year t
Categorical 3 digit Industry Code
There is also a randomly generated estab ID number,
– Published SynLBD contains one implicate, excludes
Geography, Inactive, and FirmID, uses SIC instead of NAICS
Synthesis: General Approach
Generate joint posterior predictive distribution of Y|X
f 𝑦1 , 𝑦2 , 𝑦3, … 𝑋 = f 𝑦1 X f 𝑦2 𝑦1 , X f 𝑦3 𝑦1 , 𝑦2 , X …
To draw from a posterior predictive distribution 𝑓(𝑦𝑘 |𝑋, 𝑦1 ,…, 𝑦𝑘−1 )
Fit model using observed data
– Draw new values of model parameters from their posterior distributions
– Use new parameters to predict 𝑦𝑘 from 𝑋 and synthetic values of 𝑦1 ,…,
Year establishment enters LBD
Impute Firstyear | NAICS, State using variant of Dirichlet-Multinomial
Informative “confidentiality prior” used to protect small cells
– Synthetic values obtained from sampling from posterior multinomial
Year establishment exits LBD
Impute Last Year | First Year, State SIC
Dirichlet-multinomial with flat prior (“simple multinomial”)
Multinomial cell probabilities for synthetic data obtained from matching
cells in observed data
Inactive and multiunit status
Longitudinal binary indicators
Inactive = I(estab. exited but will return)
– Multiunit = I(estab part of firm)
Modeled with simplified version of Dirichlet-multinomial (Betabinomial) for each year.
Status does not change for most establishments
Employment and Payroll
Highly skewed continuous variables
Imputed year by year for employment, then year by year for payroll
Impute emp(t)|emp(t-1), other predictors
– Impute pay(t)|pay(t-1), emp(t), other predictors
CART models with Bayesian bootstrap and optional density
smoothing (Reiter 2005)
Employment: For births predict employment; thereafter predict changes
CART synthesis method
Goal: Synthesize Y | X.
Grow maximum tree. Prune
Partition X space so that subsets
of units formed by partitions have relatively homogenous Y.
For any X, trace down tree until
reach appropriate leaf.
– Draw Y from leaf using Bayes bootstrap optionally with smoothed density
estimate (with agency-specified bandwidth).
Employment and Payroll
CART synthesis found to be too good at predicting outliers. So
multiplicative noise applied to outliers prior to synthesis (Evans, Slanta, &
Bias observed in Phase 1
Job Creation Rates: LBD and Implicates by Year
Job Destruction Rates: LBD and Implicates by Year
Firm ID: Identification variable linking establishments that belong to
the same firm.
Can vary year to year due to mergers, acquisitions, sales, etc.
For most establishments it does not change
Feature most requested by SynLBD users and potential users
Synthesis of Firm IDs
Impute, year by year, binary variables indicating whether
establishments keep or switch firm ID (small % of establishments
Longitudinal binary variables with continuous predictors, modeled using
For new and switching estabs, predict firm characteristics
(employment, # estabs, age, etc.) simultaneously using multivariate
recursive partition trees (mvpart)
Link real and synthetic establishments using propensity score
matching on firm characteristics.
Assign Synthetic Firm ID = real Firm ID of linked establishment
Discard synthetic firm characteristics – not needed and not safe to
Firm characteristics computed using Firm IDs of synthetic establishments,
ensuring logical consistency
Replace synthetic firm IDs with pseudo-identifiers
Disclosure Risk Assessment
Fundamental trade-off between disclosure risk and analytic utility.
Phase 1 overall approach was ‘Maximum knowledge’ intruder
Disclosure risk = probability of re-identification of establishment and/or
With both observed data and synthetic data in hand, can mapping be
In Phase 1, satisfactory differential privacy assessment done on a
Phase 1 analyses can be replicated in Phase 2, adapted for firms
– Additional work needed for Phase 2 changes and additions
Phase 2 synthesis is being finalized
Evaluate analytical validity
– Disclosure risk analysis
Seek IRS and Census DRB disclosure approval
Updating SynLBD as LBD updates, multiple implicates (?)
For more info or to access Phase 1 SynLBD :
Phase 2 coauthors: Jerry Reiter, Javier Miranda
Additional Phase 1 coauthors: John Abowd, Ron Jarmin, Arnold
Other support: Lars Vilhuber, Kevin McKinney, many others