Linear Sketches

Transcription

Linear Sketches
Linear Sketches
with Applications to Data Streams
Andrew McGregor
University of Massachusetts
Linear Sketches
•
Random linear projection: M: ℝn→ℝk (where k≪n) that
preserves properties of any v∈ℝn with high probability.
2
4
M
32
56
6
6
6
6v
6
6
6
4
3
2
3
7 = 4Mv 5
7
7
7
7
7
7
7
5
! answer
•
Many Results: Estimating norms, entropy, support size,
quantiles, heavy hitters, fitting histograms and polynomials, ...
•
Rich Theory: Related to compressed sensing and sparse
recovery, dimensionality reduction and metric embeddings, ...
Why? Distributed Processing
Input: v
v1
v2
v3
v4
Mv1
Mv2
Mv3
Mv4
Output: Mv=Mv1+Mv2+Mv3+Mv4
Why? Data Streams
•
Stream: m elements from some universe of size n
e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ...
•
Goal: Estimate properties of the stream, e.g., median, number
of distinct elements, longest increasing sequence.
•
The Catch:
i) Limited working memory, e.g., polylog(n,m)
ii) Access data sequentially & process elements quickly
•
•
•
•
Rich theory with links to communication complexity, pseudorandomness...Very applicable to network monitoring, sensor
network fusion, I/O efficiency in external memory...
Sketches: Can maintain Mf where fi is freq of i. On seeing j:
Mf
Mf + Mej
I: Classic Sketches II: Graph Sketches
III: Other Things
Classics: Count-Min
•
•
•
Count-Min: In each column, place a “1” in a random row.
2
32
3 2
3
0 0 1 1 0 0 0 0
v3 + v4
v1
4 1 0 0 0 1 0 1 0 56 v2 7= 4 v1 + v5 + v7 5
6
7
6
0 1 0 0 0 1 0 1 6 v3 7
v 2 + v6 + v 8
7
6 v4 7
6
7
6 v5 7
6
7
6 v6 7
6
7
4 v7 5
v8
Point Queries: For example, use v2+v6+v8 as estimate for v2.
Analysis: Over-estimate by ≤2F1/k with probability 1/2. Setting
k=O(ε-1) and repeating O(log n) times yields estimate ṽ of v
such that with high probability:
X
8i 2 [n] ;
vi  ṽi  vi + ✏F1 where F1 =
|vi |
i
Classics: Count-Sketch
non-zero entries ∈ {-1,1}.
•2 Count-Sketch: Like Count-Min 3but
2
3
3 2
R
0 0 1
4 1 0 0
0 1 0
•
•
1 0
0 1
0 0
0
0
1
0 0
v3 v4
v1
7 4 v1 + v5 v7 5
1 0 56
6 v2 7=
7
0 1 6
v 2 v6 + v 8
6 v3 7
6 v4 7
6
7
6 v5 7
6
7
6 v6 7
6
7
4 v7 5
v8
Point Queries: For example, use v2-v6+v8 as estimate for v2.
Analysis: Correct in expectation with variance F2/k. Setting
k=O(ε-2) and repeating O(log n) times yields estimate ṽ of v
such that with high probability:
X
p
8i 2 [n] ;
ṽi = vi ± ✏ F2 where F2 =
|vi |2
i
New Classic: Lp-Sampling
• Goal: Return (i,v ) where i chosen proportional to (1±ε) |v |
on u where u =λ v and λ ∈ [0,1]:
• 2Sketch: Count-Sketch
32
32
3
i
i
i
•
•
•
4 Count-Sketch 56
6
6
6
6
6
6
6
6
6
4
1
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
4
0
0
0
0
0
0
0
0
5
0
0
0
i i
0
0
0
0
0
6
0
0
i
0
0
0
0
0
0
7
0
0
0
0
0
0
0
0
8
-2
76
76
76
76
76
76
76
76
76
76
54
R
v1
v2
v3
v4
v5
v6
v7
v8
7
7
7
7
7
7
7
7
7
7
5
Post-Processing: Estimate ũi and return (i,vi) if ũi2 > F2/ε
⇥ 2
⇤
2
2
Pr[ui > F2 /✏] = Pr i < ✏vi /F2 = ✏vi2 /F2
Repeat O(1/ε) times to find a sample.
Analysis: O(ε-1 log2 n)-dimensional Count-Sketch suffices.
p
L2 Sampling: Proof Sketch
Recall: O(w log n)-dimension Count-Sketch ensures
ũi = ui ±
p
F2 (u)/w
Exercise: F2(u) = O(F2(v) log n) with probability .99
Set w = O(ε−1 log n). If ũi2 > F2(v)/ε then,
p
p
F2 (u)/w  ✏F2 (v )  ✏ũi
Hence get (1±ε) approx if value ≥ threshold.
Lp Sampling: Applications
For p=2: Estimating Fk = ∑|vi|k for k>2.
Let (i,vi) be an L2 sample and T=F2 |vi|k-2. Then,
E[T ] = F2
X ✓v2 ◆
i
i
F2
|vi |k
2
= Fk
Repeat O(ε−2 n1-2/k) times and return mean.
For p=1: Finding duplicates and estimating entropy.
For p=0: Corresponds to sampling from the
support of v. Use for graph sketching...
Algorithm Template
•
Template: Combine classic sketch with some transformation:
2
32
4 “Classic” Sketch 56
6
6
6
6
6
6
6
6
6
4
•
Examples:
32
“Transform Matrix”
76
76
76
76
76
76
76
76
76
76
54
v1
v2
v3
v4
v5
v6
v7
v8
3
7
7
7
7
7
7
7
7
7
7
5
a) Histograms: Count-Sketch + Transform to Haar Basis
b) Periodicity: AMS Sketch + Transform to Fourier Basis
c) Quantiles: Count-Min + Dyadic Interval Dictionary
I: Classic Sketches II: Graph Sketches
III: Other Things
Mark and Erica are now friends.
Like · Add Friend
Mark and Erica are no longer friends.
Like · Add Friend
Eduardo and Mark are now friends.
...
...
Like · Add Friend
Lawyers are now friends with everyone.
Like · Add Friend
•
•
Problem #1: Small-Space Dynamic Graph Connectivity
•
•
Input: Observe stream of edge inserts/deletes on n nodes.
Goal: Using Õ(n) space, maintain connected components.
Note: Easy if there are no deletes; just use Union-Find.
...
•
•
Problem #2: Communication Complexity
•
•
Input: Each player knows neighborhood Γ(v) for a node v
Goal: Simultaneously, each player sends O(polylog n) bits to
a central player who then determines if graph is connected.
Note: May assume players have access to public random bits.
It can’t be done!?
•
Suppose there’s a bridge (u,v) in the graph, i.e., a special
friendship that is essential to ensuring the graph is connected.
?
Claim: At least one of the players needs to send Ω(n) bits.
a) Central player needs to know about the special friendship.
b) Participant don’t know which of their friendships are special.
c) Participants may have Ω(n) friends.
How to do it...
•
•
Players send “appropriate” sketches of their address books.
Main Idea: a) Sketch b) Run Algorithm in Sketch Space
Sketch
Algorithm
Original Graph
•
ANSWER
Algorithm
Sketch Space
Catch: Sketch must be homomorphic for algorithm operations.
Ingredient 1: Basic Connectivity Algorithm
Basic Algorithm (Spanning Forest):
1. For each node: pick incident edge
2.For each connected comp: pick incident edge
3.Repeat until no edges between connected comp.
Lemma: Takes O(log n) steps and selected edges
include spanning forest.
Ingredient 2: Graph Representation
For node i, let ai be vector indexed by node pairs.
Non-zero entries: ai[i,j]=1 if j>i and ai[i,j]=-1 if j<i.
Example:
{1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}
a1 =
a2 =
1 1 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0
support
X
i2S
ai
5
3
4
1
Lemma: For any subset of nodes S⊂V,
!
2
= E (S, V \ S)
Ingredient 3: L0-Sampling
Recall: Exists random M: ℝN→ℝk with k=O(log2 N)
such that for any a ∈ ℝN
Ma ! e 2 support(a)
with probability 9/10.
Recipe: Sketch & Compute on Sketches
Sketch: Apply L0-sketch matrix M to each aj
Run Algorithm in Sketch Space:
Use Maj to get incident edge on each node j
For i=2 to t:
To get incident edge on component S⊂V use:
0
1
X
X
X
aj ) = E (S, V \ S)
Maj = M @
aj A ! e 2 support(
j2S
j2S
j2S
Connectivity Results
•
Thm: Can determine connectivity:
a) Of dynamic graph stream using O(n polylog n) memory.
b) Using simultaneous messages of length O(polylog n).
Sketch
Algorithm
Original Graph
ANSWER
Algorithm
Sketch Space
•
•
k-Connectivity
A graph is k-connected if every cut has size ≥ k.
Thm: Can determine k-connectivity:
a) Of dynamic graph stream using O(n k polylog n) space.
b) Using simultaneous messages of length O(k polylog n).
•
Extension: A weighted subgraph is a cut-sparsifier if it
preserves all cuts up to a factor (1+ε). Can construct with
O(n ε-2 polylog n) space or O(ε-2 polylog n)-length messages.
Ingredient 1: Basic Algorithm
Algorithm (k-Connectivity):
1. Let F1 be spanning forest of G(V,E)
2.For i=2 to k:
2.1. Let Fi be spanning forest of G(V,E-F1-...-Fi-1)
Lemma: G(V,F1+...+Fk) is k-connected iff G(V,E) is.
Ingredient 2: Connectivity Sketches
Sketch: Simultaneously construct k independent
sketches {M1G, M2G, ... MkG} for connectivity.
Run Algorithm in Sketch Space:
Use M1G to find a spanning forest F1 of G
Use M2G-M2F1=M2(G-F1) to find F2
Use M3G-M3F1-M3F2=M3(G-F1-F2) to find F3
etc.
•
•
k-Connectivity
A graph is k-connected if every cut has size ≥ k.
Thm: Can determine k-connectivity:
a) Of dynamic graph stream using O(n k polylog n) space.
b) Using simultaneous messages of length O(k polylog n).
•
Extension: A cut-sparsifier is a weighted subgraph that preserves
all cuts up to a (1+ε) factor. Can construct in O(n ε-2 polylog n)
space or O(ε-2 polylog n)-length messages.
Original Graph
Sparsifier Graph
I: Classic Sketches II: Graph Sketches
III: Other Things
Geometric Sketches
•
•
Input: Set of points p1, p2, ... , pn ∈ {1, ... , Δ}d
•
Basic Idea: Consider points at different resolutions and relate
numerical properties of quantizations to geometric problem.
Goal: Estimate geometric properties, e.g., diameter, width
clustering cost, Steiner tree weight, min-cost matchings, ...
x2
x6
x1
x3
Other Stream Algorithms
•
Order Dependent Functions: Longest-increasing subsequence,
time-series data, well-balanced parenthesis...
•
Multi-Pass Streams: Space complexity vs. p-passes
a) Length k increasing subsequence: space = ⇥(k
1+ 2p 1
1
)
b) Median of length m stream: space = ⇥(m1/p )
•
Stochastic Streams: Space complexity vs. Sample complexity
a) Stream is sequence of iid samples from some unknown
distribution. Goal: Infer parameters of distribution.
b) Stream is formed by randomly subsampling an original
stream. Goal: Infer properties of original stream.
Summary
•
Sketches: Linear projections that (approximately) preserve
relevant properties of vectors, point-sets, graphs etc.
Embarrassingly parallelizable and applicable to data streams.
•
Further References:
•
•
•
•
•
Courses: ! http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11
!
http://people.cs.umass.edu/~mcgregor/courses/CS711S12/
!
http://stellar.mit.edu/S/course/6/fa07/6.895/
Book: Data Stream Algorithms. McGregor and Muthukrishnan (forthcoming)
Blog: http://polylogblog.wordpress.com/