Linear Sketches
Transcription
Linear Sketches
Linear Sketches with Applications to Data Streams Andrew McGregor University of Massachusetts Linear Sketches • Random linear projection: M: ℝn→ℝk (where k≪n) that preserves properties of any v∈ℝn with high probability. 2 4 M 32 56 6 6 6 6v 6 6 6 4 3 2 3 7 = 4Mv 5 7 7 7 7 7 7 7 5 ! answer • Many Results: Estimating norms, entropy, support size, quantiles, heavy hitters, fitting histograms and polynomials, ... • Rich Theory: Related to compressed sensing and sparse recovery, dimensionality reduction and metric embeddings, ... Why? Distributed Processing Input: v v1 v2 v3 v4 Mv1 Mv2 Mv3 Mv4 Output: Mv=Mv1+Mv2+Mv3+Mv4 Why? Data Streams • Stream: m elements from some universe of size n e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ... • Goal: Estimate properties of the stream, e.g., median, number of distinct elements, longest increasing sequence. • The Catch: i) Limited working memory, e.g., polylog(n,m) ii) Access data sequentially & process elements quickly • • • • Rich theory with links to communication complexity, pseudorandomness...Very applicable to network monitoring, sensor network fusion, I/O efficiency in external memory... Sketches: Can maintain Mf where fi is freq of i. On seeing j: Mf Mf + Mej I: Classic Sketches II: Graph Sketches III: Other Things Classics: Count-Min • • • Count-Min: In each column, place a “1” in a random row. 2 32 3 2 3 0 0 1 1 0 0 0 0 v3 + v4 v1 4 1 0 0 0 1 0 1 0 56 v2 7= 4 v1 + v5 + v7 5 6 7 6 0 1 0 0 0 1 0 1 6 v3 7 v 2 + v6 + v 8 7 6 v4 7 6 7 6 v5 7 6 7 6 v6 7 6 7 4 v7 5 v8 Point Queries: For example, use v2+v6+v8 as estimate for v2. Analysis: Over-estimate by ≤2F1/k with probability 1/2. Setting k=O(ε-1) and repeating O(log n) times yields estimate ṽ of v such that with high probability: X 8i 2 [n] ; vi ṽi vi + ✏F1 where F1 = |vi | i Classics: Count-Sketch non-zero entries ∈ {-1,1}. •2 Count-Sketch: Like Count-Min 3but 2 3 3 2 R 0 0 1 4 1 0 0 0 1 0 • • 1 0 0 1 0 0 0 0 1 0 0 v3 v4 v1 7 4 v1 + v5 v7 5 1 0 56 6 v2 7= 7 0 1 6 v 2 v6 + v 8 6 v3 7 6 v4 7 6 7 6 v5 7 6 7 6 v6 7 6 7 4 v7 5 v8 Point Queries: For example, use v2-v6+v8 as estimate for v2. Analysis: Correct in expectation with variance F2/k. Setting k=O(ε-2) and repeating O(log n) times yields estimate ṽ of v such that with high probability: X p 8i 2 [n] ; ṽi = vi ± ✏ F2 where F2 = |vi |2 i New Classic: Lp-Sampling • Goal: Return (i,v ) where i chosen proportional to (1±ε) |v | on u where u =λ v and λ ∈ [0,1]: • 2Sketch: Count-Sketch 32 32 3 i i i • • • 4 Count-Sketch 56 6 6 6 6 6 6 6 6 6 4 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 5 0 0 0 i i 0 0 0 0 0 6 0 0 i 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 8 -2 76 76 76 76 76 76 76 76 76 76 54 R v1 v2 v3 v4 v5 v6 v7 v8 7 7 7 7 7 7 7 7 7 7 5 Post-Processing: Estimate ũi and return (i,vi) if ũi2 > F2/ε ⇥ 2 ⇤ 2 2 Pr[ui > F2 /✏] = Pr i < ✏vi /F2 = ✏vi2 /F2 Repeat O(1/ε) times to find a sample. Analysis: O(ε-1 log2 n)-dimensional Count-Sketch suffices. p L2 Sampling: Proof Sketch Recall: O(w log n)-dimension Count-Sketch ensures ũi = ui ± p F2 (u)/w Exercise: F2(u) = O(F2(v) log n) with probability .99 Set w = O(ε−1 log n). If ũi2 > F2(v)/ε then, p p F2 (u)/w ✏F2 (v ) ✏ũi Hence get (1±ε) approx if value ≥ threshold. Lp Sampling: Applications For p=2: Estimating Fk = ∑|vi|k for k>2. Let (i,vi) be an L2 sample and T=F2 |vi|k-2. Then, E[T ] = F2 X ✓v2 ◆ i i F2 |vi |k 2 = Fk Repeat O(ε−2 n1-2/k) times and return mean. For p=1: Finding duplicates and estimating entropy. For p=0: Corresponds to sampling from the support of v. Use for graph sketching... Algorithm Template • Template: Combine classic sketch with some transformation: 2 32 4 “Classic” Sketch 56 6 6 6 6 6 6 6 6 6 4 • Examples: 32 “Transform Matrix” 76 76 76 76 76 76 76 76 76 76 54 v1 v2 v3 v4 v5 v6 v7 v8 3 7 7 7 7 7 7 7 7 7 7 5 a) Histograms: Count-Sketch + Transform to Haar Basis b) Periodicity: AMS Sketch + Transform to Fourier Basis c) Quantiles: Count-Min + Dyadic Interval Dictionary I: Classic Sketches II: Graph Sketches III: Other Things Mark and Erica are now friends. Like · Add Friend Mark and Erica are no longer friends. Like · Add Friend Eduardo and Mark are now friends. ... ... Like · Add Friend Lawyers are now friends with everyone. Like · Add Friend • • Problem #1: Small-Space Dynamic Graph Connectivity • • Input: Observe stream of edge inserts/deletes on n nodes. Goal: Using Õ(n) space, maintain connected components. Note: Easy if there are no deletes; just use Union-Find. ... • • Problem #2: Communication Complexity • • Input: Each player knows neighborhood Γ(v) for a node v Goal: Simultaneously, each player sends O(polylog n) bits to a central player who then determines if graph is connected. Note: May assume players have access to public random bits. It can’t be done!? • Suppose there’s a bridge (u,v) in the graph, i.e., a special friendship that is essential to ensuring the graph is connected. ? Claim: At least one of the players needs to send Ω(n) bits. a) Central player needs to know about the special friendship. b) Participant don’t know which of their friendships are special. c) Participants may have Ω(n) friends. How to do it... • • Players send “appropriate” sketches of their address books. Main Idea: a) Sketch b) Run Algorithm in Sketch Space Sketch Algorithm Original Graph • ANSWER Algorithm Sketch Space Catch: Sketch must be homomorphic for algorithm operations. Ingredient 1: Basic Connectivity Algorithm Basic Algorithm (Spanning Forest): 1. For each node: pick incident edge 2.For each connected comp: pick incident edge 3.Repeat until no edges between connected comp. Lemma: Takes O(log n) steps and selected edges include spanning forest. Ingredient 2: Graph Representation For node i, let ai be vector indexed by node pairs. Non-zero entries: ai[i,j]=1 if j>i and ai[i,j]=-1 if j<i. Example: {1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5} a1 = a2 = 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 support X i2S ai 5 3 4 1 Lemma: For any subset of nodes S⊂V, ! 2 = E (S, V \ S) Ingredient 3: L0-Sampling Recall: Exists random M: ℝN→ℝk with k=O(log2 N) such that for any a ∈ ℝN Ma ! e 2 support(a) with probability 9/10. Recipe: Sketch & Compute on Sketches Sketch: Apply L0-sketch matrix M to each aj Run Algorithm in Sketch Space: Use Maj to get incident edge on each node j For i=2 to t: To get incident edge on component S⊂V use: 0 1 X X X aj ) = E (S, V \ S) Maj = M @ aj A ! e 2 support( j2S j2S j2S Connectivity Results • Thm: Can determine connectivity: a) Of dynamic graph stream using O(n polylog n) memory. b) Using simultaneous messages of length O(polylog n). Sketch Algorithm Original Graph ANSWER Algorithm Sketch Space • • k-Connectivity A graph is k-connected if every cut has size ≥ k. Thm: Can determine k-connectivity: a) Of dynamic graph stream using O(n k polylog n) space. b) Using simultaneous messages of length O(k polylog n). • Extension: A weighted subgraph is a cut-sparsifier if it preserves all cuts up to a factor (1+ε). Can construct with O(n ε-2 polylog n) space or O(ε-2 polylog n)-length messages. Ingredient 1: Basic Algorithm Algorithm (k-Connectivity): 1. Let F1 be spanning forest of G(V,E) 2.For i=2 to k: 2.1. Let Fi be spanning forest of G(V,E-F1-...-Fi-1) Lemma: G(V,F1+...+Fk) is k-connected iff G(V,E) is. Ingredient 2: Connectivity Sketches Sketch: Simultaneously construct k independent sketches {M1G, M2G, ... MkG} for connectivity. Run Algorithm in Sketch Space: Use M1G to find a spanning forest F1 of G Use M2G-M2F1=M2(G-F1) to find F2 Use M3G-M3F1-M3F2=M3(G-F1-F2) to find F3 etc. • • k-Connectivity A graph is k-connected if every cut has size ≥ k. Thm: Can determine k-connectivity: a) Of dynamic graph stream using O(n k polylog n) space. b) Using simultaneous messages of length O(k polylog n). • Extension: A cut-sparsifier is a weighted subgraph that preserves all cuts up to a (1+ε) factor. Can construct in O(n ε-2 polylog n) space or O(ε-2 polylog n)-length messages. Original Graph Sparsifier Graph I: Classic Sketches II: Graph Sketches III: Other Things Geometric Sketches • • Input: Set of points p1, p2, ... , pn ∈ {1, ... , Δ}d • Basic Idea: Consider points at different resolutions and relate numerical properties of quantizations to geometric problem. Goal: Estimate geometric properties, e.g., diameter, width clustering cost, Steiner tree weight, min-cost matchings, ... x2 x6 x1 x3 Other Stream Algorithms • Order Dependent Functions: Longest-increasing subsequence, time-series data, well-balanced parenthesis... • Multi-Pass Streams: Space complexity vs. p-passes a) Length k increasing subsequence: space = ⇥(k 1+ 2p 1 1 ) b) Median of length m stream: space = ⇥(m1/p ) • Stochastic Streams: Space complexity vs. Sample complexity a) Stream is sequence of iid samples from some unknown distribution. Goal: Infer parameters of distribution. b) Stream is formed by randomly subsampling an original stream. Goal: Infer properties of original stream. Summary • Sketches: Linear projections that (approximately) preserve relevant properties of vectors, point-sets, graphs etc. Embarrassingly parallelizable and applicable to data streams. • Further References: • • • • • Courses: ! http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11 ! http://people.cs.umass.edu/~mcgregor/courses/CS711S12/ ! http://stellar.mit.edu/S/course/6/fa07/6.895/ Book: Data Stream Algorithms. McGregor and Muthukrishnan (forthcoming) Blog: http://polylogblog.wordpress.com/