M390C: Probabilistic Combinatorics
M390C: Probabilistic Combinatorics Last updated: February 26, 2015 1 Lecture 1 We introduce the basic idea of the probabilistic method, and running examples. These objects will reappear in later lectures, as we refine our tools. Typical problem: prove that there exists an object with desired property T . The probabilistic method: take a random object. Show that with positive probability, it will have property T (and therefore such an object must exist). This method is particularly useful when explicit construction is difficult. We illustrate this method with three examples. 1.1 Example 1: Tournament A tournament is a complete, directed graph, where u → v if u beats v. Say that a tournament has property T (k) if for every set of k teams, there is one who beat them all. Show that if k < n and n (1 − 2−k )n−k < 1, k then there exists a tournament with property T (k). Tournament analysis: For example, property T (1) means there exists a set of cycles than span the graph. For given n and k, constructing an explicit tournament with property T (k) seems difficult. Here is the probabilistic solution. Take a tournament uniformly at random: orient each the n2 edges leftward or rightward with probability 1/2. Take a set K of k teams. Let AK be the bad event that there does NOT exist a team who beat them all. For each of the n − k remaining teams, the probability that it beats all of the chosen k teams is 2−k . Thus P(AK ) = (1 − 2−k )n−k . 1 There are P( n k set of k teams. By the union bound,1 [ K⊂V,|K|=k AK ) ≤ X P(AK ) K n = (1 − 2−k )n−k < 1 k by our assumption. T Thus the complement event { K AK }, which is property T (k), must exists with positive probability. Thus, there exists a tournament with property T (k). 1.2 Example 2: Ramsey numbers The complete graph on n vertices, denoted Kn , is the graph where each vertex is connected to all other vertices. A k-coloring of a graph is an assignment of a color to each vertex, out of k possible color choices. One can also assign colors to the edges instead. In this case, it is called an edge coloring. A k-clique in a graph G is a complete subgraph Kk in G. In a colored graph, a monochromatic clique is a clique where all vertices (or edges) have the same color. Definition 1 The Ramsey number R(k, `) is the smallest integer n such that in any two-coloring of the edges of the complete graph on n vertices Kn by red and blue, either there is a red Kk , or there is a blue K` . Hidden in this definition is the claim that such a number exists (ie: finite) for any pairs of positive integers (k, `). That is, given integers k, ` ≥ 1, the definition claims that one can always find a large enough complete graph with a two-coloring of the edges such that there is a monochromatic clique of either size k or `. This claim itself is called Ramsey’s theorem. Ramsey’s theorem. R(k, `) exists (ie: is finite) for any pair of integers k, ` ≥ 1. Problem 1. For k ≥ 1, what is R(k, 1)? What is R(1, k)? Problem 2. Prove Ramsey’s theorem by showing that R(k, `) ≤ R(k − 1, `) + R(k, ` − 1). Now that we have convinced ourselves that R(k, `) exists, the next question is, what is it? Unfortunately, formula for arbitrary k, ` is an open problem. A fast algorithm 1 For events {A1 , . . . , Am }, the union bound says P(A1 ∪ . . . ∪ Am ) ≤ P(A1 ) + . . . + P(Am ). (That is, the probability that any of these events occur is at most the sum of their probabilities). 2 for computing R(k, `) is also open. In fact, we only know this number exactly for k ≤ 5, ` ≤ 4. For other numbers we have bounds: R(5, 5) ∈ [43, 49], for example. The situation is summarized by Joel Spencer, who attributed the story to Erdos. Imagine a bunch of aliens, much more powerful than us, land on Earth and demand R(5, 5). Then we should gather all of our computers and mathematicians, and attempt to find the value. But suppose, instead, that they ask for R(6, 6). Then we should attempt to destroy the aliens. Problem 3. Consider the na¨ıve algorithm of finding R(5, 5) by checking all possible two-colorings of K43 , K44 , . . . , K49 . How many such colorings are there on K43 ? Ramsey theorem is an instance of Ramsey theory, which are theorems that roughly says: ‘if a structure is big enough, order emerges’. Wikipedia states: Problems in Ramsey theory typically ask: ‘how many elements of some structure must there be to guarantee that a particular property will hold?’ Results of this kind have numerous applications - some of which we will derive in this class. The probabilistic method is a very good tool for tackling these questions. Our goal will be to bound the Ramsey number using the probabilistic method. 1.3 First bound on the Ramsey number Proposition 2 If k ≥ 3. 1−(k) 2 2 < 1, then R(k, k) > n. Thus, R(k, k) > b2k/2 c for all n k Proof: First, let us prove the ‘thus’. Suppose k ≥ 3. Choose n = b2k/2 c. We shall use the bound nk n n! < . = (n − k)!k! k! k Apply this bound to the first term, and expand out the second term, we get n 1−(k2) nk 2 nk 21+k/2 2 < = . k k! 2(k2 −k)/2 k! 2k2 /2 But nk = b2k/2 ck < 2k 2 /2 , and for k ≥ 3, 21+k/2 < k!, so nk 21+k/2 = k! 2k2 /2 nk 2k2 /2 3 21+k/2 k! < 1. Thus by the first claim, R(k, k) > n. Now, let us prove the first claim. Fix an n satisfying the hypothesis. We need to show that there exists a coloring of Kn such that there are no monochromatic k-cliques. The idea is to find one at random. For each edge of Kn , color the edge red or blue each with probability 1/2. For a set K of k vertices, let AK be the event that they form a monochromatic k-clique. Then k P(AK ) = 21−(2) . Indeed, imagine that we are coloring the edges sequentially. After fixing the color of k the first edge, there are 2 − 1 uncolored edges remaining. The clique is monochromatic if and only if all of these edges have the same color as our first edge, and this k happens with probability 21−(2) . By the union bound, [ X P( AK ) ≤ P(AK ) K K n 1−(k2) = 2 <1 k by our assumption. So, with positive probability, our random coloring of Kn is a coloring that does not contain monochromatic k-cliques. Thus, such a coloring exists. Such, R(k, k) > n. 1.4 Take-away messages The probabilistic method is the following: to find an object, take a random one (and show that we succeed with positive probability). In Proposition 12, we needed to find a two-coloring of Kn with no monochromatic k-cliques. The ‘object’ is a twocoloring of Kn . ‘Property T ’ is ‘contains no monochromatic k-cliques’. The random construction is the simplest, most natural one: color each edge independently at random with a uniformly chosen color. There are three other take-away points in this proof. • It is a counting argument. This is the ‘combinatorics’ part of the method. • We applied a bound to simplify quantities that appeared. In this case, we used the union bound. We will encouter ‘fancier’ bounds in the class, and in practice part of the art is to apply the correct bound at the correct step. • We could have applied more complicated random constructions: for example, we could have chosen red with probability p, blue with probability 1 − p, and hope 4 to tune p to obtain an optimal lower bound. Sometimes this yield substantially better results. See Lecture 3. 2 Lecture 2: Existence using expectation Lemma 3 Let X be a random variable, E(X) = c < ∞. Either X = c with probability 1, or that each of the events {X > c} and {X < c} happen with positive probability. Example application: Theorem 4 (Szele 1943) There is a tournament with n players and at least n!2−(n−1) Hamiltonian paths. (Recall that a Hamiltonian path in a directed graph is a path that visits each vertex exactly once, with no edges repeated. A Hamiltonian path in a tournament is a total ranking of players such that a > b > c > · · · . It may seem surprising that a tournament can have exponentially many different Hamiltonian paths. Proof: Take a random tournament as before. Let X denote the number Hamiltonian paths. For each permutation σ, let Xσ be indicator of the event that σ is a Hamiltonian path. Such a path has n − 1 edges, so E(Xσ ) = P({σ is Hamiltonian }) = 2−(n−1) . There are n! permutations, so ! E(X) = E X Xσ = X E(Xσ ) = n!2−(n−1) . σ σ Example application 2: Let F be a family of subsets of [n]. Say that F is an antichain if there are no two sets A, B ∈ F such that A ⊂ B. For example, if F= Fk is the family of subsets of size k of [n], then Fk is an antichain with |Fk | = nk . We can represent F as a graph on a subset of vertices of the n-hypercube, where two vertices A and B are adjacent if either A ⊂ B or B ⊂ A. Then F is an antichain if the graph is totally disconnected. Call Fk the k-th belt of the cube. The largest belt Fk is for k = bn/2c. Can one construct a larger antichain? Sperner’s theorem says no. 5 Theorem 5 (Sperner’s theorem) Let F be a family of subsets of [n]. If F is an n antichain, then |F| ≤ bn/2c . Proof: View F as a graph, let fk denote the number of vertices of F in the k-th belt of the n-hypercube. We shall prove that n X fk n ≤ 1. k=0 (1) k This then proves Sperner’s theorem, since n n X X fk n ≥ k=0 k k=0 fk n bn/2c , ⇒ n X fk k=0 n bn/2c ≤ 1, ⇒ n X fk = |F| ≤ k=0 n . bn/2c We shall prove the contrapositive of (1). That is, assume that n X fk n > 1. k=0 k We want to show that F is not an antichain. Let π be a random permutation of [n]. Consider the sequence of sets {∅}, {π(1)}, {π(1), π(2)}, . . . {π(1), π(2), . . . , π(n)}. Let U be the number of sets in this sequence that belongs to F. Let Uk denote the indicator of the event that the k-th set {π(k), . . . π(k)} is in F. Note that the k-th set in the sequence is a random element of the k-th belt. Thus fk E(Uk ) = n . k Now n X X X fk E(U ) = E( Uk ) = E(Uk ) = n > 1. k k k=0 k By Lemma 3, with positive probability, U ≥ 2. That is, there exists a permutation π such that there are at least two sets of the sequence lies in F. But one of such set must contain another, and thus F cannot be an antichain. 3 Lecture 3: Sample and Modify Sometimes the first random object we constructed does not quite have the desired property. In this case, we may need to do small modifications. In some communities, this is called sample-and-modify. 6 3.1 Independent set Let G = (V, E) be a graph. A set I ⊆ V is called independent if i, j ∈ I implies (i, j) 6∈ E. The size of the largest independent set α(G) is called the independent number of a graph. (Recall: for example, an antichain is an independent set in the n-hypercube). We want a lower bound on α(G). Theorem 6 Suppose G = (V, E) has n vertices and nd/2 edges, d ≥ 1. Then α(G) ≥ n . 2d For example, suppose n is a multiple of d + 1, so n = (d + 1)k. Consider a graph n with k disjoint (d + 1)-cliques. Clearly α(G) = d+1 in this case, so our bound is only off by a constant 2. In fact, the clique construction is tight in this case. This is an example of Turan’s theorem, which has a probabilistic proof. (See future lectures). Proof: We shall randomly find a large independent set. Here is a recipe for an independent set: start with some initial set of vertices S. To each of its edges on this induced subgraph, delete one of the incident vertices. After this, all the induced edges have been destroyed, and thus the remaining vertices S ∗ form an independent set. For this construction to work, we need to start with a big enough set S with few enough edges, so that after deletion, S ∗ is large. Specifically, let X be the number of vertices in S, Y be the number of edges in the induced subgraph on S. Then S ∗ has at least X − Y vertices. So we need to find S such that X − Y is large. Choose S at random: for each vertex v ∈ V , include v in S with probability p. Then E(X) = np and E(Y ) = X E({e ∈ G|S }) = e∈E So nd 2 p. 2 nd 2 p. 2 Choose p to maximize the above quantity, we find that the optimal p is p = 1/d. (Assuming d ≥ 1). Thus, n E(X − Y ) = . 2d By Lemma 3, there exists a set S where the modified set S ∗ is an independent set n with at least X − Y ≥ 2d vertices. E(X − Y ) = np − 7 3.2 Packing Let B(x) denote the cube [0, x]d of side length x. Let C be a compact measurable set with (Lebesgue) measure µ(C). A packing of C into B(x) is a family of mutually disjoint copies of C, all lying inside B(x). Let f (x) denote the largest size of such a family. The packing constant δ(C) is δ(C) := µ(C) lim f (x)x−d . x→∞ Note that µ(C)f (x) is the volume of space covered by copies of C inside B(x), and x−d is the volume of B(x). So δ(C) is the maximal proportion of space that may be packed by copies of C. One can show that this limit exists. Our goal is to lower bound δ(C). Theorem 7 Let C be a bounded, convex and centrally symmetric around the origin. Then δ(C) ≥ 2−d−1 . Proof: This example is similar to independent set, since putting copies of C i.i.d inside B(x) does not yield a packing, so after sampling one needs to modify. Given C, we need to find a dense packing. Here is one construction of a packing: sample n points s1 , . . . , sn i.i.d from B(x). To each point si , puts a copy of C with this center, that is, C + si . Now to each pair of intersecting copies, remove one of the two copies. This gives a packing, except that some copies of C may lie outside the box. We fix it by enlarging the box until it includes all copies, and call this the packing of the larger box. Let s and t be two i.i.d points from B(x). First we compute the probability that C + s intersects C + t. By central symmetry and convexity, C = −C. Thus the two sets intersect iff s ∈ t + 2C. For each given t, this event happens with probability at most µ(2C)x−d = 2d µ(C)x−d . So P(C + s ∩ C + t 6= ∅) ≤ 2d µ(C)x−d . Let s1 , . . . , sn be n i.i.d points from B(x). Let Xij be the indicator of the event that C + si intersects C + sj . Let X be the total number of pairwise intersections. Then X n d n2 E(X) = E(Xij ) ≤ 2 µ(C)x−d ≤ 2d µ(C)x−d . 2 2 i<j 8 So there exists a specific choice of n points with fewer than intersections. After our removals, there are at least n2 d 2 µ(C)x−d 2 pairwise n2 d n − 2 µ(C)x−d 2 nonintersecting copies of C. Now, we choose n to optimize this quantity. This gives n= xd . 2d µ(C) Finally we enlarge the box: let 2w be the width of the smallest cube centered at 0 that contains C. Then our constructed set is a packing of B(x + 2w). Hence f (x + 2w) ≥ xd , 2d µ(C) so δ(C) ≥ lim µ(C)f (x + 2w)(x + 2w)−d ≥ 2−d−1 . x→∞ 4 Lecture 3 (cont): Markov’s inequality (first moment method) Theorem 8 (Markov’s inequality) Consider a random variable X ≥ 0. Let t > 0. Then E(X) P(X ≥ t) ≤ t Proof: Consider the function g : x 7→ t · 1{x≥t} . Note that g(x) ≤ x. Then E(X) ≥ E(g(x)) = tP(X ≥ t). Rearranging gives P(X ≥ t) ≤ 9 E(X) . t Corollary 9 (First moment inequality) If X is a nonnegative, integer random variable, then P(X > 0) ≤ E(X). Corollary 10 Let Xn be a sequence of nonnegative, integer random variables. Suppose E(Xn ) = o(1), that is, limn→∞ E(Xn ) = 0. Then limn→∞ P(Xn > 0) = 0. That is, Xn = 0 asymptotically almost surely. 4.1 Ramsey number revisited Recall the definition of Ramsey number. Definition 11 The Ramsey number R(k, `) is the smallest integer n such that in any two-coloring of the edges of the complete graph on n vertices Kn by red and blue, either there is a red Kk , or there is a blue K` . Recall that we proved the following Proposition 12 If n k k 21−(2) < 1, then R(k, k) > n. Let us rewrite the proof as an application of the first moment method. Proof: Take a random coloring: color each edge red or blue with probability 1/2 independently. Let X be the number of cliques of size k that is monochromatic. As k argued before, the probability of the above event for a given k set of vertices is 21−(2) . There are nk set of vertices of size k, so n 1−(k2) E(X) = 2 . k Thus P(X > 0) ≤ E(X) < 1, k so P(X = 0) > 0. That is, if n satisfies nk 21−(2) < 1, there exists a coloring with no monochromatic k-clique, thus R(k, k) > n. This proof is shorter, and cleaner to generalize. Proposition 13 (Exercise) If there exists p ∈ [0, 1] with ` n (k2) n p + (1 − p)(2) < 1 k ` then R(k, `) > n. 10 Proof: Color an edge blue with probability p, red with probability 1 − p independently at random. Let X denote the number of k-clique colored red plus the number of `-cliques colored blue. Then ` n n (k2) (1 − p)(2) . E(X) = p + ` k The proof ends in the same way as that of Proposition 13. As an aside, with sample-and-modify, we can generalize the above result. This does not use the the first moment method, but is an illustration of the previous lecture. Proposition 14 (Exercise) For all p ∈ [0, 1] and for all integers n, ` n (k2) n R(k, `) > n − p − (1 − p)(2) . k ` Proof: As before, color an edge blue with probability p, red with probability 1 − p independently at random. Let X denote the number of k-clique colored red plus the number of `-cliques colored blue. Then ` n (k2) n E(X) = p + (1 − p)(2) . k ` Recall that X is the number of ‘bad’ sets. By the Expectation Lemma (Lemma 3, Lecture 2), there exists a coloring with at most E(X) such bad sets. For each bad k ` set, remove a vertex. This procedure removes at most nk p(2) + n` (1 − p)(2) vertices. The coloring on the remaining ` n (k2) n n− p + (1 − p)(2) k ` vertices have no ‘bad’ sets (that is, no k red clique or ` blue clique), thus R(k, `) is at least this quantity. 4.2 4.2.1 Graph threshold, K4 in a G(n, p) Threshold function The Erdos-Renyi random graph G(n, p) is a random undirected graph on n vertices, generated by including each edge independently with probability p. A property P of a graph G is called monotone increasing if G ∈ P, G0 ⊃ G ⇒ G0 ∈ P. 11 Example of such properties include connectedness, existence of a clique, existence of a Hamiltonian cycle, etc. A threshold for a monotone increasing property P is a function p∗ = p(n) such that 0 if p p∗ (p = o(p∗ )) lim P(G(n, p) ∈ P) = 1 if p p∗ (p∗ = o(p)) n→∞ Every non-trivial monotone increasing property has a threshold function. (We omit the proof of this theorem). Finding the threshold function for a given property is a central problem in graph theory. Threshold proofs often consist of two parts, an upper bound and a matching lower bound for p∗ . In many cases, the first moment method gives one side. 4.2.2 Existence of K4 Proposition 15 If p n−2/3 , then lim P(G(n, p) ⊃ K4 ) = 0. n→∞ Proof: Let X be the number of copies of K4 in G(n, p). Then n 6 E(X) = p. 4 Suppose p n−2/3 , that is, p = n−2/3 o(1). Then n 6 E(X) = p = O(n4 n−12/3 o(1)) = o(1). 4 That is, E(X) → 0 as n → ∞. By the first moment method, P(X > 0) ≤ E(X) → 0, thus lim P(G(n, p) ⊃ K4 ) = 0. n→∞ 5 5.1 Lecture 4: More examples of the first moment method Background: Big-O and little-o Big-O. For sequences an and bn , we write an = O(bn ) to mean that there exists some finite constant C and some large integer N such that |an | ≤ C|bn | 12 for all n ≥ N . We say that an is big-O of bn . If the quantities involved are non-zero, then an = O(bn ) if and only if lim sup |an |/|bn | < ∞. Recall that the lim sup (limit superior) of a sequence {xn } is lim sup xn . m→∞ m≥n By monotone convergence, the lim sup of a bounded sequence always exist, even if the sequence does not converge. Little-o. We write an = o(bn ) if an grows slower than bn , that is, if for every positive constant , there exists some constant N () such that |an | ≤ |bn | for all n ≥ N (). If the quantities involved are non-zero, then an = o(bn ) if and only if abnn → 0 as n → ∞. The notation ∼. We write an ∼ bn if 5.2 an bn → 1 as n → ∞. Background: Stirling’s approximation The term n! often appears in various formulae (for example, in the binomial coefficient n ). Often we want to bound this quantity for large n. We now collects and proves k some useful bounds. Theorem 16 (Stirling’s formula) n! ∼ nn en √ 2πn. It is common to express large quantities by its (natural) logarithm. So another way to write the above is log(n!) = n log(n) − n + 1 1 log(n) + log(2π) + o(1) 2 2 First we prove that the dominant term n log(n) is the right order. In fact, often we only need this bound. Theorem 17 For n ≥ 1, n log n − n < log(n!) < n log(n). Thus log(n!) ∼ n log(n). 13 Proof: Since n! < nn , log(n!) < n log(n). Now we prove the lower bound. One Rn method is to use the Riemann sum with right endpoints for the integral 1 log(x)dx. Since log is a strictly increasing function, the right endpoint sum upperbounds the integral. So Z n log(2) + . . . + log(n) = log(n!) > log(x) dx = n log(n) − n + 1 > n log(n) − n. 1 A sketch of the proof Stirling’s formula can be found on Wikipedia, for example. Useful asymptotic bounds on binomial coefficients come from Stirling’s. The most 2n commonly quoted bound in n!n , but for any large k (ie: going to ∞ as n → ∞), , and use Stirling’s approximation to analyze the it is worth writing nk as (n−k)!k! asymptotic of this coefficient. Example 1 Write 2n n = (2n)! . n!n! Apply Stirling’s formula to (2n)! and n!, we get 2n 22n ∼√ n πn Here are some other useful bounds Proposition 18 For 1 ≤ k ≤ n, n k k n nk n · e k ≤ ≤ ≤ k k! k Proof: First inequality: note that n n−1 n−2 ≤ ≤ ..., k k−1 k−2 so nn−1 n n−k n = ··· ≥ ( )k . k kk−1 1 k Second inequality: note that n(n − 1) · · · (n − k) n nk = ≤ . k k! k! Third inequality: cancel the nk , take log both sides, rearrange, this is the inequality k log(k) − k < log(k!), which we already proved above. 14 5.3 Longest increasing subsequence Let σn be a uniformly chosen random permutation of [n]. An increasing subsequence of σn is a sequence of indices i1 < i2 < . . . < ik such that σn (i1 ) < σn (i2 ) < . . . < σn (ik ). Let Ln be the length of a longest increasing subsequence of σn . √ Lemma 19 P(Ln > 2e n) = o(n−1/2 ) as n → ∞. In particular, this implies E(Ln ) √ ≤ 2e. n n→∞ lim sup Technical comment: Here we use the lim sup instead of lim since we do not know if ) √ n converges as n → ∞. the sequence E(L n Intuition: The lemma states that for large n, √ for a random permutation of [n], the longest increasing subsequence is at most O( n). Proof: First we deal with the implications. Let ` be some value (think ` for ‘length’). Borrowing the idea from the proof of Markov’s inequality, we use the bound E(Ln ) ≤ `P(Ln < `) + nP(Ln ≥ `) ≤ ` + nP(Ln ≥ `) √ Thus if we can show that for ` = 2e n, P(Ln ≥ `) → 0 faster than n−1/2 , this would imply E(Ln ) √ ≤ 2e + o(1), n and thus E(Ln ) √ ≤ 2e. n n→∞ lim sup Now consider the problem of bounding P(Ln ≥ `). This is difficult to bound because Ln is the definition of Ln involves the maximum. Let Xn,` be the number of increasing subsequences of σn with length `. Then P(Ln ≥ `) = P(Xn,` > 0). Now we use the first moment method to upperbound the second quantity. Note that X Xn,` = 1{ s is increasing in σn } . subsequence s of length ` There are n` subsequences, and the probability of a specific subsequence s being increasing in σn is 1/`! 15 Indeed, σn restricted to s is just a random permutation of these s indices. There are `! ways that σn permutes s, only one of which is increasing. Thus 1 n P(Xn,` > 0) ≤ E(Xn ) = . `! ` Assuming that ` → ∞, we use the bounds ` `! ≥ ( )` e and n en ≤ ( )` ` ` to obtain √ 1 n e n 2` ≤( ) . `! ` ` √ For ` = 2e n, the above bound becomes √ 2−4e n √ which → 0 much faster than 1/ n. In summary, we have shown that P(Ln ≥ `) = P(Xn,` > 0) ≤ E(Xn ) ≤ 2−4e √ n n−1/2 , so P(Ln ≥ `) = o(n−1/2 ). Remark. It has been show that √ E(Ln ) = 2 n + cn1/6 + o(n1/6 ) for c = −1.77 . . .. So our bound is loose only by a constant factor. 5.4 Connectivity threshold Say a graph is connected if for every pair of vertices u, v, there exists a path that connects u and v. Our goal is to prove a lower bound for the threshold function of connectivity of G(n, p). Proposition 20 If p log(n) , n then limn→∞ P(G(n, p) is connected ) → 1 A graph is disconnected if either it has isolated vertices, or it has disconnected components. We shall prove both events have asymptotically zero probabilities, and the conclusion follows. 16 Lemma 21 [Exercise] Let X1 be the number of isolated vertices of G(n, p). If p = c log(n) for c 1, then P(X1 > 0) = o(1). n Proof: Note that X1 = Z1 + . . . + Zn , where Zi is 1 if the i-th vertex is isolated, 0 else. Then lim E(X1 ) = lim n(1 − p)n−1 = lim n(1 − n→∞ n→∞ n→∞ c log n n ) = lim ne−c log(n) = n1−c . n→∞ n So if c 1, E(X1 ) → 0, hence P(X1 > 0) → 0. Proposition 22 Let Xk be the number of components of size k which are disconnected from the rest of the graph in G(n, p). If p = c log(n) for c 1, then n n/2 X P(Xk > 0) = o(1). k=2 Proof: Our goal will be to show that n/2 X E(Xk ) = o(1). k=2 There are nk subset of k vertices. For fixed k vertices, there are k(n − k) edges to the rest of the graph that cannot appear, so n E(Xk ) = (1 − p)k(n−k) . k k Now we use k ≤ n/2 so n − k ≥ n/2, nk ≤ nk! , and k! ≥ (k/e)k to get E(Xk ) ≤ nk (1 − p)kn/2 ≤ (en(1 − p)n/2 )k . k! Now, c log(n) n/2 ) → ne−c log(n)/2 → n1−c/2 → 0 n for c 1. Thus (en(1 − p)n/2 ) = o(1) for c 1. By the first moment n(1 − p)n/2 = n(1 − n/2 X k=2 P(Xk > 0) ≤ ∞ X (en(1 − p)n/2 )k = O(n(1 − p)n/2 ) = o(1). k=2 17 6 Lecture 6: The second moment method Corollary 23 Suppose φ : R → R is a strictly monotone increasing function, that is, x > y implies φ(x) > φ(y). Let X be a nonnegative random variable. Then P(X ≥ t) = P(φ(X) ≥ φ(t)) ≤ E(φ(X)) . φ(t) Theorem 24 (Chebychev’s inequality) Let X be a random variable. Then P(|X − E(X)| ≥ t) ≤ Var(X) . t2 Proof: Apply the previous corollary to the random variable |X − E(X)|, with function φ : x 7→ x2 . The use of the Chebychev’s inequality is called the second moment method. The following often-used corollary is sometimes called the second moment inequality. Corollary 25 (Second moment inequality) Let X be a nonnegative integer valued random variable. Then Var(X) P(X = 0) ≤ E(X)2 Proof: P(X = 0) ≤ P(|X − E(X)| ≥ E(X)) ≤ Var(X) (E(X))2 In comparison, the first moment inequality is P(X > 0) ≤ E(X). A typical use of the first moment is to show that E(X) → 0 implies X = 0 a.s. in the limit. Or E(X) < 1, and thereby get P(X = 0) > 0. That is, if we can show that E(X) is ‘very small’, then the event {X = 0} happens ‘very often’. If E(X) is large, say, E(X) → ∞, we cannot conclude that the event {X > 0} happens ‘very often’ based on the first moment. This is where the second moment is handy. Corollary 26 Let Xn be a sequence of nonnegative integer valued random variables. Suppose E(Xn ) → ∞. If Var(Xn ) = o(E(Xn )2 ) as n → ∞, then P(Xn > 0) → 1 as n → ∞. 18 Corollary 27 Let Xn be a sequence of nonnegative integer valued random variables. Suppose E(Xn ) → ∞. If Var(Xn ) = o(E(Xn )2 ) as n → ∞, then Xn ∼ E(Xn ) as n → ∞ a.a.e. Proof: From Chebychev’s inequality, for any > 0, P(|X − E(X)| ≥ E(X)) ≤ Var(X) → 0. (E(X))2 Intuitively, Var(X) measures the concentration of X around E(X). If Var(X) = o(E(X)), the above corollary says that eventually X concentrates around E(X). P Typically, X = i Xi , where Xi is the indicator of some event Ai . Write i ∼ j if the events Ai and Aj are not independent. Then X X Var(X) = E(Xi ) − E(Xi2 ) + E(Xi Xj ) − E(Xi )E(Xj ) i i∼j ≤ E(X) + X P(Ai ∩ Aj ). i∼j A typical P application of the second moment in this case reduce to showing that ∆ := i∼j P(Ai ∩ Aj ) = o(E(X)2 ). Corollary 28 Suppose that PX = Xn is the sum of indicators of events {Ai }. Suppose E(X) → ∞ as n → ∞. If i∼j P(Ai ∩ Aj ) = o(E(X)2 ), then X ∼ E(X) a.a.s. Here is another ‘trick’ for sum of indicators. Suppose the events Ai are symmetric: for any i 6= j, there is a measure preserving map of the underlying probability space that sends Ai to Aj . Then X X X P(Ai ∩ Aj ) = P(Ai ) P(Aj |Ai ). i∼j By symmetry, the sum i P j∼i j∼i P(Aj |Ai ) is independent of i. Call this sum ∆∗ . Then ∆= X P(Ai )∆∗ = ∆∗ E(X). i Corollary 29 Suppose E(X) → ∞. If ∆∗ = o(E(X)), then X ∼ E(X) a.a.s. 19 6.1 K4 threshold Recalled two lectures ago we showed that if p n−2/3 , then G(n, p) a.a.s. does NOT have a copy of K4 . Now we shall prove that if p n−2/3 , then G(n, p) a.a.s has a copy of K4 . Theorem 30 n−2/3 is the threshold function for the existence of K4 . Proof: The lower bound was established in Proposition 15. Now we shall prove the upperbound. Suppose p n−2/3 . Let X be the number of copies of K4 in G(n, p). As we computed previously, n 6 E(X) = p = O(n4 p6 ) → ∞. 4 We want to show that X > 0 a.a.s. Note that X is a sum of symmetric indicators. Fix a K4 graph i. The indicator for K4 graph j is not independent from that of i if i and j share either two or three vertices. There are O(n2 ) j’s that share precisely two vertices with i (pick two vertices), and P(Aj |Ai ) = p5 . There are O(n) j’s that share precisely three vertices, and P(Aj |Ai ) = p3 . Using Corollary 29, we have X ∆∗ = P(Aj |Ai ) = O(n2 p5 ) + O(np3 ) = o(n4 p6 ) = o(E(X)). j∼i Thus X > 0 a.a.s. 6.2 Connectivity threshold Theorem 31 The threshold for the connectivity of G(n, p) is p∗ = log n . n First we prove that this is the threshold for existence of isolated vertices. Proposition 32 (exercise) The threshold for existence of isolated vertices is p∗ = log n . n 20 Proof: Let X1 be the number of isolated vertices of G(n, p). By first moment method (see previous lecture), we showed that p = p∗ implies P(X1 > 0) = o(1), and thus p p∗ implies P(X1 > 0) = o(1). We need to prove that p p∗ implies X1 > 0 with high probability. Let p = cp∗ = c logn n . Note that X1 = Z1 + . . . + Zn , where Zi is 1 if the i-th vertex is isolated, 0 else. Then lim E(X1 ) = lim n(1 − p)n−1 = lim n(1 − n→∞ n→∞ n→∞ c log n n ) = lim ne−c log(n) = n1−c . n→∞ n So if c < 1, E(X1 ) → ∞. We want to show that c < 1 implies X1 > 0 w.h.p. Use the second moment method. In general, it is good to bound that X E(X12 ) = E(X1 ) + E(Zi Zj ) Var(X1 ) E(X1 )2 directly. Note i∼j All the pairs (i, j), i 6= j are correlated. For a given pair i 6= j, E(Zi Zj ) = P(Zi = Zj = 1) = (1 − p)2n−3 . So X E(Zi Zj ) = n(n − 1)(1 − p)2n−3 . i∼j Now n(1 − p)n−1 + n(n − 1)(1 − p)2n−3 1 1 1 E(X12 ) = = + − . E(X1 )2 n2 (1 − p)2n−2 n(1 − p)n−1 1 − p n(1 − p) For p = c log(n) , n c < 1, we have E(X 2 ) 1 1 1 = lim 1−c + − = 1. c log(n) 2 n→∞ E(X) n→∞ n n − c log(n) 1− n lim So in particular, E(X12 ) − E(X1 )2 Var(X1 ) = → 0, E(X1 )2 E(X1 )2 thus P(X1 = 0) → 0 as n → ∞. So for p = c log(n) n < p∗ , a.a.s there are no isolated vertices. Proof:[Proof of the connectivity threshold] We have shown the following. • If p p∗ , then the graph has isolated vertices and hence disconnected. 21 • If p ≥ p∗ , then the graph has no isolated vertices. • If p = p∗ , then the graph has no isolated components of size at least 2. It follows that for p ≥ p∗ , the graph has no isolated components, and hence is connected. 6.3 Turan’s proof of the Hardy-Ramanjuan theorem This theorem states that for most large integers, the number of prime factors of x is about log log n. Theorem 33 (Hardy and Ramanujan 1920; Turan 1934) Let ω(n) → ∞ arbitrarily slowly. For x = 1, 2, . . . , n, let ν(x) be the number of prime factors of x. Then the number of x ∈ {1, . . . , n} such that p |ν(x) − log log n| > ω(n) log log n is o(1). For this proof, reserve p to denote a prime number. We need the following result from number theory, known as Merten’s formula Lemma 34 (Mertens’ formula) X1 p≤x p = log log x + O(1). Proof: Here is a proof sketch to show that log log x is the correct order. For an integer x, the higher power of a prime p which divides x! is x x x b c + b 2c + b 3c + .... p p p (Indeed, if only p is present in x!, then this power is b xp c. If both p and p2 are present in x!, then this power is the sum of the first two terms, and so on). Thus we have Y b x c+b x c+b x c+... x! = p p p2 p3 . p≤x Take log from both sides, and use log(x!) ∼ x log x. Since squares, cubes etc... of primes are quite rare, and b xp c is almost the same as x/p, we get x X log p p≤x p ∼ x log x. 22 Write S(n) = X log p p≤n We have X1 p≤x p = p . 1 (S(n) − S(n − 1)) log n n≤x X We now use summation by parts (also called Abel summation), which states that X X a(n)f (n) = A(x)f (x) − A(n)f 0 (n). n≤x Apply this for f (x) = n≤x 1 , log(x) X1 p≤x p a(n) = S(n) − S(n − 1), so A(n) = S(n), we get = S(x) X 1 1 + S(n) . log(x) n≤x n log(n)2 Now S(x) ∼ log(x), so the first term is O(1). For the second term, X n≤x S(n) X 1 1 ∼ ∼ log log x. n log(n)2 n log n n≤x Proof:[Proof of Hardy-Ramanujan] Our goal will be to apply Chebychev’s inequality. Choose x randomly from {1, 2, . . . , n}. For prime p, set Xp = 1 if p divides x, zero otherwise. Our goal will be to apply Chebychev’s inequality to bound X ν(x) = Xp . p≤n But the covariance of this quantity requires one to upper bound the number of primes below n. This is ∼ logn n by the prime number theorem, but we can get around introducing extra results by the following trick. Let M = n1/10 . Define X X := Xp . p≤M Since each x ≤ n cannot have more than 10 prime factors larger than M , we have ν(x) − 10 ≤ X ≤ ν(x). Thus to prove the statement for ν(x), it is sufficient to prove the same statement for X. 23 There are bn/pc many x that is divisible by p. So E(Xp ) = bn/pc 1 = + O(n−1 ). n p So by Mertens’ formula E(X) = X1 + O(n−1 ) = log log(M ) + O(1) = log log(n) + O(1). p p≤M Now we bound the variance. Var(X) = X Var(Xp ) + p≤n X Cov(Xp , Xq ). p6=q≤M Since Xp is an indicator with expectation 1 p + O(n−1 ), 1 1 Var(Xp ) = (1 − ) + O(n−1 ). p p So the first term is log log n + O(1). For p and q distinct primes, Xp Xq = 1 if and only if pq divides x. Thus Cov(Xp , Xq ) = E(Xp Xq ) − E(Xp )E(Xq ) = 1 1 1 1 1 ≤ − − − pq p n q n 1 1 1 + . ≤ n p q bn/(pq)c bn/pc bn/qc − n n n So 1 X 1 1 2M X 1 + ≤ = O(n−9/10 log log n) = o(1). Cov(Xp , Xq ) ≤ n p6=q≤M p q n p≤M p p6=q≤M X Similarly, by lower bounding Cov(Xp , Xq ), one can show that −o(1). So Var(X) = log log n + O(1). P p6=q≤M Apply Chebychev’s inequality give for any constant λ > 0 p P(|X − log log n| > λ log log n) < λ−2 + o(1) → 0. 24 Cov(Xp , Xq ) ≥ 7 7.1 Lecture 7: Variations on the second moment method A slight improvement Lemma 35 Let X be a non-negative random variable, X 6≡ 0. Then P(X = 0) ≤ Var(X) . (EX)2 + Var(X) Compared to the second moment inequality, this has an extra Var(X) at the bottom. We will see below an example application where this extra Var(X) makes a difference. For most other applications this does not matter, the original inequality is often sufficient. Proof: We will prove the equivalent statement, which is P(X > 0) ≥ (EX)2 . E(X 2 ) By Cauchy-Schwarz inequality (EX)2 = (E(X1X>0 ))2 ≤ E(X 2 )P(X > 0). Rearranging gives the above. 7.1.1 Percolation on regular tree Percolation theory models the movement of liquids through a porous material, consisting of ‘sites’ (vertices) connected by ‘bonds’ (edges). An edge (or vertex) is open if the liquid flows through, otherwise it is close. In bond percolation, each edge is open independently with probability p. In site percolation, each vertex is open independently with probability p. The main question is the existence of an infinite component: on an infinite graph, for what values of p does there exist an infinite subgraph connected by open paths? Let Td be the infinite regular tree of degree d. Designate 0 to be the root. Consider bond percolation on Td . Define θ(p) = Pp (|C0 | = +∞). Define pc (Td ) = supp∈[0,1]:θ(p)=0 . That is, pc (Td ) is the critical probability for the existence of an infinite component in the regular infinite tree. 25 Theorem 36 pc = 1 . d−1 Proof: Let ∂n be the set of vertices of Td of distance n from 0. Let Xn be the number of vertices of ∂n ∩ C0 . ∂n is called a cutset, since for C0 to be infinite, one must have Xn > 0. By the first moment, θ(p) ≤ Pp (Xn > 0) ≤ Ep (Xn ) = d(d − 1)n−1 pn . (There are d vertices in the first level, each gives d − 1 children at the next level, thus there are d(d − 1)n−1 leaves. For each leaf, there is a unique path of length n to the origin). We have Ep (Xn ) → 0 for p < 1 . d−1 Thus pc (Td ) ≥ 1 . d−1 1 , limn→∞ Pp (Xn > Now we use the second moment. We shall prove that for p > d−1 1 0) ≥ 0, and hence pc ≤ d−1 . Note that nodes x ∼ y for x, y ∈ ∂n iff they have a common ancestor that is not 0. Furthermore, their paths are independent starting from the most recent common ancestor. Let x ∧ y denote the most recent common ancestor of x and y. We sum the pairs x ∼ y by the level of x ∧ y. Define µn = Ep (Xn ) = d(d − 1)n−1 pn . Ep (Xn2 ) = X P(x, y ∈ C0 ) x,y∈∂n = µn + n−1 X XX 1{x∧y∈∂m } pm p2(n−m) . x∈∂n m=0 y∈∂n For a fixed x, the set y where x ∧ y ∈ ∂m has (d − 2)(d − 1)n−m−1 . All the vertices at level n are equivalent, and there are d(d − 1)n−1 such vertices. Let r = ((d − 1)p)−1 . 26 Since p > 1 , d−1 r < 1. So the above equals n−1 = µn + d(d − 1) (d − 2) n−1 X (d − 1)n−m−1 p2n−m m=0 n−1 X 2n−2 = µn + d(d − 2)(d − 1) ((d − 1)p)−m m=0 n = µn + µ2n d−2 1−r d 1−r Dividing by µ2n which goes to ∞, we get 1 d − 2 1 − rn E(Xn2 ) = + E(Xn )2 µn d 1−r 1 d−2 1 ≤ + µn d 1 − ((d − 1)p)−1 1 d−2 ≤1+ d 1 − ((d − 1)p)−1 =: C This bound holds for all n large enough. Thus by the variant of the second moment inequality θ(p) = P( for all n, Xn > 0) = lim P(Xn > 0) ≥ C −1 > 0. n This concludes the proof. Note that the second moment inequality does not work. This gives the bound Var(Xn ) 2µ2n − E(Xn2 ) = µ2n µ2n 1 d − 2 1 − rn =2− − . µn d 1−r P(Xn > 0) ≥ 1 − Take p very close to 8 1 , d−1 so r ≈ 1, then the bound becomes negative. Lecture 8: The Local Lemma Often one can phrase existence of a desired random object as a lack of bad events. If the bad events A1 , . . . , An are independent, and Ai happens with probability at most xi , then n n ^ Y P( Ai ) = P(Ai ) ≥ (1 − xi )n . i=1 i=1 27 The Local Lemma generalizes this to the case where the bad events Ai have limited dependencies. Definition 37 Say that an event A is mutually independent of a set of events {Bi } if for any subset β of events contained in {Bi }, P(A|β) = P(A). Note that mutual independence is not symmetric (except for the case of two events). That is, for events A, B1 and B2 , P(A|B1 ) = P(A) ⇔ P(B1 |A) = P(B1 ), and P(A|B1 , B2 ) = P(A) ⇒ P(A|B1 ) = P(A), P(A|B2 ) = P(A), but P(A|B1 ) = P(A), P(A|B2 ) = P(A) 6⇒ P(A|B1 , B2 ) = P(A). Example 2 Let B1 and B2 be two independent Ber(1/2), and A be the event that B1 = B2 . Then A is independent of B1 , A is independent of B2 , but A is not independent of {B1 , B2 }, and thus it is not mutually independent of {B1 , B2 }. The following proposition is useful to establish mutual independence. Proposition 38 (Mutual independence principle) Suppose that Z1 , . . . , Zm is an underlying sequence of independent events, and suppose that each event Ai is completely determined by some subset Si ⊂ {Z1 , . . . , Zm }. For a given i, if Si ∩ Sj = ∅ for all j = j1 , . . . , jk , then Ai is mutually independent of {Aj1 , . . . , Ajk }. Proof: Homework Definition 39 (Dependency digraph) The dependency digraph D = (V, E) of events A1 , . . . , An is a directed graph on n nodes, where for i = 1, . . . , n, Ai is mutually independent of all events {Aj : (i, j) 6∈ E}. Note that the dependency digraph is directed since mutual independence is not symmetric. Lemma 40 (The Local Lemma) For events A1 , . . . , An with dependence digraph D = (V, E), suppose that there are real numbers x1 , . . . , xn ∈ [0, 1] such that for all i = 1, . . . , n Y P(Ai ) ≤ xi (1 − xj ). j:i→j∈E 28 Then P( n ^ n Y Ai ) ≥ (1 − xi ). i=1 i=1 In particular, with positive probability no events Ai hold. Proof: that 2 Fix a subset S ⊂ {1, . . . , n}, |S| = s ≤ n. We will show by induction on s Y ^ (1 − xi ), (2) P( Ai ) ≥ i∈S i∈S and for i 6∈ S, P(Ai | ^ Aj ) ≤ xi . (3) j∈S The case s = 0 is trivial. Suppose (2) and (3) are true for all S such that |S| = s0 < s. If (3) is true, then (2) holds for a set with cardinality s. Indeed, such a new set is of the form S ∪ {i}, so by conditional probability ! ^ ^ Y P( Aj ) = (1 − P(Ai )) · · · 1 − P(Ai | Aj ≥ (1 − xj ). j∈S j∈S∪{i} j∈S∪{i} Thus one can use (3) to prove the induction step in (2). It remains to prove (3) by induction. Split the set into S1 = {j ∈ S : i → j}, and its complement S2 = S\S1 . For any subset T ⊆ [n] of indices, define ^ AT := Aj . j∈T Then P(Ai |AS ) = P(Ai |AS1 ∧ AS2 ) = P(Ai ∧ AS1 |AS2 ) . P(AS1 |AS2 ) (4) For the numerator, since Ai is independent of AS2 , P(Ai ∧ AS1 |AS2 ) ≤ P(Ai |AS2 ) = P(Ai ) ≤ xi Y (1 − xj ). j:i→j∈E 2 Thanks to class discussion for pointing out issues with the earlier version of the proof. 29 (5) For the denominator, use conditional probability and applies the induction hypothesis in (3). Suppose S = {j1 , j2 , . . . , jr }. Then P(AS1 |AS2 ) = P(Aj1 ∧ Aj2 . . . ∧ Ajr |AS2 ) = (1 − P(Aj1 |AS2 )) · (1 − P(Aj2 |Aj1 ∧ AS2 ) · · · · · · (1 − P(Ajr |AS\{jr } )) Y ≥ (1 − xj1 )(1 − xj2 ) · · · (1 − xjr ) ≥ (1 − xj ). j:i→j∈E Substitute in (4) completes the induction step in (3). 8.1 Special cases Here is a special case of the Local Lemma. This result is most powerful when the events Ai have roughly the same probabilities, and dependencies between events are rare. Corollary 41 (The Local Lemma, symmetric version) For events A1 , . . . , An , suppose that each Ai is mutually independent of all but at most d other events Aj , and that P(Ai ) ≤ p for all i ∈ [n]. If ep(d + 1) ≤ 1, V then P( ni=1 Ai ) > 0. Proof: For d = 0 this is trivial. Suppose d > 0. By assumption, there exists a 1 dependency digraph where |j : i → j ∈ E| ≤ d for all i. Define xi = d+1 , we get xi Y j:i→j (1 − xj ) ≥ 1 1 d 1 (1 − ) > . d+1 d+1 e(d + 1) Here we used the fact that for all d ≥ 1, (1 − 1 d 1 ) > . d+1 e So if ep(d + 1) ≤ 1, then P(Ai ) < 1 ≤ Lemma applies. 1 e(d+1) 30 ≤ xi Q j:i→j (1 − xj ), and the Local 8.1.1 Example: Two-coloring of hypergraphs A hypergraph H = (V, E) is two-colorable if there is a coloring of V by two colors so that no edge f ∈ E is monochromatic. Theorem 42 (Erd¨ os-Lov´ asz 1975. (Exercise)) Let H = (V, E) be a hypergraph in which every edge has at least k elements. Suppose that each edge of H intersects at most d other edges. If e(d + 1) ≤ 2k−1 , then H is two-colorable. Proof: Color each vertex of H independently of either color with probability 1/2. For an edge f ∈ E, let Af be the event that it is monochromatic. Then P(Af ) = 21−|f | ≤ 21−k . Note that Af is mutually independent of {Ag : f ∩g = ∅} by the mutual independence principle. The conclusion of the Local Lemma applies. The following is another special case, useful when the probability of the events Ai can differ alot. Corollary 43 (The Local Lemma, summation version) For events A1 , . . . , An , suppose that for all i, X 1 P(Aj ) ≤ . 4 j:i→j Vn Then P( i=1 Ai ) > 0. P Proof: Take xi = 2P(Ai ). For all i, given that j:i→j xj ≤ 12 , we need to show that 1 (1 − xj ) ≥ . 2 j:i→j Y Let r be the cardinality of the set {j : i → j}. Since xj ∈ [0, 1], we shall prove that the solution to the problem r Y minimize (1 − xj ) subject to j=1 r X 1 xj = , xj ≥ 0 2 j=1 is attained at the boundary where all but one xj are zero, with minimum value 21 . Let f be the objective function. Then h(x) := log(f (x)) = r X j=1 31 log(1 − xj ). Note that ∇h(x) = ( −1 , . . . −1 ). By Lagrange multipler method, the only critical x1 xr 1 point of h (and hence of f ) on the interior is where x1 = . . . = xr = 2r , which is a clear maxmizer. Thus the minimum of the function has to be attained the boundary of the domain, which is where at least one xj is zero. Recurse the argument, we are done. If we apply the asymmetric version under the hypothesis of the symmetric case, we get a bound of 4pd ≤ 1, which is slightly worse than ep(d + 1) ≤ 1. However, the 14 in this proof is tight, since the optimization problem we solved has a tight minimum of 12 . 8.1.2 Example: frugal graph coloring Definition 44 A vertex-coloring of a graph G is proper if no edge have the same end-point color. A proper vertex-coloring of a graph G is called β-frugal if no color appears more than β times in the neighborhood of any vertex of G. If ∆ is the maximum degree of a vertex of G, then a β-frugal coloring requires at least d∆/βe + 1 many colors. Theorem 45 (1997) For β ≥ 2, if a graph G has maximum degree ∆ ≥ β β , then G has a β-frugal coloring with 16∆1+1/β colors. Proof: We will pick a random coloring of G with C = 16∆1+1/β colors. Then we will show that with positive probability, the coloring is proper and β-frugal. There are two types of bad events: those that prevents our coloring from being proper (type A), and those that prevents our coloring from being β-frugal (type B). Type A events: for each edge (u, v) of G, let Auv be the event that u and v have the same color. Type B events: for each set of β + 1 neighbors U of some vertex, let BU be the event that they all have the same color. We have P(Auv ) = 1/C and P(BU ) = 1/C β . 32 Each type A event is mutually independent of all but at most 2∆ other A events and 2∆ ∆ other B events. β Each type B event is mutually independent of all but at most (β + 1)∆ type A events ∆ and (β + 1)∆ β type B events. Let Ei denote an event of either type A or B. Apply the summation version of the Local Lemma, we have X 1 ∆ 1 P(Ej ) ≤ (β + 1)∆ + (β + 1)∆ C β Cβ j:BU →j ∆β+1 1 ∆ ∆β ≤ (β + 1)∆ + (β + 1) use ≤ C β!C β β! β β+1 β+1 = + 16∆1/β β!16β 1 ≤ . use ∆ > β β , β ≥ 2. 4 Similarly, X ∆ 1 1 P(Ej ) ≤ 2∆ + 2∆ β Cβ C j:Auv →j X 1 P(Ej ) ≤ . ≤ 4 j:B →j U Thus by the summation version of the Local Lemma, we conclude that with positive probability, none of the bad events occur, so the coloring is proper and β-frugal. Note that if we used the symmetric version, then we need the bound P(Ei )∆ ≤β p = 1/C, ∆ and the dependency set size of at least d ≥ (β + 1)∆ β ≥ (β + 1)∆( β ) . But for ∆ = β β , then 1 1 2 2 pd > β −β−1 (β + 1)β β β β −β > β β −β > 1 16 16 for β > 3. So the Symmetric Local Lemma does not work. The reason the other version works is that most events have very small probability and only few have large probability. 9 9.1 Lecture 9: More examples of the Local Lemma The Lopsided Local Lemma Let Di = {j : i → j ∈ E}. The proof of the local lemma would still go through if we replace the condition that each Ai is mutually independent of {Aj : j 6∈ Di }, which 33 implies ^ P(Ai | Aj ) = P(Ai ), j6∈Di by the weaker assumption that for each i, ^ P(Ai | Aj ) ≤ P(Ai ). (6) j6∈Di Indeed, in the proof, we had Di = S1 , {j 6∈ Di } = S2 . The inequality in line (5) would still hold when we have P(Ai |AS2 ) ≤ P(Ai ). This generalization is useful when we do not have mutual independence between events. Definition 46 (Negative dependency digraph) The negative dependency digraph (V, E) of events A1 , . . . , An is a directed graph on n nodes, where for i = 1, . . . , n, Di the children of node i, then (6) holds. A dependency digraph is a negative dependency digraph, but the converse is not true. Thus, the Local Lemma for negative dependency digraphs generalizes the Local Lemma. This is also called the Lopsided Local Lemma. It first appeared in the paper of Erd¨os and Spencer titled ‘Lopsided Lov´asz Local Lemma and Latin transversals’. Lemma 47 (The Lopsided Local Lemma) For events A1 , . . . An with negative dependency digraph (V, E), with the same hypothesis as the Local Lemma, the same conclusion holds. Proof: Same as the proof of the Local Lemma, with the equality in (??) replaced by an inequality. What does (6) mean? In words, this says that event Ai is less likely to occur if its non-neighbors do not occur. One can rewrite (6) in the form of a correlation inequality _ _ P(Ai )P( Aj ) ≤ P(Ai ∧ Aj ), j6∈Di j6∈Di W that is, it says that the events Ai and j6∈Di Aj are positively correlated (or more precisely, have non-negative correlation). 34 9.1.1 Example: counting derangements A derangement is a permutation π with no fixed points, that is, there is no i ∈ [n] such that π(i) = i. Let Dn be the number of derangements of n. While there are exact counts for Dn , we can use the Lopsided Local Lemma to get a lower bound on its asymptotic. While the result is weak, this example shows how the Lopsided version can succeed while the Local Lemma fails. Lemma 48 lim n→∞ 1 Dn ≥ . n! e Proof: Let π be a permutation chosen uniformly at random. Let V Ai denote the event that i is a fixed point of π. A derangement is thus the event ni=1 Ai . Unfortunately the Local Lemma fails here, since no pair of events Ai , Aj are independent 1 1 (n − 2)! = 6= P(Ai )P(Aj ) = 2 . P(Ai ∧ Aj ) = n! n(n − 1) n Thus there is no mutual independence between events. On the other hand, we claim that the graph with n vertices and no edges is a negative dependency graph for this case. That is, for all subsets Sk of k elements in n, and for all i ∈ [n], i 6∈ Sk , P(Ai |ASk ) ≤ P(Ai ). One can establish this by counting. (Homework. Hint: start with the correlation inequality form). The intuitive idea is simple: if k other elements are not fixed, then one of them has a positive probability of being mapped to i, and in that event, i cannot be a fixed point. Thus the conditional probability is strictly smaller. Apply the Lopsided Local Lemma with xi = 1 n (the smallest possible values), we have n ^ n Y 1 1 P( Ai ) ≥ (1 − ) = (1 − )n → e−1 n n i=1 i=1 as n → ∞. In comparison, the correct value for finite n is n ^ n Dn X (−1)k n! 1 P( Ai ) = = = b + c. n! k! e 2 i=1 k=0 35 9.2 Lower bound for Ramsey number In our examples so far we have used ‘ready-made’ version of the Local Lemma - that is, in particular, we did not have to choose the xi ’s. In general one can formulate this as an optimization problem, and choose the optimal xi ’s this way. (The special cases of the Local Lemma come with choices such as xi = 2P(Ai ) for the summation 1 for the symmetric version). version, or xi = d+1 9.2.1 Example: lower bound for R(k, 3) Proposition 49 There exists a constant C ≈ 1 27 such that R(k, 3) > Ck / log2 k 2 This is not far from the best current bound, which is R(k, 3) > C 0 k 2 / log k. Proof: As before, color edges blue with probability p, red with probability 1 − p. There are two types of bad events: let AT be the event of a triangle T being blue, BS be the event of a clique S being red. Then k P(A ) = p3 , P(B ) = (1 − p)(2) . T S By the mutual independence principle, each event is mutually independent of all but events that share an edge with it. So an event AT is adjacent to 3(n−3) < 3n other A events, and at most nk B events. An event BS is adjacent to at most k2 (n − 2) < k 2 n/2 A events, and nk other B events. Since the events A and B are symmetric (in the sense that there exists a relabeling of the graph that takes one A-event to another), we try to find xi ’s such that all events A have the same xi = x, and all events B have the same xi = y. Thus, we need to find p ∈ (0, 1), and real numbers x and y such that n p3 ≤ x(1 − x)3n (1 − y)(k ) , and k n 2 (1 − p)(2) ≤ y(1 − x)k n/2 (1 − y)(k ) . If there exist such p, x and y, then R(k, 3) > n by the Local Lemma. It turns out (after optimizing for the largest possible n) that the optimum is reached with n −1/2 3/2 p = c1 n , x = c3 /n , y = c4 / , k which gives R(k, 3) > Ck 2 / log2 k. 36 10 Lecture 10: Correlation inequalities The condition of negative dependency digraph can be rewritten as _ _ P(Ai )P( Aj ) ≤ P(Ai ∧ Aj ), j6∈Di j6∈Di which states that two events are positively correlated. To apply the Lopsided Local Lemma, one needs to establish this inequality. In the example on derangement, we obtained this by counting. In general, there are general conditions which imply that two events are correlated. Perhaps the most famous is the FKG inequality. In the next two lectures we state and prove this inequality and its extension, the Four Functions theorem, their applications in percolation, and the XYZ theorem. 10.1 Order inequalities Theorem 50 (Chebychev’s order inequality) Let f, g : R → R be non-decreasing functions. Let X be a random variable distributed according to probability measure µ. Ef (X)Eg(X) ≤ E(f (X)g(X)). (7) Equality occurs when either f or g is a constant. The inequality in intuitive: if f (X) and g(X) are increasing functions of a common variable X, then they are positively correlated. A case of particular interest is when µ is a discrete measure Pn with support on finitely many values x1 ≤ x2 ≤ . . . ≤ xn , with finite total mass i=1 µ(xi ). Then Chebychev’s inequality (after normalization) states that n X i=1 f (xi )µ(xi ) n X g(xi )µ(xi ) ≤ i=1 n X f (xi )g(xi )µ(xi ) i=1 n X µ(xi ). (8) i=1 The FKG inequality extends the above to the case where the underlying index set is only partially ordered, as opposed to totally ordered. In particular, the index set is a finite distributive lattice. Definition 51 (Finite distributive lattice) A finite distributive lattice (L, <) consists of a finite set L, and a partial order <, for which the two functions ∧ (meet) and ∨ (join) defined by x ∧ y := max{z ∈ L : z ≤ x, z ≤ y} x ∨ y := max{z ∈ L : z ≥ x, z ≥ y} 37 are well-defined and satisfy the distributive laws x ∧ (y ∨ z) = (x ∧ y) ∨ (x ∧ z) x ∨ (y ∧ z) = (x ∨ y) ∧ (x ∨ z). Example 3 Let L = 2[n] be the power set (set of all subsets of [n]). Order the sets by inclusion (ie: <:=⊂), define x ∧ y = x ∪ y, x ∨ y = x ∩ y. Then this is a finite distributive lattice. A finite distributive lattice can be represented as an undirected graph. A totally ordered set, for example, is a line. Any finite distributive lattice is isomorphic to a sublattice of (2[n] , ⊂). Definition 52 Let (L, >) be a finite distributive lattice. Suppose µ : L → R≥0 , and µ(x)µ(y) ≤ µ(x ∧ y)µ(x ∨ y) for al x, y ∈ L. Then µ is called a log supermodular function. Theorem 53 (The FKG inequality) If f, g : L → R are both increasing (or both decreasing) functions, µ : L → R≥0 is log supermodular, then ! ! ! ! X X X X f (x)µ(x) g(x)µ(x) ≤ f (x)g(x)µ(x) µ(x) . (9) x∈L x∈L x∈L x∈L Remark: we shall prove the FKG inequality as a corollary of the Four Functions theorem. We shall prove this theorem by induction. Thus, the FKG inequality also holds for a countably infinite distributive lattice. 10.2 Example application In some applications, the sample space Ω comes with a natural partial order. For example, in models of random graph, for two graphs G, H ∈ Ω, one can define G ≤ H if every edge in G is present in H. Similarly, in bond percolation, for realizations ω1 , ω2 , we define ω1 ≤ ω2 if any edge open in ω1 must be open in ω2 . Consider a probability space (Ω, F, P) where Ω has partial order <. Say that a random variable N on (Ω, F, P) is increasing if N (ω) ≤ N (ω 0 ) whenever ω ≤ ω 0 . Say that an event A is increasing if its indicator function is increasing. Suppose Ω is countable. (Such is the case for Erd¨os-Renyi random graph, or bond percolation on a graph with countable many edges). The FKG inequality applied to 38 countable distributive lattices states that if X and Y are increasing random variables, E(X 2 ) < ∞, E(Y 2 ) < ∞, then E(X)E(Y ) ≤ E(XY ). In particular, if A and B are increasing events, then P(A)P(B) ≤ P(A ∩ B). That is, increasing events are positively correlated. A typical use of this property is as follows. Consider bond percolation on a graph G with countably many edges. Let p be the probability of an edge being open. Let Π1 , . . . , Πk be families of paths in G, Ai be the event that some path in Πi is open. The Ai ’s are clearly increasing events, so Pp ( k \ Ai ) ≥ Pp (A1 )Pp ( i=1 Reiterate this to obtain Pp ( k \ Ai ). i=2 k \ Ai ) ≥ k Y i=1 Pp (Ai ). i=1 Note the resemblances to the conclusion of the Local Lemma. (This is indeed no surprise, because the condition for the more general Lopsided Local Lemma is a correlation inequality). Here is a concrete application. Let G = (V, E) be an infinite connected graph with countably many edges. Consider bond percolation on G. For a vertex x, write θ(p, x) for the probability that x lies in an infinite open cluster. For fixed x, θ(p, x) is an increasing function in p. Define pc (x) = sup{p : θ(p, x) = 0.} If the graph is symmetric, then one may argue that pc (x) = pc (y) for all sites x, y ∈ V , and thus is independent of the choice of x. For general graph this is not intuitively clear. Theorem 54 The value pc (x) is independent of the choice of x. Proof: Let x, y ∈ V . Let {x ↔ y} be the event that there is an open path from x and y, and let {y ↔ ∞} be the event that y lies in an infinite open cluster. These are both increasing events. So by the FKG inequality, θ(p, x) ≥ Pp ({x ↔ y} ∩ {y ↔ ∞}) ≥ P(x ↔ y)θ(p, y). Thus pc (x) ≤ pc (y). The same argument with x and y interchanged gives pc (y) ≤ pc (x). Thus pc (x) = pc (y). We will prove the FKG inequality as a special case of the Four Functions theorem. 39 10.3 The Four Functions theorem Let (L, <) be a finite distributive lattice. For X, Y ⊂ L, define X ∧ Y = {x ∧ y : x ∈ X, y ∈ Y } X ∨ Y = {x ∨ y : x ∈ X, y ∈ Y }. For a function φ : L → R≥0 , X ⊂ L, define X φ(X) = φ(x). x∈X Theorem 55 (The Four Functions theorem) Let L be a finite distributive lattice. Consider α, β, γ, δ : L → R≥0 . If for every x, y ∈ L, α(x)β(y) ≤ γ(x ∧ y)δ(x ∨ y), (10) α(X)β(Y ) ≤ γ(X ∧ Y )δ(X ∨ Y ). (11) then for every X, Y ⊂ L, First we show how this theorem implies the FKG inequality. Proof:[Proof of the FKG inequality] For x ∈ L, define α(x) = µ(x)f (x), γ(x) = µ(x)f (x)g(x), β(x) = µ(x)g(x) δ(x) = µ(x). We claim that these four function satisfy the hypothesis of the Four Functions theorem. Indeed, α(x)β(y) = f (x)g(y)µ(x)µ(y) ≤ f (x)g(y)µ(x ∧ y)µ(x ∨ y) ≤ f (x ∧ y)g(x ∧ y)µ(x ∧ y)µ(x ∨ y) = γ(x ∧ y)δ(x ∨ y). as µ is log-supermodular as f and g are increasing The Four Functions theorem applies, which gives the implication of the FKG inequality. 11 11.1 Lecture 11: Proof of the Four Functions Theorem and Applications of the FKG inequality Proof of the Four Functions theorem (Notes forthcoming) 40 11.2 Some applications of the Four Functions theorem The FKG inequality (and more generally, the Four Functions theorem) is an important and elegant tool to prove correlation between events. Without it, proof of ‘intuitive’ statements may involve very complicated counting. The last problem of Homework 4 is an example. We now consider more examples. 11.2.1 Increasing events As mentioned in the previous lecture, the FKG inequality can be used to show that increasing events are positively correlated. We state some special cases here, since we will need them in the next lecture to prove Janson’s inequality. Consider an n-element set [n]. Choose the i-th element of this set independently with probability pi ∈ [0, 1]. Let Pp (A) be the probability of a set A ⊂ [n] be chosen. For a family A ⊂ 2[n] , let Pp (A) be the probability that some element A ∈ A is chosen. A family of subsets A ⊆ 2[n] is called monotone increasing if A ∈ A, A ⊆ A0 implies Q Q A0 ∈ A. Define µ : 2[n] → R≥0 by µ(A) = i∈A pi j6∈A (1 − pj ) = Pp (A). One can check that µ is log-supermodular. With this µ and indicator functions f and g, (that is, f (A) = 0 if A 6∈ A, and f (A) = 1 otherwise, similarly for g and B), the FKG inequality translates to the following. Theorem 56 Let A, B ⊆ 2[n] be two monotone increasing family of subsets of [n]. Then for any p = (p1 , . . . , pn ) ∈ [0, 1]n , Pp (A ∧ B) ≥ Pp (A)Pp (B). Here is a simple but non-trivial illustration of the above theorem. Corollary 57 Suppose A1 , . . . , Ak are arbitrary subsets of [n]. Choose a random subset A of [n] according to p as above. Then Pp (A intersects each Ai ) ≥ k Y Pp (A intersects Ai ). i=1 Proof:[Exercise] Let Ai = {C : C ∩ Ai 6= ∅}. Then each Ai is a monotone increasing family. Write ^ Pp (A intersects each Ai ) = Pp ( Ai ), i and recursively apply the previous theorem. 41 Corollary 58 Consider the Erd¨os-Renyi random graph G = G(n, p). Then P(G is planar and Hamiltonian) ≥ P(G is planar)P(G is Hamiltonian). Proof:[Exercise] Being planar and Hamiltonian are increasing events. 11.2.2 Marcia-Sch¨ onheim inequality and number theory For two families of sets A and B, define A\B = {A\B : A ∈ A, B ∈ B}. As usual, let |A\B| denote the number of distinct elements of A\B. Theorem 59 (The MS inequality) For all A ⊆ 2[n] , |A| ≤ |A\A|. This is a special case of the Four Functions theorem. Proof: Consider the set lattice 2[n] . Choose α = β = γ = δ = 1, so α(T ) = |T | for T ⊆ [n]. Then the Four Functions theorem state that for all A, B ⊆ 2[n] , |A||B| ≤ |A ∧ B||A ∨ B|. Let B = {b ⊆ [n] : b 6∈ B}. Then |A||B| = |A||B| ≤ |A ∨ B||A ∧ B| = |A ∨ B||A ∧ B| = |A ∧ B||A ∧ B| = |B\A||A\B|. Now choose B = A, and we get the conclusion of the MS inequality. The MS inequality, discovered in 1969, is a non-trivial statement. It arose in connection with the following result in number theory. Note that an integer is squarefree if it is not divisible by any perfect square. Proposition 60 If 0 < a1 < a2 < . . . < an are all squarefree integers, then max i,j ai ≥ n. gcd(ai , aj ) 42 Proof: Let pk be the k-th prime. Since each ai is squarefree, there is a finite set Si ⊂ N such that Y ai = pk . k∈Si Therefore, Y ai = pk . gcd(ai , aj ) k∈Si \Sj Let A = {S1 , . . . , Sn }. By the MS inequality, |A\A| = |{Si \Sj }| ≥ |A| = n. That is, there are at least n different sets of the form Si \Sj , and hence there must be at least n different integers of the form gcd(aaii ,aj ) . Thus, the largest value must be at least as large as n. This is the inequality desired. 11.2.3 The XYZ Theorem In the early days of the FKG, an important application is the analysis of partially ordered sets and sorting algorithms. Many sorting algorithms for sorting numbers {a1 , . . . , an } perform binary comparisons (ai , aj ) to successively construct partial orders P until one gets a linear ordering. Thus, a fundamental quantity is P(ai > aj |P ), that is, the probability ai > aj given the current partial order P , and assuming that all linear extensions of P are equally likely. The XYZ conjecture (1980, Rival and Sands) states that for any partially ordered set P , and any three elements x, y, z ∈ P , P(x ≤ y ∨ x ≤ z) ≥ P(x ≤ y)P(x ≤ z). This seems intuitive: if x ≤ y, then x is small, so it is even more likely to be smaller than z. Thus the events {x ≤ y} and {x ≤ z} are likely to be positively correlated. This type of reasoning may be misleading. For example, consider the statement P(x1 < x2 < x4 ∧ x1 < x3 < x4 ) ≥ P(x1 < x2 < x4 )P(x1 < x3 < x4 ). By the same reasoning, one expects x1 to be small and x4 to be large, and thus the statement seems true. But in fact it is false. Here is a counterexample by Mallows: for n = 6, let P = {x2 < x5 < x6 < x3 , x1 < x4 }. By computations, one find that P(x1 < x2 < x4 ) = but 4 , 15 1 P(x1 < x2 < x4 |x1 < x3 < x4 ) = . 4 43 So P(x1 < x2 < x4 ) > P(x1 < x2 < x4 |x1 < x3 < x4 ), and rearranging gives P(x1 < x2 < x4 ∧ x1 < x3 < x4 ) < P(x1 < x2 < x4 )P(x1 < x3 < x4 ). The XYZ conjecture was first proved by Graham, Yao and Yao (1980) using a complicated combinatorial argument, which used the MS inequality. The second proof by Shepp (1982) uses the FKG inequality. We now present this proof. Theorem 61 (The XYZ theorem) Let P be a partially ordered set with n elements a1 , . . . , an . Consider the probability space of all linear extensions of P , each extension equally like. Then P(a1 ≤ a2 ∧ a1 ≤ a3 ) ≥ P(a1 ≤ a2 )P(a1 ≤ a3 ). Proof: Fix a large integer m. Let L be a set with elements of the form x = (x1 , . . . , xn ), where xi ∈ [m]. Define an order relation ≤ on L as follows: for x, y ∈ L, say x ≤ y if x1 ≥ y1 and xi − x1 ≤ yi − y1 for all 2 ≤ i ≤ n. One can show that this makes (L, ≤) a lattice with (x ∨ y)i = max(xi − x1 , yi − y1 ) + min(x1 , y1 ), for all i ∈ [n], (x ∧ y)i = min(xi − x1 , yi − y1 ) + max(x1 , y1 ), for all i ∈ [n]. One also needs to show that L is a distributive lattice, that is, x ∨ (y ∧ z) = (x ∨ y) ∧ (x ∨ z). This follows from the following identities for three integers a, b, c min(a, max(b, c)) = max(min(a, b), min(a, c)) max(a, min(b, c)) = min(max(a, b), max(a, c)). Let us show that this implies L is distributive. The i-th component of x ∨ (y ∧ z) is min(xi − x1 , max(yi − y1 , zi − z1 )) + max(x1 , min(y1 , z1 )). The i-th component of (x ∨ y) ∧ (x ∨ z is max(min(xi − x1 , yi − y1 ), min(xi − x1 , zi − z1 )) + min(max(x1 , y1 ), max(x1 , z1 )). 44 Apply the previous two identities, we see that these two quantities are equal. So indeed L is a finite distributive lattice. Now we connect L with the partial order P through the functions µ, f, g. Define µ by µ(x) = 1 if xi ≤ xj whenever ai ≤ aj in P , and 0 otherwise. Note that µ puts positive mass on tuples x that satisfies the inequalities in P only. To show that µ is log-supermodular, it suffices to check that if µ(x) = µ(y) = 1, then µ(x ∨ y) = µ(x ∧ y) = 1. Suppose µ(x) = µ(y) = 1. Let ai ≤ aj ∈ P . Then xi ≤ xj , yi ≤ yj , so (x∧y)i = max(xi −x1 , yi −y1 )+min(x1 , y1 ) ≤ max(xj −x1 , yj −y1 )+min(x1 , y1 ) = (x∧y)j . So µ(x ∧ y) = 1. By a similar proof, µ(x ∨ y) = 1. Define f (x) = 1 if x1 ≤ x2 and f (x) = 0 otherwise. Define g(x) = 1 if x1 ≤ x3 and g(x) = 0 otherwise. These are trivially increasing functions by our definition of order on L. The FKG inequality then states P(x1 ≤ x2 ∧ x1 ≤ x3 ∧ P in L)P(P in L) ≥ P(x1 ≤ x2 ∧ P in L)P(x1 ≤ x3 ∧ P in L), which is almost what we want, except that L is not necessary a tuple of n distinct integers, and hence may not be a linear extension of P . However, as m → ∞, the fraction of x such that xi = xj for i 6= j in L tends to 0. Then for any linear extension of P identified via the tuple x of distinct entries, the desired inequality holds. This proves the theorem. 45 12 12.1 Lecture 12: Janson’s inequality Motivations and setup Let Ω be a finite state space, {Ai : i ∈ I} be events in Ω. Choose a random subset R of Ω by including each element r ∈ Ω with probability pr independently. Define the event that all elements in Ai was picked, that is, Bi = {Ai ⊆ R}. Let Xi be the indicator for Vthe event Bi . We want upper and lower bounds on P(X = 0), or equivalently, P( i∈I Bi ). This is a common setup in many combinatorial problems, many of which we have seen. (Recall: mutual independence principle, Local Lemma examples, random graphs, Chebychev’s inequality examples, etc). Consider the lower bound. Suppose the Ai ’s are all disjoint. Then Bi are independent events, so Y ^ P(Bi ). P( Bi ) = i∈I i∈I Q Define M = i∈I P(Bi ). Now suppose some Ai ’s overlap. Recall Corollary 57 in the last lecture, which shows, using the FKG inequality, that the events {R ∩ Ai 6= ∅} are positively correlated. By the same argument, one can show that Bi are also positively correlated, and thus ^ M ≤ P( Bi ). i∈I Now consider the upper bound. If Ai ∩ Aj = ∅, then Xi and Xj are independent. Define i ∼ j if Ai ∩ Aj 6= ∅, X ∆= P(Bi ∧ Bj ), i∼j and µ = E(X) = X P(Bi ). i∈I Then, by Chebychev’s inequality, ^ Var(X) µ+∆ P( Bi ) = P(X = 0) ≤ ≤ . 2 2 (EX) µ i∈I (12) Janson’s inequality is an improvement on this upperbound in this setup, when P(Bi ) are all small. Theorem 62 (Janson’s inequality) Let {Bi }, ∆, M , µ be defined as above. Then ^ P( Bi ) ≤ e−µ+∆/2 . (13) i∈I 46 If in addition, P(Bi ) ≤ , then M ≤ P( ^ 1 Bi ) ≤ M e 1− ∆/2 . (14) i∈I The two upperbounds are quite similar in practice. The second form is more often used for convenience. Note that for each i, P(Bi ) = 1 − P(Bi ) ≤ e−P(Bi ) , so M ≤ e−µ . So if ∆ is small, then the upperbound is very close to the lower bound. (This makes sense: small ∆ means the covariance of X is small, that is, the Xi ’s are close in being independent). In many G(n, p) applications, = o(1), ∆ = o(1), and µ → k as n → ∞. Both bounds conclude that P(X = 0) → e−k . d In particular, one can often show (by other methods) that X → P oisson(k). The intuition is that X is a sum of almost independent rare events. This is no longer the case for large ∆ (ie: far away from independence). If ∆ ≥ 2µ, for example, then the second bound is useless. For slightly less ∆, the following inequality offers an improvement. Theorem 63 (The extended Janson’s inequality) Under the assumptions of the previous theorem, suppose further that ∆ ≥ µ. Then ^ 2 P( Bi ) ≤ e−µ /2∆ . (15) i∈I Compare (12) and (15). Suppose µ → ∞ as |Ω| → ∞ (say, in applications like random graphs, where we send n, the number of nodes in the graph, to infinity). Suppose µ ∆, and γ = µ2 /∆ → ∞. Then Chebychev’s upperbound scales as γ −1 , while Janson’s inequality scales as e−γ , a vast improvement. We will revisit this point next week, when we view Janson and Chebychev inequalities as tail bounds for random variables, and in particular, sums of indicators. 12.2 Proofs and generalizations Janson’s proof of Janson’s inequality bounds the Laplace transform E(e−tX ) of X, a technique common in proving tail bounds. (If time permits, we will consider this proof next week). Here we follow the proof of Boppana and Spencer, which uses the FKG inequality and resembles proof the Local Lemma. First we need some inequalities 47 Lemma 64 For arbitrary events A, B, C, P(A|B ∧ C) ≥ P(A ∧ B|C). Proof: We have P(A ∧ B ∧ C) = P(A|B ∧ C)P(B ∧ C) = P(A ∧ B|C)P(C). (16) Since P(B ∧ C) ≤ P(C), the inequality follows by rearranging. Proof:[Proof of Janson’s inequality] As mentioned, the lower bound follows from Theorem 56 (Exercise). We now prove the two upperbounds. The initial steps are the same. Let |I| = m. By ‘peeling’ the intersection from the back, we get P( ^ Bi ) = m Y P(Bi | i=1 i∈I ^ Bj ). (17) 1≤j<i We shall upperbound each term in this product, or equivalently, lower bound ^ P(Bi | Bj ). 1≤j<i By renumbering the events, let d1 , . . . , dm ∈ [m] be integers such that for 1 ≤ j < i, i ∼ j for all 1 ≤ j ≤ di , and not for di + 1 ≤ j < i. If di = i − 1, then the later set is empty. Note that this does not say anything about dependence between i and j for j > i. V Vi Fix an i. Use (16) for A = Bi , B = dj=1 Bj , C = i−1 j=di +1 Bj . Suppose di < i − 1, so that C is an event with positive probability (we deal with the other case later). Then ^ P(Bi | Bj ) = P(A|B ∧ C) ≥ P(A ∧ B|C) = P(A|C)P(B|A ∧ C). (18) 1≤j<i By the mutual independence principle, we have P(A|C) = P(A). For the remaining term, we bound di _ P(B|A ∧ C) = P Bj |Bi ∧ C j=1 ≥1− di X P(Bj |Bi ∧ C) by union bound j=1 ≥1− di X P(Bj |Bi ) j=1 48 by FKG. If C is an event with zero probability, then P(B) > 0, and the analogue of (18 is P(A|B) = P(A)P(B|A)/P(B) ≥ P(A)P(B|A). By the union bound as argued above, P(B|A) ≥ 1 − di X P(Bj |Bi ). j=1 (Similarly, if B is an event with zero probability, then one can check that the proof still goes through). So far we have P(Bi | ^ Bj ) ≥ P(Bi ) − di X P(Bj |Bi )P(Bi ) = P(Bi ) − j=1 1≤j<i di X P(Bj ∧ Bi ). j=1 Take 1 and subtract the above, we get ^ P(Bi | Bj ) ≤ P(Bi ) + di X P(Bj ∧ Bi ). j=1 1≤j<i For the second upperbound, use 1 − x ≤ e−x to get P(Bi | ^ Bj ) ≤ 1 − P(Bi ) + di X P(Bj ∧ Bi ) j=1 1≤j<i ≤ exp −P(Bi ) + di X ! P(Bj ∧ Bi ) . j=1 Now take product over i, we get P( m ^ i=1 Bi ) ≤ exp − X i P(Bi ) + di XX i ! P(Bj ∧ Bi ) = exp (−µ + ∆/2) . j=1 For the first upperbound, we also use 1 + x ≤ ex but at another location. ! di ^ 1 X P(Bi | Bj ) ≤ P(Bi ) 1 + P(Bj ∧ Bi ) since P(Bi ) ≥ 1 − 1 − j=1 1≤j<i ! di X 1 ≤ P(Bi ) exp P(Bj ∧ Bi ) . 1 − j=1 49 Again, take product over i on both sides, we get becomes ∆/2. Q i P(Bi ) = 1, and the inner sum When we re-examine the proof, in fact we only need to use the following two properties about the Bj ’s (both from come the FKG inequality) P(Bi | ^ Bj ) ≤ P(Bi ), (19) j∈J valid for all index sets J ⊂ I, i 6= J, and ^ P(Bi |Bk ∧ Bj ) ≤ P(Bi |Bk ), (20) j∈J valid for all index sets J ⊂ I, i, k 6∈ J. So if Bi are arbitrary events with dependency digraph G that satisfies (19) and (20), the Janson’s inequality applies. (Admittedly, I only the know of the examples where Janson’s inequality applies directly, so this observation is not hugely useful). 12.2.1 Proof of the Extended Janson’s inequality Proof:[Proof of the Extended Janson’s inequality, Exercise] The Extended Janson’s inequality has a probabilistic(!) proof. Here is the idea. For any subset S ⊂ I, ^ ^ P( Bi ) ≤ P( Bi ). i∈S i∈I Thus we can choose a random S, stsart with (13) applied to S, and optimize S to get a new upperbound over the entire collection I. (Without the hindsight, it does seem strange that this argument even works). Substitute in the definitions of µ and ∆ and take log, we get ! X ^ 1X − log P( Bi ) ≥ P(Bi ) − P(Bi ∧ Bj ). 2 i∼j i∈I i∈I This inequality holds for all subset S ⊂ I. Let S be a random subset given by P(i ∈ S) = p, where we will optimize for p later. So each term P(Bi ) appears with probability p, each term P(Bi ∧ Bj ) appears with probability p2 . So ! ^ E(− log P( Bi ) ) ≥ pµ − p2 ∆/2. i∈S 50 This probabilty is maximized at p = µ/∆. By the assumption of the theorem, p ≤ 1. Thus we get ! ^ µ2 E(− log P( Bi ) ) ≥ . 2∆ i∈S So there must be a subset S ⊂ I for which ! − log P( ^ Bi ) ≥ i∈S So P( ^ Bi ) ≤ exp(− i∈S So this completes the proof. 51 µ2 . 2∆ µ2 ). 2∆ 13 13.1 Lecture 13: Brun’s sieve, and some examples Example: Threshold of Golomb rulers Recall that we used first and second moment methods to prove threshold for random graphs. This really boils down to upper and lower bounding P(X = 0) for some counting random variable X. Thus, we can also use Janson’s inequality to prove thresholds. Here is an example. A set S ⊂ N is called a Golomb ruler if all pairwise differences ai − aj , i 6= j, are different. If we think of S as marks on a line, then this condition states that no two pairs of points are of the same distance from each other. Initial motivations for Golomb ruler is to select Fourier frequencies (for radio applications) to minimize interference. Suppose we choose a random S ⊂ [n] by picking elements of [n] independently with probability p. Heuristics suggest that smaller p implies S more likely to be a Golomb ruler. How large can p be so that S is still very likely to be a Golomb ruler? Proposition 65 Assume p = o(n−2/3 ). If p n−3/4 (that is, pn3/4 → ∞), then P(S is Sidon) → 0. If p n−3/4 (that is, pn3/4 → 0), then P(S is Sidon) → 1. Proof: For a given quadruple x = (a, b, c, d) of elements of [n] with a − b = c − d, let Ax be the event that x ⊂ S. Let N be the number of such quadruples. For large n, one can show that N ∼ αn3 for some constant α. (The idea is once we picked three integers, then the last in the quadruple is determined. One needs to check that this last element is in [n], and is not one of the integers already picked. But the number of invalid last element is very small, thus N scales as αn3 ). P Let X = x Ax be the number of bad quadruples. So, with µ = p4 N , by Janson’s inequality, we have −µ ≤ log(P(X = 0)) ≤ −µ + ∆/2. Consider ∆. Two quadruples x ∼ y share between 1 and 6 integers. If they share j integers, then P(Ax ∩ Ay ) = p8−j . By the same heuristic used for counting N , the number of pairs of quadruples which share j integers is ∼ n8−j−2 . Let i = 8 − j. Thus, ∆ is a polynomial of the form ∆= 7 X αi pi ni−2 , i=2 for some constants αi . By our assumption pn2/3 → 0, ∆ µ. So by Janson’s inequality, log P(X = 0) = log P(S is a Golomb ruler) ∼ αp4 n3 . The conclusion follows. 52 13.2 Example: triangles in G(n, p) Let n ≥ 3. Let X be the number of triangles in G(n, p). Lemma 66 (Exercise) If p = cn−1 , then as n → ∞, the probability that the graph 3 does not contain any triangle is e−c /6 . Proof: Let Ai be the event that the i-th triple of vertices is a triangle. Then P(Ai ) = p3 , µ = E(X) = n3 p3 . Two triangles i, j shares at most one edge. For such triangles, P(Ai ∩ Aj ) = p5 . Thus X n 4 5 ∆= p. P(Ai ∩ Aj ) = 4 2 i∼j By Janson’s inequality, Y (1 − p3 ) ≤ P(X = 0) ≤ e−µ+∆/2 . i So if p = o(n−1/2 ), that is, pn1/2 → 0, then log P(X = 0) ∼ −µ. Plug in p = cn−1 and simplify, we have the desired statement. 13.3 Brun’s sieve Poisson approximation was the original motivation for Janson’s inequality. In general, there are two methods to prove that a sequence of random variables Xn converge to Poisson: convergence characteristic functions (equivalently, moment generating function), and the Chen-Stein’s method. Since the moment generating function of a random variable X is E(etX ) = ∞ r X t E(X r ) r=0 r! , one would hope that if one can show E(Xnr ) → r-th moment of P oisson(µ), d then this implies Xn → P oisson(µ). Brun’s sieve uses this idea, with the small twist Xn that one should consider the limit of E( r ) instead, since the r-th moment of a r Poisson random variable is ugly, but E( P oi(µ) ) = µr! . The use of Brun’s sieve to r prove Poisson convergence is also called the moment method. 53 Theorem 67 Suppose {Xn } is a sequence of nonnegative integer-valued random variables, each is a sum of indicators. Suppose that there exists a constant µ such that Er (Xn ) := E(Xn (Xn − 1)(Xn − 2) · · · (Xn − r + 1)) → µr . Then for every t, as n → ∞, P(Xn = t) → µt −µ e . t! There are several ways to prove this statement. One is by inclusion-exclusion (hence the name sieve). Another is by moment generating function, and invoke general theorems in probability such as the L´evy Convergence Theorem, see, for example, Kallenberg[Chapter 4]. We skip the proof and consider an application. 13.3.1 Example: EPIT In G(n, p), let EPIT stand for the property that every vertex is in a triangle. Theorem 68 For fixed c > 0, p = p(n), µ = µ(n), assume that n−1 3 p = µ, 2 c e−µ = . n Let X = X(n) be the number of vertices of G(n, p) not lying in a triangle. Then lim P(G(n, p) is EPIT) = e−c . n→∞ Proof: Our goal is to apply Brun’s P sieve. Let Xx be the indicator of the event that x is not in any triangle. Then X = x Xx . We need to find the limit of E(X) and E( Xr ). We will do this by Janson’s inequality. Fix a vertex x ∈ V . For each y, z ∈ V , let Bxyz be the event that the triangle on {x, y, z} is present. Let Cx be the event x is not in any triangle, that is, ^ Cx = Bxyz . y,z Then Xx is the indicator of Cx . Let us use Janson to compute E(Xx ). Here we have events Bxyz over y, z ∈ V . By definition of µ, we have X P(Bxyz ) = µ. y,z 54 Let us compute ∆ of these events. xyz ∼ xuv if and only if {y, z} ∩ {u, v} = 6 ∅. So X ∆= P(Bxyz ∧ Bxyz0 ) = O(n3 p5 ) = o(1) y,z,z 0 since p = n−2/3+o(1) . So by Janson’s inequality E(Xx ) ∼ e−µ = c . n Thus E(X) = X E(Xx ) = c. x Fix r. Then X X E( )= P(Cx1 ∧ · · · ∧ Cxr ), r where the sum is over all sets of vertices {x1 , . . . , xr }. These events are symmetric, so r X n n E( )= P(Cx1 ∧ · · · ∧ Cxr )) ∼ P(C1 ∧ · · · ∧ Cr ). r r r! But P(C1 ∧ · · · ∧ Cr ) = ^ Biyz , y,z where 1 ≤ i ≤ r, and over all y and z. Again, apply Janson’s inequality to this set of events. The number of {i, y, z} is r n−1 − O(n), the overcount comes from those 2 triangles where either y or z are equal to i. So X n−1 3 P(Biyz ) = p (r − O(n)) = rµ + O(n−1+o(1) ). 2 y,z As before, ∆ is p5 times the number of pairs iyz ∼ jyz 0 . There are O(rn3 ) = O(n3 ) terms with i = j and O(r2 n2 ) = O(n2 ) terms with i 6= j, so ∆ = o(1). Therefore, P(C1 ∧ · · · ∧ Cr ) ∼ e−rµ , and E( X r )∼ cr r! as needed. 55