Statistics and Probability Letters
Transcription
Statistics and Probability Letters
Statistics and Probability Letters 79 (2009) 270–274 Contents lists available at ScienceDirect Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro Complete monotonicity of the entropy in the central limit theorem for gamma and inverse Gaussian distributions Yaming Yu ∗ Department of Statistics, University of California, Irvine 92697, USA article a b s t r a c t info √ Let Hg (α) be the differential entropy of the gamma distribution Gam(α, α). It is shown that (1/2) log(2π e) − Hg (α) is a completely monotone function of α . This refines the monotonicity of the entropy in the central limit theorem for gamma random variables. A similar result holds for the inverse Gaussian family. How generally this complete monotonicity holds is left as an open problem. © 2008 Elsevier B.V. All rights reserved. Article history: Received 16 June 2008 Received in revised form 11 August 2008 Accepted 14 August 2008 Available online 22 August 2008 1. Introduction The classical central limit theorem (CLT) states that for a sequence of independent and identically distributed (i.i.d.) random variables Xi , i = 1,√ 2, . . ., with a finite mean µ = E (X1 ) and a finite variance σ 2 = Var(X1 ), the normalized partial Pn sum Zn = ( i=1 Xi − nµ)/ nσ 2 tends to N (0, 1) in distribution as n → ∞. There exists an information theoretic version of the CLT. The differential entropy for a continuous random variable S with density f (s) is defined as H (S ) = − Z ∞ f (s) log f (s)ds. −∞ Barron (1986) proved that if the (differential) entropy of Zn is eventually finite, then it tends to (1/2) log(2π e), the entropy of N (0, 1), as n → ∞. As a consequence of Pinsker’s inequality (Cover and Thomas, 1991), convergence in entropy implies convergence in distribution in the normal CLT case. An interesting feature of the information theoretic CLT is that H (Zn ) increases monotonically in n (which reminds us of the second law of thermodynamics); since the entropy is invariant under a location shift, equivalently for a sequence of i.i.d. random variables Xi with a finite second moment, we have H n 1 X √ n i =1 ! Xi ≤H √ 1 n +1 X n + 1 i =1 ! Xi , n ≥ 1. (1) Although the special case n = 1 is known to follow from Shannon’s classical entropy power inequality (Stam, 1959; Blachman, 1965), it was not until 2004 before the above inequality was finally proven by Artstein et al. (2004); see also Madiman and Barron (2007) for generalizations. This paper is concerned with a possible refinement of this monotonicity, at least when the distribution of Xi belongs to some special families. Recall that a function f (x) on x ∈ (0, ∞) is called completely monotonic if its derivatives of all orders exist and satisfy (−1)k f (k) (x) ≥ 0, ∗ x > 0, Tel.: +1 949 824 7361. E-mail address: [email protected]. 0167-7152/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.spl.2008.08.008 Y. Yu / Statistics and Probability Letters 79 (2009) 270–274 271 for all k ≥ 0. In particular, a completely monotone function is both monotonically decreasing and convex. A discrete sequence fn on n = 1, 2, . . . is completely monotonic if (−1)k ∆k fn ≥ 0, n = 1, 2, . . . , for all k ≥ 0, where ∆ denotes the first difference operator, i.e., ∆fn = fn+1 − fn . Some basic properties of completely monotone functions can be found in Feller (1966). Theorem 1. Let Xi , i = 1, 2, . . ., be i.i.d. random variables with distribution F , mean µ, and variance σ 2 ∈ (0, ∞). Denote h(n) = 1 2 log(2π e) − H √ ! n X (Xi − µ) , 1 nσ 2 i=1 n = 1, 2, . . . . Then h(n) is a completely monotonic function of n if F is either a gamma distribution or an inverse Gaussian distribution. √ Note that in Theorem 1, h(n) is the also the relative entropy, or Kullback–Leibler divergence, between (1/ nσ 2 ) i=1 (Xi − µ) and the standard normal. Theorem 1 indicates that the behavior of relative entropy in the classical CLT is (at least for gamma and inverse Gaussian families) highly regular in a sense. In Sections 2 and 3, we derive Theorem 1 for the gamma and inverse Gaussian families respectively. Part of the reason these two families are chosen is that they are analytically tractable. It seems intuitive that Theorem 1 should hold for a wide class of distributions, but the proofs presented here depend heavily on properties of the gamma and inverse Gaussian distributions, and are unlikely to work for the general situation. How Theorem 1 can be extended is an open problem. Pn 2. The gamma case Let Hg (α, β), α > 0, β > 0, be the entropy of a Gam(α, β) random variable whose density is specified by p(x; α, β) = 1 Γ (α) β α xα−1 e−β x , x > 0, where Γ (α) = 0 xα−1 e−x dx is Euler’s gamma function. If Xi ’s are i.i.d. with a Gam(α, β ) distribution, then the entropy of the standardized sum is R∞ H β √ n X nα i = 1 α Xi − β ! = Hg nα, √ nα . √ With a slight abuse of notation let Hg (α) = Hg (α, α). It is easy to see that Theorem 2 below implies Theorem 1 if the common distribution of Xi belongs to the gamma family. Theorem 2. The function (1/2) log(2π e) − Hg (α) is completely monotonic on α ∈ (0, ∞). Proof. Direct calculation gives Hg (α) = log Γ (α) + α − 1 2 log(α) + (1 − α)ψ(α) (2) where ψ(α) is the digamma function defined by ψ(α) = Γ 0 (α)/Γ (α). It is clear that (1/2) log(2π e) > Hg (α), because the standard normal achieves maximum entropy among continuous distributions with unit variance. Moreover, Hg0 (α) = (1 − α)ψ 0 (α) + 1 − 1 2α . (3) By Leibniz’s rule, for k ≥ 2, (−1)k (k − 1)! (−1)k Hg(k) (α) = (−1)k (1 − α)ψ (k) (α) − (k − 1)ψ (k−1) (α) + . 2α k (4) The function ψ (k) (α), k ≥ 1, admits an integral representation (Abramowitz and Stegun, 1964, p. 260) ψ (k) (α) = (−1)k+1 ∞ Z 0 t k e−α t 1 − e− t dt . (5) 272 Y. Yu / Statistics and Probability Letters 79 (2009) 270–274 Lemma 1. If k ≥ 2 then αψ (k) (α) + kψ (k−1) (α) = (−1)k ∞ Z t k e−α t −t (1 − e−t )2 0 dt . (6) For k = 1 we have αψ 0 (α) = 1 + ∞ Z 0 1 − e−t − te−t −α t e dt . (1 − e−t )2 (7) Proof. Using (5) and integration by parts, we get kψ (k−1) (α) = (−1)k ∞ Z kt k−1 e−α t 1 − e− t 0 ∞ dt −α e−αt e−α t −t − dt 1 − e− t (1 − e−t )2 0 Z ∞ k −αt −t t e = −αψ (k) (α) + (−1)k dt (1 − e−t )2 0 = (−1)k+1 Z tk which proves (6). The proof of (7) is similar. In view of (4)–(6), we have (k ≥ 2) (k) (k) (k−1) Z ∞ (−1)k (k − 1)! dt + (1 − e−t )2 2α k t k e−α t −t (−1) Hg (α) = (−1) ψ (α) + ψ (α) + (−1) 0 Z ∞ 1−t te−t 1 k−1 −α t = − + t e dt 1 − e− t (1 − e−t )2 2 0 Z ∞ 1 t 1 k−1 −α t = − + t e dt . 1 − e− t (1 − e−t )2 2 0 k k k+1 (8) A parallel calculation using (7) reveals that the expression (8) is also valid for k = 1. Lemma 2. The function u(t ) ≡ 1 1− e− t − t (1 − e− t ) 2 + 1 2 is negative on t ∈ (0, ∞). Proof. Put s = 1 − e−t . Then u(t ) = s + s2 /2 + log(1 − s) s2 < 0, given the elementary inequality log(1 − s) < −s − s2 /2 for 0 < s < 1. Lemma 2 immediately yields (−1)k Hg(k) (α) < 0, k ≥ 1, thus completing the proof of Theorem 2. (k) As an illustration, Fig. 1 displays the functions Hg (α) on (1, 10) for k = 0, 1, 2, 3. The calculations are based on (2)–(4). The monotone behavior is in clear agreement with Theorem 2. 3. The inverse Gaussian case The inverse Gaussian distribution IG(µ, λ), µ > 0, λ > 0, has density p(x; µ, λ) = λ 2 π x3 1/2 exp − λ(x − µ)2 , 2µ2 x x > 0. (9) The mean is µ and the variance is µ3 /λ. Some well-known properties are listed below; further details can be found in Seshadri (1993). Y. Yu / Statistics and Probability Letters 79 (2009) 270–274 273 (k) Fig. 1. The functions Hg (α), k = 0, 1, 2, 3, on α ∈ (1, 10). Lemma 3. The inverse Gaussian family is closed under scaling and (i.i.d.) convolution. Specifically, • if X ∼ IG(µ, λ) and a > 0 is a constant, then aX ∼ IG(aµ, aλ); P • if X1 , . . . , Xn are i.i.d. random variables distributed as IG(µ, λ), then ni=1 Xi ∼ IG(nµ, n2 λ). Lemma 4. If X ∼ IG(µ, λ) then λ(X − µ)2 /(µ2 X ) has a χ 2 distribution with one degree of freedom. Let Hig (µ, λ) denote the entropy of an IG(µ, λ) random variable. By Lemma 3, for an i.i.d. sequence Xi ∼ IG(µ, λ), the entropy of the standardized sum is H −1/2 (nµ /λ) 3 ! n X (Xi − µ) = Hig (nλ/µ)1/2 , (nλ/µ)3/2 . i=1 It is then clear that Theorem 3 implies Theorem 1 in the inverse Gaussian case. Theorem 3. The function (1/2) log(2π e) − Hig (θ 1/2 , θ 3/2 ), θ ∈ (0, ∞), is a completely monotone function of θ . Proof. Let J (θ ) = (1/2) log(2π e) − Hig (θ 1/2 , θ 3/2 ). As in the proof of Theorem 2, we know J (θ ) > 0 and only need to show (−1)k J (k) (θ ) ≥ 0 for all k ≥ 1. Write µ = θ 1/2 for notational convenience and let X ∼ IG(µ, µ3 ). Direct calculation using (9) gives J (θ ) = = 1 2 1 2 log(2π e) + E log p(X ; µ, µ3 ) 3 µ(X − µ)2 2 2X − E log(X /µ) − E 3 = − E log(X /µ) 2 where Lemma 4 is used in the last equality. To calculate E log(X /µ), let us recall the moment generating function of X : h EetX = exp µ2 1 − p i 1 − 2t /µ . Using this and the integral identity log(a) = ∞ Z 0 e−t − e−at t dt , a > 0, 274 Y. Yu / Statistics and Probability Letters 79 (2009) 270–274 we may express E log(X /µ) as E log(µX /µ2 ) = −2 log(µ) + E ∞ Z t 0 = − log(θ ) + Z 2 e−t − eµ (1− ∞ Z = −2 log(µ) + 0 t ∞ e−t − eθ (1− dE log(X /µ) dθ =− θ dθ =− 1+2t 1+2t ) ) dt dt . √ 1 + 2t − 1 θ (1−√1+2t ) e dt . t + 0 √ By a change of variables s = dE log(X /µ) ∞ Z 1 dt √ √ t 0 Thus e−t − e−µXt 1 θ Z 1 + 2t − 1 in the above integral, we obtain s(1 + s) −θ s e ds s2 /2 + s ∞ Z + 0 ∞ =− Z ∞0 = 0 e −θ s ∞ Z ds + 0 s s+2 e −θ s 2(1 + s) −θ s e ds s+2 ds. It is then clear that (−1)k dk E log(X /µ) dθ k ∞ Z =− 0 sk s+2 e−θ s ds < 0 for all k ≥ 1; in other words (−1)k J (k) (θ ) > 0, k ≥ 1. Acknowledgment The author would like to thank an anonymous referee for his/her valuable comments. References Abramowitz, M., Stegun, I.A. (Eds.), 1964. Handbook of Mathematical Functions. Dover, New York. Artstein, S., Ball, K.M., Barthe, F., Naor, A., 2004. Solution of Shannon’s problem on the monotonicity of entropy. J. Amer. Math. Soc. 17, 975–982. Barron, A.R., 1986. Entropy and the central limit theorem. Ann. Probab. 14, 336–342. Blachman, N.M., 1965. The convolution inequality for entropy powers. IEEE Trans. Inform. Theory 11, 267–271. Cover, T.M., Thomas, J.A., 1991. Elements of Information Theory. Wiley, New York. Feller, W., 1966. An Introduction to Probability Theory and its Applications, vol. 2. Wiley, New York. Madiman, M., Barron, A.R., 2007. Generalized entropy power inequalities and monotonicity properties of information. IEEE Trans. Inform. Theory 53, 2317–2329. Seshadri, V., 1993. The Inverse Gaussian Distribution. Oxford Univ. Press, Oxford. Stam, A.J., 1959. Some inequalities satisfied by the quantities of information of Fisher and Shannon. Inform. Control 2, 101–112.