Parametric estimation. Finite sample theory
Transcription
Parametric estimation. Finite sample theory
Parametric estimation. Finite sample theory Vladimir Spokoiny∗ Weierstrass-Institute, Mohrenstr. 39, 10117 Berlin, Germany Humboldt University Berlin, Moscow Institute of Physics and Technology [email protected] Abstract The paper aims at reconsidering the famous Le Cam LAN theory. The main features of the approach which make it different from the classical one are: (1) the study is non-asymptotic, that is, the sample size is fixed and does not tend to infinity; (2) the parametric assumption is possibly misspecified and the underlying data distribution can lie beyond the given parametric family. These two features enable to bridge the gap between parametric and nonparametric theory and to build a unified framework for statistical estimation. The main results include a large deviation bound for the (quasi) maximum likelihood and a local quadratic bracketing result for the log-likelihood process. The latter yields a number of important corollaries for statistical inference: concentration, confidence and risk bounds, expansion of the maximum likelihood estimate, etc. All these corollaries are stated in a non-classical way admitting a model misspecification and finite samples. However, the classical asymptotic results including the efficiency bounds can be easily derived as corollaries of the obtained non-asymptotic statements. At the same time, the new bracketing device works well in the situations with large or growing parameter dimension in which the classical parametric theory fails. The general results are illustrated for the i.i.d. set-up as well as for generalized linear modeling and median estimation. The results apply for any dimension of the parameter space and provide a quantitative lower bound on the sample size yielding the root-n accuracy. ∗ The author is supported by Predictive Modeling Laboratory, MIPT, RF government grant, ag. 11.G34.31.0073. Financial support by the German Research Foundation (DFG) through the Collaborative Research Center 649 “Economic Risk” is gratefully acknowledged. Critics and suggestions of two anonymous referees helped a lot in improving the paper 1 2 Parametric estimation. Finite sample theory AMS 2000 Subject Classification: Primary 62F10. Secondary 62J12,62F25,62H12 Keywords: maximum likelihood, local quadratic bracketing, deficiency, concentration 1 Introduction One of the most popular approaches in statistics is based on the parametric assumption (PA) that the distribution IP of the observed data Y belongs to a given parametric family (IPθ , θ ∈ Θ ⊆ IRp ) , where p stands for the number of parameters. This assumption allows to reduce the problem of statistical inference about IP to recovering the parameter θ . The theory of parameter estimation and inference is nicely developed in a quite general set-up. There is a vast literature on this issue. We only mention the book by Ibragimov and Khas’minskij (1981), which provides a comprehensive study of asymptotic properties of maximum likelihood and Bayesian estimators. The theory is essentially based on two major assumptions: (1) the underlying data distribution follows the PA; (2) the sample size or the amount of available information is large relative to the number of parameters. In many practical applications, both assumptions can be very restrictive and limit the scope of applicability for the whole approach. Indeed, the PA is usually only an approximation of real data distribution and in most statistical problems it is too restrictive to assume that the PA is exactly fulfilled. Many modern statistical problems deal with very complex high dimensional data where a huge number of parameters are involved. In such situations, the applicability of large sample asymptotics is questionable. These two issues partially explain why the parametric and nonparametric theory are almost isolated from each other. Relaxing these restrictive assumptions can be viewed as an important challenge of the modern statistical theory. The present paper attempts at developing a unified approach which does not require the restrictive parametric assumptions but still enjoys the main benefits of the parametric theory. The main steps of the approach are similar to the classical local asymptotic normality (LAN) theory; see e.g. Chapters 1–3 in the monograph Ibragimov and Khas’minskij (1981): first one localizes the problem to a neighborhood of the target parameter. Then one uses a local quadratic expansion of the log-likelihood to solve the corresponding estimation problem. There is, however, one feature of the proposed approach which makes it essentially different from classical scheme. Namely, the use of the bracketing device instead of classical Taylor expansion allows to consider much larger local neighborhoods than in the LAN theory. More specifically, the classical LAN theory effectively requires a strict localization to a root-n vicinity of the true point. At this point, the LAN theory fails in extending to the spokoiny, v. 3 nonparametric situation. Our approach works for any local vicinity of the true point. This opens the door to building a unified theory including most of classical parametric and nonparametric results. Let Y stand for the available data. Everywhere below we assume that the observed data Y follow the distribution IP on a metric space Y . We do not specify any particular structure of Y . In particular, no assumption like independence or weak dependence of individual observations is imposed. The basic parametric assumption is that IP can be approximated by a parametric distribution IPθ from a given parametric family (IPθ , θ ∈ Θ ⊆ IRp ) . Our approach allows that the PA can be misspecified, that is, in general, IP 6∈ IPθ . Let L(Y , θ) be the log-likelihood for the considered parametric model: L(Y , θ) = Pθ log dI dµ (Y ) , where µ0 is any dominating measure for the family (IPθ ) . We focus on 0 the properties of the process L(Y , θ) as a function of the parameter θ . Therefore, we suppress the argument Y there and write L(θ) instead of L(Y , θ) . One has to keep def in mind that L(θ) is random and depends on the observed data Y . By L(θ, θ ∗ ) = L(θ)−L(θ ∗ ) we denote the log-likelihood ratio. The classical likelihood principle suggests to estimate θ by maximizing the corresponding log-likelihood function L(θ) : e def θ = argmax L(θ). (1.1) θ∈Θ Our ultimate goal is to study the properties of the quasi maximum likelihood estimator e . It turns out that such properties can be naturally described in terms of (MLE) θ e . To avoid the maximum of the process L(θ) rather than the point of maximum θ technical burdens it is assumed that the maximum is attained leading to the identity e . However, the point of maximum does not have to be unique. If maxθ∈Θ L(θ) = L(θ) e as any of them. Basically, the notation θ e is used there are many such points we take θ e = sup for the identity L(θ) θ∈Θ L(θ) . e from (1.1) is still meaningful and it appears If IP 6∈ IPθ , then the (quasi) MLE θ to be an estimator of the value θ ∗ defined by maximizing the expected value of L(θ) : def θ ∗ = argmax IEL(θ) (1.2) θ∈Θ which is the true value in the parametric situation and can be viewed as the parameter of the best parametric fit in the general case. e like concentraThe results below show that the main properties of the quasi MLE θ tion or coverage probability can be described in terms of the excess which is the difference between the maximum of the process L(θ) and its value at the “true” point θ ∗ : def e θ ∗ ) = L(θ) e − L(θ ∗ ) = max L(θ) − L(θ ∗ ), L(θ, θ∈Θ 4 Parametric estimation. Finite sample theory The established results can be split into two big groups. A large deviation bound states e . For specific local sets Θ0 (r) with some concentration properties of the estimator θ e 6∈ Θ0 (r) is exponentially small in r . This elliptic shape, the deviation probability IP θ concentration bound allows to restrict the parameter space to a properly selected vicinity Θ0 (r) . Our main results concern the local properties of the process L(θ) within Θ0 (r) including a bracketing bound and its corollaries. The paper is organized as follows. Section 2 presents the list of conditions which are systematically used in the text. The conditions only concern the properties of the quasi log-likelihood process L(θ) . Section 3 appears to be central in the whole approach and it focuses on local properties of the process L(θ) within Θ0 (r) . The idea is to sandwich the underlying (quasi) log-likelihood process L(θ) for θ ∈ Θ0 (r) between two quadratic (in parameter) expressions. Then the maximum of L(θ) over Θ0 (r) will be sandwiched as well by the maxima of the lower and upper processes. The quadratic structure of these processes helps to compute these maxima explicitly yielding the bounds for the value of the original problem. This approximation result is used to derive a number of corollaries e, including the concentration and coverage probability, expansion of the estimator θ polynomial risk bounds, etc. In contrary to the classical theory, all the results are nonasymptotic and do not involve any small values of the form o(1) , all the terms are specified explicitly. Also the results are stated under possible model misspecification. Section 4 accomplishes the local results with the concentration property which bounds e deviates from the local set Θ0 (r) . In the modern statistical the probability that θ literature there is a number of studies considering maximum likelihood or more generally minimum contrast estimators in a general i.i.d. situation, when the parameter set Θ is a subset of some functional space. We mention the papers Van de Geer (1993), Birg´e and Massart (1993), Birg´e and Massart (1998), Birg´e (2006) and references therein. The established results are based on deep probabilistic facts from empirical process theory; see e.g. Talagrand (1996, 2001, 2005), van der Vaart and Wellner (1996), Boucheron et al. (2003). The general result presented in Section A follows the generic chaining idea due to Talagrand (2005); cf. Bednorz (2006). However, we do not assume any specific structure of the model. In particular, we do not assume independent observations and thus, cannot apply the most developed concentration bounds from the empirical process theory. Section 5 illustrates the applicability of the general results to the classical case of an i.i.d. sample. The previously established general results apply under rather mild conditions. Basically we assume some smoothness of the log-likelihood process and some minimal number of observations per parameter: the sample size should be at least of order of the dimensionality p of the parameter space. We also consider the examples of spokoiny, v. 5 generalized linear modeling and of median regression. It is important to mention that the non-asymptotic character of our study yields an almost complete change of the mathematical tools: the notions of convergence and tightness become meaningless, the arguments based on compactness of the parameter space do not apply, etc. Instead we utilize the tools of the empirical process theory based on the ideas of concentration of measures and nonasymptotic entropy bounds. Section ?? in the Appendix presents an exponential bound for a general quadratic form which is very important for getting the sharp risk bounds for the quasi MLE. This bound is an important step in the concentration results for the quasi MLE. Section A explains how the generic chaining and majorizing measure device by Talagrand (2005) refined in Bednorz (2006) can be used for obtaining a general exponential bound for the loglikelihood process. The proposed approach can be useful in many further research directions including penalized maximum likelihood and semiparametric estimation, Andresen and Spokoiny (2013), contraction rate and asymptotic normality of the posterior within the Bayes approach, ?, local adaptive quantile estimation, Spokoiny et al. (2013). 2 Conditions Below we collect the list of conditions which are systematically used in the text. It seems to be an advantage of the whole approach that all the results are stated in a unified way under the same conditions. Once checked, one obtains automatically all the established results. We do not try to formulate the conditions and the results in the most general form. In some cases we sacrifice generality in favor of readability and ease of presentation. It is important to stress that all the conditions only concern the properties of the quasi likelihood process L(θ) . Even if the process L(·) is not a sufficient statistic, the whole analysis is entirely based on its geometric structure and probabilistic properties. The conditions are not restrictive and can be effectively checked in many particular situations. Some examples are given in Section 5 for i.i.d setup, generalized linear models, and for median regression. The imposed conditions can be classified into the following groups by their meaning: • smoothness conditions on L(θ) allowing the second order Taylor expansion; • exponential moment conditions; • identifiability and regularity conditions; We also distinguish between local and global conditions. The global conditions concern the global behavior of the process L(θ) while the local conditions focus on its behavior 6 Parametric estimation. Finite sample theory in the vicinity of the central point θ ∗ . Below we suppose that degree of locality is described by a number r . The local zone corresponds to r ≤ r0 for a fixed r0 . The global conditions concern r > 0 . 2.1 Local conditions Local conditions describe the properties of L(θ) in a vicinity of the central point θ ∗ from (1.2). To bound local fluctuations of the process L(θ) , we introduce an exponential moment condition on the stochastic component ζ(θ) : def ζ(θ) = L(θ) − IEL(θ). Below we suppose that the random function ζ(θ) is differentiable in θ and its gradient ∇ζ(θ) = ∂ζ(θ)/∂θ ∈ IRp has some exponential moments. Our first condition describes the property of the gradient ∇ζ(θ ∗ ) at the central point θ ∗ . There exist a positive symmetric matrix V02 , and constants g > 0 , ν0 ≥ 1 such that Var ∇ζ(θ ∗ ) ≤ V02 and for all |λ| ≤ g (ED0 ) > γ ∇ζ(θ ∗ ) ≤ ν02 λ2 /2. sup log IE exp λ p kV γk 0 γ∈IR In typical situation, the matrix V02 can be defined as the covariance matrix of the gradient vector ∇ζ(θ ∗ ) : V02 = Var ∇ζ(θ ∗ ) = Var ∇L(θ ∗ ) . If L(θ) is the loglikelihood for a correctly specified model, then θ ∗ is the true parameter value and V02 coincides with the corresponding Fisher information matrix. The matrix V0 shown in this condition determines the local geometry in the vicinity of θ ∗ . In particular, define the local elliptic neighborhoods of θ ∗ as def Θ0 (r) = {θ ∈ Θ : kV0 (θ − θ ∗ )k ≤ r}. (2.1) The further conditions are restricted to such defined neighborhoods Θ0 (r) . (ED1 ) For each r ≤ r0 , there exists a constant ω(r) ≤ 1/2 such that it holds for all θ ∈ Θ0 (r) γ > {∇ζ(θ) − ∇ζ(θ ∗ )} sup log IE exp λ ω(r)kV0 γk γ∈IRp Here the constant g is the same as in (ED0 ) . ≤ ν02 λ2 /2, |λ| ≤ g. spokoiny, v. 7 The main bracketing result also requires second order smoothness of the expected loglikelihood IEL(θ) . By definition, L(θ ∗ , θ ∗ ) ≡ 0 and ∇IEL(θ ∗ ) = 0 because θ ∗ is the extreme point of IEL(θ) . Therefore, −IEL(θ, θ ∗ ) can be approximated by a quadratic function of θ−θ ∗ in the neighborhood of θ ∗ . The local identifiability condition quantifies this quadratic approximation from above and from below on the set Θ0 (r) from (2.1). (L0 ) There are a symmetric strictly positive-definite matrix D02 and for each r ≤ r0 and a constant δ(r) ≤ 1/2 , such that it holds on the set Θ0 (r) −2IEL(θ, θ ∗ ) kD0 (θ − θ ∗ )k2 − 1 ≤ δ(r). Usually D02 is defined as the negative Hessian of IEL(θ ∗ ) : D02 = −∇2 IEL(θ ∗ ) . If L(θ, θ ∗ ) is the log-likelihood ratio and IP = IPθ∗ then −IEL(θ, θ ∗ ) = IEθ∗ log dIPθ∗ /dIPθ = K(IPθ∗ , IPθ ) , the Kullback-Leibler divergence between IPθ∗ and IPθ . Then condition (L0 ) with D0 = V0 follows from the usual regularity conditions on the family (IPθ ) ; cf. Ibragimov and Khas’minskij (1981). If the log-likelihood process L(θ) is sufficiently smooth in θ , e.g. three times stochastically differentiable, then the quantities ω(r) and δ(r) can be taken proportional to the value %(r) defined as def %(r) = max kθ − θ ∗ k. θ∈Θ0 (r) In the important special case of an i.i.d. model one can take ω(r) = ω ∗ r/n1/2 and δ(r) = δ ∗ r/n1/2 for some constants ω ∗ , δ ∗ ; see Section 5.1. The identifiability condition relates the matrices D02 and V02 . (I) There is a constant a > 0 such that a2 D02 ≥ V02 . 2.2 Global conditions The global conditions have to be fulfilled for all θ lying beyond Θ0 (r0 ) . We only impose one condition on the smoothness of the stochastic component of the process L(θ) in term of its gradient, and one identifiability condition in terms of the expectation IEL(θ, θ ∗ ) . The first condition is similar to the local condition (ED0 ) and it requires some exponential moment of the gradient ∇ζ(θ) for all θ ∈ Θ . However, the constant g may be dependent of the radius r = kV0 (θ − θ ∗ )k . (Er) For any r , there exists a value g(r) > 0 such that for all λ ≤ g(r) sup θ∈Θ0 (r) > γ ∇ζ(θ) ≤ ν02 λ2 /2. sup log IE exp λ p kV γk 0 γ∈IR 8 Parametric estimation. Finite sample theory The global identification property means that the deterministic component IEL(θ, θ ∗ ) of the log-likelihood is competitive with its variance Var L(θ, θ ∗ ) . (Lr) There is a function b(r) such that rb(r) monotonously increases in r and for each r ≥ r0 inf IEL(θ, θ ∗ ) ≥ b(r)r2 . θ: kV0 (θ−θ ∗ )k=r 3 Local inference The Local Asymptotic Normality (LAN) condition since introduced in Le Cam (1960) became one of the central notions in the statistical theory. It postulates a kind of local approximation of the log-likelihood of the original model by the log-likelihood of a Gaussian shift experiment. The LAN property being once checked yields a number of important corollaries for statistical inference. In words, if you can solve a statistical problem for the Gaussian shift model, the result can be translated under the LAN condition to the original setup. We refer to Ibragimov and Khas’minskij (1981) for a nice presentation of the LAN theory including asymptotic efficiency of MLE and Bayes estimators. The LAN property was extended to mixed LAN or Local Asymptotic Quadraticity (LAQ); see e.g. Le Cam and Yang (2000). All these notions are very much asymptotic and very much local. The LAN theory also requires that L(θ) is the correctly specified log-likelihood. The strict localization does not allow for considering a growing or infinite parameter dimension and limits applications of the LAN theory to nonparametric estimation. Our approach tries to avoid asymptotic constructions and attempts to include a possible model misspecification and a large dimension of the parameter space. The presentation below shows that such an extension of the LAN theory can be made essentially for free: all the major asymptotic results like Fisher and Cram´er-Rao information bounds, as well as the Wilks phenomenon can be derived as corollaries of the obtained nonasymptotic statements simply by letting the sample size to infinity. At the same time, it applies to a high dimensional parameter space. The LAN property states that the considered process L(θ) can be approximated by a quadratic in θ expression in a vicinity of the central point θ ∗ . This property is usually checked using the second order Taylor expansion. The main problem arising here is that the error of the approximation grows too fast with the local size of the neighborhood. Section 3.1 presents the non-asymptotic version of the LAN property in which the local quadratic approximation of L(θ) is replaced by bounding this process from above and from below by two different quadratic in θ processes. More precisely, we apply the 9 spokoiny, v. bracketing idea: the difference L(θ, θ ∗ ) = L(θ) − L(θ ∗ ) is put between two quadratic processes L (θ, θ ∗ ) and L (θ, θ ∗ ) : L (θ, θ ∗ ) − ♦ ≤ L(θ, θ ∗ ) ≤ L (θ, θ ∗ ) + ♦ , θ ∈ Θ0 (r), (3.1) where is a numerical parameter, = − , and ♦ and ♦ are stochastic errors which only depend on the selected vicinity Θ0 (r) . The upper process L (θ, θ ∗ ) and the lower process L (θ, θ ∗ ) can deviate substantially from each other, however, the errors ♦ , ♦ remain small even if the value r describing the size of the local neighborhood Θ0 (r) is large. The sandwiching result (3.1) naturally leads to two important notions: the value of the problem and the spread. It turns out that most of the statements like confidence and concentration probability rely upon the maximum of L(θ, θ ∗ ) over θ which we call the excess. Its expectation will be referred to as the value of the problem. Due to (3.1) the excess can be bounded from above and from below using the similar quantities maxθ L (θ, θ ∗ ) and maxθ L (θ, θ ∗ ) which can be called the lower and upper excess, while their expectations are the values of the lower and upper problems. Note that maxθ {L (θ, θ ∗ ) − L (θ, θ ∗ )} can be very large or even infinite. However, this is not crucial. What really matters is the difference between the upper and the lower excess. The spread ∆ can be defined as the width of the interval bounding the excess due to (3.1), that is, as the sum of the approximation errors and of the difference between the upper and the lower excess: def ∆ = ♦ + ♦ + max L (θ, θ ∗ ) − max L (θ, θ ∗ ) . θ θ The range of applicability of this approach can be described by the following mnemonic rule: “The value of the upper problem is larger in order than the spread.” The further sections explain in details the meaning and content of this rule. Section 3.1 presents the key bound (3.1) and derives it from the general results on empirical processes. Section 3.2 presents some straightforward corollaries of the bound (3.1) including the coverage and concentration probabilities, expansion of the MLE and the risk bounds. It also indicates how the classical results on asymptotic efficiency of the MLE follow from the obtained non-asymptotic bounds. 3.1 Local quadratic bracketing This section presents the key result about local quadratic approximation of the quasi log-likelihood process given by Theorem 3.1 below. Let the radius r of the local neighborhood Θ0 (r) be fixed in a way that the deviation e 6∈ Θ0 (r) is sufficiently small. Precise results about the choice of r probability IP θ 10 Parametric estimation. Finite sample theory which ensures this property are postponed until Section 4. In this neighborhood Θ0 (r) we aim at building some quadratic lower and upper bounds for the process L(θ) . The first step is the usual decomposition of this process into deterministic and stochastic components: L(θ) = IEL(θ) + ζ(θ), where ζ(θ) = L(θ) − IEL(θ) . Condition (L0 ) allows to approximate the smooth deterministic function IEL(θ) − IEL(θ ∗ ) around the point of maximum θ ∗ by the quadratic form −kD0 (θ − θ ∗ )k2 /2 . The smoothness properties of the stochastic component ζ(θ) given by conditions (ED0 ) and (ED1 ) leads to linear approximation ζ(θ) − ζ(θ ∗ ) ≈ (θ − θ ∗ )> ∇ζ(θ ∗ ) . Putting these two approximations together yields the following approximation of the process L(θ) on Θ0 (r) : def L(θ, θ ∗ ) ≈ L(θ, θ ∗ ) = (θ − θ ∗ )> ∇ζ(θ ∗ ) − kD0 (θ − θ ∗ )k2 /2. (3.2) This expansion is used in most of statistical calculus. However, it does not suit our purposes because the error of approximation grows quadratically with the radius r and starts to dominate at some critical value of r . We slightly modify the construction by introducing two different approximating processes. They only differ in the deterministic quadratic term which is either shrunk or stretched relative to the term kD0 (θ − θ ∗ )k2 /2 in L(θ, θ ∗ ) . Let δ, % be nonnegative constants. Introduce for a vector = (δ, %) the following notation: def L (θ, θ ∗ ) = (θ − θ ∗ )> ∇L(θ ∗ ) − kD (θ − θ ∗ )k2 /2 ∗ ∗ 2 = ξ> D (θ − θ ) − kD (θ − θ )k /2, (3.3) where ∇L(θ ∗ ) = ∇ζ(θ ∗ ) by ∇IEL(θ ∗ ) = 0 and D2 = D02 (1 − δ) − %V02 , def ξ = D−1 ∇L(θ ∗ ). Here we implicitly assume that with the proposed choice of the constants δ and % , the matrix D2 is non-negative: D2 ≥ 0 . The representation (3.3) indicates that the process L (θ, θ ∗ ) has the geometric structure of log-likelihood of a linear Gaussian model. We do not require that the vector ξ is Gaussian and hence, it is not the Gaussian log-likelihood. However, the geometric structure of this process appears to be more important than its distributional properties. One can see that if δ, % are positive, the quadratic drift component of the process L (θ, θ ∗ ) is shrunk relative to L(θ, θ ∗ ) in (3.2) for positive and it is stretched if δ, % 11 spokoiny, v. are negative. Now, given r , fix some δ ≥ δ(r) and % ≥ 3ν0 ω(r) with the value δ(r) from condition (L0 ) and ω(r) from condition (ED1 ) . Finally set = − , so that D2 = D02 (1 + δ) + %V02 . Theorem 3.1. Assume (ED1 ) and (L0 ) . Let for some r , the values % ≥ 3ν0 ω(r) and δ ≥ δ(r) be such that D02 (1 − δ) − %V02 ≥ 0 . Then L (θ, θ ∗ ) − ♦ (r) ≤ L(θ, θ ∗ ) ≤ L (θ, θ ∗ ) + ♦ (r), θ ∈ Θ0 (r), (3.4) with L (θ, θ ∗ ), L (θ, θ ∗ ) defined by (3.3). The error terms ♦ (r) and ♦ (r) satisfy the bound (3.11) from Proposition 3.7. The proof of this theorem is given in Proposition 3.7. Remark 3.1. This bracketing bound (3.4) describes some properties of the log-likelihood e is not shown there. However, it directly implies most of process and the estimator θ our inference results. We therefore formulate (3.4) as a separate statement. Section 3.3 below presents some exponential bounds on the error terms ♦ (r) and ♦ (r) . The main message is that under rather broad conditions, these errors are small and have only e. minor impact on the inference for the quasi MLE θ 3.2 Local inference This section presents a list of corollaries from the basic approximation bounds of Theorem 3.1. The idea is to replace the original problem by a similar one for the approximating upper and lower models. It is important to stress once again that all the corollaries only rely on the bracketing result (3.4) and the geometric structure of the processes L and L . Define the spread ∆ (r) by def ∆ (r) = ♦ (r) + ♦ (r) + kξ k2 − kξ k2 /2. (3.5) Here ξ = D−1 ∇L(θ ∗ ) and ξ = D−1 ∇L(θ ∗ ) . The quantity ∆ (r) appears to be the price induced by our bracketing device. Section 3.3 below presents some probabilistic bounds on the spread showing that it is small relative to the other terms. All our corollaries below are stated under conditions of Theorem 3.1 and implicitly assume that the spread can be nearly ignored. 3.2.1 Local coverage probability Our first result describes the probability of covering θ ∗ by the random set e θ) ≤ z}. E(z) = {θ : 2L(θ, (3.6) 12 Parametric estimation. Finite sample theory Corollary 3.2. For any z > 0 e ∈ Θ0 (r) ≤ IP kξ k2 ≥ z − 2♦ (r) . IP E(z) 63 θ ∗ , θ (3.7) Proof. The bound (3.7) follows from the upper bound of Theorem 3.1 and the statement (3.12) of Lemma 3.8 below. Below, see (3.14), we also present an exponential bound which helps to answer a very important question about a proper choice of the critical value z ensuring a prescribed covering probability. 3.2.2 Local expansion, Wilks theorem, and local concentration Now we show how the bound (3.4) can be used for obtaining a local expansion of the e . All our results will be conditioned to the random set C (r) defined as quasi MLE θ def C (r) = e ∈ Θ0 (r), kV0 D−1 ξ k ≤ r . θ (3.8) The second inequality in the definition of C (r) is related to the solution of the upper e 6∈ Θ0 (r) , where θ e = and lower problem; cf. Lemma 3.8: kV0 D−1 ξ k ≤ r means θ argminθ L (θ, θ ∗ ) . Below in Section 3.3 we present some upper bounds on the value r ensuring a dominating probability of this random set. The first result can be viewed as a finite sample version of the famous Wilks theorem. Corollary 3.3. On the random set C (r) from (3.8), it holds e θ ∗ ) ≤ kξ k2 /2 + ♦ (r). kξ k2 /2 − ♦ (r) ≤ L(θ, (3.9) The next result is an extension of another prominent asymptotic result, namely, the Fisher expansion of the MLE. Corollary 3.4. On the random set C (r) from (3.8), it holds e − θ ∗ − ξ 2 ≤ 2∆ (r). D θ (3.10) The proof of Corollaries 3.3 and 3.4 relies on the solution of the upper and lower problems and it is given below at the end of this section. e assuming that θ e is restricted to Now we describe concentration properties of θ e − θ ∗ )k > z for a given Θ0 (r) . More precisely, we bound the probability that kD (θ z > 0. spokoiny, v. 13 Corollary 3.5. For any z > 0 , it holds p e − θ ∗ )k > z, C (r) ≤ IP kξ k > z − 2∆ (r) . IP kD (θ An interesting and important question is for which z in (3.6) the coverage probability of the event E(z) 3 θ ∗ or for which z , the concentration probability of the event e − θ ∗ )k ≤ z} becomes close to one. It will be addressed in Section 3.3. {kD (θ 3.2.3 A local risk bound e θ ∗ ) and of the normalized loss Below we also bound the moments of the excess L(θ, e ∗ ) when θ e is restricted to Θ0 (r) . The result follows directly from Corollaries 3.3 D (θ−θ and 3.4. Corollary 3.6. For u > 0 e θ ∗ ) 1I θ e ∈ Θ0 (r) ≤ IE kξ k2 /2 + ♦ (r) u . IE Lu (θ, Moreover, it holds p e − θ ∗ )ku 1I C (r) ≤ IE kξ k + 2∆ (r) u . IE kD (θ 3.2.4 Comparing with the asymptotic theory This section briefly discusses the relation between the established non-asymptotic bounds and the classical asymptotic results in parametric estimation. This comparison is not straightforward because the asymptotic theory involves the sample size or noise level as the asymptotic parameter, while our setup is very general and works even for a “single” observation. Here we simply treat = (δ, %) as a small parameter. This is well justified p by the i.i.d. case with n observations, where it holds δ = δ(r) r/n and similarly for % ; see Section 5 for more details. The bounds below in Section 3.3 show that the spread ∆ (r) from (3.5) is small and can be ignored in the asymptotic calculations. The results of Corollary 3.2 through 3.6 represent the desired bounds in terms of deviation bounds for the quadratic form kξ k2 . For better understanding the essence of the presented results, consider first the “true” parametric model with the correctly specified log-likelihood L(θ) . Then D02 = V02 is the total Fisher information matrix. In the i.i.d. case it becomes nF0 where F0 is the usual Fisher information matrix of the considered parametric family at θ ∗ . In particular, Var ∇L(θ ∗ ) = nF0 . So, if D is close to D0 , then ξ can be treated as the normalized def score. Under usual assumptions, ξ = D0−1 ∇L(θ ∗ ) is asymptotically standard normal p -vector. The same applies to ξ . Now one can observe that Corollary 3.2 through 3.6 14 Parametric estimation. Finite sample theory directly imply most of classical asymptotic statements. In particular, Corollary 3.3 shows e θ ∗ ) is nearly kξ k2 and thus nearly χ2 (Wilks Theorem). that the twice excess 2L(θ, p e − θ ∗ ≈ ξ (the Fisher expansion) and hence, Corollary 3.4 yields the expansion D θ ∗ e − θ ∗ is e D θ − θ is asymptotically standard normal. Asymptotic variance of D θ e achieves the Cram´er-Rao efficiency bound in the asymptotic set-up. nearly one, so θ 3.3 Spread This section presents some bounds on the spread ∆ (r) from (3.5). This quantity is random but it can be easily evaluated under the conditions made. We present two different results: one bounds the errors ♦ (r), ♦ (r) , while the other presents a deviation bound on quadratic forms like kξ k2 . The results are stated under conditions (ED0 ) and (ED1 ) in a non-asymptotic way, so the formulation is quite technical. An informal discussion at the end of this section explains the typical behavior of the spread. The first result accomplishes the bracketing bound (3.4). Proposition 3.7. Assume (ED1 ) . The error ♦ (r) in (3.4) fulfills IP %−1 ♦ (r) ≥ z0 (x, Q) ≤ exp −x (3.11) with z0 (x, Q) given for g0 = gν0 ≥ 3 by 1 + √x + Q2 def z0 (x, Q) = 1 + 2g−1 (x + Q) + g 2 0 0 if 1 + √ x + Q ≤ g0 , otherwise, where Q = c1 p with c1 = 2 for p ≥ 2 and c1 = 2.7 for p = 1 . Similarly for ♦ (r) . Remark 3.2. The bound (3.11) essentially depends on the value g from condition (ED1 ) . The result requires that gν0 ≥ 3 . However, this constant can usually be taken of order n1/2 ; see Section 5 for examples. If g2 is larger in order than p + x , then z0 (x, Q) ≈ c1 p + x . Proof. Consider for fixed r and = (δ, %) the quantity def ♦ (r) = % L(θ, θ ∗ ) − IEL(θ, θ ∗ ) − (θ − θ ∗ )> ∇L(θ ∗ ) − kV0 (θ − θ ∗ )k2 . 2 θ∈Θ0 (r) sup As δ ≥ δ(r) , it holds −IEL(θ, θ ∗ ) ≥ (1 − δ)D02 and L(θ, θ ∗ ) − L (θ, θ ∗ ) ≤ ♦ (r) . Moreover, in view of ∇IEL(θ ∗ ) = 0 , the definition of ♦ (r) can be rewritten as def ♦ (r) = sup θ∈Θ0 (r) % ζ(θ, θ ∗ ) − (θ − θ ∗ )> ∇ζ(θ ∗ ) − kV0 (θ − θ ∗ )k2 . 2 15 spokoiny, v. Now the claim of the theorem can be easily reduced to an exponential bound for the quantity ♦ (r) . We apply Theorem A.12 to the process U(θ, θ ∗ ) = 1 ζ(θ, θ ∗ ) − (θ − θ ∗ )> ∇ζ(θ ∗ ) , ω(r) θ ∈ Θ0 (r), and H0 = V0 . Condition (ED) follows from (ED1 ) with the same ν0 and g in view of ∇U(θ, θ ∗ ) = ∇ζ(θ) − ∇ζ(θ ∗ ) /ω(r) . So, the conditions of Theorem A.12 are fulfilled yielding (3.11) in view of % ≥ 3ν0 ω(r) . Due to the main bracketing result, the local excess supθ∈Θ0 (r) L(θ, θ ∗ ) can be put between similar quantities for the upper and lower approximating processes up to the error terms ♦ (r), ♦ (r) . The random quantity supθ∈IRp L (θ, θ ∗ ) can be called the upper excess while supθ∈Θ0 (r0 ) L (θ, θ ∗ ) is the lower excess. The quadratic (in θ ) structure of the functions L (θ, θ ∗ ) and L (θ, θ ∗ ) enables us to explicitly solve the problem of maximizing the corresponding function w.r.t. θ . Lemma 3.8. It holds sup L (θ, θ ∗ ) = kξ k2 /2. (3.12) θ∈IRp On the random set {kV0 D−1 ξ k ≤ r} , it also holds sup L (θ, θ) = kξ k2 /2. θ∈Θ0 (r) Proof. The unconstrained maximum of the quadratic form L (θ, θ ∗ ) w.r.t. θ is attained e = D−1 ξ = D−2 ∇L(θ ∗ ) yielding the expression (3.12). The lower excess is at θ computed similarly. Our next step is in bounding the difference kξ k2 − kξ k2 . It can be decomposed as kξ k2 − kξ k2 = kξ k2 − kξk2 + kξk2 − kξ k2 with ξ = D0−1 ∇L(θ ∗ ) . If the values δ, % are small then the difference kξ k2 − kξ k2 is automatically smaller than kξk2 . def Lemma 3.9. Suppose (I) and let τ = δ + %a2 < 1 . Then D2 ≥ (1 − τ )D02 , D2 ≤ (1 + τ )D02 , def kIIp − D D−2 D k∞ ≤ α = 2τ . 1 − τ2 (3.13) 16 Parametric estimation. Finite sample theory Moreover, kξ k2 − kξk2 ≤ τ kξk2 , 1 − τ kξk2 − kξ k2 ≤ τ kξk2 , 1 + τ kξ k2 − kξ k2 ≤ α kξk2 . Our final step is in showing that under (ED0 ) , the norm kξk behaves essentially as a def norm of a Gaussian vector with the same covariance matrix. Define for IB = D0−1 V02 D0−1 def p = tr IB , def λ0 = kIBk∞ = λmax IB . def v2 = 2 tr(IB 2 ), Under the identifiability condition (I) , one can bound IB 2 ≤ a2 Ip , p ≤ a2 p, v2 ≤ 2a4 p, λ0 ≤ a2 . Similarly to the previous result, we assume that the constant g from condition (ED0 ) is sufficiently large, namely g2 ≥ 2p . Define µc = 2/3 and def y2c = g2 /µ2c − p/µc , p def gc = µc yc = g2 − µc p, def 2xc = µc y2c + log det Ip − µc IB 2 /λ0 . It is easy to see that y2c ≥ 3g2 /2 and gc ≥ p 2/3 g . Theorem 3.10. Let (ED0 ) hold with ν0 = 1 and g2 ≥ 2p . Then IEkξk2 ≤ p , and for each x ≤ xc IP kξk2 /λ0 ≥ z(x, IB) ≤ 2e−x + 8.4e−xc , where z(x, IB) is defined by p + 2vx1/2 , def z(x, IB) = p + 6x x ≤ v/18, v/18 < x ≤ xc , 2 Moreover, for x > xc , it holds with z(x, IB) = yc + 2(x − xc )/gc IP kξk2 /λ0 ≥ z(x, IB) ≤ 8.4e−x . Proof. It follows from condition (ED0 ) that IEkξk2 = IE tr ξξ > = tr D0−1 IE∇L(θ ∗ ){∇L(θ ∗ )}> D0−1 = tr D0−2 Var ∇L(θ ∗ ) (3.14) spokoiny, v. 17 and (ED0 ) implies γ > Var ∇L(θ ∗ ) γ ≤ γ > V02 γ and thus, IEkξk2 ≤ p . The deviation bound (3.14) is proved in Corollary ??. Remark 3.3. This small remark concerns the term 8.4e−xc in the probability bound (3.14). As already mentioned, this bound implicitly assumes that the constant g is large (usually g n1/2 ). Then xc g2 n is large as well. So, e−xc is very small and asymptotically negligible. Below we often ignore this term. For x ≤ xc , we can use z(x, IB) = p + 6x . Remark 3.4. The exponential bound of Theorem 3.10 helps to describe the critical value of z ensuring a prescribed deviation probability IP kξk2 ≥ z . Namely, this probability starts to gradually decrease when z grows over λ0 p . In particular, this helps to answer a very important question about a proper choice of the critical value z providing the prescribed covering probability, or of the value z ensuring the dominating concentration e − θ ∗ )k ≤ z . probability IP kD (θ The definition of the set C (r) from (3.8) involves the event {kV0 D−1 ξ k > r} . Under (I) , it is included in the set {kξ k > (1 + α )−1 a−1 r} , see (3.13), and its probability is of order e−x for r2 ≥ C(x + p) with a fixed C > 0 . By Theorem 3.7, one can use max ♦ (r), ♦ (r) ≤ % z0 (x, Q) on a set of probability at least 1 − e−x . Further, kξk2 /λ0 ≤ z(x, IB) with a probability of order 1 − e−x ; see (3.14). Putting together the obtained bounds yields for the spread ∆ (r) with a probability about 1 − 4e−x ∆ (r) ≤ 2% z0 (x, Q) + α λ0 z(x, IB). The results obtained in Section 3.2 are sharp and meaningful if the spread ∆ (r) is smaller in order than the value IEkξk2 . Theorem 3.10 states that kξk2 does not signifdef icantly deviate over its expected value p = IEkξk2 which is our leading term. We know that z0 (x, Q) ≈ Q + x = c1 p + x if x is not too large. Also z(x, IB) ≤ p + 6x , where p is of order p due to (I) . Summarizing the above discussion yields that the local results apply if the regularity condition (I) holds and the values % and α , or equivalently, p ω(r), δ(r) are small. In Section 5 we show for the i.i.d. example that ω(r) r2 /n and similarly for δ(r) . 18 Parametric estimation. Finite sample theory 3.4 Proof of Corollaries 3.3 and 3.4 The bound (3.4) together with Lemma 3.8 yield on C (r) e θ∗ ) = L(θ, sup L(θ, θ ∗ ) θ∈Θ0 (r) ≥ sup L (θ, θ ∗ ) − ♦ (r) = kξ k2 /2 − ♦ (r). (3.15) θ∈Θ0 (r) Similarly e θ∗ ) ≤ L(θ, sup L (θ, θ ∗ ) + ♦ (r) ≤ kξ k2 /2 + ♦ (r) θ∈Θ0 (r) yielding (3.9). For getting (3.10), we again apply the inequality L(θ, θ ∗ ) ≤ L (θ, θ ∗ ) + e . With ξ = D−1 ∇L(θ ∗ ) and u def e− ♦ (r) from Theorem 3.1 for θ equal to θ = D (θ θ ∗ ) , this gives e θ ∗ ) − ξ > u + ku k2 /2 ≤ ♦ (r). L(θ, Therefore, by (3.15) 2 kξ k2 /2 − ♦ (r) − ξ > u + ku k /2 ≤ ♦ (r) or, equivalently 2 2 2 kξ k2 /2 − ξ > u + ku k /2 ≤ ♦ (r) + ♦ (r) + kξ k − kξ k /2 2 and the definition of ∆ (r) implies u − ξ ≤ 2∆ (r) . 4 Upper function approach and concentration of the qMLE e is localization. This property means A very important step in the analysis of the qMLE θ e concentrates in a small vicinity of the central point θ ∗ . This section states such that θ a concentration bound under the global conditions of Section 2. Given r0 , the deviation e 6∈ Θ0 (r0 ) that θ e does not belong to the local bound describes the probability IP θ vicinity Θ0 (r0 ) of Θ . The question of interest is to check a possibility of selecting r0 in a way that the local bracketing result and the deviation bound apply simultaneously; see the discussion at the end of the section. Below we suppose that a sufficiently large constant x is fixed to specify the accepted level be of order e−x for this deviation probability. All the constructions below depend upon this constant. We do not indicate it explicitly for ease of notation. 19 spokoiny, v. The key step in this large deviation bound is made in terms of an upper function for def the process L(θ, θ ∗ ) = L(θ) − L(θ ∗ ) . Namely, u(θ) is a deterministic upper function if it holds with a high probability: sup L(θ, θ ∗ ) + u(θ) ≤ 0 (4.1) θ∈Θ Such bounds are usually called for in the analysis of the posterior measure in the Bayes approach. Below we present sufficient conditions ensuring (4.1). Now we explain how e. (4.1) can be used for describing concentration sets for θ Lemma 4.1. Let u(θ) be an upper function in the sense IP sup L(θ, θ ∗ ) + u(θ) ≥ 0 ≤ e−x (4.2) θ∈Θ for x > 0 . Given a subset Θ0 ⊂ Θ with θ ∗ ∈ Θ0 , the condition u(θ) ≥ 0 for θ 6∈ Θ0 ensures e∈ IP θ 6 Θ0 ≤ e−x . e ∈ Θ◦ is only possible Proof. If Θ◦ is a subset of Θ not containing θ ∗ , then the event θ if supθ∈Θ◦ L(θ, θ ∗ ) ≥ 0 , because L(θ ∗ , θ ∗ ) ≡ 0 . A possible way of checking the condition (4.2) is based on a lower quadratic bound for the negative expectation −IEL(θ, θ ∗ ) ≥ b(r)kV0 (θ − θ ∗ )k2 /2 in the sense of condition (Lr) from Section 2.2. We present two different results. The first one assumes that the values b(r) can be fixed universally for all r ≥ r0 . Theorem 4.2. Suppose (Er) and (Lr) with b(r) ≡ b . Let, for r ≥ r0 , 1+ p x + Q ≤ 3ν02 g(r)/b, (4.3) 6ν0 p x + Q ≤ rb, (4.4) with x + Q ≥ 2.5 and Q = c1 p . Then e 6∈ Θ0 (r0 ) ≤ e−x . IP θ Proof. The result follows from Theorem A.8 with µ = ∗ ∗ IEL(θ) and M (θ, θ ) = −IEL(θ, θ ) ≥ b 2 kV0 (θ −θ ∗ (4.5) b 3ν0 , t(µ) ≡ 0 , U(θ) = L(θ) − )k2 . Remark 4.1. The bound (4.5) requires only two conditions. Condition (4.3) means that the value g(r) from condition (Er) fulfills g2 (r) ≥ C(x + p) , that is, we need a qualified rate in the exponential moment conditions. This is similar to requiring finite polynomial 20 Parametric estimation. Finite sample theory moments for the score function. Condition (4.4) requires that r exceeds some fixed value, namely, r2 ≥ C(x + p) . This bound is helpful for fixing the value r0 providing a sensible deviation probability bound. If b(r) decreases with r , the result is a bit more involved. The key requirement is that b(r) decreases not too fast, so that the product rb(r) grows to infinity with r . The idea is to include the complement of the central set Θ0 in Θ in the union of the growing sets Θ0 (rk ) with b(rk ) ≥ b(r0 )2−k , and then apply Theorem 4.2 for each Θ0 (rk ) . Theorem 4.3. Suppose (Er) and (Lr) . Let rk be such that b(rk ) ≥ b(r0 )2−k for k ≥ 1 . If the conditions p x + Q + ck ≤ 3ν02 g(rk )/b(rk ), p 6ν0 x + Q + ck ≤ rk b(rk ), 1+ are fulfilled for c = log(2) , then it holds e∈ IP θ 6 Θ0 (r0 ) ≤ e−x . Proof. The result (4.5) is applied to each set Θ0 (rk ) and xk = x + ck . This yields X X −x−ck e∈ e∈ IP θ 6 Θ0 (r0 ) ≤ IP θ 6 Θ0 (rk ) ≤ e = e−x k≥1 k≥1 as required. Remark 4.2. Here we briefly discuss the very important question: how one can fix the value r0 ensuring the bracketing result in the local set Θ0 (r0 ) and a small probability of the related set C (r) from (3.8)? The event {kV0 D−1 ξ k > r} requires r2 ≥ C(x + p) . Further we inspect the deviation bound for the complement Θ \ Θ0 (r0 ) . For simplicity, assume (Lr) with b(r) ≡ b . Then the condition (4.4) of Theorem 4.2 requires that r20 ≥ Cb−2 (x + p). In words, the squared radius r20 should be at least of order p . The other condition (4.3) of Theorem 4.2 is technical and only requires that g(r) is sufficiently large while the local results only require that δ(r) and %(r) are small for such r . In the asymptotic setup one can typically bring these conditions together. Section 5 provides further discussion for the i.i.d. setup. spokoiny, v. 5 21 Examples The model with independent identically distributed (i.i.d.) observations is one of the most popular setups in statistical literature and in statistical applications. The essential and the most developed part of the statistical theory is designed for the i.i.d. modeling. Especially, the classical asymptotic parametric theory is almost complete including asymptotic root-n normality and efficiency of the MLE and Bayes estimators under rather mild assumptions; see e.g. Chapter 2 and 3 in Ibragimov and Khas’minskij (1981). So, the i.i.d. model can naturally serve as a benchmark for any extension of the statistical theory: being applied to the i.i.d. setup, the new approach should lead to essentially the same conclusions as in the classical theory. Similar reasons apply to the regression model and its extensions. Below we try demonstrate that the proposed non-asymptotic viewpoint is able to reproduce the existing brilliant and well established results of the classical parametric theory. Surprisingly, the majority of classical efficiency results can be easily derived from the obtained general non-asymptotic bounds. The next question is whether there is any added value or benefits of the new approach being restricted to the i.i.d. situation relative to the classical one. Two important issues have been already mentioned: the new approach applies to the situation with finite samples and survives under model misspecification. One more important question is whether the obtained results remain applicable and informative if the dimension of the parameter space is high – this is one of the main challenges in the modern statistics. We show that the dimensionality p naturally appears in the risk bounds and the results apply as long as the sample size exceeds in order this value p . All these questions are addressed in Section 5.1 for the i.i.d. setup, Section 5.2 focuses on generalized linear modeling, while Section 5.3 discusses linear median regression. 5.1 Quasi MLE in an i.i.d. model An i.i.d. parametric model means that the observations Y = (Y1 , . . . , Yn ) are independent identically distributed from a distribution P which belongs to a given parametric family (Pθ , θ ∈ Θ) on the observation space Y1 . Each θ ∈ Θ clearly yields the product data distribution IPθ = Pθ⊗n on the product space Y = Yn1 . This section illustrates how the obtained general results can be applied to this type of modeling under possible model misspecification. Different types of misspecification can be considered. Each of the assumptions, namely, data independence, identical distribution, parametric form of the marginal distribution can be violated. To be specific, we assume the observations Yi independent and identically distributed. However, we admit that the distribution of each Yi does not necessarily belong to the parametric family (Pθ ) . The case of non- 22 Parametric estimation. Finite sample theory identically distributed observations can be done similarly at cost of more complicated notation. In what follows the parametric family (Pθ ) is supposed to be dominated by a measure µ0 , and each density p(y, θ) = dPθ /dµ0 (y) is two times continuously differentiable in θ for all y . Denote `(y, θ) = log p(y, θ) . The parametric assumption Yi ∼ Pθ∗ ∈ (Pθ ) leads to the log-likelihood L(θ) = X `(Yi , θ), e maximizes this sum where the summation is taken over i = 1, . . . , n . The quasi MLE θ over θ ∈ Θ : e def θ = argmax L(θ) = argmax θ∈Θ X `(Yi , θ). θ∈Θ The target of estimation θ ∗ maximizes the expectation of L(θ) : def θ ∗ = argmax IEL(θ) = argmax IE`(Y1 , θ). θ∈Θ θ∈Θ def Let ζi (θ) = `(Yi , θ) − IE`(Yi , θ) . Then ζ(θ) = P ζi (θ) . The equation ∇IEL(θ ∗ ) = 0 implies ∇ζ(θ ∗ ) = X ∇ζi (θ ∗ ) = X ∇`i (θ ∗ ). (5.1) I.i.d. structure of the Yi ’s allows to rewrite the local conditions (Er) , (ED0 ) , (ED1 ) , and (L0 ) , and (I) in terms of the marginal distribution. (ed0 ) There exists a positively definite symmetric matrix v0 , such that for all |λ| ≤ g1 > γ ∇ζ1 (θ ∗ ) ≤ ν02 λ2 /2. sup log IE exp λ kv0 γk γ∈IRp A natural candidate on v02 is given by the variance of the gradient ∇`(Y1 , θ ∗ ) , that is, v02 = Var ∇`(Y1 , θ ∗ ) = Var ∇ζ1 (θ ∗ ) . Next consider the local sets Θloc (u) = {θ : kv0 (θ − θ ∗ )k ≤ u}. In view of V02 = nv02 , it holds Θ0 (r) = Θloc (u) with r2 = nu2 . Below we distinguish between local conditions for u ≤ u0 and the global conditions for all u > 0 , where u0 is some fixed value. The local smoothness conditions (ED1 ) and (L0 ) require to specify the functions δ(r) and %(r) for r ≤ r0 where r20 = nu20 . If the log-likelihood function `(y, θ) is sufficiently smooth in θ , these functions can be selected proportional to u = r/n1/2 . spokoiny, v. 23 (ed1 ) There are constants ω ∗ > 0 and g1 > 0 such that for each u ≤ u0 and |λ| ≤ g1 > γ ∇ζ1 (θ) − ∇ζ1 (θ ∗ ) sup sup log IE exp λ ≤ ν02 λ2 /2. ω ∗ u kv0 γk γ∈IRp θ∈Θloc (u) Further we restate the local identifiability condition (L0 ) in terms of the expected def value k(θ, θ ∗ ) = −IE `(Yi , θ) − `(Yi , θ ∗ ) for each i . We suppose that k(θ, θ ∗ ) is two times differentiable w.r.t. θ . The definition of θ ∗ implies ∇IE`(Yi , θ ∗ ) = 0 . Define also the matrix F0 = −∇2 IE`(Yi , θ ∗ ) . In the parametric case P = Pθ∗ , k(θ, θ ∗ ) is the Kullback-Leibler divergence between Pθ∗ and Pθ while the matrices v02 = F0 are equal to each other and coincide with the Fisher information matrix of the family (Pθ ) at θ ∗ . (`0 ) There is a constant δ ∗ such that it holds for each u ≤ u0 2k(θ, θ ∗ ) ∗ sup ∗ > ∗ − 1 ≤ δ u. θ∈Θloc (u) (θ − θ ) F0 (θ − θ ) (ι) There is a constant a > 0 such that a2 F20 ≥ v02 . (eu) For each u > 0 , there exists g1 (u) > 0 , such that for all |λ| ≤ g1 (u) > γ ∇ζ1 (θ) ≤ ν02 λ2 /2. log IE exp λ p kv γk 0 γ∈IR θ∈Θloc (u) sup sup (`u) For each u > 0 , there exists b(u) > 0 such that k(θ, θ ∗ ) ∗ 2 ≥ b(u), θ∈Θ: kv0 (θ−θ ∗ )k=u kv0 (θ − θ )k sup Lemma 5.1. Let Y1 , . . . , Yn be i.i.d. Then (eu) , (ed0 ) , (ed1 ) , (ι) , and (`0 ) imply (Er) , (ED0 ) , (ED1 ) , (I) , and (L0 ) with V02 = nv02 , D02 = nF0 , ω(r) = ω ∗ r/n1/2 , √ δ(r) = δ ∗ r/n1/2 , and g = g1 n . Proof. The identities V02 = nv02 , D02 = nF0 follow from the i.i.d. structure of the observations Yi . We briefly comment on condition (Er) . The use of the i.i.d. structure once again yields by (5.1) in view of V02 = nv02 n γ > ∇ζ(θ) o n λ γ > ∇ζ (θ) o 1 log IE exp λ = nIE exp 1/2 ≤ ν02 λ2 /2 kV0 γk kv0 γk n as long as λ ≤ n1/2 g1 (u) ≤ g(r) . Similarly for (ED0 ) and (ED1 ) . Remark 5.1. This remark discusses how the presented conditions relate to what is usually assumed in statistical literature. One general remarks concern the choice of the 24 Parametric estimation. Finite sample theory parametric family (Pθ ) . The point of the classical theory is that the true measure is in this family, so the conditions should be as weak as possible. The viewpoint of this paper is slightly different: whatever family (Pθ ) is taken, the true measure is never included, any model is only an approximation of reality. From the other side, the choice of the parametric model (Pθ ) is always done by a statistician. Sometime some special stylized features of the model force to include an irregularity in this family. Otherwise any smoothness condition on the density `(y, θ) can be secured by a proper choice of the family (Pθ ) . The presented list also includes the exponential moment conditions (ed0 ) and (ed1 ) on the gradient ∇`(Y1 , θ) . We need exponential moments for establishing some nonasymptotic risk bounds, the classical concentration bounds require even stronger conditions that the considered random variables are bounded. The identifiability condition (`u) is very easy to check in the usual asymptotic setup. Indeed, if the parameter set Θ is compact, the Kullback-Leibler divergence k(θ, θ ∗ ) is continuous and positive for all θ 6= θ ∗ , then (`u) is fulfilled automatically with a universal constant b . If Θ is not compact, the condition is still fulfilled but the function b(u) may depend on u . Below we specify the general results of Section 3 and 4 to the i.i.d. setup. 5.1.1 A large deviation bound This section presents some sufficient conditions ensuring a small deviation probability e 6∈ Θloc (u0 )} for a fixed u0 . Below Q = c1 p . We only discuss the case for the event {θ b(u) ≡ b . The general case only requires more complicated notations. The next result follows from Theorem 4.2 with the obvious changes. Theorem 5.2. Suppose (eu) and (`u) with b(u) ≡ b . If, for u0 > 0 , p n1/2 u0 b ≥ 6ν0 x + Q, p 1 + x + Q ≤ 3b−1 ν02 g1 (u0 ) n1/2 , (5.2) then e∈ e − θ ∗ )k > u0 ≤ e−x . IP θ 6 Θloc (u0 ) = IP kv0 (θ Remark 5.2. The presented result helps to qualify two important values u0 and n providing a sensible deviation probability bound. For simplicity suppose that g1 (u) ≡ g1 > 0 . Then the condition (5.2) can be written as nu20 x + Q . In other words, the result of the theorem claims a large deviation bound for the vicinity Θloc (u0 ) with u20 25 spokoiny, v. of order p/n . In classical asymptotic statistics this result is usually referred to as root-n consistency. Our approach yields this result in a very strong form and for finite samples. 5.1.2 Local inference Now we restate the general local bounds of Section 3 for the i.i.d. case. First we describe the approximating linear models. The matrices v02 and F0 from conditions (ed0 ) , (ed1 ) , and (`0 ) determine their drift and variance components. Define def F = F0 (1 − δ) − %v02 . def If τ = δ + a2 % < 1 , then F ≥ (1 − τ )F0 > 0. Further, D2 = nF and def ξ = D−1 ∇ζ(θ ∗ ) = nF −1/2 X ∇`(Yi , θ ∗ ). The upper bracketing process reads as L (θ, θ ∗ ) = (θ − θ ∗ )> D ξ − kD (θ − θ ∗ )k2 /2. This expression can be viewed as log-likelihood for the linear model ξ = D θ + ε for a e for this model is of the form θ e = D−1 ξ . standard normal error ε . The (quasi) MLE θ Theorem 5.3. Suppose (ed0 ) . Given u0 , assume (ed1 ) , (`0 ) , and (ι) on Θloc (u0 ) , def and let % = 3ν0 ω ∗ u0 , δ = δ ∗ u0 , and τ = δ + a2 % < 1 . Then the results of Theorem 3.1 and all its corollaries apply to the case of i.i.d. modeling with r20 = nu20 . In particular, e ∈ Θloc (u0 ), kξ k ≤ r0 , it holds on the random set C (r0 ) = θ e θ ∗ ) ≤ kξ k2 /2 + ♦ (r0 ), kξ k2 /2 − ♦ (r0 ) ≤ L(θ, p e − θ ∗ − ξ 2 ≤ 2∆ (r0 ). nF θ The random quantities ♦ (r0 ) , ♦ (r0 ) , and ∆ (r0 ) follow the probability bounds of Theorem 3.7 and 3.10. Now we briefly discuss the implications of Theorem 5.2 and 5.3 to the classical asymptotic setup with n → ∞ . We fix u20 = Cp/n for a constant C ensuring the deviation bound of Theorem 5.2. Then δ is of order u0 and the same for % . For a sufficiently large n , both quantities are small and thus, the spread ∆ (r0 ) is small as well; see Section 3.3. 26 Parametric estimation. Finite sample theory Further, under (ed0 ) condition, the normalized score def ξ = nF0 −1/2 X ∇`(Yi , θ ∗ ) is zero mean asymptotically normal by the central limit theorem. Moreover, if F0 = v02 , then ξ is asymptotically standard normal. The same holds for ξ . This immediately yields all classical asymptotic results like Wilks theorem or the Fisher expansion for MLE in the i.i.d. setup as well as the asymptotic efficiency of the MLE. Moreover, our results bounds yield the asymptotic result for the case when the parameter dimension p = pn grows linearly with n . Below un = on (pn ) means that un /pn → 0 as n → ∞ . Theorem 5.4. Let Y1 , . . . , Yn be i.i.d. IPθ∗ and let (ed0 ) , (ed1 ) , (`0 ) , (ι) , (eu) , and (`u) with b(u) ≡ b hold. If n > Cpn for a fixed constant C depending on constants in the above conditions only, then p e − θ ∗ − ξk2 = on (pn ), k nF0 θ This result particularly yields that e θ ∗ ) is nearly χ2 . 2L(θ, √ e θ ∗ ) − kξk2 = on (pn ). 2L(θ, e − θ∗ nF0 θ is nearly standard normal and p 5.2 Generalized linear modeling Now we consider a generalized linear modeling (GLM) which is often used for describing some categorical data. Let P = (Pw , w ∈ Υ ) be an exponential family with a canonical parametrization; see e.g. McCullagh and Nelder (1989). The corresponding log-density can be represented as `(y, w) = yw − d(w) for a convex function d(w) . The popular examples are given by the binomial (binary response, logistic) model with d(w) = log ew + 1 , the Poisson model with d(w) = ew , the exponential model with d(w) = − log(w) . Note that linear Gaussian regression is a special case with d(w) = w2 /2 . A GLM specification means that every observation Yi has a distribution from the family P with the parameter wi which linearly depends on the regressor Ψi ∈ IRp : Yi ∼ PΨ > θ∗ . i (5.3) The corresponding log-density of a GLM reads as L(θ) = X Yi Ψi> θ − d(Ψi> θ) . Under IPθ∗ each observation Yi follows (5.3), in particular, IEYi = d0 (Ψi> θ ∗ ) . However, similarly to the previous sections, it is accepted that the parametric model (5.3) 27 spokoiny, v. def is misspecified. Response misspecification means that the vector f = IEY cannot be represented in the form d0 (Ψ > θ) whatever θ is. The other sort of misspecification concerns the data distribution. The model (5.3) assumes that the Yi ’s are independent and the marginal distribution belongs to the given parametric family P . In what follows, we only assume independent data having certain exponential moments. The target of estimation θ ∗ is defined by def θ ∗ = argmax IEL(θ). θ e is defined by maximization of L(θ) : The quasi MLE θ e = argmax L(θ) = argmax θ θ X θ Yi Ψi> θ − d(Ψi> θ) . Convexity of d(·) implies that L(θ) is a concave function of θ , so that the optimization problem has a unique solution and can be effectively solved. However, a closed form solution is only available for the constant regression or for the linear Gaussian regression. The corresponding target θ ∗ is the maximizer of the expected log-likelihood: θ ∗ = argmax IEL(θ) = argmax θ θ X fi Ψi> θ − d(Ψi> θ) with fi = IEYi . The function IEL(θ) is concave as well and the vector θ ∗ is also well defined. Define the individual errors (residuals) εi = Yi − IEYi . Below we assume that these errors fulfill some exponential moment conditions. (e1 ) There exist some constants ν0 and g1 > 0 , and for every i a constant si such 2 that IE εi /si ≤ 1 and log IE exp λεi /si ≤ ν02 λ2 /2, |λ| ≤ g1 . (5.4) A natural candidate for si is σi where σi2 = IEε2i is the variance of εi ; see Lemma A.17. Under (5.4), introduce a p × p matrix V0 defined by def V02 = X s2i Ψi Ψi> . (5.5) Condition (e1 ) effectively means that each error term εi = Yi − IEYi has some bounded def exponential moments: for |λ| ≤ g1 , it holds f (λ) = log IE exp λεi /si < ∞ . This implies the quadratic upper bound for the function f (λ) for |λ| ≤ g1 ; see Lemma A.17. In words, condition (e1 ) requires light (exponentially decreasing) tail for the marginal distribution of each εi . 28 Parametric estimation. Finite sample theory Define also def N −1/2 = max sup i γ∈IRp si |Ψi> γ| . kV0 γk (5.6) Lemma 5.5. Assume (e1 ) and let V0 be defined by (5.5) and N by (5.6). Then conditions (ED0 ) and (Er) follow from (e1 ) with the matrix V0 due to (5.5) and g = g1 N 1/2 . Moreover, the stochastic component ζ(θ) is linear in θ and the condition (ED1 ) is fulfilled with ω(r) ≡ 0 . Proof. The gradient of the stochastic component ζ(θ) of L(θ) does not depend on θ , P namely, ∇ζ(θ) = Ψi εi with εi = Yi − IEYi . Now, for any unit vector γ ∈ IRp and λ ≤ g , independence of the εi ’s implies that log IE exp n o X n λs Ψ > γ o X λ i i Ψ i εi = log IE exp γ> εi /si . kV0 γk kV0 γk By definition si |Ψi> γ|/kV0 γk ≤ N −1/2 and therefore, λsi |Ψi> γ|/kV0 γk ≤ g1 . Hence, (5.4) implies log IE exp n o X ν02 λ2 X 2 > 2 ν02 λ2 λ Ψi εi ≤ si |Ψi γ| = γ> , kV0 γk 2kV0 γk2 2 (5.7) and (ED0 ) follows. It only remains to bound the quality of quadratic approximation for the mean of the process L(θ, θ ∗ ) in a vicinity of θ ∗ . An interesting feature of the GLM is that the effect of model misspecification disappears in the expectation of L(θ, θ ∗ ) . Lemma 5.6. It holds −IEL(θ, θ ∗ ) = X d(Ψi> θ) − d(Ψi> θ ∗ ) − d0 (Ψi> θ ∗ )Ψi> (θ − θ ∗ ) = K IPθ∗ , IPθ , where K IPθ∗ , IPθ (5.8) is the Kullback-Leibler divergence between measures IPθ∗ and IPθ . Moreover, −IEL(θ, θ ∗ ) = kD(θ ◦ )(θ − θ ∗ )k2 /2, where θ ◦ ∈ [θ ∗ , θ] and D2 (θ ◦ ) = X d00 (Ψi> θ ◦ )Ψi Ψi> . (5.9) 29 spokoiny, v. Proof. The definition implies IEL(θ, θ ∗ ) = X fi Ψi> (θ − θ ∗ ) − d(Ψi> θ) + d(Ψi> θ ∗ ) . As θ ∗ is the extreme point of IEL(θ) , it holds ∇IEL(θ ∗ ) = P fi −d0 (Ψi> θ ∗ ) Ψi = 0 and (5.8) follows. The Taylor expansion of the second order around θ ∗ yields the expansion (5.9). Define now the matrix D0 by def D02 = D2 (θ ∗ ) = X d00 (Ψi> θ ∗ )Ψi Ψi> . Let also V0 be defined by (5.5). Note that the matrices D0 and V0 coincide if the model Yi ∼ PΨ > θ∗ is correctly specified and s2i = d00 (Ψi> θ ∗ ) . The matrix V0 describes a local i elliptic neighborhood of the central point θ ∗ in the form Θ0 (r) = {θ : kV0 (θ − θ ∗ )k ≤ r} . If the matrix function D2 (θ) is continuous in this vicinity Θ0 (r) then the value δ(r) measuring the approximation quality of −IEL(θ, θ ∗ ) by the quadratic function kD0 (θ − θ ∗ )k2 /2 is small and the identifiability condition (L0 ) is fulfilled on Θ0 (r) . Lemma 5.7. Suppose that kIIp − D0−1 D2 (θ)D0−1 k∞ ≤ δ(r), θ ∈ Θ0 (r). (5.10) Then (L0 ) holds with this δ(r) . Moreover, as the quantities ω(r), ♦ (r), ♦ (r) vanish, one can take % = 0 leading to the following representation for D and ξ : D2 = (1 − δ)D02 , ξ = (1 + δ)1/2 ξ D2 = (1 + δ)D02 , ξ = (1 − δ)1/2 ξ with def ξ = D0−1 ∇ζ = D0−1 X Ψi (Yi − IEYi ). Linearity of the stochastic component ζ(θ) in the considered GLM implies the important fact that the quantities ♦ (r), ♦ (r) in the general bracketing bound (3.4) vanish for any r . Therefore, in the GLM case, the deficiency can be defined as the difference between upper and lower excess and it can be easily evaluated: ∆(r) = kξ k2 /2 − kξ k2 /2 = δkξk2 . Our result assumes some concentration properties of the squared norm kξk2 of the vector ξ . These properties can be established by general results of Section ?? under the 30 Parametric estimation. Finite sample theory regularity condition: for some a V0 ≤ aD0 . (5.11) Now we are prepared to state the local results for the GLM estimation. Theorem 5.8. Let (e1 ) hold. Then for δ ≥ δ(r) any z > 0 and z > 0 , it holds e − θ ∗ k > z, kV0 θ e − θ ∗ k ≤ r ≤ IP kξk2 > (1 − δ)z 2 IP kD0 θ e θ ∗ ) > z, kV0 θ e − θ ∗ k ≤ r ≤ IP kξk2 /2 > (1 − δ)z . IP L(θ, e − θ ∗ k ≤ r, kξ k ≤ r} , it holds Moreover, on the set C (r) = {kV0 θ e − θ ∗ − ξk2 ≤ kD0 θ 2δ kξk2 . 1 − δ2 (5.12) If the function d(w) is quadratic then the approximation error δ vanishes as well and the expansion (5.12) becomes equality which is also fulfilled globally, a localization step in not required. However, if d(w) is not quadratic, the result applies only locally and it has to be accomplished with a large deviation bound. The GLM structure is helpful in the large deviation zone as well. Indeed, the gradient ∇ζ(θ) does not depend on θ and hence, the most delicate condition (Er) is fulfilled automatically with g = g1 N 1/2 for all local sets Θ0 (r) . Further, the identifiability condition (Lr) easily follows from Lemma 5.6: it suffices to bound from below the matrix D(θ) for θ ∈ Θ0 (r) : D(θ) ≥ b(r)V0 , θ ∈ Θ0 (r). An interesting question, similarly to the i.i.d. case, is the minimal radius r0 of the local vicinity Θ0 (r0 ) ensuring the desirable concentration property. Suppose for the moment that the constants b(r) are all the same for different r : b(r) ≡ b . Under the regularity condition (5.11), a sufficient lower bound for r0 can be based on Corollary 4.3. The required condition can be restated as 1+ p x + Q ≤ 3ν02 g/b, 6ν0 p x + Q ≤ rb. It remains to note that Q = c1 p and g = g1 N 1/2 . So, the required conditions are fulfilled for r2 ≥ r20 = C(x + p) , where C only depends on ν0 , b , and g . 5.3 Linear median estimation This section illustrates how the proposed approach applies to robust estimation in linear models. The target of analysis is the linear dependence of the observed data Y = 31 spokoiny, v. (Y1 , . . . , Yn ) on the set of features Ψi ∈ IRp : Yi = Ψi> θ + εi , (5.13) where εi means the i th individual error. As usual, the true data distribution can deviate from the linear model. In addition, we admit contaminated data which naturally leads to the idea of robust estimation. This section offers a qMLE view on the robust estimation problem. Our parametric family assumes the linear dependence (5.13) with i.i.d. errors εi which follow the double exponential (Laplace) distribution with the density (1/2)e−|y| . Then the corresponding log-likelihood reads as L(θ) = − 1X |Yi − Ψi> θ| 2 e def and θ = argmaxθ L(θ) is called the least absolute deviation (LAD) estimate. In the context of linear regression, it is also called the linear median estimate. The target of estimation θ ∗ is defined as usually by the equation θ ∗ = argmaxθ IEL(θ) . It is useful to define the residuals εei = Yi − Ψi> θ ∗ and their distributions Pi (A) = IP εei ∈ A = IP Yi − Ψi> θ ∗ ∈ A for any Borel set A on the real line. If Yi = Ψi> θ ∗ + εi is the true model then Pi coincides with the distribution of each εi . Below we suppose that each Pi has a positive density fi (y) . Note that the difference L(θ) − L(θ ∗ ) is bounded by 1 2 P |Ψi> (θ − θ ∗ )| . Next we check conditions (ED0 ) and (ED1 ) . Denote ξi (θ) = 1I(Yi − Ψi> θ ≤ 0) − qi (θ) for qi (θ) = IP (Yi − Ψi> θ ≤ 0) . This is a centered Bernoulli random variable, and it is easy to check that ∇ζ(θ) = − X ξi (θ)Ψi . (5.14) This expression differs from the similar ones from the linear and generalized linear regression because the stochastic terms ξi now depend on θ . First we check the global condition (Er) . Fix any g1 < 1 . Then it holds for a Bernoulli r.v. Z with IP (Z = 1) = q , ξ = Z − q , and |λ| ≤ g1 log IE exp(λξ) = log q exp{λ(1 − q)} + (1 − q) exp(−λq) ≤ ν02 q(1 − q)λ2 /2, (5.15) where ν0 ≥ 1 depends on g1 only. Let now a vector γ ∈ IRp and ρ > 0 be such that 32 Parametric estimation. Finite sample theory ρ|Ψi> γ| ≤ g1 for all i = 1, . . . , n . Then log IE exp{ργ > ∇ζ(θ)} ≤ ≤ ν02 ρ2 X qi (θ) 1 − qi (θ) |Ψi> γ|2 2 ν02 ρ2 kV (θ)γk2 , 2 (5.16) where V 2 (θ) = X qi (θ) 1 − qi (θ) Ψi Ψi> . Denote also V02 = 1X Ψi Ψi> . 4 (5.17) Clearly V (θ) ≤ V0 for all θ and condition (Er) is fulfilled with the matrix V0 and g(r) ≡ g = g1 N 1/2 for N defined by def N −1/2 = max sup i γ∈IRp Ψi> γ ; 2kV0 γk (5.18) cf. (5.7). Let some r0 > 0 be fixed. We will specify this choice later. Now we check the local conditions within the elliptic vicinity Θ0 (r0 ) = {θ : kV0 (θ − θ ∗ )k ≤ r0 } of the central point θ ∗ for V0 from (5.17). Then condition (ED0 ) with the matrix V0 and g = N 1/2 g1 is fulfilled on Θ0 (r0 ) due to (5.16). Next, in view of (5.18), it holds |Ψi> γ| ≤ 2N −1/2 kV0 γk for any vector γ ∈ IRp . By (5.14) ∇ζ(θ) − ∇ζ(θ ∗ ) = X Ψi ξi (θ) − ξi (θ ∗ ) . If Ψi> θ ≥ Ψi> θ ∗ , then ξi (θ) − ξi (θ ∗ ) = 1I(Ψi> θ ∗ ≤ Yi < Ψi> θ) − IP Ψi> θ ∗ ≤ Yi < Ψi> θ . Similarly for Ψi> θ < Ψi> θ ∗ ξi (θ) − ξi (θ ∗ ) = − 1I(Ψi> θ ≤ Yi < Ψi> θ ∗ ) + IP Ψi> θ ≤ Yi < Ψi> θ ∗ . def Define qi (θ, θ ∗ ) = qi (θ) − qi (θ ∗ ) . Now (5.15) yields similarly to (5.16) ν 2 ρ2 X qi (θ, θ ∗ )|Ψi> γ|2 log IE exp ργ > ∇ζ(θ) − ∇ζ(θ ∗ ) ≤ 0 2 ≤ 2ν02 ρ2 max qi (θ, θ ∗ ) kV0 γk2 ≤ ω(r)ν02 ρ2 kV0 γk2 /2, i≤n spokoiny, v. 33 with def ω(r) = 4 max sup qi (θ, θ ∗ ). i≤n θ∈Θ0 (r) If each density function pi is uniformly bounded by a constant C then |qi (θ) − qi (θ ∗ )| ≤ C Ψi> (θ − θ ∗ ) ≤ CN −1/2 kV0 (θ − θ ∗ )k ≤ CN −1/2 r. Next we check the local identifiability condition. We use the following technical lemma. Lemma 5.9. It holds for any θ − X ∂2 def 2 I EL(θ) = D (θ) = pi Ψi> (θ − θ ∗ ) Ψi Ψi> , 2 ∂ θ (5.19) where fi (·) is the density of εei = Yi − Ψi> θ ∗ . Moreover, there is θ ◦ ∈ [θ, θ ∗ ] such that −IEL(θ, θ ∗ ) = 1X > |Ψi (θ − θ ∗ )|2 fi (Ψi> (θ ◦ − θ ∗ )) 2 = (θ − θ ∗ )> D2 (θ ◦ )(θ − θ ∗ )/2. (5.20) Proof. Obviously ∂IEL(θ) X = IP (Yi ≤ Ψi> θ) − 1/2 Ψi . ∂θ The identity (5.19) is obtained by one more differentiation. By definition, θ ∗ is the extreme point of IEL(θ) . The equality ∇IEL(θ ∗ ) = 0 yields X IP (Yi ≤ Ψi> θ ∗ ) − 1/2 Ψi = 0. Now (5.20) follows by the Taylor expansion of the second order at θ ∗ . Define def D02 = X |Ψi> (θ − θ ∗ )|2 fi (0). (5.21) Due to this lemma, condition (L0 ) is fulfilled in Θ0 (r) with this choice D0 for δ(r) from (5.10); see Lemma 5.7. Moreover, if fi (0) ≥ a2 /4 for a > 0 , then the identifiability condition (I) is also satisfied. Now all the local conditions are fulfilled yielding the general bracketing bound of Theorem 3.1 and all its corollaries. It only remains to accomplish them by a large deviation bound, that is, to specify the local vicinity Θ0 (r0 ) providing the prescribed deviation bound. A sufficient condition for the concentration property is that the expectation IEL(θ, θ ∗ ) grows in absolute value 34 Parametric estimation. Finite sample theory with the distance kV0 (θ −θ ∗ )k . We use the representation (5.19). Suppose that for some fixed δ < 1/2 and ρ > 0 fi (u)/fi (0) − 1 ≤ δ, |u| ≤ ρ. (5.22) For any θ with kV0 (θ − θ ∗ )k = r ≥ r0 , and for any i = 1, . . . , n , it holds |Ψi> (θ − θ ∗ )| ≤ N −1/2 kV0 (θ − θ ∗ )k = N −1/2 r. Therefore, for r ≤ ρN 1/2 and any θ ∈ Θ0 (r) with kV0 (θ−θ ∗ )k = r , it holds fi Ψi> (θ ◦ − θ ∗ ) ≥ (1 − δ)fi (0) . Now Lemma 5.9 implies −IEL(θ, θ ∗ ) ≥ 1−δ 1−δ 1−δ 2 kD0 (θ − θ ∗ )k2 ≥ kV0 (θ − θ ∗ )k2 = r . 2 2 2a 2a2 By Lemma 5.9 the function −IEL(θ, θ ∗ ) is convex. This easily yields −IEL(θ, θ ∗ ) ≥ 1−δ ρN 1/2 r 2a2 for all r ≥ ρN 1/2 . Thus, (1 − δ)(2a2 )−1 r rb(r) ≥ (1 − δ)(2a2 )−1 ρN 1/2 if r ≤ ρN 1/2 , if r > ρN 1/2 . So, the global identifiability condition (L1 ) is fulfilled if r20 ≥ C1 a2 (x + Q) and if ρ2 N ≥ C2 a2 (x + Q) for some fixed constants C1 and C2 . Putting all together yields the following result. Theorem 5.10. Let Yi be independent, θ ∗ = argmaxθ IEL(θ) , D02 be given by (5.21), and V02 by (5.17). Let also the densities fi (·) of Yi − Ψi> θ ∗ be uniformly bounded by a constant C , fulfill (5.22) for some ρ > 0 and δ > 0 , and fi (0) ≥ a2 /4 for all i . Finally, let N ≥ C2 ρ−2 a2 (x + p) for some fixed x > 0 and C2 . Then on the random def set of probability at least 1 − e−x , one obtains for ξ = D0−1 ∇L(θ ∗ ) the bounds k p e − θ ∗ − ξk2 = o(p), D0 θ e θ ∗ ) − kξk2 = o(p). 2L(θ, 35 spokoiny, v. A Some results for empirical processes This chapter presents some general results of the theory of empirical processes. We assume some exponential moment conditions on the increments of the process which allows to apply the well developed chaining arguments in Orlicz spaces; see e.g. van der Vaart and Wellner (1996), Chapter 2.2. We, however, follow the more recent approach inspired by the notions of generic chaining and majorizing measures due to M. Talagrand; see e.g. Talagrand (1996, 2001, 2005). The results are close to that of Bednorz (2006). We state the results in a slightly different form and present an independent and self-contained proof. The first result states a bound for local fluctuations of the process U(υ) given on a metric space Υ . Then this result will be used for bounding the maximum of the negatively drifted process U(υ) − U(υ 0 ) − ρd2 (υ, υ 0 ) over a vicinity Υ◦ (r) of the central point υ 0 . The behavior of U(υ) outside of the local central set Υ◦ (r) is described using the upper function method. Namely, we construct a multiscale deterministic function u(µ, υ) ensuring that with probability at least 1 − e−x it holds µU(υ) + u(µ, υ) ≤ z(x) for all υ 6∈ Υ◦ (r) and µ ∈ M , where z(x) grows linearly in x . A.1 A bound for local fluctuations An important step in the whole construction is an exponential bound on the maximum of a random process U(υ) under the exponential moment conditions on its increments. Let d(υ, υ 0 ) be a semi-distance on Υ . We suppose the following condition to hold: (Ed) There exist g > 0 , r0 > 0 , ν0 ≥ 1 , such that for any λ ≤ g and υ, υ 0 ∈ Υ with d(υ, υ 0 ) ≤ r0 U(υ) − U(υ 0 ) log IE exp λ d(υ, υ 0 ) ≤ ν02 λ2 /2. (A.1) Formulation of the result involves a sigma-finite measure π on the space Υ which is often called the majorizing measure and used in the generic chaining device; see Talagrand (2005). A typical example of choosing π is the Lebesgue measure on IRp . Let Υ ◦ be a subset of Υ , a sequence rk be fixed with r0 = diam(Υ ◦ ) and rk = r0 2−k . def Let also Bk (υ) = {υ 0 ∈ Υ ◦ : d(υ, υ 0 ) ≤ rk } be the d -ball centered at υ of radius rk and πk (υ) denote its π -measure: def πk (υ) = Z 0 Z π(dυ ) = Bk (υ) Υ◦ 1I d(υ, υ 0 ) ≤ rk π(dυ 0 ). 36 Parametric estimation. Finite sample theory Denote also def Mk = max◦ υ∈Υ π(Υ ◦ ) πk (υ) k ≥ 1. (A.2) Finally set c1 = 1/3 , ck = 2−k+2 /3 for k ≥ 2 , and define the value Q(Υ ◦ ) by def Q(Υ ◦ ) = ∞ X ∞ ck log(2Mk ) = 4 X −k 1 log(2M1 ) + 2 log(2Mk ). 3 3 k=2 k=1 Theorem A.1. Let U be a separable process following to (Ed) . If Υ ◦ is a d -ball in Υ with the center υ ◦ and the radius r0 , i.e. d(υ, υ ◦ ) ≤ r0 for all υ ∈ Υ ◦ , then for def λ ≤ g0 = ν0 g log IE exp n o λ sup U(υ) − U(υ ◦ ) ≤ λ2 /2 + Q(Υ ◦ ). 3ν0 r0 υ∈Υ ◦ (A.3) Proof. A simple change U(·) with ν0−1 U(·) and g with g0 = ν0 g allows to reduce the result to the case with ν0 = 1 which we assume below. Consider for k ≥ 1 the smoothing operator Sk defined as 1 Sk f (υ ) = πk (υ ◦ ) ◦ Z f (υ)π(dυ). Bk (υ ◦ ) Further, define S0 U(υ) ≡ U(υ ◦ ) so that S0 U is a constant function and the same holds for Sk Sk−1 . . . S0 U with any k ≥ 1 . If f (·) ≤ g(·) for two non-negative functions f and g , then Sk f (·) ≤ Sk g(·) . Separability of the process U implies that limk Sk U(υ) = U(υ) . We conclude that for each υ ∈ Υ ◦ U(υ) − U(υ ◦ ) = lim Sk U(υ) − Sk . . . S0 U(υ) k→∞ k ∞ X X ≤ lim Sk . . . Si (I − Si−1 )U(υ) ≤ ξi∗ . k→∞ i=1 i=1 def Here ξk∗ = supυ∈Υ ◦ ξk (υ) for k ≥ 1 with ξ1 (υ) ≡ |S1 U(υ) − U(υ ◦ )|, def ξk (υ) = |Sk (I − Sk−1 )U(υ)|, k≥2 For a fixed point υ ] , it holds Z Z 1 1 ] U(υ) − U(υ 0 )π(dυ 0 )π(dυ). ξk (υ ) ≤ ] πk (υ ) Bk (υ] ) πk−1 (υ) Bk−1 (υ) 37 spokoiny, v. For each υ 0 ∈ Bk−1 (υ) , it holds d(υ, υ 0 ) ≤ rk−1 = 2rk and U(υ) − U(υ 0 ) 0 U(υ) − U(υ ) ≤ rk−1 . d(υ, υ 0 ) This implies for each υ ] ∈ Υ ◦ and k ≥ 2 by the Jensen inequality and (A.2) Z Z n λ o λU(υ) − U(υ 0 ) π(dυ 0 ) π(dυ) ] exp exp ξk (υ ) ≤ rk−1 d(υ, υ 0 ) πk−1 (υ) πk (υ ] ) Bk (υ ] ) Bk−1 (υ) Z Z λU(υ) − U(υ 0 ) π(dυ 0 ) π(dυ) ≤ Mk exp . d(υ, υ 0 ) πk−1 (υ) π(Υ ◦ ) Υ◦ Bk−1 (υ) def As the right hand-side does not depend on υ ] , this yields for ξk∗ = supυ∈Υ ◦ ξk (υ) by condition (Ed) in view of e|x| ≤ ex + e−x Z Z λ λU(υ) − U(υ 0 ) π(dυ 0 ) π(dυ) ∗ ξ ≤ Mk IE exp IE exp rk−1 k d(υ, υ 0 ) πk−1 (υ) π(Υ ◦ ) Υ◦ Bk−1 (υ) Z Z π(dυ 0 ) π(dυ) 2 ≤ 2Mk exp(λ /2) ◦ Υ◦ Bk−1 (υ) πk−1 (υ) π(Υ ) = 2Mk exp(λ2 /2). Further, the use of d(υ, υ ◦ ) ≤ r0 for all υ ∈ Υ ◦ yields by (Ed) IE exp nλ r0 |U(υ) − U(υ ◦ )| o ≤ 2 exp λ2 /2 (A.4) and thus Z nλ nλ o o 1 IE exp |S1 U(υ) − U(υ ◦ )| ≤ IE exp |U(υ 0 ) − U(υ ◦ )| π(dυ 0 ) r0 π1 (υ) B1 (υ) r0 Z nλ o M1 0 ◦ I E exp |U(υ ) − U(υ )| π(dυ 0 ). ≤ π(Υ ◦ ) Υ ◦ r0 This implies by (A.4) for ξ1∗ ≡ supυ∈Υ ◦ |S1 U(υ) − U(υ ◦ )| λ IE exp ξ1∗ ≤ 2M1 exp λ2 /2 . r0 Denote c1 = 1/3 and ck = rk−1 /(3r0 ) = 2−k+2 /3 for k ≥ 2 . Then P∞ k=1 ck = 1 and it holds by the H¨ older inequality; see Lemma A.16 below: X ∞ ∞ λ X ∗ λ ∗ λ ∗ log IE exp ξk ≤ c1 log IE exp ξ + ck log IE exp ξ 3r0 r0 1 rk−1 k k=1 k=2 ≤ λ2 /2 + c1 log(2M1 ) + ∞ X k=2 < λ2 /2 + Q(Υ ◦ ). ck log(2Mk ) 38 Parametric estimation. Finite sample theory This implies the result. The exponential bound of Theorem A.1 can be used for obtaining a probability bound on the maximum of the increments U(υ) − U(υ 0 ) over Υ ◦ . Corollary A.2. Suppose (Ed) . If Υ ◦ is a central set with the center υ ◦ and the radius r0 , then it holds for any x > 0 1 IP sup U(υ, υ 0 ) > z(x, Q) ≤ exp −x , 3ν0 r0 υ∈Υ ◦ (A.5) where with g0 = ν0 g and Q = Q(Υ ◦ ) p 2(x + Q) def z(x, Q) = g−1 (x + Q) + g /2 0 0 if p 2(x + Q) ≤ g0 , (A.6) otherwise. def Proof. By the Chebyshev inequality, it holds for the r.v. ξ = supυ∈Υ ◦ U(υ, υ 0 )/(3ν0 r0 ) for any λ ≤ g0 by (A.3) log IP ξ > z ≤ −λz + log IE exp λξ ≤ −λz + λ2 /2 + Q. Now, given x > 0 , we choose λ = p 2(x + Q) if this value is not larger than g0 , and λ = g0 otherwise. It is straightforward to check that λz − λ2 /2 − Q ≥ x in both cases, and the choice of z by (A.6) yields the bound (A.5). A.2 Application to a two-norms case As an application of the local bound from Theorem A.1 we consider the result from Baraud (2010), Theorem 3. For convenience of comparison we utilize the notation from that paper. Let T be a subset of a linear space S of dimension D , endowed with two norms denoted by d(s, t) and δ(s, t) for s, t ∈ T . Let also (Xt )t∈T be a random process on T . The basic assumption of Baraud (2010) is a kind of a Bernstein bound: for some fixed c > 0 λ2 d(s, t)2 /2 , log IE exp λ(Xt − Xs ) ≤ 1 − λcδ(s, t) if λcδ(s, t) < 1. (A.7) The aim is to bound the maximum of the process Xt over a bounded subset Tv,b defined for v, b > 0 and a specific point t0 as def Tv,b = t : d(t, t0 ) ≤ v, cδ(t, t0 ) ≤ b . Let Q = c1 D with c1 = 2 for D ≥ 2 and c1 = 2.7 for D = 1 . 39 spokoiny, v. Theorem A.3. Suppose that (Xt )t∈S fulfills (A.7), where S is a D -dimensional linear space. For any ρ < 1 , it holds √ 1−ρ log IP sup (Xt − Xt0 ) > z(x, Q) ≤ −x 3v t∈Tv,b (A.8) where z(x, Q) from (A.6) with g0 = ρ(1 − ρ)−1/2 b−1 v . Proof. Define the new semi-distance d∗ (s, t) by def d∗ (s, t) = max d(s, t), b−1 vcδ(s, t) . The set Tv,b can be represented as Tv,b = t : d∗ (t, t0 ) ≤ v Moreover, Lemma A.10 applied for the semi-distance d∗ (t, s) yields Q(Tv,b ) ≤ c1 D , where c1 = 2 for D ≥ 2 , and c1 = 2.4 for D = 1 . Fix some ρ < 1 and define g = ρb−1 v . Then for |λ| ≤ g , it holds λ d∗ (s, t) cδ(s, t) ≤ λ b−1 vcδ(s, t) cδ(s, t) ≤ ρ and by (A.7), it follows with ν02 = (1 − ρ)−1 n X −X o n X −X o λ2 /2 ν 2 λ2 t s t s ≤ log IE exp λ ≤ ≤ 0 . log IE exp λ ∗ d (s, t) d(s, t) 1−ρ 2 So, condition (Ed) is fulfilled. Now the result follows from Corollary A.2. If v is large relative to b , then g = ρv/b is large as well. With moderate values of p x , this allows for applying the bound (A.8) with z(x, Q) = 2(x + Q) . In other words, the value z ≈ z(x, Q) ensures that the maximum of Xt − Xt0 over t ∈ Tv,b deviates over 3vz with the exponentially small probability e−x . A.3 A local central bound Due to the result of Theorem A.1, the bound for the maximum of U(υ, υ 0 ) over υ ∈ Br (υ 0 ) grows quadratically in r . So, its applications to situations with r2 Q(Υ ◦ ) are limited. The next result shows that introducing a negative quadratic drift helps to state a uniform in r local probability bound. Namely, the bound for the process U(υ, υ 0 ) − ρd2 (υ, υ 0 )/2 with some positive ρ over a ball Br (υ 0 ) around the point υ 0 only depends on the drift coefficient ρ but not on r . Here the generic chaining arguments are accomplished with the slicing technique. The idea is for a given r∗ > 1 to split the ball Br∗ (υ 0 ) into the slices Br+1 (υ 0 ) \ Br (υ 0 ) and to apply Theorem A.1 to each slice separately with a proper choice of the parameter λ . 40 Parametric estimation. Finite sample theory Theorem A.4. Let r∗ be such that (Ed) holds on Br∗ (υ 0 ) . Let also Q(Υ ◦ ) ≤ Q for √ Υ ◦ = Br (υ 0 ) with r ≤ r∗ . If ρ > 0 and z are fixed to ensure 2ρz ≤ g0 = ν0 g and ρ(z − 1) ≥ 2 , then it holds log IP sup υ∈Br∗ (υ 0 ) ρ 1 U(υ, υ 0 ) − d2 (υ, υ 0 ) 3ν0 2 >z ≤ −ρ(z − 1) + log(4z) + Q. Moreover, if √ (A.9) 2ρz > g0 , then log IP sup υ∈Br∗ (υ 0 ) ≤ −g0 1 ρ U(υ, υ 0 ) − d2 (υ, υ 0 ) 3ν0 2 >z p ρ(z − 1) + g20 /2 + log(4z) + Q. (A.10) Remark A.1. Formally the bound applies even with r∗ = ∞ provided that (Ed) is fulfilled on the whole set Υ ◦ . Proof. Denote def u(r) = 1 U(υ) − U(υ 0 ) . sup 3ν0 r υ∈Br (υ0 ) Then we have to bound the probability IP sup r u(r) − ρr2 /2 > z . r≤r∗ For each r ≤ r∗ and λ ≤ g0 , it follows from (A.3) that log IE exp λu(r) ≤ λ2 /2 + Q. The choice λ = √ 2ρz is admissible in view of √ 2ρz ≤ g0 . This implies by the exponential Chebyshev inequality log IP r u(r) − ρr2 /2 ≥ z ≤ −λ(z/r + ρr/2) + λ2 /2 + Q = −ρz(x + x−1 − 1) + Q, where x = p (A.11) ρ/(2z) r . By definition, ru(r) increases in r . We use that for any growing function f (·) , any t ≤ t∗ , and any s with t ≤ s ≤ t + 1 it holds f (t) − t ≤ f (s) − s + 1 yielding 1I f (t) − t ≥ z ≤ Z t+1 Z 1I f (s) − s + 1 > z ds ≤ t 0 t∗ +1 1I f (s) − s + 1 > z ds. 41 spokoiny, v. As this inequality is satisfied for any t , it also applies to the maximum over t ≤ t∗ : 1I sup f (t) − t ≥ z ≤ t∗ +1 Z 0≤t≤t∗ 1I f (s) − s + 1 > z ds. 0 If a function f (t) is random and growing in t , it follows t∗ +1 Z IP sup f (t) − t ≥ z ≤ IP f (s) − s + 1 > z ds t≤t∗ (A.12) 0 Now we consider t = ρr2 /2 = zx2 , so that dt = 2z x dx . It holds by (A.11) and (A.12) IP sup r u(r) − ρr2 /2 > z r≤r∗ t∗ +1 Z IP r u(r) − t > z − 1 dt ≤ 0 t∗ +1 Z ≤ 2z exp −ρ(z − 1)(x + x−1 − 1) + Q x dx 0 ≤ 2ze−b+Q Z ∞ exp −b(x + x−1 − 2) x dx 0 with b = ρ(z − 1) and t∗ = ρr∗ 2 /2 . This implies for b ≥ 2 IP Z 2 −b+Q sup r u(r) − ρr /2 > z ≤ 2ze r≤r∗ ∞ exp −2(x + x−1 − 2) x dx 0 ≤ 4z exp{−ρ(z − 1) + Q} and (A.9) follows. √ If 2ρz > g0 , then select λ = g0 . For r ≤ r∗ log IP r u(r) − ρr2 /2 ≥ z = log IP u(r) > z/r + ρr/2 ≤ −λ(z/r + ρr/2) + λ2 /2 + Q √ √ ≤ −λ ρz(x + x−1 − 2)/2 − λ ρz + λ2 /2 + Q, where x = p ρ/z r . This allows to bound in the same way as above IP 2 sup r u(r) − ρr /2 > z r≤r∗ p ≤ 4z exp −λ ρ(z − 1) + λ2 /2 + Q yielding (A.10). This result can be used for describing the concentration bound for the maximum of (3ν0 )−1 U(υ, υ 0 ) − ρd2 (υ, υ 0 )/2 . Namely, it suffices to find z ensuring the prescribed deviation probability. We state the result for a special case with ρ = 1 and g0 ≥ 3 which simplifies the notation. 42 Parametric estimation. Finite sample theory Corollary A.5. Under the conditions of Theorem A.4, for any x ≥ 0 with x + Q ≥ 4 o n 1 1 2 U(υ, υ 0 ) − d (υ, υ 0 ) > z0 (x, Q) ≤ exp −x , IP sup 2 υ∈Br∗ (υ 0 ) 3ν0 where with g0 = ν0 g ≥ 2 1 + √x + Q2 def z0 (x, Q) = 1 + 2g−1 (x + Q) + g 2 0 Proof. First consider the case 1 + 2 √ that z = 1 + x + Q ensures √ 0 if 1 + √ x + Q ≤ g0 , (A.13) otherwise. x + Q ≤ g0 . In view of (A.9), it suffices to check z − 1 − log(4z) − Q ≥ x. This follows from the inequality (1 + y)2 − 1 − 2 log(2 + 2y) ≥ y2 √ x + Q ≥ 2. √ If 1 + x + Q > g0 , define z = 1 + y2 with y = 2g−1 0 (x + Q) + g0 . Then with y = g0 p z − 1 − log(4z) − g20 /2 − Q − x = g0 y/2 − log 4(1 + y2 ) ≥ 0 because g0 ≥ 3 and 3y/2 − log(1 + y2 ) ≥ log(4) for y ≥ 2 . If g √ Q and x is not too big then z0 (x, Q) is of order x + Q . So, the main message of this result is that with a high probability the maximum of (3ν0 )−1 U(υ, υ 0 ) − d2 (υ, υ 0 )/2 does not significantly exceed the level Q . A.4 A multiscale upper function and hitting probability The result of the previous section can be explained as a local upper function for the process U(·) . Indeed, in a vicinity Br∗ (υ 0 ) of the central point υ 0 , it holds (3ν0 )−1 U(υ, υ 0 ) ≤ d2 (υ, υ 0 )/2 + z with a probability exponentially small in z . This section aims at extending this local result to the whole set Υ using multiscaling arguments. For simplifying the notations assume that U(υ 0 ) ≡ 0 . Then U(υ, υ 0 ) = U(υ) . We say that u(µ, υ) is a multiscale upper function for µU(·) on a subset Υ ◦ of Υ if IP sup sup µU(υ) − u(µ, υ) ≥ z(x) ≤ e−x , (A.14) µ∈M υ∈Υ ◦ for some fixed function z(x) . An upper function can be used for describing the concene = argmaxυ∈Υ ◦ U(υ) ; see Theorem A.8 below. tration sets of the point of maximum υ 43 spokoiny, v. The desired global bound requires an extension of the local exponential moment condition (Ed) . Below we suppose that the pseudo-metric d(υ, υ 0 ) is given on the whole set Υ . For each r this metric defines the ball Υ◦ (r) by the constraint d(υ, υ 0 ) ≤ r . Below the condition (Ed) is assumed to be fulfilled for any r , however the constant g may be dependent of the radius r . (Er) For any r , there exists g(r) > 0 such that (A.1) holds for all υ, υ 0 ∈ Υ◦ (r) and all λ ≤ g(r) . Condition (Er) implies a similar condition for the scaled process µU(υ) with g = µ−1 g(r) and d(υ, υ 0 ) replaced by µd(υ, υ 0 ) . Corollary A.5 implies for any x with √ def 1 + x + Q ≤ g0 (r) = ν0 g(r)/µ n µ 1 2 2o IP sup (A.15) U(υ) − µ r > z0 (x, Q) ≤ exp −x . 2 υ∈Br (υ 0 ) 3ν0 Let now a finite or separable set M and a function t(µ) ≥ 1 be fixed such that X e−t(µ) ≤ 2. (A.16) µ∈M One possible choice of the set M and the function t(µ) is to take a geometric sequence µk = µ0 2−k with any fixed µ0 and define t(µk ) = k = − log2 (µk /µ0 ) for k ≥ 0 . Putting together the bounds (A.15) for different µ ∈ M yields the following result. Theorem A.6. Suppose (Er) and (A.16). Then for any x ≥ 2 , there exists a random set A(x) of a total probability at least 1 − 2e−x , such that it holds on A(x) for any r h µ p 2 i 1 U(υ) − µ2 r2 − 1 + x + Q + t(µ) < 0, 2 υ∈Br (υ 0 ) µ∈M(r,x) 3ν0 sup sup where def M(r, x) = µ∈M: 1+ p x + Q + t(µ) ≤ ν0 g(r)/µ . Proof. For each µ ∈ M(r, x) , Corollary A.5 implies IP p 2 1 µ U(υ) − µ2 r2 ≥ 1 + x + Q + t(µ) ≤ e−x−t(µ) . 2 υ∈Br (υ 0 ) 3ν0 sup The desired assertion is obtained by summing over µ ∈ M due to (A.16). Moreover, the inequality x + Q ≥ 2.5 yields p 2 1 + x + Q + t(µ) ≤ 2 x + Q + t(µ) . This allows to take in (A.14) u(µ, υ) = 3ν0 µ2 r2 /2 + 2t(µ) and z(x) = 2(x + Q) . 44 Parametric estimation. Finite sample theory Corollary A.7. Suppose (Er) and (A.16). Then for any x with x + Q ≥ 2.5 , there exists a random set Ω(x) of a total probability at least 1 − 2e−x , such that it holds on Ω(x) for any r o n µ 1 U(υ) − µ2 r2 − 2t(µ) < 2(x + Q). 2 υ∈Br (υ 0 ) µ∈M(r,x) 3ν0 sup sup Now we briefly discuss the hitting problem. Let M (υ) be a deterministic boundary function. We aim at bounding the probability that a process U(υ) hits this boundary on the set Υ . This precisely means the probability that supυ∈Υ U(υ) − M (υ) ≥ 0 . An important observation here is that multiplication by any positive factor µ does not change the relation. This allows to apply the multiscale result from Theorem A.6. For any fixed x and any υ ∈ Br (υ 0 ) , define def M∗ (υ) = o n 1 1 µM (υ) − µ2 r2 − 2t(µ) . 2 µ∈M(r,x) 3ν0 sup Theorem A.8. Suppose (Er) , (A.16), and x + Q ≥ 2.5 . Let, given x , it hold M∗ (υ) ≥ 2(x + Q), υ ∈ Υ. (A.17) Then IP sup U(υ) − M (υ) ≥ 0 ≤ 2e−x . υ∈Υ Maximizing the expression (3ν0 )−1 µM (υ)−µ2 r2 /2 suggests the choice µ = M (υ)/(3ν0 r2 ) yielding M∗ (υ) ≥ M 2 (υ)/(6ν02 r2 ) − 2t(µ) . In particular, the condition (A.17) requires that M (υ) grows with r a bit faster than a linear function. A.5 Finite-dimensional smooth case Here we discuss the special case when Υ is an open subset in IRp , the stochastic prodef cess U(υ) is absolutely continuous and its gradient ∇U(υ) = dU(υ)/dυ has bounded exponential moments. (ED) There exist g > 0 , ν0 ≥ 1 , and for each υ ∈ Υ , a symmetric non-negative matrix H(υ) such that for any λ ≤ g and any unit vector γ ∈ IRp , it holds n γ > ∇U(υ) o ≤ ν02 λ2 /2. log IE exp λ kH(υ)γk A natural candidate for H 2 (υ) is the covariance matrix Var ∇U(υ) provided that this matrix is well posed. Then the constant ν0 can be taken close to one by reducing the value g ; see Lemma A.17 below. spokoiny, v. 45 In what follows we fix a subset Υ ◦ of Υ and establish a bound for the maximum of the process U(υ, υ ◦ ) = U(υ) − U(υ ◦ ) on Υ ◦ for a fixed point υ ◦ . We will assume existence of a dominating matrix H ∗ = H ∗ (Υ ◦ ) such that H(υ) H ∗ for all υ ∈ Υ ◦ . We also assume that π is the Lebesgue measure on Υ . First we show that the differentiability condition (ED) implies (Ed) . Lemma A.9. Assume that (ED) holds with some g and H(υ) H ∗ for υ ∈ Υ ◦ . Consider any υ, υ ◦ ∈ Υ ◦ . Then it holds for |λ| ≤ g ν02 λ2 U(υ, υ ◦ ) ≤ . log IE exp λ kH ∗ (υ − υ ◦ )k 2 Proof. Denote δ = kυ − υ ◦ k , γ = (υ − υ ◦ )/δ . Then Z 1 ∇U(υ ◦ + tδγ)dt U(υ, υ ◦ ) = δγ > 0 and kH ∗ (υ − υ ◦ )k = δkH ∗ γk . Now the H¨older inequality and (ED) yield ν02 λ2 U(υ, υ ◦ ) − IE exp λ kH ∗ (υ − υ ◦ )k 2 Z 1 > γ ∇U(υ ◦ + tδγ) ν02 λ2 dt = IE exp − λ kH ∗ γk 2 0 > Z 1 γ ∇U(υ ◦ + tδγ) ν02 λ2 ≤ IE exp λ − dt ≤ 1 kH ∗ γk 2 0 as required. The result of Lemma A.9 enables us to define d(υ, υ 0 ) = kH ∗ (υ − υ ◦ )k so that the corresponding ball coincides with the ellipsoid B(r, υ ◦ ) . Now we bound the value Q(Υ ◦ ) for Υ ◦ = B(r0 , υ ◦ ) . Lemma A.10. Let Υ ◦ = B(r0 , υ ◦ ) . Under the conditions of Lemma A.9, it holds Q(Υ ◦ ) ≤ c1 p , where c1 = 2 for p ≥ 2 , and c1 = 2.7 for p = 1 . Proof. The set Υ ◦ coincides with the ellipsoid B(r0 , υ ◦ ) while the d -ball Bk (υ) coincides with the ellipsoid B(rk , υ) for each k ≥ 2 . By change of variables, the study can be reduced to the case with υ ◦ = 0 , H ∗ ≡ Ip , r0 = 1 , so that B(r, υ) is the usual Euclidean ball in IRp of radius r . It is obvious that the measure of the overlap of two balls B(1, 0) and B(2−k+1 , υ) for kυk ≤ 1 is minimized when kυk = 1 , and this value is the same for all such υ . Now we use the following observation. Fix υ ] with kυ ] k = 1 . Let r ≤ 1 , υ [ = (1 − r2 /2)υ ] and r[ = r − r2 /2 . If υ ∈ B(r[ , υ [ ) , then υ ∈ B(r, υ ] ) because kυ ] − υk ≤ kυ ] − υ [ k + kυ − υ [ k ≤ r2 /2 + r − r2 /2 ≤ r. 46 Parametric estimation. Finite sample theory Moreover, for each υ ∈ B(r[ , υ [ ) , it holds with u = υ − υ [ kυk2 = kυ [ k2 + kuk2 + 2u> υ [ ≤ (1 − r2 /2)2 + |r[ |2 + 2u> υ [ ≤ 1 + 2u> υ [ . This means that either υ = υ [ + u or υ [ − u belongs to the ball B(r0 , υ ◦ ) and thus, π B(1, 0) ∩ B(r, υ) ≥ π B(r[ , υ [ ) /2 . We conclude that 2π B(1, 0) π B(1, 0) ≤ = 2(r − r2 /2)−p . π B(1, 0) ∩ B(r, υ ] ) π B(r[ , 0) This implies for k ≥ 1 and rk = 2−k+1 that 2Mk+1 ≤ 22+kp (1−2−k−1 )−p . The quantity Q(Υ ◦ ) can now be evaluated as ∞ ∞ 2 X −k 2p X −k 1 log(22+p ) + 2 log(22+kp ) − 2 log(1 − 2−k−1 ) 3 3 3 Q(Υ ◦ ) ≤ k=1 log 2 h 2+p+2 3 = ∞ X k=1 k=1 i 2p (2 + kp)2−k − 3 ∞ X 2−k log(1 − 2−k−1 ) ≤ c1 p, k=1 where c1 = 2 for p ≥ 2 , and c1 = 2.7 for p = 1 , and the result follows. Now we specify the local bounds of Theorem A.1 and the central result of Corollary A.5 to the smooth case. Theorem A.11. Suppose (Ed) . For any λ ≤ ν0 g , r0 > 0 , and υ ◦ ∈ Υ log IE exp n λ 3ν0 r0 sup o U(υ) − U(υ ◦ ) ≤ λ2 /2 + Q, υ∈B(r0 ,υ ◦ ) where Q = c1 p . def We consider the local sets of the elliptic form Υ◦ (r) = {υ : kH0 (υ − υ 0 )k ≤ r} , where H0 dominates H(υ) on this set: H(υ) H0 . Theorem A.12. Let (ED) hold with some g and a matrix H(υ) . Suppose that H(υ) H0 for all υ ∈ Υ◦ (r) . Then n 1 o 1 2 IP sup U(υ, υ 0 ) − kH0 (υ − υ 0 )k ≥ z0 (x, p) ≤ exp(−x), 2 υ∈Υ◦ (r) 3ν0 (A.18) where z0 (x, p) coincides with z0 (x, Q) from (A.13) for Q = c1 p . Remark A.2. An important feature of the established result is that the bound in the right hand-side of (A.18) does not depend on the value r describing the radius of the local vicinity around the central point υ 0 . In the ideal case one would apply this result with r = ∞ provided that the conditions H(υ) ≤ H0 is fulfilled uniformly over Υ . 47 spokoiny, v. Proof. Lemma A.10 implies (Ed) with d(υ, υ 0 ) = kH0 (υ − υ 0 )k2 /2 . Now the result follows from Corollary A.5. The global result of Theorem A.6 applies without changes to the smooth case. A.6 Roughness constraints for dimension reduction The local bounds of Theorems A.1 and A.4 can be extended in several directions. Here we briefly discuss one extension related to the use of a smoothness condition on the parameter υ . Let let t(υ) be a non-negative penalty function on Υ . A particular example of such penalty function is the roughness penalty t(υ) = kGυk2 for a given matrix IRp . Let t0 ≥ 1 be fixed. Redefine the sets Br (υ ◦ ) by the constraint t(υ) ≤ t0 : Br (υ ◦ ) = υ ∈ Υ : d(υ, υ ◦ ) ≤ r; t(υ) ≤ t0 , and consider Υ ◦ = Br◦ (υ ◦ ) for a fixed central point υ ◦ and the radius r◦ . One can easily check that the results of Theorems A.1 and A.4 and their corollaries extend to this situation without any change. The only difference is in the definition of the value Q(Υ ◦ ) and Q . Each value Q(Υ ◦ ) is defined via the quantities πk (υ) = π Brk (υ) which obviously change when each ball Br (υ) is redefined. Examples below show that the use of the penalization can substantially reduce the value Q . Now we specify the results to the case of a smooth process U given on a local ball Υ ◦ = Br◦ (υ ◦ ) defined by the condition {kH0 (υ − υ ◦ )k ≤ r◦ } and a smoothness constraint kGυk2 ≤ t0 = r2◦ . The local set Br (υ) are of the form: def Br (υ) = υ 0 : H0 (υ − υ 0 ) ≤ r, kGυ 0 k ≤ r◦ . (A.19) The effective dimension pe = pe (S) can be defined as the dimension of the subspace on which H0 ≥ G . The formal definition uses the spectral decomposition of the matrix S = H0−1 G2 H0−1 . Let g1 ≤ g2 ≤ . . . ≤ gp be the eigenvalue of S . Define pe (S) as the largest index j for which gj < 1 : def pe (S) = max j ≥ 1 : gj < 1 . (A.20) In the non-penalized case, the entropy term Q is proportional to the dimension p . The roughness penalty enables to reduce p to the effective dimension pe (S) which can be much smaller than p depending on the relation between matrices H0 and G . More precisely, if the eigenvalues gj of S grow sufficiently fast, the entropy calculus effectively reduces to the coordinates with gj ≤ 1 . 48 Parametric estimation. Finite sample theory Lemma A.13. Let g1 = 0 . For each r◦ ≥ 1 , it holds Q(Υ◦ (r◦ )) ≤ e1 ps (A.21) with ps = ps (S) defined by def ps (S) = pe (S) + p X gj−1 log+ (gj ). (A.22) j=1 If the sum −1 j≥1 gj log+ (gj ) P is bounded by a fixed constant, then the value ps is close to the effective dimension pe (S) from (A.20). Proof. We follow the proof of Lemma A.10. By a change of variables one can reduce the problem to the case when H0 is the identity matrix and r◦ ≡ 1 . Moreover, one can easily see that υ ◦ = 0 is the hardest case. The case of υ ◦ 6= 0 can be considered similarly. By a further change the matrix S = G2 can be represented in diagonal form: S = diag{g 2 ≤ . . . ≤ gp2 } . Let υ ] be any point with υ ] ≤ 1 and kGυ ] k ≤ 1 , and 1 r ≤ 1 . Simple arguments show that the measure of the set Br (υ ] ) over all such vectors υ ] is minimized at υ ] = (1, 0, . . . , 0)> . Define r[ = r − r2 /2 , υ [ = (1 − r2 /2)υ ] . Fix any υ such that u = υ − υ [ fulfills kuk ≤ r[ , kGuk ≤ r , and u1 < 0 . As Gυ [ = 0 , it holds kGυk = kGuk ≤ r ≤ 1, kυ − υ ] k ≤ kυ − υ [ k + kυ [ − υ ] k ≤ r[ + r2 /2 = r, kυk2 = kυ [ + uk2 = kυ [ k2 + kuk2 + 2u> υ [ ≤ (1 − r2 /2)2 + |r[ |2 < 1. This yields that π B1 (0) ∩ Br (υ ] ) ≥ π Br[ (0) /2 . Moreover, let the index pe (r) be defined as the largest j with rgj < 1 . Consider any υ ∈ B1 (0) and construct another point υ(r) by multiplying with r every element υ j for j ≤ pe (r) . The construction ensures that υ(r) ∈ Br (0) . This implies 2π B1 (0) π B1 (0) ≤ ≤ 2|r[ |−pe (r) . ] π B1 (0) ∩ Br (υ ) π Br[ (0) Application of this bound for k ≥ 1 , rk+1 = 2−k , and pk = pe (rk+1 ) yields that 2Mk+1 ≤ 22+kpk (1 − 2−k−1 )−pk . The quantity Q(Υ ◦ ) can now be evaluated as Q(Υ ◦ ) ≤ ∞ ∞ k=1 k=1 1 2 X −k 2 X −k log(22+pe ) + 2 log(22+kpk ) − 2 pk log(1 − 2−k−1 ). 3 3 3 49 spokoiny, v. Further, for each g > 1 , it holds with k(g) = max{k : g < 2k } def s(g) = ∞ X 2−k+1 k 1I 2−k g < 1 ≤ 2k(g)2−k(g) ≤ 2k(g)/g ≤ 2g −1 log2 (2g). k=1 Thus, ∞ X −k 2 kpk ≤ k=1 ∞ X −k 2 k p X 1I 2 −k j=1 k=1 p X gj < 1 ≤ 2 s(gj ). j=1 This easily implies the result (A.21); cf. the proof of Lemma A.10. The first result adjusts Theorem A.11 to the penalized case. The maximum of the process U is taken over a ball Br (υ) from (A.19) which is smaller than the similar ball in the non-penalized case. This explains the gain in the entropy term Q . Theorem A.14. Let (ED) hold with some g and a matrix H(υ) H0 for all υ ∈ Υ . Then for any λ ≤ ν0 g , r◦ ≥ 1 , and υ ◦ ∈ Υ log IE exp n λ 3ν0 r◦ sup o U(υ) − U(υ ◦ ) ≤ λ2 /2 + Q, υ∈Br◦ (υ ◦ ) where Br◦ (υ ◦ ) is given by (A.19), Q = e1 ps , and ps is the effective dimension from (A.22). Proof. The result follows from Corollary A.5. It is only required to evaluate the local entropy Q(Υ◦ (r◦ )) . This is done in Lemma A.13. The magnitude of the process U over Br◦ (υ ◦ ) is of order r◦ and it grows with r◦ . The use of the negative drift allows to establish a unified result. Theorem A.15. Let r◦ ≥ 1 be fixed and let (ED) hold with some g and a matrix H(υ) H0 for all υ ∈ Br◦ (υ 0 ) . Then n 1 2 o 1 IP sup U(υ, υ 0 ) − H0 (υ − υ 0 ) ≥ z(x, Q) ≤ exp(−x), 2 υ∈Br◦ (υ 0 ) 3ν0 where z(x, Q) is given by (A.13) with Q = e1 ps . The result of Theorem A.6 for the non-penalized case applies without big changes to the penalized case. 50 A.7 Parametric estimation. Finite sample theory Auxiliary facts P Lemma A.16. For any r.v.’s ξk and λk ≥ 0 such that Λ = k λk ≤ 1 X X log IE exp λ k ξk ≤ λk log IEeξk . k k Proof. Convexity of ex and concavity of xΛ imply X X 1 Λ Λ ξk ξk ≤ IE exp IE exp λk ξk − log IEe λk ξk − log IEe Λ Λ k k ≤ 1X λk IE exp ξk − log IEeξk Λ Λ = 1. k Lemma A.17. Let a r.v. ξ fulfill IEξ = 0 , IEξ 2 = 1 and IE exp(λ1 |ξ|) = κ < ∞ for some λ1 > 0 . Then for any % < 1 there is a constant C1 depending on κ , λ1 and % only such that for λ < %λ1 log IEeλξ ≤ C1 λ2 /2. Moreover, there is a constant λ2 > 0 such that for all λ ≤ λ2 log IEeλξ ≥ %λ2 /2. Proof. Define h(x) = (λ − λ1 )x + m log(x) for m ≥ 0 and λ < λ1 . It is easy to see by a simple algebra that max h(x) = −m + m log x≥0 m . λ1 − λ Therefore for any x ≥ 0 λx + m log(x) ≤ λ1 x + log m e(λ1 − λ) m . This implies for all λ < λ1 m IE|ξ| exp(λ|ξ|) ≤ m e(λ1 − λ) m IE exp(λ1 |ξ|). Suppose now that for some λ1 > 0 , it holds IE exp(λ1 |ξ|) = κ(λ1 ) < ∞ . Then the function h0 (λ) = IE exp(λξ) fulfills h0 (0) = 1 , h00 (0) = IEξ = 0 , h000 (0) = 1 and for λ < λ1 , h000 (λ) = IEξ 2 eλξ ≤ IEξ 2 eλ|ξ| ≤ 1 IE exp(λ1 |ξ|). (λ1 − λ)2 spokoiny, v. This implies by the Taylor expansion for λ < %λ1 that h0 (λ) ≤ 1 + C1 λ2 /2 with C1 = κ(λ1 )/ λ21 (1 − %)2 , and hence, log h0 (λ) ≤ C1 λ2 /2 . 51 52 Parametric estimation. Finite sample theory References Andresen, A. and Spokoiny, V. (2013). Critical dimension in profile semiparametric estimation. Manuscript. arXiv:1303.4640. Baraud, Y. (2010). A Bernstein-type inequality for suprema of random processes with applications to model selection in non-Gaussian regression. Bernoulli, 16(4):1064–1085. Bednorz, W. (2006). A theorem on majorizing measures. Ann. Probab., 34(5):1771–1781. Birg´e, L. (2006). Model selection via testing: an alternative to (penalized) maximum likelihood estimators. Annales de l’Institut Henri Poincare (B) Probability and Statistics. Birg´e, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab. Theory Relat. Fields, 97(1-2):113–150. Birg´e, L. and Massart, P. (1998). Minimum contrast estimators on sieves: Exponential bounds and rates of convergence. Bernoulli, 4(3):329–375. Boucheron, S., Lugosi, G., and Massart, P. (2003). Concentration inequalities using the entropy method. Ann. Probab., 31(3):1583–1614. Ibragimov, I. and Khas’minskij, R. (1981). Statistical estimation. Asymptotic theory. Transl. from the Russian by Samuel Kotz. New York - Heidelberg -Berlin: SpringerVerlag . Le Cam, L. (1960). Locally asymptotically normal families of distributions. Certain approximations to families of distributions and their use in the theory of estimation and testing hypotheses. Univ. California Publ. Stat., 3:37–98. Le Cam, L. and Yang, G. L. (2000). Asymptotics in Statistics: Some Basic Concepts. Springer-Verlag, New York. McCullagh, P. and Nelder, J. (1989). Generalized linear models. 2nd ed. Monographs on Statistics and Applied Probability. 37. London etc.: Chapman and Hall. xix, 511 p. . Spokoiny, V., Wang, W., and H¨ardle, W. (2013). Local quantile regression (with rejoinder). J. of Statistical Planing and Inference. To appear as Discussion Paper. ArXiv:1208.5384. Talagrand, M. (1996). 24(3):1049–1103. Majorizing measures: the generic chaining. Ann. Probab., spokoiny, v. 53 Talagrand, M. (2001). Majorizing measures without measures. Ann. Probab., 29(1):411– 417. Talagrand, M. (2005). The generic chaining. Springer Monographs in Mathematics. Springer-Verlag, Berlin. Upper and lower bounds of stochastic processes. Van de Geer, S. (1993). Hellinger-consistency of certain nonparametric maximum likelihood estimators. Ann. Stat., 21(1):14–44. van der Vaart, A. and Wellner, J. A. (1996). Weak convergence and empirical processes. With applications to statistics. Springer Series in Statistics. New York, Springer.