Parametric estimation. Finite sample theory

Transcription

Parametric estimation. Finite sample theory
Parametric estimation. Finite sample theory
Vladimir Spokoiny∗
Weierstrass-Institute, Mohrenstr. 39, 10117 Berlin, Germany
Humboldt University Berlin,
Moscow Institute of Physics and Technology
[email protected]
Abstract
The paper aims at reconsidering the famous Le Cam LAN theory. The main
features of the approach which make it different from the classical one are:
(1) the study is non-asymptotic, that is, the sample size is fixed and does not
tend to infinity; (2) the parametric assumption is possibly misspecified and
the underlying data distribution can lie beyond the given parametric family.
These two features enable to bridge the gap between parametric and nonparametric theory and to build a unified framework for statistical estimation. The
main results include a large deviation bound for the (quasi) maximum likelihood and a local quadratic bracketing result for the log-likelihood process.
The latter yields a number of important corollaries for statistical inference:
concentration, confidence and risk bounds, expansion of the maximum likelihood estimate, etc. All these corollaries are stated in a non-classical way
admitting a model misspecification and finite samples. However, the classical asymptotic results including the efficiency bounds can be easily derived
as corollaries of the obtained non-asymptotic statements. At the same time,
the new bracketing device works well in the situations with large or growing
parameter dimension in which the classical parametric theory fails. The general results are illustrated for the i.i.d. set-up as well as for generalized linear
modeling and median estimation. The results apply for any dimension of the
parameter space and provide a quantitative lower bound on the sample size
yielding the root-n accuracy.
∗
The author is supported by Predictive Modeling Laboratory, MIPT, RF government grant, ag.
11.G34.31.0073. Financial support by the German Research Foundation (DFG) through the Collaborative Research Center 649 “Economic Risk” is gratefully acknowledged. Critics and suggestions of two
anonymous referees helped a lot in improving the paper
1
2
Parametric estimation. Finite sample theory
AMS 2000 Subject Classification: Primary 62F10. Secondary 62J12,62F25,62H12
Keywords: maximum likelihood, local quadratic bracketing, deficiency, concentration
1
Introduction
One of the most popular approaches in statistics is based on the parametric assumption
(PA) that the distribution IP of the observed data Y belongs to a given parametric
family (IPθ , θ ∈ Θ ⊆ IRp ) , where p stands for the number of parameters. This assumption allows to reduce the problem of statistical inference about IP to recovering
the parameter θ . The theory of parameter estimation and inference is nicely developed
in a quite general set-up. There is a vast literature on this issue. We only mention the
book by Ibragimov and Khas’minskij (1981), which provides a comprehensive study of
asymptotic properties of maximum likelihood and Bayesian estimators. The theory is
essentially based on two major assumptions: (1) the underlying data distribution follows
the PA; (2) the sample size or the amount of available information is large relative to the
number of parameters.
In many practical applications, both assumptions can be very restrictive and limit
the scope of applicability for the whole approach. Indeed, the PA is usually only an approximation of real data distribution and in most statistical problems it is too restrictive
to assume that the PA is exactly fulfilled. Many modern statistical problems deal with
very complex high dimensional data where a huge number of parameters are involved.
In such situations, the applicability of large sample asymptotics is questionable. These
two issues partially explain why the parametric and nonparametric theory are almost
isolated from each other. Relaxing these restrictive assumptions can be viewed as an
important challenge of the modern statistical theory. The present paper attempts at
developing a unified approach which does not require the restrictive parametric assumptions but still enjoys the main benefits of the parametric theory. The main steps of the
approach are similar to the classical local asymptotic normality (LAN) theory; see e.g.
Chapters 1–3 in the monograph Ibragimov and Khas’minskij (1981): first one localizes
the problem to a neighborhood of the target parameter. Then one uses a local quadratic
expansion of the log-likelihood to solve the corresponding estimation problem. There is,
however, one feature of the proposed approach which makes it essentially different from
classical scheme. Namely, the use of the bracketing device instead of classical Taylor
expansion allows to consider much larger local neighborhoods than in the LAN theory.
More specifically, the classical LAN theory effectively requires a strict localization to a
root-n vicinity of the true point. At this point, the LAN theory fails in extending to the
spokoiny, v.
3
nonparametric situation. Our approach works for any local vicinity of the true point.
This opens the door to building a unified theory including most of classical parametric
and nonparametric results.
Let Y stand for the available data. Everywhere below we assume that the observed
data Y follow the distribution IP on a metric space Y . We do not specify any particular
structure of Y . In particular, no assumption like independence or weak dependence of
individual observations is imposed. The basic parametric assumption is that IP can be
approximated by a parametric distribution IPθ from a given parametric family (IPθ , θ ∈
Θ ⊆ IRp ) . Our approach allows that the PA can be misspecified, that is, in general,
IP 6∈ IPθ .
Let L(Y , θ) be the log-likelihood for the considered parametric model: L(Y , θ) =
Pθ
log dI
dµ (Y ) , where µ0 is any dominating measure for the family (IPθ ) . We focus on
0
the properties of the process L(Y , θ) as a function of the parameter θ . Therefore, we
suppress the argument Y there and write L(θ) instead of L(Y , θ) . One has to keep
def
in mind that L(θ) is random and depends on the observed data Y . By L(θ, θ ∗ ) =
L(θ)−L(θ ∗ ) we denote the log-likelihood ratio. The classical likelihood principle suggests
to estimate θ by maximizing the corresponding log-likelihood function L(θ) :
e def
θ
= argmax L(θ).
(1.1)
θ∈Θ
Our ultimate goal is to study the properties of the quasi maximum likelihood estimator
e . It turns out that such properties can be naturally described in terms of
(MLE) θ
e . To avoid
the maximum of the process L(θ) rather than the point of maximum θ
technical burdens it is assumed that the maximum is attained leading to the identity
e . However, the point of maximum does not have to be unique. If
maxθ∈Θ L(θ) = L(θ)
e as any of them. Basically, the notation θ
e is used
there are many such points we take θ
e = sup
for the identity L(θ)
θ∈Θ L(θ) .
e from (1.1) is still meaningful and it appears
If IP 6∈ IPθ , then the (quasi) MLE θ
to be an estimator of the value θ ∗ defined by maximizing the expected value of L(θ) :
def
θ ∗ = argmax IEL(θ)
(1.2)
θ∈Θ
which is the true value in the parametric situation and can be viewed as the parameter
of the best parametric fit in the general case.
e like concentraThe results below show that the main properties of the quasi MLE θ
tion or coverage probability can be described in terms of the excess which is the difference
between the maximum of the process L(θ) and its value at the “true” point θ ∗ :
def
e θ ∗ ) = L(θ)
e − L(θ ∗ ) = max L(θ) − L(θ ∗ ),
L(θ,
θ∈Θ
4
Parametric estimation. Finite sample theory
The established results can be split into two big groups. A large deviation bound states
e . For specific local sets Θ0 (r) with
some concentration properties of the estimator θ
e 6∈ Θ0 (r) is exponentially small in r . This
elliptic shape, the deviation probability IP θ
concentration bound allows to restrict the parameter space to a properly selected vicinity
Θ0 (r) . Our main results concern the local properties of the process L(θ) within Θ0 (r)
including a bracketing bound and its corollaries.
The paper is organized as follows. Section 2 presents the list of conditions which are
systematically used in the text. The conditions only concern the properties of the quasi
log-likelihood process L(θ) . Section 3 appears to be central in the whole approach and
it focuses on local properties of the process L(θ) within Θ0 (r) . The idea is to sandwich
the underlying (quasi) log-likelihood process L(θ) for θ ∈ Θ0 (r) between two quadratic
(in parameter) expressions. Then the maximum of L(θ) over Θ0 (r) will be sandwiched
as well by the maxima of the lower and upper processes. The quadratic structure of these
processes helps to compute these maxima explicitly yielding the bounds for the value of
the original problem. This approximation result is used to derive a number of corollaries
e,
including the concentration and coverage probability, expansion of the estimator θ
polynomial risk bounds, etc. In contrary to the classical theory, all the results are nonasymptotic and do not involve any small values of the form o(1) , all the terms are
specified explicitly. Also the results are stated under possible model misspecification.
Section 4 accomplishes the local results with the concentration property which bounds
e deviates from the local set Θ0 (r) . In the modern statistical
the probability that θ
literature there is a number of studies considering maximum likelihood or more generally
minimum contrast estimators in a general i.i.d. situation, when the parameter set Θ is
a subset of some functional space. We mention the papers Van de Geer (1993), Birg´e
and Massart (1993), Birg´e and Massart (1998), Birg´e (2006) and references therein. The
established results are based on deep probabilistic facts from empirical process theory;
see e.g. Talagrand (1996, 2001, 2005), van der Vaart and Wellner (1996), Boucheron
et al. (2003). The general result presented in Section A follows the generic chaining idea
due to Talagrand (2005); cf. Bednorz (2006). However, we do not assume any specific
structure of the model. In particular, we do not assume independent observations and
thus, cannot apply the most developed concentration bounds from the empirical process
theory.
Section 5 illustrates the applicability of the general results to the classical case of
an i.i.d. sample. The previously established general results apply under rather mild
conditions. Basically we assume some smoothness of the log-likelihood process and some
minimal number of observations per parameter: the sample size should be at least of
order of the dimensionality p of the parameter space. We also consider the examples of
spokoiny, v.
5
generalized linear modeling and of median regression.
It is important to mention that the non-asymptotic character of our study yields
an almost complete change of the mathematical tools: the notions of convergence and
tightness become meaningless, the arguments based on compactness of the parameter
space do not apply, etc. Instead we utilize the tools of the empirical process theory
based on the ideas of concentration of measures and nonasymptotic entropy bounds.
Section ?? in the Appendix presents an exponential bound for a general quadratic form
which is very important for getting the sharp risk bounds for the quasi MLE. This bound
is an important step in the concentration results for the quasi MLE. Section A explains
how the generic chaining and majorizing measure device by Talagrand (2005) refined
in Bednorz (2006) can be used for obtaining a general exponential bound for the loglikelihood process.
The proposed approach can be useful in many further research directions including
penalized maximum likelihood and semiparametric estimation, Andresen and Spokoiny
(2013), contraction rate and asymptotic normality of the posterior within the Bayes
approach, ?, local adaptive quantile estimation, Spokoiny et al. (2013).
2
Conditions
Below we collect the list of conditions which are systematically used in the text. It seems
to be an advantage of the whole approach that all the results are stated in a unified way
under the same conditions. Once checked, one obtains automatically all the established
results. We do not try to formulate the conditions and the results in the most general
form. In some cases we sacrifice generality in favor of readability and ease of presentation.
It is important to stress that all the conditions only concern the properties of the quasi
likelihood process L(θ) . Even if the process L(·) is not a sufficient statistic, the whole
analysis is entirely based on its geometric structure and probabilistic properties. The
conditions are not restrictive and can be effectively checked in many particular situations.
Some examples are given in Section 5 for i.i.d setup, generalized linear models, and for
median regression.
The imposed conditions can be classified into the following groups by their meaning:
• smoothness conditions on L(θ) allowing the second order Taylor expansion;
• exponential moment conditions;
• identifiability and regularity conditions;
We also distinguish between local and global conditions. The global conditions concern
the global behavior of the process L(θ) while the local conditions focus on its behavior
6
Parametric estimation. Finite sample theory
in the vicinity of the central point θ ∗ . Below we suppose that degree of locality is
described by a number r . The local zone corresponds to r ≤ r0 for a fixed r0 . The
global conditions concern r > 0 .
2.1
Local conditions
Local conditions describe the properties of L(θ) in a vicinity of the central point θ ∗
from (1.2).
To bound local fluctuations of the process L(θ) , we introduce an exponential moment
condition on the stochastic component ζ(θ) :
def
ζ(θ) = L(θ) − IEL(θ).
Below we suppose that the random function ζ(θ) is differentiable in θ and its gradient
∇ζ(θ) = ∂ζ(θ)/∂θ ∈ IRp has some exponential moments. Our first condition describes
the property of the gradient ∇ζ(θ ∗ ) at the central point θ ∗ .
There exist a positive symmetric matrix V02 , and constants g > 0 , ν0 ≥ 1
such that Var ∇ζ(θ ∗ ) ≤ V02 and for all |λ| ≤ g
(ED0 )
>
γ ∇ζ(θ ∗ )
≤ ν02 λ2 /2.
sup log IE exp λ
p
kV
γk
0
γ∈IR
In typical situation, the matrix V02 can be defined as the covariance matrix of the
gradient vector ∇ζ(θ ∗ ) : V02 = Var ∇ζ(θ ∗ ) = Var ∇L(θ ∗ ) . If L(θ) is the loglikelihood for a correctly specified model, then θ ∗ is the true parameter value and V02
coincides with the corresponding Fisher information matrix. The matrix V0 shown in
this condition determines the local geometry in the vicinity of θ ∗ . In particular, define
the local elliptic neighborhoods of θ ∗ as
def
Θ0 (r) = {θ ∈ Θ : kV0 (θ − θ ∗ )k ≤ r}.
(2.1)
The further conditions are restricted to such defined neighborhoods Θ0 (r) .
(ED1 )
For each r ≤ r0 , there exists a constant ω(r) ≤ 1/2 such that it holds for all
θ ∈ Θ0 (r)
γ > {∇ζ(θ) − ∇ζ(θ ∗ )}
sup log IE exp λ
ω(r)kV0 γk
γ∈IRp
Here the constant g is the same as in (ED0 ) .
≤ ν02 λ2 /2,
|λ| ≤ g.
spokoiny, v.
7
The main bracketing result also requires second order smoothness of the expected loglikelihood IEL(θ) . By definition, L(θ ∗ , θ ∗ ) ≡ 0 and ∇IEL(θ ∗ ) = 0 because θ ∗ is the
extreme point of IEL(θ) . Therefore, −IEL(θ, θ ∗ ) can be approximated by a quadratic
function of θ−θ ∗ in the neighborhood of θ ∗ . The local identifiability condition quantifies
this quadratic approximation from above and from below on the set Θ0 (r) from (2.1).
(L0 ) There are a symmetric strictly positive-definite matrix D02 and for each r ≤ r0
and a constant δ(r) ≤ 1/2 , such that it holds on the set Θ0 (r)
−2IEL(θ, θ ∗ )
kD0 (θ − θ ∗ )k2 − 1 ≤ δ(r).
Usually D02 is defined as the negative Hessian of IEL(θ ∗ ) : D02 = −∇2 IEL(θ ∗ ) . If
L(θ, θ ∗ ) is the log-likelihood ratio and IP = IPθ∗ then −IEL(θ, θ ∗ ) = IEθ∗ log dIPθ∗ /dIPθ =
K(IPθ∗ , IPθ ) , the Kullback-Leibler divergence between IPθ∗ and IPθ . Then condition
(L0 ) with D0 = V0 follows from the usual regularity conditions on the family (IPθ ) ;
cf. Ibragimov and Khas’minskij (1981). If the log-likelihood process L(θ) is sufficiently
smooth in θ , e.g. three times stochastically differentiable, then the quantities ω(r) and
δ(r) can be taken proportional to the value %(r) defined as
def
%(r) = max kθ − θ ∗ k.
θ∈Θ0 (r)
In the important special case of an i.i.d. model one can take ω(r) = ω ∗ r/n1/2 and
δ(r) = δ ∗ r/n1/2 for some constants ω ∗ , δ ∗ ; see Section 5.1.
The identifiability condition relates the matrices D02 and V02 .
(I) There is a constant a > 0 such that a2 D02 ≥ V02 .
2.2
Global conditions
The global conditions have to be fulfilled for all θ lying beyond Θ0 (r0 ) . We only impose
one condition on the smoothness of the stochastic component of the process L(θ) in term
of its gradient, and one identifiability condition in terms of the expectation IEL(θ, θ ∗ ) .
The first condition is similar to the local condition (ED0 ) and it requires some exponential moment of the gradient ∇ζ(θ) for all θ ∈ Θ . However, the constant g may
be dependent of the radius r = kV0 (θ − θ ∗ )k .
(Er)
For any r , there exists a value g(r) > 0 such that for all λ ≤ g(r)
sup
θ∈Θ0 (r)
>
γ ∇ζ(θ)
≤ ν02 λ2 /2.
sup log IE exp λ
p
kV
γk
0
γ∈IR
8
Parametric estimation. Finite sample theory
The global identification property means that the deterministic component IEL(θ, θ ∗ )
of the log-likelihood is competitive with its variance Var L(θ, θ ∗ ) .
(Lr) There is a function b(r) such that rb(r) monotonously increases in r and for
each r ≥ r0
inf
IEL(θ, θ ∗ ) ≥ b(r)r2 .
θ: kV0 (θ−θ ∗ )k=r
3
Local inference
The Local Asymptotic Normality (LAN) condition since introduced in Le Cam (1960)
became one of the central notions in the statistical theory. It postulates a kind of local approximation of the log-likelihood of the original model by the log-likelihood of a
Gaussian shift experiment. The LAN property being once checked yields a number of important corollaries for statistical inference. In words, if you can solve a statistical problem
for the Gaussian shift model, the result can be translated under the LAN condition to the
original setup. We refer to Ibragimov and Khas’minskij (1981) for a nice presentation of
the LAN theory including asymptotic efficiency of MLE and Bayes estimators. The LAN
property was extended to mixed LAN or Local Asymptotic Quadraticity (LAQ); see e.g.
Le Cam and Yang (2000). All these notions are very much asymptotic and very much
local. The LAN theory also requires that L(θ) is the correctly specified log-likelihood.
The strict localization does not allow for considering a growing or infinite parameter
dimension and limits applications of the LAN theory to nonparametric estimation.
Our approach tries to avoid asymptotic constructions and attempts to include a possible model misspecification and a large dimension of the parameter space. The presentation below shows that such an extension of the LAN theory can be made essentially for
free: all the major asymptotic results like Fisher and Cram´er-Rao information bounds,
as well as the Wilks phenomenon can be derived as corollaries of the obtained nonasymptotic statements simply by letting the sample size to infinity. At the same time, it
applies to a high dimensional parameter space.
The LAN property states that the considered process L(θ) can be approximated by
a quadratic in θ expression in a vicinity of the central point θ ∗ . This property is usually
checked using the second order Taylor expansion. The main problem arising here is that
the error of the approximation grows too fast with the local size of the neighborhood.
Section 3.1 presents the non-asymptotic version of the LAN property in which the local
quadratic approximation of L(θ) is replaced by bounding this process from above and
from below by two different quadratic in θ processes. More precisely, we apply the
9
spokoiny, v.
bracketing idea: the difference L(θ, θ ∗ ) = L(θ) − L(θ ∗ ) is put between two quadratic
processes L (θ, θ ∗ ) and L (θ, θ ∗ ) :
L (θ, θ ∗ ) − ♦ ≤ L(θ, θ ∗ ) ≤ L (θ, θ ∗ ) + ♦ ,
θ ∈ Θ0 (r),
(3.1)
where is a numerical parameter, = − , and ♦ and ♦ are stochastic errors which
only depend on the selected vicinity Θ0 (r) . The upper process L (θ, θ ∗ ) and the lower
process L (θ, θ ∗ ) can deviate substantially from each other, however, the errors ♦ , ♦
remain small even if the value r describing the size of the local neighborhood Θ0 (r) is
large.
The sandwiching result (3.1) naturally leads to two important notions: the value of
the problem and the spread. It turns out that most of the statements like confidence
and concentration probability rely upon the maximum of L(θ, θ ∗ ) over θ which we
call the excess. Its expectation will be referred to as the value of the problem. Due to
(3.1) the excess can be bounded from above and from below using the similar quantities
maxθ L (θ, θ ∗ ) and maxθ L (θ, θ ∗ ) which can be called the lower and upper excess,
while their expectations are the values of the lower and upper problems. Note that
maxθ {L (θ, θ ∗ ) − L (θ, θ ∗ )} can be very large or even infinite. However, this is not
crucial. What really matters is the difference between the upper and the lower excess.
The spread ∆ can be defined as the width of the interval bounding the excess due to
(3.1), that is, as the sum of the approximation errors and of the difference between the
upper and the lower excess:
def
∆ = ♦ + ♦ + max L (θ, θ ∗ ) − max L (θ, θ ∗ ) .
θ
θ
The range of applicability of this approach can be described by the following mnemonic
rule: “The value of the upper problem is larger in order than the spread.” The further
sections explain in details the meaning and content of this rule. Section 3.1 presents the
key bound (3.1) and derives it from the general results on empirical processes. Section 3.2
presents some straightforward corollaries of the bound (3.1) including the coverage and
concentration probabilities, expansion of the MLE and the risk bounds. It also indicates
how the classical results on asymptotic efficiency of the MLE follow from the obtained
non-asymptotic bounds.
3.1
Local quadratic bracketing
This section presents the key result about local quadratic approximation of the quasi
log-likelihood process given by Theorem 3.1 below.
Let the radius r of the local neighborhood Θ0 (r) be fixed in a way that the deviation
e 6∈ Θ0 (r) is sufficiently small. Precise results about the choice of r
probability IP θ
10
Parametric estimation. Finite sample theory
which ensures this property are postponed until Section 4. In this neighborhood Θ0 (r)
we aim at building some quadratic lower and upper bounds for the process L(θ) . The
first step is the usual decomposition of this process into deterministic and stochastic
components:
L(θ) = IEL(θ) + ζ(θ),
where ζ(θ) = L(θ) − IEL(θ) . Condition (L0 ) allows to approximate the smooth deterministic function IEL(θ) − IEL(θ ∗ ) around the point of maximum θ ∗ by the quadratic
form −kD0 (θ − θ ∗ )k2 /2 . The smoothness properties of the stochastic component ζ(θ)
given by conditions (ED0 ) and (ED1 ) leads to linear approximation ζ(θ) − ζ(θ ∗ ) ≈
(θ − θ ∗ )> ∇ζ(θ ∗ ) . Putting these two approximations together yields the following approximation of the process L(θ) on Θ0 (r) :
def
L(θ, θ ∗ ) ≈ L(θ, θ ∗ ) = (θ − θ ∗ )> ∇ζ(θ ∗ ) − kD0 (θ − θ ∗ )k2 /2.
(3.2)
This expansion is used in most of statistical calculus. However, it does not suit our
purposes because the error of approximation grows quadratically with the radius r and
starts to dominate at some critical value of r . We slightly modify the construction by
introducing two different approximating processes. They only differ in the deterministic
quadratic term which is either shrunk or stretched relative to the term kD0 (θ − θ ∗ )k2 /2
in L(θ, θ ∗ ) .
Let δ, % be nonnegative constants. Introduce for a vector = (δ, %) the following
notation:
def
L (θ, θ ∗ ) = (θ − θ ∗ )> ∇L(θ ∗ ) − kD (θ − θ ∗ )k2 /2
∗
∗ 2
= ξ>
D (θ − θ ) − kD (θ − θ )k /2,
(3.3)
where ∇L(θ ∗ ) = ∇ζ(θ ∗ ) by ∇IEL(θ ∗ ) = 0 and
D2 = D02 (1 − δ) − %V02 ,
def
ξ = D−1 ∇L(θ ∗ ).
Here we implicitly assume that with the proposed choice of the constants δ and % , the
matrix D2 is non-negative: D2 ≥ 0 . The representation (3.3) indicates that the process
L (θ, θ ∗ ) has the geometric structure of log-likelihood of a linear Gaussian model. We do
not require that the vector ξ is Gaussian and hence, it is not the Gaussian log-likelihood.
However, the geometric structure of this process appears to be more important than its
distributional properties.
One can see that if δ, % are positive, the quadratic drift component of the process
L (θ, θ ∗ ) is shrunk relative to L(θ, θ ∗ ) in (3.2) for positive and it is stretched if δ, %
11
spokoiny, v.
are negative. Now, given r , fix some δ ≥ δ(r) and % ≥ 3ν0 ω(r) with the value δ(r)
from condition (L0 ) and ω(r) from condition (ED1 ) . Finally set = − , so that
D2 = D02 (1 + δ) + %V02 .
Theorem 3.1. Assume (ED1 ) and (L0 ) . Let for some r , the values % ≥ 3ν0 ω(r) and
δ ≥ δ(r) be such that D02 (1 − δ) − %V02 ≥ 0 . Then
L (θ, θ ∗ ) − ♦ (r) ≤ L(θ, θ ∗ ) ≤ L (θ, θ ∗ ) + ♦ (r),
θ ∈ Θ0 (r),
(3.4)
with L (θ, θ ∗ ), L (θ, θ ∗ ) defined by (3.3). The error terms ♦ (r) and ♦ (r) satisfy the
bound (3.11) from Proposition 3.7.
The proof of this theorem is given in Proposition 3.7.
Remark 3.1. This bracketing bound (3.4) describes some properties of the log-likelihood
e is not shown there. However, it directly implies most of
process and the estimator θ
our inference results. We therefore formulate (3.4) as a separate statement. Section 3.3
below presents some exponential bounds on the error terms ♦ (r) and ♦ (r) . The
main message is that under rather broad conditions, these errors are small and have only
e.
minor impact on the inference for the quasi MLE θ
3.2
Local inference
This section presents a list of corollaries from the basic approximation bounds of Theorem 3.1. The idea is to replace the original problem by a similar one for the approximating
upper and lower models. It is important to stress once again that all the corollaries only
rely on the bracketing result (3.4) and the geometric structure of the processes L and
L . Define the spread ∆ (r) by
def
∆ (r) = ♦ (r) + ♦ (r) + kξ k2 − kξ k2 /2.
(3.5)
Here ξ = D−1 ∇L(θ ∗ ) and ξ = D−1 ∇L(θ ∗ ) . The quantity ∆ (r) appears to be the
price induced by our bracketing device. Section 3.3 below presents some probabilistic
bounds on the spread showing that it is small relative to the other terms. All our
corollaries below are stated under conditions of Theorem 3.1 and implicitly assume that
the spread can be nearly ignored.
3.2.1
Local coverage probability
Our first result describes the probability of covering θ ∗ by the random set
e θ) ≤ z}.
E(z) = {θ : 2L(θ,
(3.6)
12
Parametric estimation. Finite sample theory
Corollary 3.2. For any z > 0
e ∈ Θ0 (r) ≤ IP kξ k2 ≥ z − 2♦ (r) .
IP E(z) 63 θ ∗ , θ
(3.7)
Proof. The bound (3.7) follows from the upper bound of Theorem 3.1 and the statement
(3.12) of Lemma 3.8 below.
Below, see (3.14), we also present an exponential bound which helps to answer a very
important question about a proper choice of the critical value z ensuring a prescribed
covering probability.
3.2.2
Local expansion, Wilks theorem, and local concentration
Now we show how the bound (3.4) can be used for obtaining a local expansion of the
e . All our results will be conditioned to the random set C (r) defined as
quasi MLE θ
def
C (r) =
e ∈ Θ0 (r), kV0 D−1 ξ k ≤ r .
θ
(3.8)
The second inequality in the definition of C (r) is related to the solution of the upper
e 6∈ Θ0 (r) , where θ
e =
and lower problem; cf. Lemma 3.8: kV0 D−1 ξ k ≤ r means θ
argminθ L (θ, θ ∗ ) .
Below in Section 3.3 we present some upper bounds on the value r ensuring a dominating probability of this random set. The first result can be viewed as a finite sample
version of the famous Wilks theorem.
Corollary 3.3. On the random set C (r) from (3.8), it holds
e θ ∗ ) ≤ kξ k2 /2 + ♦ (r).
kξ k2 /2 − ♦ (r) ≤ L(θ,
(3.9)
The next result is an extension of another prominent asymptotic result, namely, the
Fisher expansion of the MLE.
Corollary 3.4. On the random set C (r) from (3.8), it holds
e − θ ∗ − ξ 2 ≤ 2∆ (r).
D θ
(3.10)
The proof of Corollaries 3.3 and 3.4 relies on the solution of the upper and lower
problems and it is given below at the end of this section.
e assuming that θ
e is restricted to
Now we describe concentration properties of θ
e − θ ∗ )k > z for a given
Θ0 (r) . More precisely, we bound the probability that kD (θ
z > 0.
spokoiny, v.
13
Corollary 3.5. For any z > 0 , it holds
p
e − θ ∗ )k > z, C (r) ≤ IP kξ k > z − 2∆ (r) .
IP kD (θ
An interesting and important question is for which z in (3.6) the coverage probability
of the event E(z) 3 θ ∗ or for which z , the concentration probability of the event
e − θ ∗ )k ≤ z} becomes close to one. It will be addressed in Section 3.3.
{kD (θ
3.2.3
A local risk bound
e θ ∗ ) and of the normalized loss
Below we also bound the moments of the excess L(θ,
e ∗ ) when θ
e is restricted to Θ0 (r) . The result follows directly from Corollaries 3.3
D (θ−θ
and 3.4.
Corollary 3.6. For u > 0
e θ ∗ ) 1I θ
e ∈ Θ0 (r) ≤ IE kξ k2 /2 + ♦ (r) u .
IE Lu (θ,
Moreover, it holds
p
e − θ ∗ )ku 1I C (r) ≤ IE kξ k + 2∆ (r) u .
IE kD (θ
3.2.4
Comparing with the asymptotic theory
This section briefly discusses the relation between the established non-asymptotic bounds
and the classical asymptotic results in parametric estimation. This comparison is not
straightforward because the asymptotic theory involves the sample size or noise level as
the asymptotic parameter, while our setup is very general and works even for a “single”
observation. Here we simply treat = (δ, %) as a small parameter. This is well justified
p
by the i.i.d. case with n observations, where it holds δ = δ(r) r/n and similarly
for % ; see Section 5 for more details. The bounds below in Section 3.3 show that the
spread ∆ (r) from (3.5) is small and can be ignored in the asymptotic calculations. The
results of Corollary 3.2 through 3.6 represent the desired bounds in terms of deviation
bounds for the quadratic form kξ k2 .
For better understanding the essence of the presented results, consider first the “true”
parametric model with the correctly specified log-likelihood L(θ) . Then D02 = V02 is
the total Fisher information matrix. In the i.i.d. case it becomes nF0 where F0 is the
usual Fisher information matrix of the considered parametric family at θ ∗ . In particular,
Var ∇L(θ ∗ ) = nF0 . So, if D is close to D0 , then ξ can be treated as the normalized
def
score. Under usual assumptions, ξ = D0−1 ∇L(θ ∗ ) is asymptotically standard normal
p -vector. The same applies to ξ . Now one can observe that Corollary 3.2 through 3.6
14
Parametric estimation. Finite sample theory
directly imply most of classical asymptotic statements. In particular, Corollary 3.3 shows
e θ ∗ ) is nearly kξ k2 and thus nearly χ2 (Wilks Theorem).
that the twice excess 2L(θ,
p
e − θ ∗ ≈ ξ (the Fisher expansion) and hence,
Corollary 3.4 yields the expansion D θ
∗
e − θ ∗ is
e
D θ − θ is asymptotically standard normal. Asymptotic variance of D θ
e achieves the Cram´er-Rao efficiency bound in the asymptotic set-up.
nearly one, so θ
3.3
Spread
This section presents some bounds on the spread ∆ (r) from (3.5). This quantity is
random but it can be easily evaluated under the conditions made. We present two
different results: one bounds the errors ♦ (r), ♦ (r) , while the other presents a deviation
bound on quadratic forms like kξ k2 . The results are stated under conditions (ED0 )
and (ED1 ) in a non-asymptotic way, so the formulation is quite technical. An informal
discussion at the end of this section explains the typical behavior of the spread. The first
result accomplishes the bracketing bound (3.4).
Proposition 3.7. Assume (ED1 ) . The error ♦ (r) in (3.4) fulfills
IP %−1 ♦ (r) ≥ z0 (x, Q) ≤ exp −x
(3.11)
with z0 (x, Q) given for g0 = gν0 ≥ 3 by

 1 + √x + Q2
def
z0 (x, Q) =
1 + 2g−1 (x + Q) + g 2
0
0
if 1 +
√
x + Q ≤ g0 ,
otherwise,
where Q = c1 p with c1 = 2 for p ≥ 2 and c1 = 2.7 for p = 1 . Similarly for ♦ (r) .
Remark 3.2. The bound (3.11) essentially depends on the value g from condition
(ED1 ) . The result requires that gν0 ≥ 3 . However, this constant can usually be taken
of order n1/2 ; see Section 5 for examples. If g2 is larger in order than p + x , then
z0 (x, Q) ≈ c1 p + x .
Proof. Consider for fixed r and = (δ, %) the quantity
def
♦ (r) =
%
L(θ, θ ∗ ) − IEL(θ, θ ∗ ) − (θ − θ ∗ )> ∇L(θ ∗ ) − kV0 (θ − θ ∗ )k2 .
2
θ∈Θ0 (r)
sup
As δ ≥ δ(r) , it holds −IEL(θ, θ ∗ ) ≥ (1 − δ)D02 and L(θ, θ ∗ ) − L (θ, θ ∗ ) ≤ ♦ (r) .
Moreover, in view of ∇IEL(θ ∗ ) = 0 , the definition of ♦ (r) can be rewritten as
def
♦ (r) =
sup
θ∈Θ0 (r)
%
ζ(θ, θ ∗ ) − (θ − θ ∗ )> ∇ζ(θ ∗ ) − kV0 (θ − θ ∗ )k2 .
2
15
spokoiny, v.
Now the claim of the theorem can be easily reduced to an exponential bound for the
quantity ♦ (r) . We apply Theorem A.12 to the process
U(θ, θ ∗ ) =
1 ζ(θ, θ ∗ ) − (θ − θ ∗ )> ∇ζ(θ ∗ ) ,
ω(r)
θ ∈ Θ0 (r),
and H0 = V0 . Condition (ED) follows from (ED1 ) with the same ν0 and g in view of
∇U(θ, θ ∗ ) = ∇ζ(θ) − ∇ζ(θ ∗ ) /ω(r) . So, the conditions of Theorem A.12 are fulfilled
yielding (3.11) in view of % ≥ 3ν0 ω(r) .
Due to the main bracketing result, the local excess supθ∈Θ0 (r) L(θ, θ ∗ ) can be put
between similar quantities for the upper and lower approximating processes up to the
error terms ♦ (r), ♦ (r) . The random quantity supθ∈IRp L (θ, θ ∗ ) can be called the
upper excess while supθ∈Θ0 (r0 ) L (θ, θ ∗ ) is the lower excess. The quadratic (in θ ) structure of the functions L (θ, θ ∗ ) and L (θ, θ ∗ ) enables us to explicitly solve the problem
of maximizing the corresponding function w.r.t. θ .
Lemma 3.8. It holds
sup L (θ, θ ∗ ) = kξ k2 /2.
(3.12)
θ∈IRp
On the random set {kV0 D−1 ξ k ≤ r} , it also holds
sup L (θ, θ) = kξ k2 /2.
θ∈Θ0 (r)
Proof. The unconstrained maximum of the quadratic form L (θ, θ ∗ ) w.r.t. θ is attained
e = D−1 ξ = D−2 ∇L(θ ∗ ) yielding the expression (3.12). The lower excess is
at θ
computed similarly.
Our next step is in bounding the difference kξ k2 − kξ k2 . It can be decomposed as
kξ k2 − kξ k2 = kξ k2 − kξk2 + kξk2 − kξ k2
with ξ = D0−1 ∇L(θ ∗ ) . If the values δ, % are small then the difference kξ k2 − kξ k2 is
automatically smaller than kξk2 .
def
Lemma 3.9. Suppose (I) and let τ = δ + %a2 < 1 . Then
D2 ≥ (1 − τ )D02 ,
D2 ≤ (1 + τ )D02 ,
def
kIIp − D D−2 D k∞ ≤ α =
2τ
.
1 − τ2
(3.13)
16
Parametric estimation. Finite sample theory
Moreover,
kξ k2 − kξk2 ≤
τ
kξk2 ,
1 − τ
kξk2 − kξ k2 ≤
τ
kξk2 ,
1 + τ
kξ k2 − kξ k2 ≤ α kξk2 .
Our final step is in showing that under (ED0 ) , the norm kξk behaves essentially as a
def
norm of a Gaussian vector with the same covariance matrix. Define for IB = D0−1 V02 D0−1
def
p = tr IB ,
def
λ0 = kIBk∞ = λmax IB .
def
v2 = 2 tr(IB 2 ),
Under the identifiability condition (I) , one can bound
IB 2 ≤ a2 Ip ,
p ≤ a2 p,
v2 ≤ 2a4 p,
λ0 ≤ a2 .
Similarly to the previous result, we assume that the constant g from condition (ED0 )
is sufficiently large, namely g2 ≥ 2p . Define µc = 2/3 and
def
y2c = g2 /µ2c − p/µc ,
p
def
gc = µc yc = g2 − µc p,
def
2xc = µc y2c + log det Ip − µc IB 2 /λ0 .
It is easy to see that y2c ≥ 3g2 /2 and gc ≥
p
2/3 g .
Theorem 3.10. Let (ED0 ) hold with ν0 = 1 and g2 ≥ 2p . Then IEkξk2 ≤ p , and for
each x ≤ xc
IP kξk2 /λ0 ≥ z(x, IB) ≤ 2e−x + 8.4e−xc ,
where z(x, IB) is defined by

p + 2vx1/2 ,
def
z(x, IB) =
p + 6x
x ≤ v/18,
v/18 < x ≤ xc ,
2
Moreover, for x > xc , it holds with z(x, IB) = yc + 2(x − xc )/gc IP kξk2 /λ0 ≥ z(x, IB) ≤ 8.4e−x .
Proof. It follows from condition (ED0 ) that
IEkξk2 = IE tr ξξ >
= tr D0−1 IE∇L(θ ∗ ){∇L(θ ∗ )}> D0−1 = tr D0−2 Var ∇L(θ ∗ )
(3.14)
spokoiny, v.
17
and (ED0 ) implies γ > Var ∇L(θ ∗ ) γ ≤ γ > V02 γ and thus, IEkξk2 ≤ p . The deviation
bound (3.14) is proved in Corollary ??.
Remark 3.3. This small remark concerns the term 8.4e−xc in the probability bound
(3.14). As already mentioned, this bound implicitly assumes that the constant g is large
(usually g n1/2 ). Then xc g2 n is large as well. So, e−xc is very small and
asymptotically negligible. Below we often ignore this term. For x ≤ xc , we can use
z(x, IB) = p + 6x .
Remark 3.4. The exponential bound of Theorem 3.10 helps to describe the critical value
of z ensuring a prescribed deviation probability IP kξk2 ≥ z . Namely, this probability
starts to gradually decrease when z grows over λ0 p . In particular, this helps to answer
a very important question about a proper choice of the critical value z providing the
prescribed covering probability, or of the value z ensuring the dominating concentration
e − θ ∗ )k ≤ z .
probability IP kD (θ
The definition of the set C (r) from (3.8) involves the event {kV0 D−1 ξ k > r} . Under (I) , it is included in the set {kξ k > (1 + α )−1 a−1 r} , see (3.13), and its probability
is of order e−x for r2 ≥ C(x + p) with a fixed C > 0 .
By Theorem 3.7, one can use max ♦ (r), ♦ (r) ≤ % z0 (x, Q) on a set of probability
at least 1 − e−x . Further, kξk2 /λ0 ≤ z(x, IB) with a probability of order 1 − e−x ;
see (3.14). Putting together the obtained bounds yields for the spread ∆ (r) with a
probability about 1 − 4e−x
∆ (r) ≤ 2% z0 (x, Q) + α λ0 z(x, IB).
The results obtained in Section 3.2 are sharp and meaningful if the spread ∆ (r) is
smaller in order than the value IEkξk2 . Theorem 3.10 states that kξk2 does not signifdef
icantly deviate over its expected value p = IEkξk2 which is our leading term. We know
that z0 (x, Q) ≈ Q + x = c1 p + x if x is not too large. Also z(x, IB) ≤ p + 6x , where p
is of order p due to (I) . Summarizing the above discussion yields that the local results
apply if the regularity condition (I) holds and the values % and α , or equivalently,
p
ω(r), δ(r) are small. In Section 5 we show for the i.i.d. example that ω(r) r2 /n
and similarly for δ(r) .
18
Parametric estimation. Finite sample theory
3.4
Proof of Corollaries 3.3 and 3.4
The bound (3.4) together with Lemma 3.8 yield on C (r)
e θ∗ ) =
L(θ,
sup L(θ, θ ∗ )
θ∈Θ0 (r)
≥
sup L (θ, θ ∗ ) − ♦ (r) = kξ k2 /2 − ♦ (r).
(3.15)
θ∈Θ0 (r)
Similarly
e θ∗ ) ≤
L(θ,
sup L (θ, θ ∗ ) + ♦ (r) ≤ kξ k2 /2 + ♦ (r)
θ∈Θ0 (r)
yielding (3.9). For getting (3.10), we again apply the inequality L(θ, θ ∗ ) ≤ L (θ, θ ∗ ) +
e . With ξ = D−1 ∇L(θ ∗ ) and u def
e−
♦ (r) from Theorem 3.1 for θ equal to θ
= D (θ
θ ∗ ) , this gives
e θ ∗ ) − ξ > u + ku k2 /2 ≤ ♦ (r).
L(θ,
Therefore, by (3.15)
2
kξ k2 /2 − ♦ (r) − ξ >
u + ku k /2 ≤ ♦ (r)
or, equivalently
2
2
2
kξ k2 /2 − ξ >
u + ku k /2 ≤ ♦ (r) + ♦ (r) + kξ k − kξ k /2
2
and the definition of ∆ (r) implies u − ξ ≤ 2∆ (r) .
4
Upper function approach and concentration of the qMLE
e is localization. This property means
A very important step in the analysis of the qMLE θ
e concentrates in a small vicinity of the central point θ ∗ . This section states such
that θ
a concentration bound under the global conditions of Section 2. Given r0 , the deviation
e 6∈ Θ0 (r0 ) that θ
e does not belong to the local
bound describes the probability IP θ
vicinity Θ0 (r0 ) of Θ . The question of interest is to check a possibility of selecting r0
in a way that the local bracketing result and the deviation bound apply simultaneously;
see the discussion at the end of the section.
Below we suppose that a sufficiently large constant x is fixed to specify the accepted
level be of order e−x for this deviation probability. All the constructions below depend
upon this constant. We do not indicate it explicitly for ease of notation.
19
spokoiny, v.
The key step in this large deviation bound is made in terms of an upper function for
def
the process L(θ, θ ∗ ) = L(θ) − L(θ ∗ ) . Namely, u(θ) is a deterministic upper function if
it holds with a high probability:
sup L(θ, θ ∗ ) + u(θ) ≤ 0
(4.1)
θ∈Θ
Such bounds are usually called for in the analysis of the posterior measure in the Bayes
approach. Below we present sufficient conditions ensuring (4.1). Now we explain how
e.
(4.1) can be used for describing concentration sets for θ
Lemma 4.1. Let u(θ) be an upper function in the sense
IP sup L(θ, θ ∗ ) + u(θ) ≥ 0 ≤ e−x
(4.2)
θ∈Θ
for x > 0 . Given a subset Θ0 ⊂ Θ with θ ∗ ∈ Θ0 , the condition u(θ) ≥ 0 for θ 6∈ Θ0
ensures
e∈
IP θ
6 Θ0 ≤ e−x .
e ∈ Θ◦ is only possible
Proof. If Θ◦ is a subset of Θ not containing θ ∗ , then the event θ
if supθ∈Θ◦ L(θ, θ ∗ ) ≥ 0 , because L(θ ∗ , θ ∗ ) ≡ 0 .
A possible way of checking the condition (4.2) is based on a lower quadratic bound for
the negative expectation −IEL(θ, θ ∗ ) ≥ b(r)kV0 (θ − θ ∗ )k2 /2 in the sense of condition
(Lr) from Section 2.2. We present two different results. The first one assumes that the
values b(r) can be fixed universally for all r ≥ r0 .
Theorem 4.2. Suppose (Er) and (Lr) with b(r) ≡ b . Let, for r ≥ r0 ,
1+
p
x + Q ≤ 3ν02 g(r)/b,
(4.3)
6ν0
p
x + Q ≤ rb,
(4.4)
with x + Q ≥ 2.5 and Q = c1 p . Then
e 6∈ Θ0 (r0 ) ≤ e−x .
IP θ
Proof. The result follows from Theorem A.8 with µ =
∗
∗
IEL(θ) and M (θ, θ ) = −IEL(θ, θ ) ≥
b
2 kV0 (θ
−θ
∗
(4.5)
b
3ν0
, t(µ) ≡ 0 , U(θ) = L(θ) −
)k2 .
Remark 4.1. The bound (4.5) requires only two conditions. Condition (4.3) means that
the value g(r) from condition (Er) fulfills g2 (r) ≥ C(x + p) , that is, we need a qualified
rate in the exponential moment conditions. This is similar to requiring finite polynomial
20
Parametric estimation. Finite sample theory
moments for the score function. Condition (4.4) requires that r exceeds some fixed
value, namely, r2 ≥ C(x + p) . This bound is helpful for fixing the value r0 providing a
sensible deviation probability bound.
If b(r) decreases with r , the result is a bit more involved. The key requirement
is that b(r) decreases not too fast, so that the product rb(r) grows to infinity with
r . The idea is to include the complement of the central set Θ0 in Θ in the union of
the growing sets Θ0 (rk ) with b(rk ) ≥ b(r0 )2−k , and then apply Theorem 4.2 for each
Θ0 (rk ) .
Theorem 4.3. Suppose (Er) and (Lr) . Let rk be such that b(rk ) ≥ b(r0 )2−k for
k ≥ 1 . If the conditions
p
x + Q + ck ≤ 3ν02 g(rk )/b(rk ),
p
6ν0 x + Q + ck ≤ rk b(rk ),
1+
are fulfilled for c = log(2) , then it holds
e∈
IP θ
6 Θ0 (r0 ) ≤ e−x .
Proof. The result (4.5) is applied to each set Θ0 (rk ) and xk = x + ck . This yields
X
X −x−ck
e∈
e∈
IP θ
6 Θ0 (r0 ) ≤
IP θ
6 Θ0 (rk ) ≤
e
= e−x
k≥1
k≥1
as required.
Remark 4.2. Here we briefly discuss the very important question: how one can fix the
value r0 ensuring the bracketing result in the local set Θ0 (r0 ) and a small probability of
the related set C (r) from (3.8)? The event {kV0 D−1 ξ k > r} requires r2 ≥ C(x + p) .
Further we inspect the deviation bound for the complement Θ \ Θ0 (r0 ) . For simplicity,
assume (Lr) with b(r) ≡ b . Then the condition (4.4) of Theorem 4.2 requires that
r20 ≥ Cb−2 (x + p).
In words, the squared radius r20 should be at least of order p . The other condition (4.3)
of Theorem 4.2 is technical and only requires that g(r) is sufficiently large while the local
results only require that δ(r) and %(r) are small for such r . In the asymptotic setup
one can typically bring these conditions together. Section 5 provides further discussion
for the i.i.d. setup.
spokoiny, v.
5
21
Examples
The model with independent identically distributed (i.i.d.) observations is one of the
most popular setups in statistical literature and in statistical applications. The essential
and the most developed part of the statistical theory is designed for the i.i.d. modeling. Especially, the classical asymptotic parametric theory is almost complete including
asymptotic root-n normality and efficiency of the MLE and Bayes estimators under rather
mild assumptions; see e.g. Chapter 2 and 3 in Ibragimov and Khas’minskij (1981). So,
the i.i.d. model can naturally serve as a benchmark for any extension of the statistical
theory: being applied to the i.i.d. setup, the new approach should lead to essentially
the same conclusions as in the classical theory. Similar reasons apply to the regression
model and its extensions. Below we try demonstrate that the proposed non-asymptotic
viewpoint is able to reproduce the existing brilliant and well established results of the
classical parametric theory. Surprisingly, the majority of classical efficiency results can
be easily derived from the obtained general non-asymptotic bounds.
The next question is whether there is any added value or benefits of the new approach
being restricted to the i.i.d. situation relative to the classical one. Two important issues
have been already mentioned: the new approach applies to the situation with finite
samples and survives under model misspecification. One more important question is
whether the obtained results remain applicable and informative if the dimension of the
parameter space is high – this is one of the main challenges in the modern statistics.
We show that the dimensionality p naturally appears in the risk bounds and the results
apply as long as the sample size exceeds in order this value p . All these questions are
addressed in Section 5.1 for the i.i.d. setup, Section 5.2 focuses on generalized linear
modeling, while Section 5.3 discusses linear median regression.
5.1
Quasi MLE in an i.i.d. model
An i.i.d. parametric model means that the observations Y = (Y1 , . . . , Yn ) are independent identically distributed from a distribution P which belongs to a given parametric
family (Pθ , θ ∈ Θ) on the observation space Y1 . Each θ ∈ Θ clearly yields the product
data distribution IPθ = Pθ⊗n on the product space Y = Yn1 . This section illustrates
how the obtained general results can be applied to this type of modeling under possible
model misspecification. Different types of misspecification can be considered. Each of
the assumptions, namely, data independence, identical distribution, parametric form of
the marginal distribution can be violated. To be specific, we assume the observations
Yi independent and identically distributed. However, we admit that the distribution of
each Yi does not necessarily belong to the parametric family (Pθ ) . The case of non-
22
Parametric estimation. Finite sample theory
identically distributed observations can be done similarly at cost of more complicated
notation.
In what follows the parametric family (Pθ ) is supposed to be dominated by a measure
µ0 , and each density p(y, θ) = dPθ /dµ0 (y) is two times continuously differentiable in
θ for all y . Denote `(y, θ) = log p(y, θ) . The parametric assumption Yi ∼ Pθ∗ ∈ (Pθ )
leads to the log-likelihood
L(θ) =
X
`(Yi , θ),
e maximizes this sum
where the summation is taken over i = 1, . . . , n . The quasi MLE θ
over θ ∈ Θ :
e def
θ
= argmax L(θ) = argmax
θ∈Θ
X
`(Yi , θ).
θ∈Θ
The target of estimation θ ∗ maximizes the expectation of L(θ) :
def
θ ∗ = argmax IEL(θ) = argmax IE`(Y1 , θ).
θ∈Θ
θ∈Θ
def
Let ζi (θ) = `(Yi , θ) − IE`(Yi , θ) . Then ζ(θ) =
P
ζi (θ) . The equation ∇IEL(θ ∗ ) = 0
implies
∇ζ(θ ∗ ) =
X
∇ζi (θ ∗ ) =
X
∇`i (θ ∗ ).
(5.1)
I.i.d. structure of the Yi ’s allows to rewrite the local conditions (Er) , (ED0 ) ,
(ED1 ) , and (L0 ) , and (I) in terms of the marginal distribution.
(ed0 ) There exists a positively definite symmetric matrix v0 , such that for all |λ| ≤ g1
>
γ ∇ζ1 (θ ∗ )
≤ ν02 λ2 /2.
sup log IE exp λ
kv0 γk
γ∈IRp
A natural candidate on v02 is given by the variance of the gradient ∇`(Y1 , θ ∗ ) , that is,
v02 = Var ∇`(Y1 , θ ∗ ) = Var ∇ζ1 (θ ∗ ) .
Next consider the local sets
Θloc (u) = {θ : kv0 (θ − θ ∗ )k ≤ u}.
In view of V02 = nv02 , it holds Θ0 (r) = Θloc (u) with r2 = nu2 .
Below we distinguish between local conditions for u ≤ u0 and the global conditions
for all u > 0 , where u0 is some fixed value.
The local smoothness conditions (ED1 ) and (L0 ) require to specify the functions
δ(r) and %(r) for r ≤ r0 where r20 = nu20 . If the log-likelihood function `(y, θ) is
sufficiently smooth in θ , these functions can be selected proportional to u = r/n1/2 .
spokoiny, v.
23
(ed1 ) There are constants ω ∗ > 0 and g1 > 0 such that for each u ≤ u0 and |λ| ≤ g1
>
γ ∇ζ1 (θ) − ∇ζ1 (θ ∗ )
sup
sup log IE exp λ
≤ ν02 λ2 /2.
ω ∗ u kv0 γk
γ∈IRp θ∈Θloc (u)
Further we restate the local identifiability condition (L0 ) in terms of the expected
def
value k(θ, θ ∗ ) = −IE `(Yi , θ) − `(Yi , θ ∗ ) for each i . We suppose that k(θ, θ ∗ ) is two
times differentiable w.r.t. θ . The definition of θ ∗ implies ∇IE`(Yi , θ ∗ ) = 0 . Define
also the matrix F0 = −∇2 IE`(Yi , θ ∗ ) . In the parametric case P = Pθ∗ , k(θ, θ ∗ ) is the
Kullback-Leibler divergence between Pθ∗ and Pθ while the matrices v02 = F0 are equal
to each other and coincide with the Fisher information matrix of the family (Pθ ) at θ ∗ .
(`0 ) There is a constant δ ∗ such that it holds for each u ≤ u0
2k(θ, θ ∗ )
∗
sup ∗ >
∗ − 1 ≤ δ u.
θ∈Θloc (u) (θ − θ ) F0 (θ − θ )
(ι) There is a constant a > 0 such that a2 F20 ≥ v02 .
(eu)
For each u > 0 , there exists g1 (u) > 0 , such that for all |λ| ≤ g1 (u)
>
γ ∇ζ1 (θ)
≤ ν02 λ2 /2.
log IE exp λ
p
kv
γk
0
γ∈IR θ∈Θloc (u)
sup
sup
(`u) For each u > 0 , there exists b(u) > 0 such that
k(θ, θ ∗ )
∗ 2 ≥ b(u),
θ∈Θ: kv0 (θ−θ ∗ )k=u kv0 (θ − θ )k
sup
Lemma 5.1. Let Y1 , . . . , Yn be i.i.d. Then (eu) , (ed0 ) , (ed1 ) , (ι) , and (`0 ) imply
(Er) , (ED0 ) , (ED1 ) , (I) , and (L0 ) with V02 = nv02 , D02 = nF0 , ω(r) = ω ∗ r/n1/2 ,
√
δ(r) = δ ∗ r/n1/2 , and g = g1 n .
Proof. The identities V02 = nv02 , D02 = nF0 follow from the i.i.d. structure of the
observations Yi . We briefly comment on condition (Er) . The use of the i.i.d. structure
once again yields by (5.1) in view of V02 = nv02
n γ > ∇ζ(θ) o
n λ γ > ∇ζ (θ) o
1
log IE exp λ
= nIE exp 1/2
≤ ν02 λ2 /2
kV0 γk
kv0 γk
n
as long as λ ≤ n1/2 g1 (u) ≤ g(r) . Similarly for (ED0 ) and (ED1 ) .
Remark 5.1. This remark discusses how the presented conditions relate to what is
usually assumed in statistical literature. One general remarks concern the choice of the
24
Parametric estimation. Finite sample theory
parametric family (Pθ ) . The point of the classical theory is that the true measure is
in this family, so the conditions should be as weak as possible. The viewpoint of this
paper is slightly different: whatever family (Pθ ) is taken, the true measure is never
included, any model is only an approximation of reality. From the other side, the choice
of the parametric model (Pθ ) is always done by a statistician. Sometime some special
stylized features of the model force to include an irregularity in this family. Otherwise
any smoothness condition on the density `(y, θ) can be secured by a proper choice of
the family (Pθ ) .
The presented list also includes the exponential moment conditions (ed0 ) and (ed1 )
on the gradient ∇`(Y1 , θ) . We need exponential moments for establishing some nonasymptotic risk bounds, the classical concentration bounds require even stronger conditions that
the considered random variables are bounded.
The identifiability condition (`u) is very easy to check in the usual asymptotic setup.
Indeed, if the parameter set Θ is compact, the Kullback-Leibler divergence k(θ, θ ∗ )
is continuous and positive for all θ 6= θ ∗ , then (`u) is fulfilled automatically with a
universal constant b . If Θ is not compact, the condition is still fulfilled but the function
b(u) may depend on u .
Below we specify the general results of Section 3 and 4 to the i.i.d. setup.
5.1.1
A large deviation bound
This section presents some sufficient conditions ensuring a small deviation probability
e 6∈ Θloc (u0 )} for a fixed u0 . Below Q = c1 p . We only discuss the case
for the event {θ
b(u) ≡ b . The general case only requires more complicated notations. The next result
follows from Theorem 4.2 with the obvious changes.
Theorem 5.2. Suppose (eu) and (`u) with b(u) ≡ b . If, for u0 > 0 ,
p
n1/2 u0 b ≥ 6ν0 x + Q,
p
1 + x + Q ≤ 3b−1 ν02 g1 (u0 ) n1/2 ,
(5.2)
then
e∈
e − θ ∗ )k > u0 ≤ e−x .
IP θ
6 Θloc (u0 ) = IP kv0 (θ
Remark 5.2. The presented result helps to qualify two important values u0 and n
providing a sensible deviation probability bound. For simplicity suppose that g1 (u) ≡
g1 > 0 . Then the condition (5.2) can be written as nu20 x + Q . In other words, the
result of the theorem claims a large deviation bound for the vicinity Θloc (u0 ) with u20
25
spokoiny, v.
of order p/n . In classical asymptotic statistics this result is usually referred to as root-n
consistency. Our approach yields this result in a very strong form and for finite samples.
5.1.2
Local inference
Now we restate the general local bounds of Section 3 for the i.i.d. case. First we describe
the approximating linear models. The matrices v02 and F0 from conditions (ed0 ) , (ed1 ) ,
and (`0 ) determine their drift and variance components. Define
def
F = F0 (1 − δ) − %v02 .
def
If τ = δ + a2 % < 1 , then
F ≥ (1 − τ )F0 > 0.
Further, D2 = nF and
def
ξ = D−1 ∇ζ(θ ∗ ) = nF
−1/2 X
∇`(Yi , θ ∗ ).
The upper bracketing process reads as
L (θ, θ ∗ ) = (θ − θ ∗ )> D ξ − kD (θ − θ ∗ )k2 /2.
This expression can be viewed as log-likelihood for the linear model ξ = D θ + ε for a
e for this model is of the form θ
e = D−1 ξ .
standard normal error ε . The (quasi) MLE θ
Theorem 5.3. Suppose (ed0 ) . Given u0 , assume (ed1 ) , (`0 ) , and (ι) on Θloc (u0 ) ,
def
and let % = 3ν0 ω ∗ u0 , δ = δ ∗ u0 , and τ = δ + a2 % < 1 . Then the results of Theorem 3.1
and all its corollaries apply to the case of i.i.d. modeling with r20 = nu20 . In particular,
e ∈ Θloc (u0 ), kξ k ≤ r0 , it holds
on the random set C (r0 ) = θ
e θ ∗ ) ≤ kξ k2 /2 + ♦ (r0 ),
kξ k2 /2 − ♦ (r0 ) ≤ L(θ,
p
e − θ ∗ − ξ 2 ≤ 2∆ (r0 ).
nF θ
The random quantities ♦ (r0 ) , ♦ (r0 ) , and ∆ (r0 ) follow the probability bounds of
Theorem 3.7 and 3.10.
Now we briefly discuss the implications of Theorem 5.2 and 5.3 to the classical asymptotic setup with n → ∞ . We fix u20 = Cp/n for a constant C ensuring the deviation
bound of Theorem 5.2. Then δ is of order u0 and the same for % . For a sufficiently
large n , both quantities are small and thus, the spread ∆ (r0 ) is small as well; see
Section 3.3.
26
Parametric estimation. Finite sample theory
Further, under (ed0 ) condition, the normalized score
def
ξ = nF0
−1/2 X
∇`(Yi , θ ∗ )
is zero mean asymptotically normal by the central limit theorem. Moreover, if F0 = v02 ,
then ξ is asymptotically standard normal. The same holds for ξ . This immediately
yields all classical asymptotic results like Wilks theorem or the Fisher expansion for MLE
in the i.i.d. setup as well as the asymptotic efficiency of the MLE. Moreover, our results
bounds yield the asymptotic result for the case when the parameter dimension p = pn
grows linearly with n . Below un = on (pn ) means that un /pn → 0 as n → ∞ .
Theorem 5.4. Let Y1 , . . . , Yn be i.i.d. IPθ∗ and let (ed0 ) , (ed1 ) , (`0 ) , (ι) , (eu) , and
(`u) with b(u) ≡ b hold. If n > Cpn for a fixed constant C depending on constants in
the above conditions only, then
p
e − θ ∗ − ξk2 = on (pn ),
k nF0 θ
This result particularly yields that
e θ ∗ ) is nearly χ2 .
2L(θ,
√
e θ ∗ ) − kξk2 = on (pn ).
2L(θ,
e − θ∗
nF0 θ
is nearly standard normal and
p
5.2
Generalized linear modeling
Now we consider a generalized linear modeling (GLM) which is often used for describing some categorical data. Let P = (Pw , w ∈ Υ ) be an exponential family with a
canonical parametrization; see e.g. McCullagh and Nelder (1989). The corresponding
log-density can be represented as `(y, w) = yw − d(w) for a convex function d(w) .
The popular examples are given by the binomial (binary response, logistic) model with
d(w) = log ew + 1 , the Poisson model with d(w) = ew , the exponential model
with d(w) = − log(w) . Note that linear Gaussian regression is a special case with
d(w) = w2 /2 .
A GLM specification means that every observation Yi has a distribution from the
family P with the parameter wi which linearly depends on the regressor Ψi ∈ IRp :
Yi ∼ PΨ > θ∗ .
i
(5.3)
The corresponding log-density of a GLM reads as
L(θ) =
X
Yi Ψi> θ − d(Ψi> θ) .
Under IPθ∗ each observation Yi follows (5.3), in particular, IEYi = d0 (Ψi> θ ∗ ) . However, similarly to the previous sections, it is accepted that the parametric model (5.3)
27
spokoiny, v.
def
is misspecified. Response misspecification means that the vector f = IEY cannot be
represented in the form d0 (Ψ > θ) whatever θ is. The other sort of misspecification concerns the data distribution. The model (5.3) assumes that the Yi ’s are independent and
the marginal distribution belongs to the given parametric family P . In what follows,
we only assume independent data having certain exponential moments. The target of
estimation θ ∗ is defined by
def
θ ∗ = argmax IEL(θ).
θ
e is defined by maximization of L(θ) :
The quasi MLE θ
e = argmax L(θ) = argmax
θ
θ
X
θ
Yi Ψi> θ − d(Ψi> θ) .
Convexity of d(·) implies that L(θ) is a concave function of θ , so that the optimization
problem has a unique solution and can be effectively solved. However, a closed form
solution is only available for the constant regression or for the linear Gaussian regression.
The corresponding target θ ∗ is the maximizer of the expected log-likelihood:
θ ∗ = argmax IEL(θ) = argmax
θ
θ
X
fi Ψi> θ − d(Ψi> θ)
with fi = IEYi . The function IEL(θ) is concave as well and the vector θ ∗ is also well
defined.
Define the individual errors (residuals) εi = Yi − IEYi . Below we assume that these
errors fulfill some exponential moment conditions.
(e1 )
There exist some constants ν0 and g1 > 0 , and for every i a constant si such
2
that IE εi /si ≤ 1 and
log IE exp λεi /si ≤ ν02 λ2 /2,
|λ| ≤ g1 .
(5.4)
A natural candidate for si is σi where σi2 = IEε2i is the variance of εi ; see
Lemma A.17. Under (5.4), introduce a p × p matrix V0 defined by
def
V02 =
X
s2i Ψi Ψi> .
(5.5)
Condition (e1 ) effectively means that each error term εi = Yi − IEYi has some bounded
def
exponential moments: for |λ| ≤ g1 , it holds f (λ) = log IE exp λεi /si < ∞ . This
implies the quadratic upper bound for the function f (λ) for |λ| ≤ g1 ; see Lemma A.17.
In words, condition (e1 ) requires light (exponentially decreasing) tail for the marginal
distribution of each εi .
28
Parametric estimation. Finite sample theory
Define also
def
N −1/2 = max sup
i
γ∈IRp
si |Ψi> γ|
.
kV0 γk
(5.6)
Lemma 5.5. Assume (e1 ) and let V0 be defined by (5.5) and N by (5.6). Then conditions (ED0 ) and (Er) follow from (e1 ) with the matrix V0 due to (5.5) and g = g1 N 1/2 .
Moreover, the stochastic component ζ(θ) is linear in θ and the condition (ED1 ) is fulfilled with ω(r) ≡ 0 .
Proof. The gradient of the stochastic component ζ(θ) of L(θ) does not depend on θ ,
P
namely, ∇ζ(θ) =
Ψi εi with εi = Yi − IEYi . Now, for any unit vector γ ∈ IRp and
λ ≤ g , independence of the εi ’s implies that
log IE exp
n
o X
n λs Ψ > γ
o
X
λ
i i
Ψ i εi =
log IE exp
γ>
εi /si .
kV0 γk
kV0 γk
By definition si |Ψi> γ|/kV0 γk ≤ N −1/2 and therefore, λsi |Ψi> γ|/kV0 γk ≤ g1 . Hence,
(5.4) implies
log IE exp
n
o
X
ν02 λ2 X 2 > 2 ν02 λ2
λ
Ψi εi ≤
si |Ψi γ| =
γ>
,
kV0 γk
2kV0 γk2
2
(5.7)
and (ED0 ) follows.
It only remains to bound the quality of quadratic approximation for the mean of the
process L(θ, θ ∗ ) in a vicinity of θ ∗ . An interesting feature of the GLM is that the effect
of model misspecification disappears in the expectation of L(θ, θ ∗ ) .
Lemma 5.6. It holds
−IEL(θ, θ ∗ ) =
X
d(Ψi> θ) − d(Ψi> θ ∗ ) − d0 (Ψi> θ ∗ )Ψi> (θ − θ ∗ )
= K IPθ∗ , IPθ ,
where K IPθ∗ , IPθ
(5.8)
is the Kullback-Leibler divergence between measures IPθ∗ and IPθ .
Moreover,
−IEL(θ, θ ∗ ) = kD(θ ◦ )(θ − θ ∗ )k2 /2,
where θ ◦ ∈ [θ ∗ , θ] and
D2 (θ ◦ ) =
X
d00 (Ψi> θ ◦ )Ψi Ψi> .
(5.9)
29
spokoiny, v.
Proof. The definition implies
IEL(θ, θ ∗ ) =
X
fi Ψi> (θ − θ ∗ ) − d(Ψi> θ) + d(Ψi> θ ∗ ) .
As θ ∗ is the extreme point of IEL(θ) , it holds ∇IEL(θ ∗ ) =
P
fi −d0 (Ψi> θ ∗ ) Ψi = 0 and
(5.8) follows. The Taylor expansion of the second order around θ ∗ yields the expansion
(5.9).
Define now the matrix D0 by
def
D02 = D2 (θ ∗ ) =
X
d00 (Ψi> θ ∗ )Ψi Ψi> .
Let also V0 be defined by (5.5). Note that the matrices D0 and V0 coincide if the model
Yi ∼ PΨ > θ∗ is correctly specified and s2i = d00 (Ψi> θ ∗ ) . The matrix V0 describes a local
i
elliptic neighborhood of the central point θ ∗ in the form Θ0 (r) = {θ : kV0 (θ − θ ∗ )k ≤
r} . If the matrix function D2 (θ) is continuous in this vicinity Θ0 (r) then the value
δ(r) measuring the approximation quality of −IEL(θ, θ ∗ ) by the quadratic function
kD0 (θ − θ ∗ )k2 /2 is small and the identifiability condition (L0 ) is fulfilled on Θ0 (r) .
Lemma 5.7. Suppose that
kIIp − D0−1 D2 (θ)D0−1 k∞ ≤ δ(r),
θ ∈ Θ0 (r).
(5.10)
Then (L0 ) holds with this δ(r) . Moreover, as the quantities ω(r), ♦ (r), ♦ (r) vanish,
one can take % = 0 leading to the following representation for D and ξ :
D2 = (1 − δ)D02 ,
ξ = (1 + δ)1/2 ξ
D2 = (1 + δ)D02 ,
ξ = (1 − δ)1/2 ξ
with
def
ξ = D0−1 ∇ζ = D0−1
X
Ψi (Yi − IEYi ).
Linearity of the stochastic component ζ(θ) in the considered GLM implies the important fact that the quantities ♦ (r), ♦ (r) in the general bracketing bound (3.4) vanish
for any r . Therefore, in the GLM case, the deficiency can be defined as the difference
between upper and lower excess and it can be easily evaluated:
∆(r) = kξ k2 /2 − kξ k2 /2 = δkξk2 .
Our result assumes some concentration properties of the squared norm kξk2 of the
vector ξ . These properties can be established by general results of Section ?? under the
30
Parametric estimation. Finite sample theory
regularity condition: for some a
V0 ≤ aD0 .
(5.11)
Now we are prepared to state the local results for the GLM estimation.
Theorem 5.8. Let (e1 ) hold. Then for δ ≥ δ(r) any z > 0 and z > 0 , it holds
e − θ ∗ k > z, kV0 θ
e − θ ∗ k ≤ r ≤ IP kξk2 > (1 − δ)z 2
IP kD0 θ
e θ ∗ ) > z, kV0 θ
e − θ ∗ k ≤ r ≤ IP kξk2 /2 > (1 − δ)z .
IP L(θ,
e − θ ∗ k ≤ r, kξ k ≤ r} , it holds
Moreover, on the set C (r) = {kV0 θ
e − θ ∗ − ξk2 ≤
kD0 θ
2δ
kξk2 .
1 − δ2
(5.12)
If the function d(w) is quadratic then the approximation error δ vanishes as well and
the expansion (5.12) becomes equality which is also fulfilled globally, a localization step
in not required. However, if d(w) is not quadratic, the result applies only locally and
it has to be accomplished with a large deviation bound. The GLM structure is helpful
in the large deviation zone as well. Indeed, the gradient ∇ζ(θ) does not depend on θ
and hence, the most delicate condition (Er) is fulfilled automatically with g = g1 N 1/2
for all local sets Θ0 (r) . Further, the identifiability condition (Lr) easily follows from
Lemma 5.6: it suffices to bound from below the matrix D(θ) for θ ∈ Θ0 (r) :
D(θ) ≥ b(r)V0 ,
θ ∈ Θ0 (r).
An interesting question, similarly to the i.i.d. case, is the minimal radius r0 of the
local vicinity Θ0 (r0 ) ensuring the desirable concentration property. Suppose for the
moment that the constants b(r) are all the same for different r : b(r) ≡ b . Under the
regularity condition (5.11), a sufficient lower bound for r0 can be based on Corollary 4.3.
The required condition can be restated as
1+
p
x + Q ≤ 3ν02 g/b,
6ν0
p
x + Q ≤ rb.
It remains to note that Q = c1 p and g = g1 N 1/2 . So, the required conditions are
fulfilled for r2 ≥ r20 = C(x + p) , where C only depends on ν0 , b , and g .
5.3
Linear median estimation
This section illustrates how the proposed approach applies to robust estimation in linear
models. The target of analysis is the linear dependence of the observed data Y =
31
spokoiny, v.
(Y1 , . . . , Yn ) on the set of features Ψi ∈ IRp :
Yi = Ψi> θ + εi ,
(5.13)
where εi means the i th individual error. As usual, the true data distribution can
deviate from the linear model. In addition, we admit contaminated data which naturally
leads to the idea of robust estimation. This section offers a qMLE view on the robust
estimation problem. Our parametric family assumes the linear dependence (5.13) with
i.i.d. errors εi which follow the double exponential (Laplace) distribution with the
density (1/2)e−|y| . Then the corresponding log-likelihood reads as
L(θ) = −
1X
|Yi − Ψi> θ|
2
e def
and θ
= argmaxθ L(θ) is called the least absolute deviation (LAD) estimate. In the
context of linear regression, it is also called the linear median estimate. The target of
estimation θ ∗ is defined as usually by the equation θ ∗ = argmaxθ IEL(θ) .
It is useful to define the residuals εei = Yi − Ψi> θ ∗ and their distributions
Pi (A) = IP εei ∈ A = IP Yi − Ψi> θ ∗ ∈ A
for any Borel set A on the real line. If Yi = Ψi> θ ∗ + εi is the true model then Pi
coincides with the distribution of each εi . Below we suppose that each Pi has a positive
density fi (y) .
Note that the difference L(θ) − L(θ ∗ ) is bounded by
1
2
P
|Ψi> (θ − θ ∗ )| . Next we
check conditions (ED0 ) and (ED1 ) . Denote ξi (θ) = 1I(Yi − Ψi> θ ≤ 0) − qi (θ) for
qi (θ) = IP (Yi − Ψi> θ ≤ 0) . This is a centered Bernoulli random variable, and it is easy
to check that
∇ζ(θ) = −
X
ξi (θ)Ψi .
(5.14)
This expression differs from the similar ones from the linear and generalized linear regression because the stochastic terms ξi now depend on θ . First we check the global condition (Er) . Fix any g1 < 1 . Then it holds for a Bernoulli r.v. Z with IP (Z = 1) = q ,
ξ = Z − q , and |λ| ≤ g1
log IE exp(λξ) = log q exp{λ(1 − q)} + (1 − q) exp(−λq)
≤ ν02 q(1 − q)λ2 /2,
(5.15)
where ν0 ≥ 1 depends on g1 only. Let now a vector γ ∈ IRp and ρ > 0 be such that
32
Parametric estimation. Finite sample theory
ρ|Ψi> γ| ≤ g1 for all i = 1, . . . , n . Then
log IE exp{ργ > ∇ζ(θ)} ≤
≤
ν02 ρ2 X
qi (θ) 1 − qi (θ) |Ψi> γ|2
2
ν02 ρ2
kV (θ)γk2 ,
2
(5.16)
where
V 2 (θ) =
X
qi (θ) 1 − qi (θ) Ψi Ψi> .
Denote also
V02 =
1X
Ψi Ψi> .
4
(5.17)
Clearly V (θ) ≤ V0 for all θ and condition (Er) is fulfilled with the matrix V0 and
g(r) ≡ g = g1 N 1/2 for N defined by
def
N −1/2 = max sup
i
γ∈IRp
Ψi> γ
;
2kV0 γk
(5.18)
cf. (5.7).
Let some r0 > 0 be fixed. We will specify this choice later. Now we check the
local conditions within the elliptic vicinity Θ0 (r0 ) = {θ : kV0 (θ − θ ∗ )k ≤ r0 } of the
central point θ ∗ for V0 from (5.17). Then condition (ED0 ) with the matrix V0 and
g = N 1/2 g1 is fulfilled on Θ0 (r0 ) due to (5.16). Next, in view of (5.18), it holds
|Ψi> γ| ≤ 2N −1/2 kV0 γk for any vector γ ∈ IRp . By (5.14)
∇ζ(θ) − ∇ζ(θ ∗ ) =
X
Ψi ξi (θ) − ξi (θ ∗ ) .
If Ψi> θ ≥ Ψi> θ ∗ , then
ξi (θ) − ξi (θ ∗ ) = 1I(Ψi> θ ∗ ≤ Yi < Ψi> θ) − IP Ψi> θ ∗ ≤ Yi < Ψi> θ .
Similarly for Ψi> θ < Ψi> θ ∗
ξi (θ) − ξi (θ ∗ ) = − 1I(Ψi> θ ≤ Yi < Ψi> θ ∗ ) + IP Ψi> θ ≤ Yi < Ψi> θ ∗ .
def Define qi (θ, θ ∗ ) = qi (θ) − qi (θ ∗ ) . Now (5.15) yields similarly to (5.16)
ν 2 ρ2 X
qi (θ, θ ∗ )|Ψi> γ|2
log IE exp ργ > ∇ζ(θ) − ∇ζ(θ ∗ ) ≤ 0
2
≤ 2ν02 ρ2 max qi (θ, θ ∗ ) kV0 γk2 ≤ ω(r)ν02 ρ2 kV0 γk2 /2,
i≤n
spokoiny, v.
33
with
def
ω(r) = 4 max sup qi (θ, θ ∗ ).
i≤n θ∈Θ0 (r)
If each density function pi is uniformly bounded by a constant C then
|qi (θ) − qi (θ ∗ )| ≤ C Ψi> (θ − θ ∗ ) ≤ CN −1/2 kV0 (θ − θ ∗ )k ≤ CN −1/2 r.
Next we check the local identifiability condition. We use the following technical
lemma.
Lemma 5.9. It holds for any θ
−
X
∂2
def
2
I
EL(θ)
=
D
(θ)
=
pi Ψi> (θ − θ ∗ ) Ψi Ψi> ,
2
∂ θ
(5.19)
where fi (·) is the density of εei = Yi − Ψi> θ ∗ . Moreover, there is θ ◦ ∈ [θ, θ ∗ ] such that
−IEL(θ, θ ∗ ) =
1X >
|Ψi (θ − θ ∗ )|2 fi (Ψi> (θ ◦ − θ ∗ ))
2
= (θ − θ ∗ )> D2 (θ ◦ )(θ − θ ∗ )/2.
(5.20)
Proof. Obviously
∂IEL(θ) X
=
IP (Yi ≤ Ψi> θ) − 1/2 Ψi .
∂θ
The identity (5.19) is obtained by one more differentiation. By definition, θ ∗ is the
extreme point of IEL(θ) . The equality ∇IEL(θ ∗ ) = 0 yields
X
IP (Yi ≤ Ψi> θ ∗ ) − 1/2 Ψi = 0.
Now (5.20) follows by the Taylor expansion of the second order at θ ∗ .
Define
def
D02 =
X
|Ψi> (θ − θ ∗ )|2 fi (0).
(5.21)
Due to this lemma, condition (L0 ) is fulfilled in Θ0 (r) with this choice D0 for δ(r)
from (5.10); see Lemma 5.7. Moreover, if fi (0) ≥ a2 /4 for a > 0 , then the identifiability
condition (I) is also satisfied. Now all the local conditions are fulfilled yielding the
general bracketing bound of Theorem 3.1 and all its corollaries.
It only remains to accomplish them by a large deviation bound, that is, to specify the
local vicinity Θ0 (r0 ) providing the prescribed deviation bound. A sufficient condition
for the concentration property is that the expectation IEL(θ, θ ∗ ) grows in absolute value
34
Parametric estimation. Finite sample theory
with the distance kV0 (θ −θ ∗ )k . We use the representation (5.19). Suppose that for some
fixed δ < 1/2 and ρ > 0
fi (u)/fi (0) − 1 ≤ δ,
|u| ≤ ρ.
(5.22)
For any θ with kV0 (θ − θ ∗ )k = r ≥ r0 , and for any i = 1, . . . , n , it holds
|Ψi> (θ − θ ∗ )| ≤ N −1/2 kV0 (θ − θ ∗ )k = N −1/2 r.
Therefore, for r ≤ ρN 1/2 and any θ ∈ Θ0 (r) with kV0 (θ−θ ∗ )k = r , it holds fi Ψi> (θ ◦ −
θ ∗ ) ≥ (1 − δ)fi (0) . Now Lemma 5.9 implies
−IEL(θ, θ ∗ ) ≥
1−δ
1−δ
1−δ 2
kD0 (θ − θ ∗ )k2 ≥
kV0 (θ − θ ∗ )k2 =
r .
2
2
2a
2a2
By Lemma 5.9 the function −IEL(θ, θ ∗ ) is convex. This easily yields
−IEL(θ, θ ∗ ) ≥
1−δ
ρN 1/2 r
2a2
for all r ≥ ρN 1/2 . Thus,

(1 − δ)(2a2 )−1 r
rb(r) ≥
(1 − δ)(2a2 )−1 ρN 1/2
if r ≤ ρN 1/2 ,
if r > ρN 1/2 .
So, the global identifiability condition (L1 ) is fulfilled if r20 ≥ C1 a2 (x + Q) and if
ρ2 N ≥ C2 a2 (x + Q) for some fixed constants C1 and C2 .
Putting all together yields the following result.
Theorem 5.10. Let Yi be independent, θ ∗ = argmaxθ IEL(θ) , D02 be given by (5.21),
and V02 by (5.17). Let also the densities fi (·) of Yi − Ψi> θ ∗ be uniformly bounded by
a constant C , fulfill (5.22) for some ρ > 0 and δ > 0 , and fi (0) ≥ a2 /4 for all i .
Finally, let N ≥ C2 ρ−2 a2 (x + p) for some fixed x > 0 and C2 . Then on the random
def
set of probability at least 1 − e−x , one obtains for ξ = D0−1 ∇L(θ ∗ ) the bounds
k
p
e − θ ∗ − ξk2 = o(p),
D0 θ
e θ ∗ ) − kξk2 = o(p).
2L(θ,
35
spokoiny, v.
A
Some results for empirical processes
This chapter presents some general results of the theory of empirical processes. We
assume some exponential moment conditions on the increments of the process which
allows to apply the well developed chaining arguments in Orlicz spaces; see e.g. van der
Vaart and Wellner (1996), Chapter 2.2. We, however, follow the more recent approach
inspired by the notions of generic chaining and majorizing measures due to M. Talagrand;
see e.g. Talagrand (1996, 2001, 2005). The results are close to that of Bednorz (2006). We
state the results in a slightly different form and present an independent and self-contained
proof.
The first result states a bound for local fluctuations of the process U(υ) given on
a metric space Υ . Then this result will be used for bounding the maximum of the
negatively drifted process U(υ) − U(υ 0 ) − ρd2 (υ, υ 0 ) over a vicinity Υ◦ (r) of the central
point υ 0 . The behavior of U(υ) outside of the local central set Υ◦ (r) is described using
the upper function method. Namely, we construct a multiscale deterministic function
u(µ, υ) ensuring that with probability at least 1 − e−x it holds µU(υ) + u(µ, υ) ≤ z(x)
for all υ 6∈ Υ◦ (r) and µ ∈ M , where z(x) grows linearly in x .
A.1
A bound for local fluctuations
An important step in the whole construction is an exponential bound on the maximum
of a random process U(υ) under the exponential moment conditions on its increments.
Let d(υ, υ 0 ) be a semi-distance on Υ . We suppose the following condition to hold:
(Ed) There exist g > 0 , r0 > 0 , ν0 ≥ 1 , such that for any λ ≤ g and υ, υ 0 ∈ Υ with
d(υ, υ 0 ) ≤ r0
U(υ) − U(υ 0 )
log IE exp λ
d(υ, υ 0 )
≤ ν02 λ2 /2.
(A.1)
Formulation of the result involves a sigma-finite measure π on the space Υ which
is often called the majorizing measure and used in the generic chaining device; see
Talagrand (2005). A typical example of choosing π is the Lebesgue measure on IRp .
Let Υ ◦ be a subset of Υ , a sequence rk be fixed with r0 = diam(Υ ◦ ) and rk = r0 2−k .
def
Let also Bk (υ) = {υ 0 ∈ Υ ◦ : d(υ, υ 0 ) ≤ rk } be the d -ball centered at υ of radius rk
and πk (υ) denote its π -measure:
def
πk (υ) =
Z
0
Z
π(dυ ) =
Bk (υ)
Υ◦
1I d(υ, υ 0 ) ≤ rk π(dυ 0 ).
36
Parametric estimation. Finite sample theory
Denote also
def
Mk = max◦
υ∈Υ
π(Υ ◦ )
πk (υ)
k ≥ 1.
(A.2)
Finally set c1 = 1/3 , ck = 2−k+2 /3 for k ≥ 2 , and define the value Q(Υ ◦ ) by
def
Q(Υ ◦ ) =
∞
X
∞
ck log(2Mk ) =
4 X −k
1
log(2M1 ) +
2 log(2Mk ).
3
3
k=2
k=1
Theorem A.1. Let U be a separable process following to (Ed) . If Υ ◦ is a d -ball in
Υ with the center υ ◦ and the radius r0 , i.e. d(υ, υ ◦ ) ≤ r0 for all υ ∈ Υ ◦ , then for
def
λ ≤ g0 = ν0 g
log IE exp
n
o
λ
sup U(υ) − U(υ ◦ ) ≤ λ2 /2 + Q(Υ ◦ ).
3ν0 r0 υ∈Υ ◦
(A.3)
Proof. A simple change U(·) with ν0−1 U(·) and g with g0 = ν0 g allows to reduce the
result to the case with ν0 = 1 which we assume below. Consider for k ≥ 1 the smoothing
operator Sk defined as
1
Sk f (υ ) =
πk (υ ◦ )
◦
Z
f (υ)π(dυ).
Bk (υ ◦ )
Further, define
S0 U(υ) ≡ U(υ ◦ )
so that S0 U is a constant function and the same holds for Sk Sk−1 . . . S0 U with any
k ≥ 1 . If f (·) ≤ g(·) for two non-negative functions f and g , then Sk f (·) ≤ Sk g(·) .
Separability of the process U implies that limk Sk U(υ) = U(υ) . We conclude that for
each υ ∈ Υ ◦
U(υ) − U(υ ◦ ) = lim Sk U(υ) − Sk . . . S0 U(υ)
k→∞
k
∞
X
X
≤ lim
Sk . . . Si (I − Si−1 )U(υ) ≤
ξi∗ .
k→∞
i=1
i=1
def
Here ξk∗ = supυ∈Υ ◦ ξk (υ) for k ≥ 1 with
ξ1 (υ) ≡ |S1 U(υ) − U(υ ◦ )|,
def
ξk (υ) = |Sk (I − Sk−1 )U(υ)|,
k≥2
For a fixed point υ ] , it holds
Z
Z
1
1
]
U(υ) − U(υ 0 )π(dυ 0 )π(dυ).
ξk (υ ) ≤
]
πk (υ ) Bk (υ] ) πk−1 (υ) Bk−1 (υ)
37
spokoiny, v.
For each υ 0 ∈ Bk−1 (υ) , it holds d(υ, υ 0 ) ≤ rk−1 = 2rk and
U(υ) − U(υ 0 )
0
U(υ) − U(υ ) ≤ rk−1
.
d(υ, υ 0 )
This implies for each υ ] ∈ Υ ◦ and k ≥ 2 by the Jensen inequality and (A.2)
Z
Z
n λ
o
λU(υ) − U(υ 0 ) π(dυ 0 ) π(dυ)
]
exp
exp
ξk (υ ) ≤
rk−1
d(υ, υ 0 )
πk−1 (υ) πk (υ ] )
Bk (υ ] )
Bk−1 (υ)
Z Z
λU(υ) − U(υ 0 ) π(dυ 0 ) π(dυ)
≤ Mk
exp
.
d(υ, υ 0 )
πk−1 (υ) π(Υ ◦ )
Υ◦
Bk−1 (υ)
def
As the right hand-side does not depend on υ ] , this yields for ξk∗ = supυ∈Υ ◦ ξk (υ) by
condition (Ed) in view of e|x| ≤ ex + e−x
Z Z
λ
λU(υ) − U(υ 0 ) π(dυ 0 ) π(dυ)
∗
ξ ≤ Mk
IE exp
IE exp
rk−1 k
d(υ, υ 0 )
πk−1 (υ) π(Υ ◦ )
Υ◦
Bk−1 (υ)
Z Z
π(dυ 0 ) π(dυ)
2
≤ 2Mk exp(λ /2)
◦
Υ◦
Bk−1 (υ) πk−1 (υ) π(Υ )
= 2Mk exp(λ2 /2).
Further, the use of d(υ, υ ◦ ) ≤ r0 for all υ ∈ Υ ◦ yields by (Ed)
IE exp
nλ
r0
|U(υ) − U(υ ◦ )|
o
≤ 2 exp λ2 /2
(A.4)
and thus
Z
nλ
nλ
o
o
1
IE exp
|S1 U(υ) − U(υ ◦ )| ≤
IE exp
|U(υ 0 ) − U(υ ◦ )| π(dυ 0 )
r0
π1 (υ) B1 (υ)
r0
Z
nλ
o
M1
0
◦
I
E
exp
|U(υ
)
−
U(υ
)|
π(dυ 0 ).
≤
π(Υ ◦ ) Υ ◦
r0
This implies by (A.4) for ξ1∗ ≡ supυ∈Υ ◦ |S1 U(υ) − U(υ ◦ )|
λ IE exp
ξ1∗ ≤ 2M1 exp λ2 /2 .
r0
Denote c1 = 1/3 and ck = rk−1 /(3r0 ) = 2−k+2 /3 for k ≥ 2 . Then
P∞
k=1 ck
= 1 and it
holds by the H¨
older inequality; see Lemma A.16 below:
X
∞
∞
λ X ∗
λ ∗
λ ∗
log IE exp
ξk ≤ c1 log IE exp
ξ +
ck log IE exp
ξ
3r0
r0 1
rk−1 k
k=1
k=2
≤ λ2 /2 + c1 log(2M1 ) +
∞
X
k=2
< λ2 /2 + Q(Υ ◦ ).
ck log(2Mk )
38
Parametric estimation. Finite sample theory
This implies the result.
The exponential bound of Theorem A.1 can be used for obtaining a probability bound
on the maximum of the increments U(υ) − U(υ 0 ) over Υ ◦ .
Corollary A.2. Suppose (Ed) . If Υ ◦ is a central set with the center υ ◦ and the radius
r0 , then it holds for any x > 0
1
IP
sup U(υ, υ 0 ) > z(x, Q) ≤ exp −x ,
3ν0 r0 υ∈Υ ◦
(A.5)
where with g0 = ν0 g and Q = Q(Υ ◦ )
p
 2(x + Q)
def
z(x, Q) =
g−1 (x + Q) + g /2
0
0
if
p
2(x + Q) ≤ g0 ,
(A.6)
otherwise.
def
Proof. By the Chebyshev inequality, it holds for the r.v. ξ = supυ∈Υ ◦ U(υ, υ 0 )/(3ν0 r0 )
for any λ ≤ g0 by (A.3)
log IP ξ > z ≤ −λz + log IE exp λξ ≤ −λz + λ2 /2 + Q.
Now, given x > 0 , we choose λ =
p
2(x + Q) if this value is not larger than g0 , and
λ = g0 otherwise. It is straightforward to check that λz − λ2 /2 − Q ≥ x in both cases,
and the choice of z by (A.6) yields the bound (A.5).
A.2
Application to a two-norms case
As an application of the local bound from Theorem A.1 we consider the result from
Baraud (2010), Theorem 3. For convenience of comparison we utilize the notation from
that paper. Let T be a subset of a linear space S of dimension D , endowed with two
norms denoted by d(s, t) and δ(s, t) for s, t ∈ T . Let also (Xt )t∈T be a random process
on T . The basic assumption of Baraud (2010) is a kind of a Bernstein bound: for some
fixed c > 0
λ2 d(s, t)2 /2
,
log IE exp λ(Xt − Xs ) ≤
1 − λcδ(s, t)
if λcδ(s, t) < 1.
(A.7)
The aim is to bound the maximum of the process Xt over a bounded subset Tv,b defined
for v, b > 0 and a specific point t0 as
def
Tv,b =
t : d(t, t0 ) ≤ v, cδ(t, t0 ) ≤ b .
Let Q = c1 D with c1 = 2 for D ≥ 2 and c1 = 2.7 for D = 1 .
39
spokoiny, v.
Theorem A.3. Suppose that (Xt )t∈S fulfills (A.7), where S is a D -dimensional linear
space. For any ρ < 1 , it holds
√
1−ρ
log IP
sup (Xt − Xt0 ) > z(x, Q) ≤ −x
3v t∈Tv,b
(A.8)
where z(x, Q) from (A.6) with g0 = ρ(1 − ρ)−1/2 b−1 v .
Proof. Define the new semi-distance d∗ (s, t) by
def
d∗ (s, t) = max d(s, t), b−1 vcδ(s, t) .
The set Tv,b can be represented as
Tv,b = t : d∗ (t, t0 ) ≤ v
Moreover, Lemma A.10 applied for the semi-distance d∗ (t, s) yields Q(Tv,b ) ≤ c1 D ,
where c1 = 2 for D ≥ 2 , and c1 = 2.4 for D = 1 .
Fix some ρ < 1 and define g = ρb−1 v . Then for |λ| ≤ g , it holds
λ
d∗ (s, t)
cδ(s, t) ≤
λ
b−1 vcδ(s, t)
cδ(s, t) ≤ ρ
and by (A.7), it follows with ν02 = (1 − ρ)−1
n X −X o
n X −X o
λ2 /2
ν 2 λ2
t
s
t
s
≤ log IE exp λ
≤
≤ 0 .
log IE exp λ ∗
d (s, t)
d(s, t)
1−ρ
2
So, condition (Ed) is fulfilled. Now the result follows from Corollary A.2.
If v is large relative to b , then g = ρv/b is large as well. With moderate values of
p
x , this allows for applying the bound (A.8) with z(x, Q) = 2(x + Q) . In other words,
the value z ≈ z(x, Q) ensures that the maximum of Xt − Xt0 over t ∈ Tv,b deviates over
3vz with the exponentially small probability e−x .
A.3
A local central bound
Due to the result of Theorem A.1, the bound for the maximum of U(υ, υ 0 ) over υ ∈
Br (υ 0 ) grows quadratically in r . So, its applications to situations with r2 Q(Υ ◦ )
are limited. The next result shows that introducing a negative quadratic drift helps
to state a uniform in r local probability bound. Namely, the bound for the process
U(υ, υ 0 ) − ρd2 (υ, υ 0 )/2 with some positive ρ over a ball Br (υ 0 ) around the point
υ 0 only depends on the drift coefficient ρ but not on r . Here the generic chaining
arguments are accomplished with the slicing technique. The idea is for a given r∗ > 1
to split the ball Br∗ (υ 0 ) into the slices Br+1 (υ 0 ) \ Br (υ 0 ) and to apply Theorem A.1
to each slice separately with a proper choice of the parameter λ .
40
Parametric estimation. Finite sample theory
Theorem A.4. Let r∗ be such that (Ed) holds on Br∗ (υ 0 ) . Let also Q(Υ ◦ ) ≤ Q for
√
Υ ◦ = Br (υ 0 ) with r ≤ r∗ . If ρ > 0 and z are fixed to ensure 2ρz ≤ g0 = ν0 g and
ρ(z − 1) ≥ 2 , then it holds
log IP
sup
υ∈Br∗ (υ 0 )
ρ
1
U(υ, υ 0 ) − d2 (υ, υ 0 )
3ν0
2
>z
≤ −ρ(z − 1) + log(4z) + Q.
Moreover, if
√
(A.9)
2ρz > g0 , then
log IP
sup
υ∈Br∗ (υ 0 )
≤ −g0
1
ρ
U(υ, υ 0 ) − d2 (υ, υ 0 )
3ν0
2
>z
p
ρ(z − 1) + g20 /2 + log(4z) + Q.
(A.10)
Remark A.1. Formally the bound applies even with r∗ = ∞ provided that (Ed) is
fulfilled on the whole set Υ ◦ .
Proof. Denote
def
u(r) =
1
U(υ) − U(υ 0 ) .
sup
3ν0 r υ∈Br (υ0 )
Then we have to bound the probability
IP sup r u(r) − ρr2 /2 > z .
r≤r∗
For each r ≤ r∗ and λ ≤ g0 , it follows from (A.3) that
log IE exp λu(r) ≤ λ2 /2 + Q.
The choice λ =
√
2ρz is admissible in view of
√
2ρz ≤ g0 . This implies by the exponential
Chebyshev inequality
log IP r u(r) − ρr2 /2 ≥ z ≤ −λ(z/r + ρr/2) + λ2 /2 + Q
= −ρz(x + x−1 − 1) + Q,
where x =
p
(A.11)
ρ/(2z) r . By definition, ru(r) increases in r . We use that for any growing
function f (·) , any t ≤ t∗ , and any s with t ≤ s ≤ t + 1 it holds f (t) − t ≤ f (s) − s + 1
yielding
1I f (t) − t ≥ z ≤
Z
t+1
Z
1I f (s) − s + 1 > z ds ≤
t
0
t∗ +1
1I f (s) − s + 1 > z ds.
41
spokoiny, v.
As this inequality is satisfied for any t , it also applies to the maximum over t ≤ t∗ :
1I sup
f (t) − t ≥ z ≤
t∗ +1
Z
0≤t≤t∗
1I f (s) − s + 1 > z ds.
0
If a function f (t) is random and growing in t , it follows
t∗ +1
Z
IP sup f (t) − t ≥ z ≤
IP f (s) − s + 1 > z ds
t≤t∗
(A.12)
0
Now we consider t = ρr2 /2 = zx2 , so that dt = 2z x dx . It holds by (A.11) and (A.12)
IP
sup r u(r) − ρr2 /2 > z
r≤r∗
t∗ +1
Z
IP r u(r) − t > z − 1 dt
≤
0
t∗ +1
Z
≤ 2z
exp −ρ(z − 1)(x + x−1 − 1) + Q x dx
0
≤ 2ze−b+Q
Z
∞
exp −b(x + x−1 − 2) x dx
0
with b = ρ(z − 1) and t∗ = ρr∗ 2 /2 . This implies for b ≥ 2
IP
Z
2
−b+Q
sup r u(r) − ρr /2 > z ≤ 2ze
r≤r∗
∞
exp −2(x + x−1 − 2) x dx
0
≤ 4z exp{−ρ(z − 1) + Q}
and (A.9) follows.
√
If 2ρz > g0 , then select λ = g0 . For r ≤ r∗
log IP r u(r) − ρr2 /2 ≥ z = log IP u(r) > z/r + ρr/2
≤ −λ(z/r + ρr/2) + λ2 /2 + Q
√
√
≤ −λ ρz(x + x−1 − 2)/2 − λ ρz + λ2 /2 + Q,
where x =
p
ρ/z r . This allows to bound in the same way as above
IP
2
sup r u(r) − ρr /2 > z
r≤r∗
p
≤ 4z exp −λ ρ(z − 1) + λ2 /2 + Q
yielding (A.10).
This result can be used for describing the concentration bound for the maximum of
(3ν0 )−1 U(υ, υ 0 ) − ρd2 (υ, υ 0 )/2 . Namely, it suffices to find z ensuring the prescribed
deviation probability. We state the result for a special case with ρ = 1 and g0 ≥ 3
which simplifies the notation.
42
Parametric estimation. Finite sample theory
Corollary A.5. Under the conditions of Theorem A.4, for any x ≥ 0 with x + Q ≥ 4
o
n 1
1 2
U(υ, υ 0 ) − d (υ, υ 0 ) > z0 (x, Q) ≤ exp −x ,
IP
sup
2
υ∈Br∗ (υ 0 ) 3ν0
where with g0 = ν0 g ≥ 2

 1 + √x + Q2
def
z0 (x, Q) =
1 + 2g−1 (x + Q) + g 2
0
Proof. First consider the case 1 +
2
√
that z = 1 + x + Q ensures
√
0
if 1 +
√
x + Q ≤ g0 ,
(A.13)
otherwise.
x + Q ≤ g0 . In view of (A.9), it suffices to check
z − 1 − log(4z) − Q ≥ x.
This follows from the inequality
(1 + y)2 − 1 − 2 log(2 + 2y) ≥ y2
√
x + Q ≥ 2.
√
If 1 + x + Q > g0 , define z = 1 + y2 with y = 2g−1
0 (x + Q) + g0 . Then
with y =
g0
p
z − 1 − log(4z) − g20 /2 − Q − x = g0 y/2 − log 4(1 + y2 ) ≥ 0
because g0 ≥ 3 and 3y/2 − log(1 + y2 ) ≥ log(4) for y ≥ 2 .
If g √
Q and x is not too big then z0 (x, Q) is of order x + Q . So, the main
message of this result is that with a high probability the maximum of (3ν0 )−1 U(υ, υ 0 ) −
d2 (υ, υ 0 )/2 does not significantly exceed the level Q .
A.4
A multiscale upper function and hitting probability
The result of the previous section can be explained as a local upper function for the process U(·) . Indeed, in a vicinity Br∗ (υ 0 ) of the central point υ 0 , it holds (3ν0 )−1 U(υ, υ 0 ) ≤
d2 (υ, υ 0 )/2 + z with a probability exponentially small in z . This section aims at extending this local result to the whole set Υ using multiscaling arguments. For simplifying
the notations assume that U(υ 0 ) ≡ 0 . Then U(υ, υ 0 ) = U(υ) . We say that u(µ, υ) is
a multiscale upper function for µU(·) on a subset Υ ◦ of Υ if
IP sup sup µU(υ) − u(µ, υ) ≥ z(x) ≤ e−x ,
(A.14)
µ∈M υ∈Υ ◦
for some fixed function z(x) . An upper function can be used for describing the concene = argmaxυ∈Υ ◦ U(υ) ; see Theorem A.8 below.
tration sets of the point of maximum υ
43
spokoiny, v.
The desired global bound requires an extension of the local exponential moment
condition (Ed) . Below we suppose that the pseudo-metric d(υ, υ 0 ) is given on the whole
set Υ . For each r this metric defines the ball Υ◦ (r) by the constraint d(υ, υ 0 ) ≤ r .
Below the condition (Ed) is assumed to be fulfilled for any r , however the constant g
may be dependent of the radius r .
(Er)
For any r , there exists g(r) > 0 such that (A.1) holds for all υ, υ 0 ∈ Υ◦ (r) and
all λ ≤ g(r) .
Condition (Er) implies a similar condition for the scaled process µU(υ) with g =
µ−1 g(r) and d(υ, υ 0 ) replaced by µd(υ, υ 0 ) . Corollary A.5 implies for any x with
√
def
1 + x + Q ≤ g0 (r) = ν0 g(r)/µ
n µ
1 2 2o
IP
sup
(A.15)
U(υ) − µ r > z0 (x, Q) ≤ exp −x .
2
υ∈Br (υ 0 ) 3ν0
Let now a finite or separable set M and a function t(µ) ≥ 1 be fixed such that
X
e−t(µ) ≤ 2.
(A.16)
µ∈M
One possible choice of the set M and the function t(µ) is to take a geometric sequence
µk = µ0 2−k with any fixed µ0 and define t(µk ) = k = − log2 (µk /µ0 ) for k ≥ 0 .
Putting together the bounds (A.15) for different µ ∈ M yields the following result.
Theorem A.6. Suppose (Er) and (A.16). Then for any x ≥ 2 , there exists a random
set A(x) of a total probability at least 1 − 2e−x , such that it holds on A(x) for any r
h µ
p
2 i
1
U(υ) − µ2 r2 − 1 + x + Q + t(µ)
< 0,
2
υ∈Br (υ 0 ) µ∈M(r,x) 3ν0
sup
sup
where
def
M(r, x) =
µ∈M: 1+
p
x + Q + t(µ) ≤ ν0 g(r)/µ .
Proof. For each µ ∈ M(r, x) , Corollary A.5 implies
IP
p
2 1
µ
U(υ) − µ2 r2 ≥ 1 + x + Q + t(µ)
≤ e−x−t(µ) .
2
υ∈Br (υ 0 ) 3ν0
sup
The desired assertion is obtained by summing over µ ∈ M due to (A.16).
Moreover, the inequality x + Q ≥ 2.5 yields
p
2
1 + x + Q + t(µ) ≤ 2 x + Q + t(µ) .
This allows to take in (A.14) u(µ, υ) = 3ν0 µ2 r2 /2 + 2t(µ) and z(x) = 2(x + Q) .
44
Parametric estimation. Finite sample theory
Corollary A.7. Suppose (Er) and (A.16). Then for any x with x + Q ≥ 2.5 , there
exists a random set Ω(x) of a total probability at least 1 − 2e−x , such that it holds on
Ω(x) for any r
o
n µ
1
U(υ) − µ2 r2 − 2t(µ) < 2(x + Q).
2
υ∈Br (υ 0 ) µ∈M(r,x) 3ν0
sup
sup
Now we briefly discuss the hitting problem. Let M (υ) be a deterministic boundary
function. We aim at bounding the probability that a process U(υ) hits this boundary
on the set Υ . This precisely means the probability that supυ∈Υ U(υ) − M (υ) ≥ 0 .
An important observation here is that multiplication by any positive factor µ does not
change the relation. This allows to apply the multiscale result from Theorem A.6. For
any fixed x and any υ ∈ Br (υ 0 ) , define
def
M∗ (υ) =
o
n 1
1
µM (υ) − µ2 r2 − 2t(µ) .
2
µ∈M(r,x) 3ν0
sup
Theorem A.8. Suppose (Er) , (A.16), and x + Q ≥ 2.5 . Let, given x , it hold
M∗ (υ) ≥ 2(x + Q),
υ ∈ Υ.
(A.17)
Then
IP sup U(υ) − M (υ) ≥ 0 ≤ 2e−x .
υ∈Υ
Maximizing the expression (3ν0 )−1 µM (υ)−µ2 r2 /2 suggests the choice µ = M (υ)/(3ν0 r2 )
yielding M∗ (υ) ≥ M 2 (υ)/(6ν02 r2 ) − 2t(µ) . In particular, the condition (A.17) requires
that M (υ) grows with r a bit faster than a linear function.
A.5
Finite-dimensional smooth case
Here we discuss the special case when Υ is an open subset in IRp , the stochastic prodef
cess U(υ) is absolutely continuous and its gradient ∇U(υ) = dU(υ)/dυ has bounded
exponential moments.
(ED)
There exist g > 0 , ν0 ≥ 1 , and for each υ ∈ Υ , a symmetric non-negative
matrix H(υ) such that for any λ ≤ g and any unit vector γ ∈ IRp , it holds
n γ > ∇U(υ) o
≤ ν02 λ2 /2.
log IE exp λ
kH(υ)γk
A natural candidate for H 2 (υ) is the covariance matrix Var ∇U(υ) provided that
this matrix is well posed. Then the constant ν0 can be taken close to one by reducing
the value g ; see Lemma A.17 below.
spokoiny, v.
45
In what follows we fix a subset Υ ◦ of Υ and establish a bound for the maximum of the
process U(υ, υ ◦ ) = U(υ) − U(υ ◦ ) on Υ ◦ for a fixed point υ ◦ . We will assume existence
of a dominating matrix H ∗ = H ∗ (Υ ◦ ) such that H(υ) H ∗ for all υ ∈ Υ ◦ . We also
assume that π is the Lebesgue measure on Υ . First we show that the differentiability
condition (ED) implies (Ed) .
Lemma A.9. Assume that (ED) holds with some g and H(υ) H ∗ for υ ∈ Υ ◦ .
Consider any υ, υ ◦ ∈ Υ ◦ . Then it holds for |λ| ≤ g
ν02 λ2
U(υ, υ ◦ )
≤
.
log IE exp λ
kH ∗ (υ − υ ◦ )k
2
Proof. Denote δ = kυ − υ ◦ k , γ = (υ − υ ◦ )/δ . Then
Z 1
∇U(υ ◦ + tδγ)dt
U(υ, υ ◦ ) = δγ >
0
and kH ∗ (υ − υ ◦ )k = δkH ∗ γk . Now the H¨older inequality and (ED) yield
ν02 λ2
U(υ, υ ◦ )
−
IE exp λ
kH ∗ (υ − υ ◦ )k
2
Z 1 >
γ ∇U(υ ◦ + tδγ) ν02 λ2
dt
= IE exp
−
λ
kH ∗ γk
2
0
>
Z 1
γ ∇U(υ ◦ + tδγ) ν02 λ2
≤
IE exp λ
−
dt ≤ 1
kH ∗ γk
2
0
as required.
The result of Lemma A.9 enables us to define d(υ, υ 0 ) = kH ∗ (υ − υ ◦ )k so that the
corresponding ball coincides with the ellipsoid B(r, υ ◦ ) . Now we bound the value Q(Υ ◦ )
for Υ ◦ = B(r0 , υ ◦ ) .
Lemma A.10. Let Υ ◦ = B(r0 , υ ◦ ) . Under the conditions of Lemma A.9, it holds
Q(Υ ◦ ) ≤ c1 p , where c1 = 2 for p ≥ 2 , and c1 = 2.7 for p = 1 .
Proof. The set Υ ◦ coincides with the ellipsoid B(r0 , υ ◦ ) while the d -ball Bk (υ) coincides with the ellipsoid B(rk , υ) for each k ≥ 2 . By change of variables, the study can
be reduced to the case with υ ◦ = 0 , H ∗ ≡ Ip , r0 = 1 , so that B(r, υ) is the usual
Euclidean ball in IRp of radius r . It is obvious that the measure of the overlap of two
balls B(1, 0) and B(2−k+1 , υ) for kυk ≤ 1 is minimized when kυk = 1 , and this value
is the same for all such υ .
Now we use the following observation. Fix υ ] with kυ ] k = 1 . Let r ≤ 1 , υ [ =
(1 − r2 /2)υ ] and r[ = r − r2 /2 . If υ ∈ B(r[ , υ [ ) , then υ ∈ B(r, υ ] ) because
kυ ] − υk ≤ kυ ] − υ [ k + kυ − υ [ k ≤ r2 /2 + r − r2 /2 ≤ r.
46
Parametric estimation. Finite sample theory
Moreover, for each υ ∈ B(r[ , υ [ ) , it holds with u = υ − υ [
kυk2 = kυ [ k2 + kuk2 + 2u> υ [ ≤ (1 − r2 /2)2 + |r[ |2 + 2u> υ [ ≤ 1 + 2u> υ [ .
This means that either υ = υ [ + u or υ [ − u belongs to the ball B(r0 , υ ◦ ) and thus,
π B(1, 0) ∩ B(r, υ) ≥ π B(r[ , υ [ ) /2 . We conclude that
2π B(1, 0)
π B(1, 0)
≤
= 2(r − r2 /2)−p .
π B(1, 0) ∩ B(r, υ ] )
π B(r[ , 0)
This implies for k ≥ 1 and rk = 2−k+1 that 2Mk+1 ≤ 22+kp (1−2−k−1 )−p . The quantity
Q(Υ ◦ ) can now be evaluated as
∞
∞
2 X −k
2p X −k
1
log(22+p ) +
2 log(22+kp ) −
2 log(1 − 2−k−1 )
3
3
3
Q(Υ ◦ ) ≤
k=1
log 2 h
2+p+2
3
=
∞
X
k=1
k=1
i 2p
(2 + kp)2−k −
3
∞
X
2−k log(1 − 2−k−1 ) ≤ c1 p,
k=1
where c1 = 2 for p ≥ 2 , and c1 = 2.7 for p = 1 , and the result follows.
Now we specify the local bounds of Theorem A.1 and the central result of Corollary A.5 to the smooth case.
Theorem A.11. Suppose (Ed) . For any λ ≤ ν0 g , r0 > 0 , and υ ◦ ∈ Υ
log IE exp
n
λ
3ν0 r0
sup
o
U(υ) − U(υ ◦ ) ≤ λ2 /2 + Q,
υ∈B(r0 ,υ ◦ )
where Q = c1 p .
def
We consider the local sets of the elliptic form Υ◦ (r) = {υ : kH0 (υ − υ 0 )k ≤ r} ,
where H0 dominates H(υ) on this set: H(υ) H0 .
Theorem A.12. Let (ED) hold with some g and a matrix H(υ) . Suppose that H(υ) H0 for all υ ∈ Υ◦ (r) . Then
n 1
o
1
2
IP
sup
U(υ, υ 0 ) − kH0 (υ − υ 0 )k ≥ z0 (x, p) ≤ exp(−x),
2
υ∈Υ◦ (r) 3ν0
(A.18)
where z0 (x, p) coincides with z0 (x, Q) from (A.13) for Q = c1 p .
Remark A.2. An important feature of the established result is that the bound in the
right hand-side of (A.18) does not depend on the value r describing the radius of the
local vicinity around the central point υ 0 . In the ideal case one would apply this result
with r = ∞ provided that the conditions H(υ) ≤ H0 is fulfilled uniformly over Υ .
47
spokoiny, v.
Proof. Lemma A.10 implies (Ed) with d(υ, υ 0 ) = kH0 (υ − υ 0 )k2 /2 . Now the result
follows from Corollary A.5.
The global result of Theorem A.6 applies without changes to the smooth case.
A.6
Roughness constraints for dimension reduction
The local bounds of Theorems A.1 and A.4 can be extended in several directions. Here
we briefly discuss one extension related to the use of a smoothness condition on the
parameter υ . Let let t(υ) be a non-negative penalty function on Υ . A particular
example of such penalty function is the roughness penalty t(υ) = kGυk2 for a given
matrix IRp . Let t0 ≥ 1 be fixed. Redefine the sets Br (υ ◦ ) by the constraint t(υ) ≤ t0 :
Br (υ ◦ ) = υ ∈ Υ : d(υ, υ ◦ ) ≤ r; t(υ) ≤ t0 ,
and consider Υ ◦ = Br◦ (υ ◦ ) for a fixed central point υ ◦ and the radius r◦ . One can
easily check that the results of Theorems A.1 and A.4 and their corollaries extend to
this situation without any change. The only difference is in the definition of the value
Q(Υ ◦ ) and Q . Each value Q(Υ ◦ ) is defined via the quantities πk (υ) = π Brk (υ)
which obviously change when each ball Br (υ) is redefined. Examples below show that
the use of the penalization can substantially reduce the value Q .
Now we specify the results to the case of a smooth process U given on a local
ball Υ ◦ = Br◦ (υ ◦ ) defined by the condition {kH0 (υ − υ ◦ )k ≤ r◦ } and a smoothness
constraint kGυk2 ≤ t0 = r2◦ . The local set Br (υ) are of the form:
def
Br (υ) =
υ 0 : H0 (υ − υ 0 ) ≤ r, kGυ 0 k ≤ r◦ .
(A.19)
The effective dimension pe = pe (S) can be defined as the dimension of the subspace
on which H0 ≥ G . The formal definition uses the spectral decomposition of the matrix
S = H0−1 G2 H0−1 . Let g1 ≤ g2 ≤ . . . ≤ gp be the eigenvalue of S . Define pe (S) as the
largest index j for which gj < 1 :
def
pe (S) = max j ≥ 1 : gj < 1 .
(A.20)
In the non-penalized case, the entropy term Q is proportional to the dimension p .
The roughness penalty enables to reduce p to the effective dimension pe (S) which can
be much smaller than p depending on the relation between matrices H0 and G . More
precisely, if the eigenvalues gj of S grow sufficiently fast, the entropy calculus effectively
reduces to the coordinates with gj ≤ 1 .
48
Parametric estimation. Finite sample theory
Lemma A.13. Let g1 = 0 . For each r◦ ≥ 1 , it holds
Q(Υ◦ (r◦ )) ≤ e1 ps
(A.21)
with ps = ps (S) defined by
def
ps (S) = pe (S) +
p
X
gj−1 log+ (gj ).
(A.22)
j=1
If the sum
−1
j≥1 gj log+ (gj )
P
is bounded by a fixed constant, then the value ps is
close to the effective dimension pe (S) from (A.20).
Proof. We follow the proof of Lemma A.10. By a change of variables one can reduce
the problem to the case when H0 is the identity matrix and r◦ ≡ 1 . Moreover, one
can easily see that υ ◦ = 0 is the hardest case. The case of υ ◦ 6= 0 can be considered
similarly. By a further change the matrix S = G2 can be represented in diagonal form:
S = diag{g 2 ≤ . . . ≤ gp2 } . Let υ ] be any point with υ ] ≤ 1 and kGυ ] k ≤ 1 , and
1
r ≤ 1 . Simple arguments show that the measure of the set Br (υ ] ) over all such vectors
υ ] is minimized at υ ] = (1, 0, . . . , 0)> . Define r[ = r − r2 /2 , υ [ = (1 − r2 /2)υ ] . Fix
any υ such that u = υ − υ [ fulfills kuk ≤ r[ , kGuk ≤ r , and u1 < 0 . As Gυ [ = 0 , it
holds
kGυk = kGuk ≤ r ≤ 1,
kυ − υ ] k ≤ kυ − υ [ k + kυ [ − υ ] k ≤ r[ + r2 /2 = r,
kυk2 = kυ [ + uk2 = kυ [ k2 + kuk2 + 2u> υ [ ≤ (1 − r2 /2)2 + |r[ |2 < 1.
This yields that π B1 (0) ∩ Br (υ ] ) ≥ π Br[ (0) /2 . Moreover, let the index pe (r) be
defined as the largest j with rgj < 1 . Consider any υ ∈ B1 (0) and construct another
point υ(r) by multiplying with r every element υ j for j ≤ pe (r) . The construction
ensures that υ(r) ∈ Br (0) . This implies
2π B1 (0)
π B1 (0)
≤
≤ 2|r[ |−pe (r) .
]
π B1 (0) ∩ Br (υ )
π Br[ (0)
Application of this bound for k ≥ 1 , rk+1 = 2−k , and pk = pe (rk+1 ) yields that
2Mk+1 ≤ 22+kpk (1 − 2−k−1 )−pk . The quantity Q(Υ ◦ ) can now be evaluated as
Q(Υ ◦ ) ≤
∞
∞
k=1
k=1
1
2 X −k
2 X −k
log(22+pe ) +
2 log(22+kpk ) −
2 pk log(1 − 2−k−1 ).
3
3
3
49
spokoiny, v.
Further, for each g > 1 , it holds with k(g) = max{k : g < 2k }
def
s(g) =
∞
X
2−k+1 k 1I 2−k g < 1 ≤ 2k(g)2−k(g) ≤ 2k(g)/g ≤ 2g −1 log2 (2g).
k=1
Thus,
∞
X
−k
2
kpk ≤
k=1
∞
X
−k
2
k
p
X
1I 2
−k
j=1
k=1
p
X
gj < 1 ≤ 2
s(gj ).
j=1
This easily implies the result (A.21); cf. the proof of Lemma A.10.
The first result adjusts Theorem A.11 to the penalized case. The maximum of the
process U is taken over a ball Br (υ) from (A.19) which is smaller than the similar ball
in the non-penalized case. This explains the gain in the entropy term Q .
Theorem A.14. Let (ED) hold with some g and a matrix H(υ) H0 for all υ ∈ Υ .
Then for any λ ≤ ν0 g , r◦ ≥ 1 , and υ ◦ ∈ Υ
log IE exp
n
λ
3ν0 r◦
sup
o
U(υ) − U(υ ◦ ) ≤ λ2 /2 + Q,
υ∈Br◦ (υ ◦ )
where Br◦ (υ ◦ ) is given by (A.19), Q = e1 ps , and ps is the effective dimension from
(A.22).
Proof. The result follows from Corollary A.5. It is only required to evaluate the local
entropy Q(Υ◦ (r◦ )) . This is done in Lemma A.13.
The magnitude of the process U over Br◦ (υ ◦ ) is of order r◦ and it grows with r◦ .
The use of the negative drift allows to establish a unified result.
Theorem A.15. Let r◦ ≥ 1 be fixed and let (ED) hold with some g and a matrix
H(υ) H0 for all υ ∈ Br◦ (υ 0 ) . Then
n 1
2 o
1
IP
sup
U(υ, υ 0 ) − H0 (υ − υ 0 )
≥ z(x, Q) ≤ exp(−x),
2
υ∈Br◦ (υ 0 ) 3ν0
where z(x, Q) is given by (A.13) with Q = e1 ps .
The result of Theorem A.6 for the non-penalized case applies without big changes to
the penalized case.
50
A.7
Parametric estimation. Finite sample theory
Auxiliary facts
P
Lemma A.16. For any r.v.’s ξk and λk ≥ 0 such that Λ = k λk ≤ 1
X
X
log IE exp
λ k ξk ≤
λk log IEeξk .
k
k
Proof. Convexity of ex and concavity of xΛ imply
X
X
1
Λ
Λ
ξk
ξk
≤ IE exp
IE exp
λk ξk − log IEe
λk ξk − log IEe
Λ
Λ
k
k
≤
1X
λk IE exp ξk − log IEeξk
Λ
Λ
= 1.
k
Lemma A.17. Let a r.v. ξ fulfill IEξ = 0 , IEξ 2 = 1 and IE exp(λ1 |ξ|) = κ < ∞ for
some λ1 > 0 . Then for any % < 1 there is a constant C1 depending on κ , λ1 and %
only such that for λ < %λ1
log IEeλξ ≤ C1 λ2 /2.
Moreover, there is a constant λ2 > 0 such that for all λ ≤ λ2
log IEeλξ ≥ %λ2 /2.
Proof. Define h(x) = (λ − λ1 )x + m log(x) for m ≥ 0 and λ < λ1 . It is easy to see by
a simple algebra that
max h(x) = −m + m log
x≥0
m
.
λ1 − λ
Therefore for any x ≥ 0
λx + m log(x) ≤ λ1 x + log
m
e(λ1 − λ)
m
.
This implies for all λ < λ1
m
IE|ξ| exp(λ|ξ|) ≤
m
e(λ1 − λ)
m
IE exp(λ1 |ξ|).
Suppose now that for some λ1 > 0 , it holds IE exp(λ1 |ξ|) = κ(λ1 ) < ∞ . Then the
function h0 (λ) = IE exp(λξ) fulfills h0 (0) = 1 , h00 (0) = IEξ = 0 , h000 (0) = 1 and for
λ < λ1 ,
h000 (λ) = IEξ 2 eλξ ≤ IEξ 2 eλ|ξ| ≤
1
IE exp(λ1 |ξ|).
(λ1 − λ)2
spokoiny, v.
This implies by the Taylor expansion for λ < %λ1 that
h0 (λ) ≤ 1 + C1 λ2 /2
with C1 = κ(λ1 )/ λ21 (1 − %)2 , and hence, log h0 (λ) ≤ C1 λ2 /2 .
51
52
Parametric estimation. Finite sample theory
References
Andresen, A. and Spokoiny, V. (2013). Critical dimension in profile semiparametric
estimation. Manuscript. arXiv:1303.4640.
Baraud, Y. (2010). A Bernstein-type inequality for suprema of random processes with
applications to model selection in non-Gaussian regression. Bernoulli, 16(4):1064–1085.
Bednorz, W. (2006). A theorem on majorizing measures. Ann. Probab., 34(5):1771–1781.
Birg´e, L. (2006). Model selection via testing: an alternative to (penalized) maximum
likelihood estimators. Annales de l’Institut Henri Poincare (B) Probability and Statistics.
Birg´e, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators.
Probab. Theory Relat. Fields, 97(1-2):113–150.
Birg´e, L. and Massart, P. (1998). Minimum contrast estimators on sieves: Exponential
bounds and rates of convergence. Bernoulli, 4(3):329–375.
Boucheron, S., Lugosi, G., and Massart, P. (2003). Concentration inequalities using the
entropy method. Ann. Probab., 31(3):1583–1614.
Ibragimov, I. and Khas’minskij, R. (1981). Statistical estimation. Asymptotic theory.
Transl. from the Russian by Samuel Kotz. New York - Heidelberg -Berlin: SpringerVerlag .
Le Cam, L. (1960). Locally asymptotically normal families of distributions. Certain
approximations to families of distributions and their use in the theory of estimation
and testing hypotheses. Univ. California Publ. Stat., 3:37–98.
Le Cam, L. and Yang, G. L. (2000). Asymptotics in Statistics: Some Basic Concepts.
Springer-Verlag, New York.
McCullagh, P. and Nelder, J. (1989). Generalized linear models. 2nd ed. Monographs on
Statistics and Applied Probability. 37. London etc.: Chapman and Hall. xix, 511 p. .
Spokoiny, V., Wang, W., and H¨ardle, W. (2013). Local quantile regression (with rejoinder). J. of Statistical Planing and Inference. To appear as Discussion Paper.
ArXiv:1208.5384.
Talagrand, M. (1996).
24(3):1049–1103.
Majorizing measures: the generic chaining.
Ann. Probab.,
spokoiny, v.
53
Talagrand, M. (2001). Majorizing measures without measures. Ann. Probab., 29(1):411–
417.
Talagrand, M. (2005). The generic chaining. Springer Monographs in Mathematics.
Springer-Verlag, Berlin. Upper and lower bounds of stochastic processes.
Van de Geer, S. (1993). Hellinger-consistency of certain nonparametric maximum likelihood estimators. Ann. Stat., 21(1):14–44.
van der Vaart, A. and Wellner, J. A. (1996). Weak convergence and empirical processes.
With applications to statistics. Springer Series in Statistics. New York, Springer.