Exercise 8: Bias-variance decomposition of mean squared error

Transcription

Exercise 8: Bias-variance decomposition of mean squared error
CS331: Machine Learning
Prof. Dr. Volker Roth
[email protected]
FS 2015
Aleksander Wieczorek
[email protected]
Dept. of Mathematics and Computer Science
Spiegelgasse 1
4051 Basel
Date: Monday, April 13th 2015
Exercise 8: Bias-variance decomposition of mean squared error
Suppose that data (xi , yi ) are observed, where
yi = f (xi ) + i
i = 1, . . . , n
and
• xi = (xi1 , . . . , xip ) ∈ Rp
• yi ∈ R
• f : Rp → R
• i error with Ei = 0, V ar[i ] = σ, Cov[i , j ] = 0 for i 6= j.
Assume that function fˆ constructed based on data (xi , yi ) is used to approximate unknown f .
The mean squared error (MSE) of fˆ measures how well fˆ approximates f (i.e. predicts y
given new x):
M SE(fˆ(x)) = E[(fˆ(x) − f (x))2 ]
For a new data point x, MSE can be shown to depend on the bias and variance of fˆ(x):
M SE(fˆ(x)) = Bias(fˆ(x))2 + V ar(fˆ(x))
where
Bias(fˆ(x)) = E[fˆ(x)] − f (x)
V ar(fˆ(x)) = E[(fˆ(x) − E[fˆ(x)])2 ]
Prove the above result. What does it mean? What is its interpretation in the context of
regression / regularized (ridge) regression?
Exercise 9: Hoeffding’s inequality
Consider independent random variables X1 , . . . , Xn which are bounded i.e. P
Xi takes values in
[ai , bi ] with probability 1, i = 1, . . . , n. Then, for any t > 0, the sum Sn = n
i=1 Xi fulfils
the following inequality:
−2t2
.
(1)
P (|Sn − ESn | ≥ t) ≤ 2 exp P
(bi − ai )2
Exercise
Give a proof of Hoeffding’s Inequality.
This can be done in several steps:
1
CS331: Machine Learning
FS 2015
1. Show that for independent rv X1 , . . . , Xn and any s > 0 we have:
P (Sn − ESn ≥ t) ≤ e−st
n
Y
Ees(Xi −EXi ) .
(2)
i=1
(a) Multiply both sides by s and take the exp .
(b) Use Markov’s inequality: for a positive rv X we have that P (X ≥ t) ≤
EX
.
t
(c) Use the independence of X1 , . . . , Xn .
2. Show that for a rv X with EX = 0 if X takes values in [a, b] with probability 1 then for
any s > 0:
2
2
EesX ≤ es (b−a) /8
(3)
(a) Since exp is a convex function esX ≤ Cesb + Desa , for some C and D to
determine.
(b) Take the expectation EesX .
(c) Take the Taylor serie expansion of log(EesX ).
3. Combining (2) and (3) we obtain:
P (Sn − ESn ≥ t) ≤ inf
s>0
2
−st
e
n
Y
i=1
s2 (bi −ai )2 /8
e
!