MA427 Ergodic Theory

Transcription

MA427 Ergodic Theory
MA427 Ergodic Theory
Course Notes (2012-13)
1
1.1
Introduction
Orbits
Let X be a mathematical space. For example, X could be the unit interval [0, 1], a circle, a
torus, or something far more complicated like a Cantor set. Let T : X → X be a function
that maps X into itself.
Let x ∈ X be a point. We can repeatedly apply the map T to the point x to obtain the
sequence:
{x, T (x), T (T (x)), T (T (T (x))), . . . , ...}.
We will often write T n (x) = T (· · · (T (T (x)))) (n times). The sequence of points
x, T (x), T 2 (x), . . .
is called the orbit of x.
We think of applying the map T as the passage of time. Thus we think of T (x) as where
the point x has moved to after time 1, T 2 (x) is where the point x has moved to after time
2, etc.
Some points x ∈ X return to where they started. That is, T n (x) = x for some n > 1.
We say that such a point x is periodic with period n.
By way of contrast, points may move move densely around the space X. (A sequence is
said to be dense if (loosely speaking) it comes arbitrarily close to every point of X.)
If we take two points x, y of X that start very close then their orbits will initially be close.
However, it often happens that in the long term their orbits move apart and indeed become
dramatically different. This is known as sensitive dependence on initial conditions, and is
popularly known as chaos.
In general, for a given dynamical system T it is impossible to understand the orbit structure
of every orbit. Ergodic theory takes a more qualitative approach: we aim to describe the long
term behaviour of a typical orbit, at least in the case when T satisfies a technical condition
called ‘measure-preserving’.
To make the notion of ‘typical’ precise, we need to use measure theory. Roughly speaking,
a measure is a function that assigns a ‘size’ to a given subset of X. One of the simplest
measures is Lebesgue measure on [0, 1]; here the measure of an interval [a, b] ⊂ [0, 1] is just
its length b − a.
1
1.2
Introducing the doubling map
1
INTRODUCTION
Let T : [0, 1] → [0, 1] and fix a subinterval [a, b] ⊂ [0, 1]. Let x ∈ [0, 1]. What is the
frequency with which the orbit of x hits the set [a, b]? Recall that the characteristic function
χA of a subset A is defined by
1 if x ∈ A,
χA (x) =
0 if x 6∈ A.
Then the number of times the first n points of the orbit of x hits [a, b] is given by
n−1
X
χ[a,b] (T j (x)).
j=0
Thus the proportion of the first n points in the orbit of x that lie in [a, b] is equal to
n−1
1X
χ[a,b] (T j (x)).
n j=0
Hence the frequency with which the orbit of x lies in [a, b] is given by
n−1
1X
χ[a,b] (T j (x))
lim
n→∞ n
j=0
(assuming of course that this limit exists!).
One of the main results of the course, namely Birkhoff’s ergodic theorem, tells us that
when T is ergodic (a technical property—albeit an important one—that we won’t define here)
then for ‘most’ orbits the above frequency is equal to the measure of the interval [a, b]. In
the case of Lebesgue measure, this means that:
n−1
1X
χ[a,b] (T j (x)) = b − a, for almost all x ∈ X.
lim
n→∞ n
j=0
(Here ‘almost all’ is the technical measure-theoretic way of saying ‘most’.)
One way of looking at Birkhoff’s ergodic theorem is the following: the time average of a
typical point x ∈ X (i.e. the frequency with which its orbit lands in a given subset) is equal
to the space average (namely, the measure of that subset).
In this course, we develop the necessary background that builds up to Birkhoff’s ergodic
theorem, together with some illuminating examples. We also study en route some interesting
diversions to other areas of mathematics, notably number theory.
1.2
Introducing the doubling map
Let X = [0, 1] denote the unit interval. Define the map T : X → X by:
2x
if 0 ≤ x < 1/2
T (x) = 2x mod 1 =
2x − 1 if 1/2 ≤ x ≤ 1
(‘mod 1’ stands for ‘modulo 1’ and means ‘ignore the integer part’; for example 3.456 mod 1
is .456).
2
1.3
Leading digits
1
INTRODUCTION
Exercise 1.1. Draw the graph of the doubling map. By sketching the orbit in the this graph,
indicate that 7/15 is periodic. Try sketching the orbits of some points near 7/15.
In §1.1 we mentioned in passing that we will be interested in a technical condition called
‘measure-preserving’. We can illustrate this property here. Fix an interval [a, b] and consider
the set
T −1 [a, b] = {x ∈ [0, 1] | T (x) ∈ [a, b]}.
One can easily check that
T
−1
a b
a+1 b+1
[a, b] =
,
∪
,
,
2 2
2
2
so that T −1 [a, b] is the union of two intervals, each of length (b − a)/2. Hence the length
of T −1 [a, b] is equal to b − a, which is the same as the length of [a, b].
1.3
Leading digits
The leading digit of a number n ∈ N is the digit (between 1 and 9) that appears at the
leftmost-end of n when n is written in base 10. Thus, the leading digit of 4629 is 4, etc.
Consider the sequence 2n :
1, 2, 4, 8, 16, 32, 64, 128, . . .
and consider the sequence of leading digits:
1, 2, 4, 8, 1, 3, 6, 1, . . . .
Exercise 1.2. By writing down the sequence of leading digits for 2n for n = 1, 2, . . . , something
large of your choosing, try guessing the frequency with which the digit 1 appears as a leading
digit. (Hint: it isn’t 3/10ths.) Do the same for the digit 2. Can you guess the frequency
with which the digit r appears?
We will study this problem in greater detail later.
3
2
2
UNIFORM DISTRIBUTION
Uniform Distribution
2.1
Uniform distribution and Weyl’s criterion
Before we discuss dynamical systems in greater detail, we shall consider a simpler setting
which highlights some of the main ideas in ergodic theory.
Let xn be a sequence of real numbers. We may decompose xn as the sum of its integer
part [xn ] = sup{m ∈ Z | m ≤ xn } (i.e. the largest integer which is less than or equal to xn )
and its fractional part {xn } = xn − [xn ]. Clearly, 0 ≤ {xn } < 1. The study of xn mod 1 is the
study of the sequence {xn } in [0, 1).
Definition 2.1. We say that the sequence xn is uniformly distributed mod 1 if for every a, b
with 0 ≤ a < b < 1, we have that
1
card{j | 0 ≤ j ≤ n − 1, {xj } ∈ [a, b]} → b − a,
n
as n → ∞.
(The condition is saying that the proportion of the sequence {xn } lying in [a, b] converges to
b − a, the length of the interval.)
Remark 2.2. We can replace [a, b] by [a, b), (a, b] or (a, b) with the same result.
Exercise 2.3. Show that if xn is uniformly distributed mod 1 then {xn } is dense in [0, 1).
The following result gives a necessary and sufficient condition for xn to be uniformly
distributed mod 1.
Theorem 2.4 (Weyl’s Criterion). The following are equivalent:
(i) the sequence xn is uniformly distributed mod 1;
(ii) for each ` ∈ Z \ {0}, we have
n−1
1 X 2πi`xj
e
→0
n j=0
as n → ∞.
2.2
The sequence xn = nα
The behaviour of the sequence xn = nα depends on whether α is rational or irrational. If
α ∈ Q, it is easy to see that {nα} can take on only finitely many values in [0, 1): if α = p/q
(p ∈ Z, q ∈ N, hcf(p, q) = 1) then {nα} takes the q values
p
2p
(q − 1)p
0,
,
,...,
.
q
q
q
In particular, {nα} is not uniformly distributed mod 1.
4
2.3
Proof of Weyl’s criterion
2
UNIFORM DISTRIBUTION
If α ∈ R \ Q then the situation is completely different. We shall apply Weyl’s Criterion.
For l ∈ Z \ {0}, e 2πi`α 6= 1, so we have
n−1
1 X 2πi`jα 1 e 2πi`nα − 1
e
=
.
n j=0
n e 2πi`α − 1
Hence
n−1
1
1 X
2
e 2πi`jα ≤
→ 0,
2πi`α
n |e
n
−
1|
j=0
as n → ∞.
Hence nα is uniformly distributed mod 1.
Remarks 2.5. (i) More generally, we could consider the sequence xn = nα + β. It is easy to
see by modifying the above arguments that xn is uniformly distributed mod 1 if and only if α
is irrational.
(ii) Fix α > 1 and consider the sequence xn = αn x, for some x ∈ (0, 1). Then it is possible
to show that for almost every x, the sequence xn is uniformly distributed mod 1. We will
prove this later in the course, at least in the cases when α = 2, 3, 4, . . ..
(iii) Suppose in the above remark we fix x = 1 and consider the sequence xn = αn . Then one
can show that xn is uniformly distributed mod 1 for almost all α > 1. However, not a single
example of such an α is known!
Exercise 2.6. Calculate the frequency with which 2n has r (r = 1, . . . , 9) as the leading digit
of its base 10 representation. (You may assume that log10 2 is irrational.)
(Hint: first show that 2n has leading digit r if and only if
r 10` ≤ 2n < (r + 1)10`
for some ` ∈ Z+ .)
Exercise 2.7. Calculate the frequency with which 2n has r (r = 0, 1, . . . , 9) as the second
digit of its base 10 representation.
2.3
Proof of Weyl’s criterion
Proof. Since e 2πixj = e 2πi{xj } , we may suppose, without loss of generality, that xj = {xj }.
(i) ⇒ (ii): Suppose that xj is uniformly distributed mod 1. If χ[a,b] is the characteristic
function of the interval [a, b], then we may rewrite the definition of uniform distribution in
the form
Z 1
n−1
1X
χ[a,b] (xj ) →
χ[a,b] (x) dx, as n → ∞.
n j=0
0
From this we deduce that
n−1
1X
f (xj ) →
n j=0
Z
1
f (x) dx,
0
5
as n → ∞,
2.3
Proof of Weyl’s criterion
2
UNIFORM DISTRIBUTION
whenever f is a step function, i.e., a linear combination of characteristic functions of intervals.
Now let g be a continuous function on [0, 1] (with g(0) = g(1)). Then, given ε > 0, we
can find a step function f with kg − f k∞ ≤ ε. We have the estimate
Z 1
n−1
1 X
g(xj ) −
g(x) dx n
0
j=0
Z 1
n−1
n−1
1 X
1 X
(g(xj ) − f (xj )) + f (xj ) −
≤ f (x) dx n
n
0
j=0
j=0
Z 1
Z 1
+
f (x) dx −
g(x) dx 0
0
Z 1
n−1
1 X
≤ 2ε + f (xj ) −
f (x) dx .
n
0
i=0
Since the last term converges to zero, we thus obtain
n−1
Z 1
1 X
g(x) dx ≤ 2ε.
g(xj ) −
lim sup n→∞ n
0
j=0
Since ε > 0 is arbitrary, this gives us that
n−1
1X
g(xj ) →
n j=0
Z
1
g(x) dx,
0
as n → ∞, and this holds, in particular, for g(x) = e 2πi`x . If ` 6= 0 then
Z 1
e 2πi`x dx = 0,
0
so the first implication is proved.
(ii) ⇒ (i): Suppose now that Weyl’s Criterion holds. Then
n−1
1X
g(xj ) →
n j=0
Z
1
g(x) dx,
as n → ∞,
0
Pm
whenever g(x) = k=1 αk e 2πi`k x is a trigonometric polynomial.
Let f be any continuous function on [0, 1] with f (0) = f (1). Given ε > 0 we can find
a trigonometric polynomial g such that kf − gk∞ ≤ ε. (This is a consequence of Fejér’s
Theorem.) As in the first part of the proof, we can conclude that
n−1
1X
f (xj ) →
n j=0
Z
1
f (x) dx,
0
6
as n → ∞.
2.4
Generalisation to Higher Dimensions
2
UNIFORM DISTRIBUTION
Now consider the interval [a, b] ⊂ [0, 1). Given ε > 0, we can find continuous functions
f1 , f2 (with f1 (0) = f1 (1), f2 (0) = f2 (1)) such that
f1 ≤ χ[a,b] ≤ f2
and
Z
1
f2 (x) − f1 (x) dx ≤ ε.
0
We then have that
n−1
n−1
1X
1X
lim inf
χ[a,b] (xj ) ≥ lim inf
f1 (xj ) =
n→∞ n
n→∞ n
j=0
j=0
Z 1
Z
≥
f2 (x) dx − ε ≥
0
1
Z
f1 (x) dx
0
1
χ[a,b] (x) dx − ε
0
and
Z 1
n−1
n−1
1X
1X
χ[a,b] (xj ) ≤ lim sup
f2 (xj ) =
f2 (x) dx
lim sup
n→∞ n
n→∞ n
0
j=0
j=0
Z 1
Z 1
≤
f1 (x) dx + ε ≤
χ[a,b] (x) dx + ε.
0
0
Since ε > 0 is arbitrary, we have shown that
n−1
1X
lim
χ[a,b] (xj ) =
n→∞ n
j=0
1
Z
χ[a,b] (x) dx = b − a,
0
so that xi is uniformly distributed mod 1.
2.4
Generalisation to Higher Dimensions
We shall now look at the distribution of sequences in Rk .
Definition 2.8. A sequence xn = (xn1 , . . . , xnk ) ∈ Rk is said to be uniformly distributed mod 1
if, for each choice of k intervals [a1 , b1 ], . . . , [ak , bk ] ⊂ [0, 1), we have that
n−1
k
k
Y
1 XY
χ[ai ,bi ] ({xji }) →
(bi − ai ),
n j=0 i=1
i=1
as n → ∞.
We have the following criterion for uniform distribution.
Theorem 2.9 (Multi-dimensional Weyl’s Criterion). The sequence xn ∈ Rk is uniformly distributed mod 1 if and only if
n−1
1 X 2πi(`1 xj1 +···+`k xjk )
e
→ 0,
n j=0
for all ` = (`1 , . . . , `k ) ∈ Zk \ {0}.
7
as n → ∞,
2.5
Generalisation to polynomials
2
UNIFORM DISTRIBUTION
Remark 2.10. Here and throughout 0 ∈ Zk denotes the zero vector (0, . . . , 0).
Proof. The proof is essentially the same as in the case k = 1.
We shall apply this result to the sequence xn = (nα1 , . . . , nαk ), for real numbers α1 , . . . , αk .
Suppose first that the numbers α1 , . . . , αk , 1 are rationally independent. This means that
if r1 , . . . , rk , r are rational numbers such that
r1 α1 + · · · + rk αk + r = 0,
then r1 = · · · = rk = r = 0. In particular, for ` = (`1 , . . . , `k ) ∈ Zk \ {0} and n ∈ N,
`1 nα1 + · · · + `k nαk ∈
/ Z,
so that
e 2πi(`1 nα1 +···+`k nαk ) 6= 1.
We therefore have that
2πin(` α +···+` α )
n−1
1 X
1 1
k k − 1
1 e
2πi(`1 jα1 +···+`k jαk ) e
= 2πi(`1 α1 +···+`k αk ) − 1 n
n
e
j=0
≤
1
2
→ 0,
n |e 2πi(`1 α1 +···+`k αk ) − 1|
as n → ∞.
Therefore, by Weyl’s Criterion, (nα1 , . . . , nαk ) is uniformly distributed mod 1.
Now suppose that the numbers α1 , . . . , αk , 1 are rationally dependent, i.e. there exist
rational numbers r1 , . . . , rk , r , not all equal to zero, such that r1 α1 + · · · + rk αk + r = 0. Then
there exists ` = (`1 , . . . , `k ) ∈ Zk \ {0} such that
`1 α1 + · · · + `k αk ∈ Z.
Thus e 2πi(`1 nα1 +···+`k nαk ) = 1 for all n ∈ N and so
n−1
1 X 2πi(`1 jα1 +···+`k jαk )
e
= 1 6→ 0,
n j=0
as n → ∞.
Therefore, (nα1 , . . . , nαk ) is not uniformly distributed mod 1.
2.5
Generalisation to polynomials
We shall now consider another generalisation of the sequence nα. Write
p(n) = αk nk + αk−1 nk−1 + · · · α1 n + α0 .
Theorem 2.11 (Weyl). If any one of α1 , . . . , αk is irrational then p(n) is uniformly distributed
mod 1.
8
2.5
Generalisation to polynomials
2
UNIFORM DISTRIBUTION
(Note that it is irrelevent whether or not α0 is irrational.) To prove this theorem we shall
need the following technical result.
Lemma 2.12 (van der Corput’s Inequality). Let z0 , . . . , zn−1 ∈ C and let 1 ≤ m ≤ n − 1.
Then
n−1 2
n−1
X X
2
|zj |2
m zj ≤ m(n + m − 1)
j=0
j=0
+ 2(n + m − 1)<
m−1
X
(m − j)
n−1−j
X
j=1
zi+j z¯i .
i=0
Proof. Consider the following sums:
S1 = z0
S2 = z0 + z1
..
.
Sm = z0 + z1 + · · · + zm−1
Sm+1 = z1 + z2 + · · · + zm
..
.
Sn = zn−m + zn−m+1 + · · · + zn−1
Sn+1 = zn−m+1 + zn−m+2 · · · + zn−1
..
.
Sn+m−2 = zn−2 + zn−1
Sn+m−1 = zn−1 .
Notice that each zj occurs in exactly m of the sums Sk . Thus
S1 + · · · + Sn+m−1 = m
n−1
X
zj
j=0
and so
n−1 2
X
m2 zj = |S1 + · · · + Sn+m−1 |2
j=0
≤ (|S1 | + · · · + |Sn+m−1 |)2
≤ (n + m − 1)(|S1 |2 + · · · + |Sn+m−1 |2 ),
using the fact that
l
X
!2
ak
≤l
k=1
l
X
k=1
9
ak2 .
2.5
Generalisation to polynomials
2
UNIFORM DISTRIBUTION
Now, using the formula
2
!
l
l
X
X
X
ak =
|ak |2 + 2Re
ai aj ,
k=1
k=1
i<j
we have
|S1 |2 + · · · + |Sn+m−1 |2 = m
n−1
X
!
|zj |2
+ 2Re
m−1
X
(m − r )
n−r
−1
X
r =1
j=0
!
zj zj+r
.
j=0
Hence
n−1 2
!
n−1
X X
m2 zj ≤ m(n + m − 1)
|zj |2
j=0
j=0
+ 2(n + m − 1)Re
m−1
X
n−j−1
X
j=1
i=1
(m − j)
!
zi zi+j
,
as required.
Let xn ∈ R. For each m ≥ 1 define the sequence xn(m) = xn+m − xn of mth differences.
The following lemma allows us to infer the uniform distribution of the sequence xn if we know
the uniform distribution of the each of the mth differences of xn .
Lemma 2.13. Let xn ∈ R be a sequence. Suppose that for each m ≥ 1 the sequence xn(m) of
mth differences is uniformly distributed mod 1. Then xn is uniformly distributed mod 1.
Proof. We shall apply Weyl’s Criterion. We need to show that if ` ∈ Z \ {0} then
n−1
1 X 2πi`xj
e
→ 0,
n j=0
as n → ∞.
Let zj = e 2πi`xj for j = 0, . . . , n − 1. Note that |zj | = 1. Let 1 < m < n. By van der
Corput’s inequality,
2
n−1
m2 X 2πi`xj m
e
≤ 2 (n + m − 1)n
2
n n
j=0
n−1−j
m−1
2(n + m − 1) X (m − j) X 2πi`(xi+j −xi )
+
<
e
n
n
j=1
i=0
m−1
=
m
2(n + m − 1) X
(m + n − 1) +
<
(m − j)An,j
n
n
j=1
where
An,j
n−1−j
n−1−j
1 X 2πi`(xi+j −xi ) 1 X 2πi`x (j)
i .
e
=
e
=
n i=0
n i=0
10
2.5
Generalisation to polynomials
2
UNIFORM DISTRIBUTION
As the sequence xi(j) of j th differences is uniformly distributed mod 1, by Weyl’s criterion we
have that An,j → 0 for each j = 1, . . . , m − 1. Hence for each m ≥ 1
m2
lim sup 2
n→∞ n
2
n−1
X
(n + m − 1)
2πi`xj e
= m.
≤ lim sup m
n
n→∞
j=0
Hence, for each m > 1 we have
n−1
1
1 X 2πii`xj lim sup e
≤ √ .
m
n→∞ n j=0
As m > 1 is arbitrary, the result follows.
Proof of Weyl’s Theorem. We will only prove Weyl’s theorem in the special case where the
leading digit αk of
p(n) = αk nk + · · · + α1 n + α0
is irrational. (The general case, where αi is irrational for some 1 ≤ i ≤ k can be deduced
very easily from this special case, but we will not go into this.)
We shall use induction on the degree of p. Let ∆(k) denote the statement ‘for every
polynomial q of degree ≤ k, with irrational leading coefficient, the sequence q(n) is uniformly
distributed mod 1’. We know that ∆(1) is true.
Suppose that ∆(k −1) is true. Let p(n) = αk nk +· · ·+α1 n+α0 be an arbitrary polynomial
of degree k with αk irrational. For each m ∈ N, we have that
p(n + m) − p(n)
= αk (n + m)k + αk−1 (n + m)k−1 + · · · + α1 (n + m) + α0
− αk nk − αk−1 nk−1 − · · · − α1 n − α0
= αk nk + αk knk−1 m + · · · + αk−1 nk−1 + αk−1 (k − 1)nk−2 h
+ · · · + α1 n + α1 m + α0 − αk nk − αk−1 nk−1 − · · · − α1 n − α0 .
After cancellation, we can see that, for each m, p(n + m) − p(n) is a polynomial of degree
k − 1, with irrational leading coefficient αk km. Therefore, by the inductive hypothesis,
p(n + m) − p(n) is uniformly distributed mod 1. We may now apply Lemma 2.13 to conclude
that p(n) is uniformly distributed mod 1 and so ∆(k) holds. This completes the induction.
Exercise 2.14. Let p(n) = αk nk + αk−1 nk−1 + · · · + α1 n + α0 , q(n) = βk nk + βk−1 nk−1 +
· · · + β1 n + β0 . Show that (p(n), q(n)) is uniformly distributed mod 1 if at least one of
(αk , βk , 1), . . . , (α1 , β1 , 1) is rationally independent.
11
3
3
3.1
EXAMPLES OF DYNAMICAL SYSTEMS
Examples of Dynamical Systems
The circle
Several of the key examples in the course take place on the circle. There are two different—
although equivalent—ways of thinking about the circle.
We can think of the circle as the quotient group
R/Z = {x + Z | x ∈ R}
which is easily seen to be equivalent to [0, 1) mod 1. We refer to this as additive notation.
Alternatively, we can regard the circle as
S 1 = {z ∈ C | |z| = 1} = {exp 2πi θ | θ ∈ [0, 1)}.
We refer to this as multiplicative notation.
The two viewpoints are obviously equivalent, and we shall use whichever is most convenient
given the circumstances.
We will also be interested in maps of the k-dimensional torus. The k-dimensional torus
is defined to be
Rk /Zk = {x + Zk | x ∈ Rk } = [0, 1)k mod 1
(in additive notation) and
S 1 × · · · × S 1 (k-times) = {(exp 2πi θ1 , . . . , exp 2πi θk ) | θ1 , . . . , θk ∈ [0, 1)}
(in multiplicative notation).
3.2
Rotations on a circle
Fix α ∈ [0, 1) and define the map
T : R/Z → R/Z : x 7→ x + α mod 1.
(In multiplicative notation this is: exp 2πi θ 7→ exp 2πi (θ +α).) This map acts on the circle by
rotating it by angle α. Clearly, we have that T n (0) = nα mod 1 = {nα}, i.e. the fractional
parts we considered in section 2 form the orbit of 0.
Suppose that α = p/q is rational (here, p, q ∈ Z, q 6= 0). Then
T q (x) = x + qp/q mod 1 = x + p mod 1 = x.
Hence every point of R/Z is periodic.
When α is irrational, one can show that every point x ∈ R/Z has a dense orbit. This can
be deduced from uniform distribution, but it can also be proved directly.
Exercise 3.1. Prove that, for an irrational rotation of the circle, every orbit is dense. (Recall
that the orbit of x is dense if: for all y ∈ R/Z and for all ε > 0, there exists n > 0 such that
d(T n (x), y ) < ε.)
(Hints: (1) First show that T n (x) = T n (0) + x and conclude that it’s sufficient to prove
that the orbit of 0 is dense. (2) Prove that T n (x) 6= T m (x) for n 6= m. (3) Show that for
each ε > 0 there exists n > 0 such that 0 < nα mod 1 < ε (you will need to remember that
the circle is sequentially compact). (4) Now show that the orbit of 0 is dense.)
12
3.3
The doubling map
3.3
3
EXAMPLES OF DYNAMICAL SYSTEMS
The doubling map
We have already seen the doubling map
T : R/Z 7→ R/Z : x 7→ 2x mod 1.
(In multiplicative notation this is
T (exp 2πi θ) = exp 2πi (2θ).
or, writing z = e 2πiθ , T (z) = z 2 .)
Proposition 3.2. Let T be the doubling map.
(i) There are 2n − 1 points of period n.
(ii) The periodic points are dense.
(iii) There exists a dense orbit.
Proof. We prove (i). Notice that
T n (x) = 2n x = x mod 1
if there exists an integer p > 0 such that
2n x = x + p.
Hence
p
.
−1
We get distinct values of x ∈ [0, 1) for p = 0, 1, . . . , 2n − 2. Hence there are 2n − 1 periodic
points.
We leave (ii) as an exercise.
x=
2n
Exercise 3.3. Prove (ii).
We sketch the proof of (iii). Let us denote the interval [0, 1/2) by the symbol 0 and
denote the interval [1/2, 1) by 1. Let x ∈ [0, 1). For each n ≥ 0 let xn denote the symbol
corresponding to the interval in which T n (x) lies. Thus to each x ∈ [0, 1) we associate a
sequence (x0 , x1 , . . .) of 0s and 1s. It is easy to see that
∞
X
xn
x=
2n+1
n=0
so that the sequence (x0 , x1 , . . .) corresponds to the base 2 expansion of x.
Notice that if x has coding (x0 , x1 , . . .) then
∞
∞
∞
X
X
X
2xn
xn+1
xn+1
T (x) = 2x mod 1 =
mod 1 = x0 +
mod 1 =
n+1
n+1
2
2
2n+1
n=0
n=0
n=0
13
3.4
Shifts of finite type
3
EXAMPLES OF DYNAMICAL SYSTEMS
so that T (x) has expansion (x1 , x2 , . . .), i.e. T can be thought of as acting on the coding of
x be shifting the associated sequence one place to the left.
For each n-tuple x0 , x1 , . . . , xn−1 let
I(x0 , . . . , xn−1 ) = {x ∈ [0, 1) | T k (x) lies in interval xk for k = 0, 1, . . . , n − 1}.
That is, I(x0 , . . . , xn−1 ) corresponds to the set of all x ∈ [0, 1) whose base 2 expansion starts
x0 , . . . , xn−1 . We call I(x0 , . . . , xn−1 ) a cylinder of rank n.
Exercise 3.4. Draw all cylinders of length ≤ 4.
One can show:
(i) a cylinder of rank n is an interval of length 2−n .
(ii) for each x ∈ [0, 1) with base 2 expansion x0 , x1 , . . ., the intervals I(x0 , . . . , xn ) ‘converge’
as n → ∞ (in an appropriate sense) to x.
From these observations it is easy to see that, in order to construct a dense orbit, it is
sufficient to construct a point x such that for every cylinder I there exists n = n(I) such that
T n (x) ∈ I. To do this, firstly write down all possible cylinders (there are countably many):
0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101, 110, 111, 0000, 0001, . . . .
Now take x to be the point with base 2 expansion
010001101100000101001110010111011100000001 . . .
(that is, just adjoin all the symbolic representations of all cylinders in some order). One can
easily check that such a point x has a dense orbit.
Exercise 3.5. Write down the proof of Proposition 3.2(iii), adding in complete details.
Remark 3.6. This technique of coding the orbits of a given dynamical system by partitioning
the space X and forming an itinerary map is a very powerful technique that can be used to
study many different classes of dynamical system.
3.4
Shifts of finite type
Let S = {1, 2, . . . , k} be a finite set of symbols. We will be interested in sets consisting
of sequences of these symbols, subject to certain conditions. We will impose the following
conditions: we assume that for each symbol i we allow certain symbols (depending only on
i ) to follow i and disallow all other symbols.
This information is best recorded in a k × k matrix A with entries in {0, 1}. That is, we
allow the symbol j to follow the symbol i if and only if the corresponding (i, j)th entry of the
matrix A (denoted by Ai,j ) is equal to 1.
14
3.4
Shifts of finite type
3
EXAMPLES OF DYNAMICAL SYSTEMS
Definition 3.7. Let A be a k × k matrix with entries in {0, 1}. Let
+
∞
Σ+
A = {(xj )j=0 | Axi ,xi+1 = 1, for j ∈ Z }
denote the set of all infinite sequences of symbols (xj ) where symbol j can follow symbol i
precisely when Ai,j = 1. We call Σ+
A a (one-sided) shift of finite type.
Let
ΣA = {(xj )∞
j=−∞ | Axj ,xj+1 = 1, for j ∈ Z}
denote the set of all bi-infinite sequences of symbols subject to the same conditions. We call
ΣA a (two-sided) shift of finite type.
Sometimes for brevity we refer to Σ+
A or ΣA as a shift space.
+
An alternative description of ΣA and ΣA can be given as follows. Consider a directed graph
with vertex set {1, 2, . . . , k} and with a directed edge from vertex i to vertex j precisely when
Ai,j = 1. Then Σ+
A and ΣA correspond to the set of all infinite (respectively, bi-infinite) paths
in this graph.
Define
+
σ + : Σ+
A → ΣA
by
(σ + (x))j = xj+1 .
Then σ + takes a sequence in Σ+
A and shifts it one place to the left (deleting the first term),
+
We call σ the (one-sided, left) shift map.
There is a corresponding shift map on the two-sided shift space. Define
σ : ΣA → ΣA
by
(σ(x))j = xj+1 ,
so that σ shifts sequences one place to the left. Notice that in this case, we do not need to
delete any terms in the sequence, We call σ the (two-sided, left) shift map.
Notice that σ is invertible but σ + is not. For ease of notation, we shall often write σ to
denote both the one-sided and the two-sided shift map.
Examples 3.8.
Take A to be the k × k matrix with each entry equal to 1, Then any symbol can follow any
other symbol. Hence Σ+
A is the space of all sequences of symbols {1, 2, . . . , k}. In this case
+
+
we write Σk for ΣA and refer to it as the full one-sided k-shift. Similarly, we can define the
full two-sided k-shift,
Take A to be the matrix
1 1
1 0
.
Then Σ+
A consists of all sequences of 1s and 2s subject to the condition that each 2 must be
followed by a 1.
15
3.4
Shifts of finite type
3
EXAMPLES OF DYNAMICAL SYSTEMS
The following two exercises show that, for certain A, Σ+
A (or ΣA ) can be rather uninteresting.
Exercise 3.9. Let
A=
0 1
0 0
1 1
0 1
.
Show that Σ+
A is empty.
Exercise 3.10. Let
A=
.
Calculate Σ+
A.
The following conditions on A guarantee that Σ+
A (or ΣA ) is more interesting than the
examples in exercises 5.1 and 5.2.
Definition 3.11. Let A be a k × k matrix with entries in {0, 1}. We say that A is irreducible
if for each i, j ∈ {1, 2, . . . , k} there exists n = n(i, j) > 0 such that (An )i,j > 0. (Here, (An )i,j
denotes the (i, j)th entry of the nth power of A.)
Definition 3.12. Let A be a k × k matrix with entries in {0, 1}. We say that A is aperiodic
if there exists n > 0 such that for all i, j ∈ {1, 2, . . . , k} we have (An )i,j > 0.
In graph-theoretic terms, the matrix A is irreducible if there exists a path along edges
from any vertex to any other vertex. The matrix A is aperiodic if this path can be chosen
to have the same length (i.e. consist of the same number of edges), irrespective of the two
vertices chosen.
Exercise 3.13.
(i) Consider the matrix
1 1
1 0
.
Draw the corresponding directed graph. Is this matrix irreducible? Is it aperiodic?
(ii) Consider the matrix

0
 1

 0
1
1
0
1
0
0
1
0
1

1
0 
.
1 
0
Draw the corresponding directed graph. Is this matrix irreducible? Is it aperiodic?
Remark 3.14. These shift spaces may seem very strange at first sight—it takes a long time
to get used to them. However (as we shall see) they are particularly tractable examples of
chaotic dynamical systems. Moreover, a wide class of dynamical systems (notably hyperbolic
dynamical systems) can be modeled in terms of shifts of finite type. We have already seen
a particularly simple example of this: the doubling map can be modeled by the full one-sided
2-shift.
16
3.5
3.5
Periodic points
3
EXAMPLES OF DYNAMICAL SYSTEMS
Periodic points
+
A sequence x = (xj )∞
j=0 ∈ ΣA is periodic for the shift σ if there exists n > 0 such that
σ n x = x.
One can easily check that this means that
xj = xj+n for all j ∈ Z+ .
That is, the sequence x is determined by a finite block of symbols x0 , . . . , xn−1 and
x = (x0 , x1 , . . . , xn−1 , x0 , x1 , . . . , xn−1 , . . . , ).
Exercise 3.15. Consider the full one-sided k-shift. How many periodic points of period n are
there?
3.6
Cylinders
Later on we will need a particularly tractable class of subsets of shift spaces. These are the
cylinder sets and are formed by fixing a finite set of co-ordinates. More precisely, in ΣA we
define
[y−m , . . . , y−1 , y0 , y1 , . . . , yn ]−m,n = {x ∈ ΣA | xj = yj , −m ≤ j ≤ n},
and in Σ+
A we define
[y0 , y1 , . . . , yn ]0,n = {x ∈ Σ+
A | xj = yj , 0 ≤ j ≤ n}.
3.7
A metric on Σ+
A
What does it mean for two sequences in Σ+
A to be ‘close’ ? Heuristically we will say that two
∞
∞
sequences (xj )j=0 and (yj )j=0 are close if they agree for a large number of initial places.
+
∞
More formally, for two sequences x = (xj )∞
j=0 , y = (yj )j=0 ∈ ΣA we define n(x, y ) by
setting n(x, y ) = n if xj = yj for j = 0, . . . , n − 1 but xn 6= yn . Thus n(x, y ) is the first place
in which the sequences x and y disagree. (We set n(x, y ) = ∞ if x = y .)
We define a metric d on Σ+
A by
∞
d((xj )∞
j=0 , (yj )j=0 )
n(x,y )
1
=
if x 6= y
2
∞
and d((xj )∞
j=0 , (yj )j=0 ) = 0 if x = y .
Exercise 3.16. Show that this is a metric.
In the two-sided case, we can define a metric in a similar way. Let x = (xj )∞
j=−∞ , y =
∞
(yj )j=−∞ ∈ ΣA . Define n(x, y ) by setting n(x, y ) = n if xj = yj for |j| ≤ n − 1 and either
xn 6= yn or x−n 6= y−n . Thus n(x, y ) is the first place, going either forwards or backwards, in
which the sequences x, y disagree. (We again set n(x, y ) = ∞ if x = y .)
17
3.7
A metric on Σ+
A
3
EXAMPLES OF DYNAMICAL SYSTEMS
We define a metric d on ΣA in the same way:
∞
d((xj )∞
j=−∞ , (yj )j=−∞ )
n(x,y )
1
=
if x 6= y
2
∞
and d((xj )∞
j=−∞ , (yj )j=−∞ ) = 0 if x = y .
Theorem 3.17. Let Σ+
A be a shift of finite type.
(i) Σ+
A is a compact metric space.
(ii) The shift map σ is continuous.
Remark 3.18. The corresponding statements for the two-sided case also hold.
+
Proof. (i) If Σ+
A = ∅ or if ΣA finite then trivially it is compact. Thus we may assume that
Σ+
A is infinite.
Let x (m) ∈ Σ+
sequences!). We need to show
A be a sequence (in reality, a sequence of S
+
(m)
that x
has a convergent subsequence. Since ΣA = ki=1 [i ] at least one cylinder [i ]
contains infinitely many elements of the sequence x (m) ; call this [y0 ]. Thus there are
infinitely m for which x (m) ∈ [y0 ].
S
Since [y0 ] = Ay ,i =1 [y0 i ] we similarly obtain a cylinder of length 2, [y0 y1 ] say, containing
0
infinitely many elements of the sequence x (m) .
Continue inductively in this way to obtain a nested family of cylinders [y0 , . . . , yn ], n ≥ 0,
each containing infinitely many elements of the sequence x (m) .
+
Set y = (yn )∞
n=0 ∈ ΣA . Then for each n ≥ 0, there exist infinitely many m for which
d(y , x (m) ) ≤ (1/2)m . Thus y is the limit of some subsequence of x (m) .
(ii) We want to show the following: ∀ε > 0 ∃δ > 0 s.t. d(x, y ) < δ ⇒ d(σ(x), σ(y )) < ε.
Let ε > 0. Choose n such that 1/2n < ε. Let δ = 1/2n+1 . Suppose that d(x, y ) < δ.
Then n(x, y ) > n + 1, so that x and y agree in the first n + 1 places. Hence σ(x) and
σ(y ) agree in the first n places, so that n(σ(x), σ(y )) > n. Hence d(σ(x), σ(y )) =
1/2n(σ(x).σ(y )) < 1/2n < ε.
Exercise 3.19. Let A be an irreducible k × k matrix with entries in {0, 1}. Show that the set
of all periodic points for σ is dense in Σ+
A . (Recall that a subset Y of a set X is said to be
dense if: for all x ∈ X and for all ε > 0 there exists y ∈ Y such that d(x, y ) < ε, i.e. any
point of X can be arbitrarily well approximated by a point of Y .)
Exercise 3.20. Let A be an irreducible k × k matrix with entries in {0, 1}. Show that there
exists a point x ∈ Σ+
A with a dense orbit. (Hint: first show that if the orbit of a point visits
each cylinder then it is dense. To construct such a point, mimic the argument used for the
doubling map above. Use irreducibility to show that one can concatenate cylinders together
by inserting finite strings of symbols between them.)
18
3.8
3.8
The continued fraction map
3
EXAMPLES OF DYNAMICAL SYSTEMS
The continued fraction map
Every x ∈ (0, 1) can be expressed as a continued fraction:
1
x=
x0 +
(1)
1
x1 +
1
1
x2 + x +···
3
for xn ∈ N.
For example,
−1 +
2
√
5
=
1
1+
3
1
=
4
1+
π = 3+
1
1+
1
1
1+ 1+···
1
3
1
7+
1
15+
1
1
1+ 292+···
One can show that rational numbers have a finite continued fraction expansion (that
is, the above expression terminates at xn for some n). Conversely, it is clear that a finite
continued fraction expansion gives rise to a rational number.
Thus each irrational x ∈ (0, 1) has an infinite continued fraction expansion of the form
(1). Moreover, one can show that this expansion is unique. For brevity, we will sometime
write (1) as x = [x0 ; x1 ; x2 ; . . .].
Recall that earlier in this section we saw how the doubling map x 7→ 2x mod 1 can be
used to determine the base 2 expansion of x. Here we introduce a dynamical system that
allows us to determine the continued fraction expansion of x.
We can read off the numbers xi from the transformation T : [0, 1] → [0, 1] defined by
T (0) = 0 and, for 0 < x < 1,
1
T (x) = mod 1.
x
Then
1
1
1
, x1 =
, . . . , xn =
.
x0 =
x
Tx
T nx
This is called the continued fraction map or the Gauss map.
Exercise 3.21. Draw the graph of the continued fraction map.
Later in the course we will study the ergodic theoretic properties of the continued fraction
map and use them to deduce some interesting facts about continued fractions
3.9
Endomorphisms of a torus
Take X = Rk /Zk to be the k-torus.
19
3.9
Endomorphisms of a torus
3
EXAMPLES OF DYNAMICAL SYSTEMS
Let A = (aij ) be a k × k matrix with entries in Z
linear map Rk → Rk by



x1
x1
 .. 
 ..
 .  7→ A  .
xk
xk
and with det A 6= 0. We can define a


.
For brevity, we shall often write this as (x1 , . . . , xk ) 7→ A(x1 , . . . , xk ).
Since A is an integer matrix, it maps Zk to itself. We claim that A allows us to define a
map
T = TA : Rk /Zk → Rk /Zk
(x1 , . . . , xk ) 7→ A(x1 , . . . , xk ) mod 1.
To see that this map is well defined, we need to check that if x, y ∈ Rk determine the
same point in Rk /Zk then Ax mod 1 and Ay mod 1 are the same point in Rk /Zk . But this is
clear: if x, y ∈ Rk give the same point in the torus, then x = y + n for some n ∈ Zk . Hence
Ax = A(y + n) = Ay + An. As A maps Zk to itself, we see that An ∈ Zk so that Ax, Ay
determine the same point in the torus.
Definition 3.22. Let A = (aij ) denote a k ×k matrix with integer entries such that det A 6= 0.
Then we call the map TA : Rk /Zk → Rk /Zk a linear toral endomorphism.
The map T is not invertible in general. However, if det A = ±1 then A−1 exists and is an
integer matrix. Hence we have a map T −1 given by
T −1 (x1 , . . . , xk ) = A−1 (x1 , . . . , xk ) mod 1.
One can easily check that T −1 is the inverse of T .
Definition 3.23. Let A = (aij ) denote a k × k matrix with integer entries such that det A =
±1. Then we call the map TA : Rk /Zk → Rk /Zk a linear toral automorphism.
Example 3.24. Take A to be the matrix
A=
2 1
1 1
and define T : R2 /Z2 → R2 /Z2 to be the induced map:
T (x1 , x2 ) = (2x1 + x2 mod 1, x1 + x2 mod 1).
Then T is a linear toral automorphism and is called Arnold’s cat map. (CAT stands for
‘C’ontinuous ‘A’utomorphism of the ‘T’orus.)
Definition 3.25. Suppose that det A = ±1. Then we call T a hyperbolic toral automorphism
if A has no eigenvalues of modulus 1.
20
3.9
Endomorphisms of a torus
3
EXAMPLES OF DYNAMICAL SYSTEMS
Exercise 3.26. Check that Arnold’s cat map is hyperbolic. Decide whether the following
matrices give hyperbolic toral automorphisms:
1 1
1 1
A1 =
, A2 =
.
0 1
1 0
Let us consider the special case of a toral automorphism of the 2-dimensional torus R2 /Z2 .
Proposition 3.27. Let T be a hyperbolic toral automorphism of R2 /Z2 with corresponding
matrix A having eigenvalues λ1 , λ2 .
(i) The periodic points of T correspond precisely with the set of rational points of R2 /Z2 :
p1 p2
2
,
+ Z | p1 , p2 , q ∈ N, 0 ≤ p1 , p2 < q .
q q
(In particular, the periodic points are dense.)
(ii) Suppose that det A = 1. Then the number of points of period n is given by:
card{x ∈ R2 /Z2 | T n (x) = x} = |λn1 + λn2 − 2|.
Proof.
(i) If (x1 , x2 ) = (p1 /q, p2 /q) has rational co-ordinates then we can write
!
(n)
(n)
p1 p2
,
T n (x1 , x2 ) =
q
q
where 0 ≤ p1(n) , p2(n) < q are integers. As there are at most q 2 distinct possibilities for
p1(n) , p2(n) , this sequence (in n) must be eventually periodic. Hence there exists n1 > n0
such that T n1 (x1 , x2 ) = T n0 (x1 , x2 ). As T is invertible, we see that T n1 −n0 (x1 , x2 ) =
(x1 , x2 ) so that (x1 , x2 ) is periodic.
Conversely, If (x1 , x2 ) ∈ R2 /Z2 is periodic then T n (x1 , x2 ) = (x1 , x2 ) for some n > 0.
Hence
x1
x1
n1
n
A
=
+
(2)
x2
x2
n2
for some n1 , n2 ∈ Z. As A is hyperbolic, A has no eigenvalues of modulus 1. Hence An
has no eigenvalues of modulus 1, and in particular 1 is not an eigenvalue. Hence An − I
is invertible. Hence solutions to (2) have the form
x1
n1
n
−1
= (A − I)
.
x2
n2
As An − I has entries in Z, the matrix (An − I)−1 has entries in Q. Hence x1 , x2 ∈ Q.
21
3.9
Endomorphisms of a torus
3
EXAMPLES OF DYNAMICAL SYSTEMS
(ii) A point (x1 , x2 ) is periodic with period n for T if and only if
x1
n1
n
(A − I)
=
.
x2
n2
(3)
We may take x1 , x2 ∈ [0, 1). Let u = (An − I)(0, 1), v = (An − I)(1, 0). The map An − I
maps [0, 1) × [0, 1) onto the parallelogram
R = {αu + βv | 0 ≤ α, β < 1}.
For the point (x1 , x2 ) ∈ [0, 1) × [0, 1) to be periodic, it follows from (3) that (An −
I)(x1 , x2 ) must be an integer point of R. Thus the number of periodic points of period
n correspond to the number of integer points in R. One can check that the number of
such points is equal to the area of R. Hence that number of periodic points of period
n is given by | det(An − I)|.
Let us calculate the eigenvalues of An − I. Let µ be an eigenvalue of An − I with
eigenvector v . Then
(An − I)v = µv ⇔ An v = (µ + 1)v
so that µ + 1 is an eigenvalue of An . As the eigenvalues of A are given by λ1 , λ2 , the
eigenvalues of An are given by λn1 , λn2 . Hence the eigenvalues of An −I are λn1 −1, λn2 −1.
As the determinant of a matrix is given by the product of the eigenvalues, we have that
| det(An − I)| = |(λn1 − 1)(λn2 − 1)|
= |(λ1 λ2 )n + 1 − (λn1 + λn2 )|
= λn1 + λn2 − 2,
as λ1 λ2 = det A = 1.
22
4
4
MEASURE THEORY
Measure Theory
4.1
Background
In section 1 we remarked that ergodic theory is the study of the qualitative distributional
properties of typical orbits of a dynamical system and that these properties are expressed
in terms of measure theory. Measure theory therefore lies at the heart of ergodic theory.
However, we will not need to know the (many!) intricacies of measure theory and this
section will be devoted to an expository account of the required facts.
4.2
Measure spaces
Loosely speaking, a measure is a function that, when given a subset of a space X, will say
how ‘big’ that subset is. A motivating example is given by Lebesgue measure. The Lebesgue
measure of an interval is given by its length. In defining an abstract measure space, we will be
taking the properties of ‘length’ (or, in higher dimensions, ‘volume’) and abstracting them,
in much the same way that a metric space abstracts the properties of ‘distance’.
It turns out that in general it is not possible to be able to define the measure of an arbitrary
subset of X. Instead, we will usually have to restrict our attention to a class of subsets of X.
Definition 4.1. A collection B of subsets of X is called a σ-algebra if the following properties
hold:
(i) ∅ ∈ B,
(ii) if E ∈ B then its complement X \ E ∈ B,
(iii) S
if En ∈ B, n = 1, 2, 3, . . ., is a countable sequence of sets in B then their union
∞
n=1 En ∈ B.
Examples 4.2.
The trivial σ-algebra is given by B = {∅, X}.
The full σ-algebra is given by B = P(X), i.e. the collection of all subsets of X.
Here are some easy properties of σ-algebras:
Lemma 4.3. Let B be a σ-algebra of subsets of X. Then
(i) X ∈ B;
(ii) if En ∈ B then
T∞
n=1
En ∈ B.
Exercise 4.4. Prove Lemma 4.3.
In the special case when X is a compact metric space there is a particularly important
σ-algebra.
23
4.3
The Kolmogorov Extension Theorem
4
MEASURE THEORY
Definition 4.5. Let X be a compact metric space. We define the Borel σ-algebra B(X) to
be the smallest σ-algebra of subsets of X which contains all the open subsets of X.
Remarks 4.6.
By ‘smallest’ we mean that if C is another σ-algebra that contains all open subsets of X then
B(X) ⊂ C.
We say that the Borel σ-algebra is generated by the open sets. We call sets in B(X) a Borel
set.
By Definition 4.1(ii), the Borel σ-algebra also contains all closed sets and is the smallest
σ-algebra with this property.
Let X be a set and let B be a σ-algebra of subsets of X.
Definition 4.7. A function µ : B → R+ ∪ {∞} is called a measure if:
(i) µ(∅) = 0;
(ii) if En is a countable collection of pairwise disjoint sets in B (i.e. En ∩ Em = ∅ for n 6= m)
then
!
∞
∞
X
[
µ
µ(En ).
En =
n=1
n=1
(If µ(X) < ∞ then we call µ a finite measure.) We call (X, B, µ) a measure space.
If µ(X) = 1 then we call µ a probability or probability measure and refer to (X, B, µ) as
a probability space.
Remark 4.8. Thus a measure just abstracts properties of ‘length’ or ‘volume’. Condition (i)
says that the empty set has zero length, and condition (ii) says that the length of a disjoint
union is the sum of the lengths of the individual sets.
Definition 4.9. We say that a property holds almost everywhere if the set of points on which
the property fails to hold has measure zero.
We will usually be interested in studying measures on the Borel σ-algebra of a compact
metric space X. To define such a measure, we need to define the measure of an arbitrary
Borel set. In general, the Borel σ-algebra is extremely large. In the next section we see that
it is often unnecessary to do this and instead it is sufficient to define the measure of a certain
class of subsets.
4.3
The Kolmogorov Extension Theorem
A collection A of subsets of X is called an algebra if:
(i) ∅ ∈ A,
(ii) if A, B ∈ A then A ∩ B ∈ A;
24
4.3
The Kolmogorov Extension Theorem
4
MEASURE THEORY
(iii) if A ∈ A then Ac ∈ A.
Thus an algebra is like a σ-algebra, except that we do not assume that A is closed under
countable unions.
Example 4.10. Take X = [0, 1], and A = {all finite unions of subintervals}.
Let B(A) denote the σ-algebra generated by A, i.e., the smallest σ-algebra containing A.
(In the above example B(A) is the Borel σ-algebra.)
Theorem 4.11 (Kolmogorov Extension Theorem). Let A be an algebra of subsets of X.
Suppose that µ : A → R+ satisfies:
(i) µ(∅) = 0;
(ii) there exists finitely or countably many sets Xn ∈ A such that X =
∞;
S∞
(iii) if En ∈ A, n ≥ 1, are pairwise disjoint and if n=1 En ∈ A then
!
∞
∞
X
[
µ
En =
µ(En ).
n=1
S
n
Xn and µ(Xn ) <
n=1
Then there is a unique measure µ : B(A) → R+ which is an extension of µ : A → R+ .
Remarks 4.12.
(i) The important hypotheses are (i) and (iii). Thus the Kolmogorov Extension Theorem says
that if we have a function µ that looks like a measure on an algebra A, then it is indeed a
measure when extended to B(A).
(ii) We will often use the Kolmogorov Extension Theorem as follows. Take X = [0, 1] and
take A to be the algebra consisting of all finite unions of subintervals of X. We then define
the ‘measure’ µ of a subinterval in such a way as to be consistent with the hypotheses of the
Kolmogorov Extension Theorem. It then follows that µ does indeed define a measure on the
Borel σ-algebra.
(iii) Here is another way in which we shall use the Kolmogorov Extension Theorem. Suppose
we have two measures, µ and ν, and we want to see if µ = ν. A priori we would have to
check that µ(B) = ν(B) for all B ∈ B. The Kolmogorov Extension Theorem says that it
is sufficient to check that µ(E) = ν(E) for all E in an algebra A that generates B. For
example, to show that two measures on [0, 1] are equal, it is sufficient to show that they give
the same measure to each subinterval.
25
4.4
4.4
Examples of measure spaces
4
MEASURE THEORY
Examples of measure spaces
Lebesgue measure on [0, 1]. Take X = [0, 1] and take A to be the collection of all finite
unions of subintervals of [0, 1]. For a subinterval [a, b] define
µ([a, b]) = b − a.
This satisfies the hypotheses of the Kolmogorov Extension Theorem, and so defines a measure
on the Borel σ-algebra B. This is Lebesgue measure.
Lebesgue measure on R/Z. Take X = R/Z = [0, 1) mod 1 and take A to be the collection
of all finite unions of subintervals of [0, 1). For a subinterval [a, b] define
µ([a, b]) = b − a.
This satisfies the hypotheses of the Kolmogorov Extension Theorem, and so defines a measure
on the Borel σ-algebra B. This is Lebesgue measure on the circle.
k
Lebesgue measure on the k-dimensional torus. Take X = Rk /Zk = [0, 1)
Qk mod 1 and
take A to be the collection of all finite unions of k-dimensional sub-cubes j=1 [aj , bj ] of
Q
[0, 1)k . For a sub-cube kj=1 [aj , bj ] of [0, 1)k , define
µ(
k
k
Y
Y
(bj − aj ).
[aj , bj ]) =
j=1
j=1
This satisfies the hypotheses of the Kolmogorov Extension Theorem, and so defines a measure
on the Borel σ-algebra B. This is Lebesgue measure on the torus.
Stieltjes measures. Take X = [0, 1] and let ρ : [0, 1] → R+ be an increasing function such
that ρ(1) − ρ(0) = 1. Take A to be the algebra of finite unions of subintervals and define
µρ ([a, b]) = ρ(b) − ρ(a).
This satisfies the hypotheses of the Kolmogorov Extension Theorem, and so defines a measure
on the Borel σ-algebra B. We say that µρ is the measure on [0, 1] with density ρ.
Dirac measures. Finally, we give an example of a class of measures that do not fall into
the above categories. Let X be an arbitrary space and let B be an arbitrary σ-algebra. Let
x ∈ X. Define the measure δx by
1 if x ∈ A
δx (A) =
0 if x 6∈ A.
Then δx defines a probability measure. It is called the Dirac measure at x.
4.5
Integration: The Riemann integral
Before discussing the Lebesgue theory of integration, we briefly review the construction of
the Riemann integral. This gives a method for defining the integral of (sufficiently nice)
functions defined on [a, b]. In the next subsection we will see how the Lebesgue integral is a
26
4.5
Integration: The Riemann integral
4
MEASURE THEORY
generalisation of the Riemann integral, in the sense that it allows us to integrate functions
defined on spaces more general than subintervals of R (as well as a wider class of functions).
However, the Lebesgue integral has other nice properties, for example it is well-behaved with
respect to limits. Here we give a brief exposition about some inadequacies of the Riemann
integral and how they motivate the Lebesgue integral.
Let f : [a, b] → R be a bounded function (for the moment we impose no other conditions
on f ).
A partition ∆ of [a, b] is a finite set of points ∆ = {x0 , x1 , x2 , . . . , xn } with
a = x0 < x1 < x2 < · · · < xn = b.
In other words, we are dividing [a, b] up into subintervals.
We then form the upper and lower Riemann sums
U(f , ∆) =
L(f , ∆) =
n−1
X
f (x) (xi+1 − xi ),
sup
i=0 x∈[xi ,xi+1 ]
n−1
X
inf
i=0
x∈[xi ,xi+1 ]
f (x) (xi+1 − xi ).
The idea is then that if we make the subintervals in the partition small, these sums will be a
good approximation to (our intuitive notion of) the integral of f over [a, b]. More precisely,
if
inf U(f , ∆) = sup L(f , ∆),
∆
∆
where the infimum and supremum are taken over all possible partitions of [a, b], then we write
Z b
f (x) dx
a
for their common value and call it the (Riemann) integral of f between those limits. We also
say that f is Riemann integrable.
The class of Riemann integrable functions includes continuous functions and step functions
(i.e. finite linear combinations of characteristic functions of intervals).
However, there are many functions for which one wishes to define an integral but which
are not Riemann integrable, making the theory rather unsatisfactory. For example, define
f : [0, 1] → R by
1 if x ∈ Q
f (x) =
0 otherwise.
Since between any two distinct real numbers we can find both a rational number and an
irrational number, given 0 ≤ y < z ≤ 1, we can find y < x < z with f (x) = 1 and
y < x 0 < z with f (x 0 ) = 0. Hence for any partition ∆ = {x0 , x1 , . . . , xn } of [0, 1], we have
U(f , ∆) =
n−1
X
(xi+1 − xi ) = 1,
i=0
L(f , ∆) = 0.
27
4.6
Integration: The Lebesgue integral
4
MEASURE THEORY
Taking the infimum and supremum, respectively, over all partitions ∆, shows that f is not
Riemann integrable.
Why does Riemann integration not work for the above function and how could we go
about improving it? Let us look again at (and slightly rewrite) the formulæ for U(f , ∆) and
L(f , ∆). We have
n−1
X
U(f , ∆) =
sup f (x) l([xi , xi+1 ])
i=0 x∈[xi ,xi+1 ]
and
L(f , ∆) =
n−1
X
i=0
inf
x∈[xi ,xi+1 ]
f (x) l([xi , xi+1 ]),
where, for an interval [y , z],
l([y , z]) = z − y
denotes its length. In the example above, things didn’t work because dividing [0, 1] into
intervals (no matter how small) did not ‘separate out’ the different values that f could take.
But suppose we had a notion of ‘length’ that worked for more general sets than intervals.
Then we could do better by considering more complicated ‘partitions’ of [0, 1], where by
partition we S
now mean a collection of subsets {E1 , . . . , Em } of [0, 1] such that Ei ∩ Ej = ∅,
if i 6= j, and m
i=1 Ei = [0, 1].
In the example, for instance, it might be reasonable to write
Z 1
f (x) dx = 1 × l([0, 1] ∩ Q) + 0 × l([0, 1]\Q)
0
= l([0, 1] ∩ Q).
Instead of using subintervals, the Lebesgue integral uses a much wider class of subsets
(namely sets in the given σ-algebra) together with a notion of ‘generalised length’ (namely,
measure).
4.6
Integration: The Lebesgue integral
Let (X, B, µ) be a measure space. We are interested in how to integrate functions defined
on X with respect to the measure µ. In the special case when X = [0, 1], B is the Borel
σ-algebra and µ is Lebesgue measure, this will extend the definition of the Riemann integral
to functions that are not Riemann integrable.
Definition 4.13. A function f : X → R is measurable if f −1 (D) ∈ B for every Borel subset
D of R, or, equivalently, if f −1 (c, ∞) ∈ B for all c ∈ R.
A function f : X → C is measurable if both the real and imaginary parts, Ref and Imf ,
are measurable.
We define integration via simple functions.
28
4.7
Examples
4
MEASURE THEORY
Definition 4.14. A function f : X → R is simple if it can be written as a linear combination
of characteristic functions of sets in B, i.e.:
f =
r
X
ai χAi ,
i=1
for some ai ∈ R, Ai ∈ B, where the Ai are pairwise disjoint.
For a simple function f : X → R we define
Z
f dµ =
r
X
ai µ(Ai )
i=1
(which can be shown to be independent of the representation of f as a simple function).
Thus for simple functions, the integral can be thought of as being defined to be the area
underneath the graph.
If f : X → R, f ≥ 0, is measurable then one can show that there exists an increasing
sequence of simple functions fn such that fn ↑ f pointwise as n → ∞ (i.e. for every x, fn (x)
is an increasing sequence and fn (x) → f (x) as n → ∞) and we define
Z
Z
f dµ = lim
fn dµ.
n→∞
This can be shown to be independent of the choice of sequence fn .
For an arbitrary measurable function f : X → R, we write f = f + − f − , where f + =
max{f , 0} ≥ 0 and f − = max{−f , 0} ≥ 0 and define
Z
Z
Z
+
f dµ = f dµ − f − dµ.
Finally, for a measurable function f : X → C, we define
Z
Z
Z
f dµ = Ref dµ + i Imf dµ.
We say that f is integrable if
Z
|f | dµ < +∞.
4.7
Examples
Lebesgue measure. Let X = [0, 1] and let µ denote Lebesgue measure on the Borel σalgebra. If f : [0, 1] → R is Riemann integrable then it is also Lebesgue integrable and the
two definitions agree.
The Stieltjes integral. Let ρ : [0, 1] → R+ and suppose that ρ is differentiable. Then
Z
Z
f dµρ = f (x)ρ0 (x) dx.
29
4.8
The Lp Spaces
4
MEASURE THEORY
Integration with respect to Dirac measures. Let x ∈ X and recall that we defined the
Dirac measure by
1 if x ∈ A
δx (A) =
0 if x 6∈ A.
If χA denotes the characteristic function of A then
Z
1 if x ∈ A
χA dδx =
0 if x 6∈ A.
R
P
Hence if f = ai χAi is a simple function then f dδx = ai where x ∈ Ai . Now let f : X → R.
By choosing an increasing sequence of simple functions, we see that
Z
f dδx = f (x).
4.8
The Lp Spaces
Let us say that two measurable functions f , g : X → C are equivalent if f = g µ-a.e. We
shall write L1 (X, B, µ) (or L1 (µ)) for the set of equivalence classes of integrable functions
on (X, B, µ). We define
Z
kf k1 = |f | dµ.
Then d(f , g) = kf − gk1 is a metric on L1 (X, B, µ).
More generally, for any p ≥ 1, we can define the space Lp (X, B, µ) consisting of (equivalence classes of) measurable functions f : X → C such that |f |p is integrable. We can again
define a metric on Lp (X, B, µ) by defining d(f , g) = kf − gkp where
Z
kf kp =
p
|f | dµ
1/p
.
It is worth remarking that convergence in Lp neither implies nor is implied by convergence
almost everywhere.
If (X, B, µ) is a finite measure space and if 1 ≤ p < q then
Lq (X, B, µ) ⊂ Lp (X, B, µ).
Apart from L1 , the most interesting Lp space is L2 (X, B, µ). This is a Hilbert space with
the inner product
Z
hf , gi = f g¯ dµ.
Remark 4.15. We shall continually abuse notation by saying that, for example, a function
f ∈ L1 (X, B, µ) when, strictly speaking, we mean that the equivalence class of f lies in
L1 (X, B, µ).
Exercise 4.16. Give an example of a sequence of functions fn ∈ L1 ([0, 1], B, µ) (µ = Lebesgue)
such that fn → 0 µ-a.e. but fn 6→ 0 in L1 .
30
4.9
Convergence theorems
4
MEASURE THEORY
Exercise 4.17. Give an example to show that
L2 (R, B, µ) 6⊂ L1 (R, B, µ)
where µ is Lebesgue measure.
4.9
Convergence theorems
We state the following two convergence theorems for integration.
Theorem 4.18 (Monotone Convergence Theorem).
R Suppose that fn : X → R is an increasing
sequence of integrable functions on (X, B, µ). If fn dµ is a bounded sequence of real numbers
then limn→∞ fn exists µ-a.e. and is integrable and
Z
Z
lim fn dµ = lim
fn dµ.
n→∞
n→∞
Theorem 4.19 (Dominated Convergence Theorem). Suppose that g : X → R is integrable
and that fn : X → R is a sequence of measurable functions functions with |fn | ≤ g µ-a.e.
and limn→∞ fn = f µ-a.e. Then f is integrable and
Z
Z
lim
fn dµ = f dµ.
n→∞
Remark 4.20. Both the Monotone Convergence Theorem and the Dominated Convergence
Theorem fail for Riemann integration.
31
5
5
MEASURES ON COMPACT METRIC SPACES
Measures on Compact Metric Spaces
5.1
The Riesz Representation Theorem
Let X be a compact metric space and let
C(X, R) = {f : X → R | f is continuous}
denote the space of all continuous functions on X. Equip C(X, R) with the metric
d(f , g) = kf − gk∞ = sup |f (x) − g(x)|.
x∈X
Let B denote the Borel σ-algebra on X and let µ be a probability measure on (X, B).
Then we can think of µ as a functional that acts on C(X, R), namely
Z
C(X, R) → R : f 7→ f dµ.
R
We will often write µ(f ) for f dµ.
Notice that this map enjoys several natural properties:
(i) the functional defined by µ is continuous: i.e. if fn ∈ C(X, R) and fn → f then µ(fn ) →
µ(f ).
(i’) the functional defined by µ is bounded: i.e. if f ∈ C(X, R) then |µ(f )| ≤ kf k∞ .
(ii) the functional defined by µ is linear:
µ(λ1 f1 + λ2 f2 ) = λ1 µ(f1 ) + λ2 µ(f2 )
where λ1 , λ2 ∈ R and f1 , f2 ∈ C(X, R).
(iii) if f ≥ 0 then µ(f ) ≥ 0 (i.e. the map µ is positive);
(iv) consider the function 1 defined by 1(x) ≡ 1 for all x; then µ(1) = 1 (i.e. the map µ is
normalised).
Exercise 5.1. Prove the above assertions.
Remark 5.2. It can be shown that a linear functional if continuous if and only if it is bounded.
Thus in the presence of (ii), we have that (i) is equivalent to (i’).
The Riesz Representation Theorem says that the above properties characterise all Borel
probability measures on X. That is, if we have a map w : C(X, R) → R that satisfies the
above four properties, then w must be given by integrating with respect to a Borel probability
measure. This will be a very useful method of constructing measures: we need only construct
continuous positive normalised linear functionals!
Theorem 5.3 (Riesz Representation Theorem). Let w : C(X, R) → R be a functional such
that:
32
5.2
The space M(X)
5
MEASURES ON COMPACT METRIC SPACES
(i) w is bounded: i.e. for all f ∈ C(X, R) we have |w (f )| ≤ kf k∞ ;
(ii) w is linear: i.e. w (λ1 f1 + λ2 f2 ) = λ1 w (f1 ) + λ2 w (f2 );
(iii) w is positive: i.e. if f ≥ 0 then w (f ) ≥ 0;
(iv) w is normalised: i.e. w (1) = 1.
Then there exists a Borel probability measure µ ∈ M(X) such that
Z
w (f ) = f dµ.
Moreover, µ is unique.
5.2
The space M(X)
In all of the examples that we shall consider, X will be a compact metric space and B will be
the Borel σ-algebra.
We will also be interested in the space of continuous R-valued functions
C(X, R) = {f : X → R | f is continuous}.
This space is also a metric space. We can define a metric on C(X, R) by first defining
kf k∞ = sup |f (x)|
x∈X
and then defining
ρ(f , g) = kf − gk∞ .
This metric turns C(X, R) into a complete metric spaces. (Recall that a metric space is said
to be complete if every Cauchy sequence is convergent.) Note also that C(X, R) is a vector
space.
An important property of C(X, R) that will prove to be useful later on is that it is separable,
that is, it contains countable dense subsets.
Rather than fixing one measure on (X, B), it is interesting to consider the totality of
possible (probability) measures. To formalise this, let M(X) denote the set of all probability
measures on (X, B). The following simple fact will be useful later on.
Proposition 5.4. The space M(X) is convex: if µ1 , µ2 ∈ M(X) and 0 ≤ α ≤ 1 then
αµ1 + (1 − α)µ2 ∈ M(X).
Exercise 5.5. Prove the above proposition.
33
5.3
5.3
The weak∗ topology on M(X)
5
MEASURES ON COMPACT METRIC SPACES
The weak∗ topology on M(X)
It will be very important to have a sensible notion of convergence in M(X); this is called
weak∗ convergence. We say that a sequence of probability measures µn weak∗ converges to
µ, as n → ∞ if, for every f ∈ C(X, R),
Z
Z
f dµn → f dµ, as n → ∞.
If µn weak∗ converges to µ then we write µn * µ. (Note that with this definition it is not
necessarily true that µn (B) → µ(B), as n → ∞, for B ∈ B.) We can make M(X) into a
metric space compatible with this definition of convergence by choosing a countable dense
subset {fn }∞
n=1 ⊂ C(X, R) and, for µ, m ∈ M(X), setting
Z
Z
∞
X
1
fn dµ − fn dm .
d(µ, m) =
2n kfn k∞ n=1
However, we will not need to work with a particular metric: what is important is the definition
of convergence.
Notice that there is a continuous embedding of X in M(X) given by the map X → M(X) :
x 7→ δx , where δx is the Dirac measure at x:
1 if x ∈ A,
δx (A) =
0 if x ∈
/ A,
R
(so that f dδx = f (x)).
Exercise 5.6. Show that the map δ : X → M(X) is continuous. (Hint: This is really just
unravelling the underlying definitions.)
Exercise 5.7. Let X be a compact metric space. For µ ∈ M(X) define
Z
kµk =
sup
f dµ .
f ∈C(X,R),kf k∞ ≤1
We say that µn converges strongly to µ if kµn − µk → 0 as n → ∞. The topology this
determines is called the strong topology (or the operator topology).
(i) Show that if µn → µ strongly then µn * µ in the weak∗ topology.
(ii) Show that X ,→ M(X) : x 7→ δx is not continuous in the strong topology.
(iii) Prove that kδx − δy k = 2 if x 6= y . (You may use Urysohn’s Lemma: Let A and B
be disjoint closed subsets of a metric space X. Then there is a continuous function
f ∈ C(X, R) such that 0 ≤ f ≤ 1 on X while f ≡ 0 on A and f ≡ 1 on B.)
Hence prove that M(X) is not compact in the strong topology when X is infinite.
Exercise 5.8. Give an example of a sequence of measures µn and a set B such that µn * µ
but µn (B) 6→ µ(B).
34
5.4
5.4
M(X) is weak∗ compact
5
MEASURES ON COMPACT METRIC SPACES
M(X) is weak∗ compact
We can use the Riesz Representation Theorem to establish another important property of
M(X): that it is compact.
Theorem 5.9. Let X be a compact metric space. Then M(X) is weak∗ compact.
Proof. In fact, we shall show that M(X) is sequentially compact, i.e., that any sequence
R
µn ∈ M(X) has a convergent subsequence. For convenience, we shall write µ(f ) = f dµ.
Since C(X, R) is separable, we can choose a countable dense subset of functions {fi }∞
i=1 ⊂
C(X, R). Given a sequence µn ∈ M(X), we shall first consider the sequence of real numbers
µn (f1 ) ∈ R. We have that |µn (f1 )| ≤ kf1 k∞ for all n, so µn (f1 ) is a bounded sequence of real
numbers. As such, it has a convergent subsequence, µ(1)
n (f1 ) say.
(1)
Next we apply the sequence of measures µn to f2 and consider the sequence µ(1)
n (f2 ) ∈ R.
Again, this is a bounded sequence of real numbers and so it has a convergent subsequence
µ(2)
n (f2 ).
(i−1)
} such that
In this way we obtain, for each i ≥ 1, nested subsequences {µ(i)
n } ⊂ {µn
(n)
(i)
µn (fj ) converges for 1 ≤ j ≤ i . Now consider the diagonal sequence µn . Since, for n ≥ i ,
(n)
µn(n) is a subsequence of µ(i)
n , µn (fi ) converges for every i ≥ 1.
We can now use the fact that {fi } is dense to show that µn(n) (f ) converges for all f ∈
C(X, R), as follows. For any ε > 0, we can choose fi such that kf − fi k∞ ≤ ε. Since µ(n)
n (fi )
converges, there exists N such that if n, m ≥ N then
(m)
|µn(n) (fi ) − µm
(fi )| ≤ ε.
Thus if n, m ≥ N we have
(m)
(n)
(n)
(n)
(m)
|µ(n)
n (f ) − µm (f )| ≤ |µn (f ) − µn (fi )| + |µn (fi ) − µm (fi )|
(m)
+ |µm
(fi ) − µ(m)
m (f )|
≤ 3ε,
so µ(n)
n (f ) converges, as required.
To complete the proof, write w (f ) = limn→∞ µ(n)
n (f ). We claim that w satisfies the
hypotheses of the Riesz Representation Theorem and so corresponds to integration with
respect to a probability measure.
(i) By construction, w is a linear mapping: w (λf + µg) = λw (f ) + µw (g).
(ii) As |w (f )| ≤ kf k∞ , we see that w is bounded.
(iii) If f ≥ 0 then it is easy to check that w (f ) ≥ 0. Hence w is positive.
(iv) It is easy to check that w is normalised: w (1) = 1.
Therefore,
by the Riesz Representation
=
R
R
RTheorem, there exists µ ∈ M(X) such that w (f ) (n)
(n)
f dµ. We then have that f dµn → f dµ, as n → ∞, for all f ∈ C(X, R), i.e., that µn
converges weak∗ to µ, as n → ∞.
35
6
6
MEASURE PRESERVING TRANSFORMATIONS
Measure Preserving Transformations
6.1
Invariant measures
Let (X, B, µ) be a probability space. A transformation T : X → X is said to be measurable
if T −1 B ∈ B for all B ∈ B.
Definition 6.1. We say that T is a measure-preserving transformation (m.p.t.) or, equivalently, µ is said to be a T -invariant measure, if µ(T −1 B) = µ(B) for all B ∈ B.
Remark 6.2. We write L1 (X, B, µ) for the space of (equivalence classes of) all functions
f : X → R that are integrable with respect to the measure µ, i.e.
Z
1
L (X, B, µ) = f : X → R | f is measurable and
|f | dµ < ∞ .
Lemma 6.3. The following are equivalent:
(i) T is a measure-preserving transformation;
(ii) for each f ∈ L1 (X, B, µ), we have
Z
Z
f dµ =
f ◦ T dµ.
Proof. For B ∈ B, χB ∈ L1 (X, B, µ) and χB ◦ T = χT −1 B , we have
Z
Z
µ(B) =
χB dµ = χB ◦ T dµ
Z
=
χT −1 B dµ = µ(T −1 B).
This proves one implication.
Conversely, suppose that T is a measure-preserving transformation. For any characteristic
function χB , B ∈ B,
Z
Z
Z
−1
χB dµ = µ(B) = µ(T B) = χT −1 B dµ = χB ◦ T dµ
and so the equality holds for any simple function (a finite linear combination of characteristic
functions). Given any f ∈ L1 (X, B, µ) with f ≥ 0, we can find an increasing sequence of
simple functions fn with fn → f pointwise, as n → ∞. For each n we have
Z
Z
fn dµ = fn ◦ T dµ
and, applying the Monotone Convergence Theorem to both sides, we obtain
Z
Z
f dµ = f ◦ T dµ.
To extend the result to general real-valued f , consider the positive and negative parts. This
completes the proof.
36
6.2
Continuous transformations
6.2
6
MEASURE PRESERVING TRANSFORMATIONS
Continuous transformations
We shall now concentrate on the special case where X is a compact metric space, B is the
Borel σ-algebra and T is a continuous mapping (in which case T is measurable). The map
T induces a mapping on the set of (Borel) probability measures M(X) as follows:
Definition 6.4. Define the induced mapping T∗ : M(X) → M(X) by
(T∗ µ)(B) = µ(T −1 B).
(We call T∗ µ the push-forward of µ by T .)
Exercise 6.5. Check that T∗ µ is a probability measure.
Then µ is T -invariant if and only if T∗ µ = µ. Write
M(X, T ) = {µ ∈ M(X) | T∗ µ = µ}.
Lemma 6.6. For f ∈ C(X, R) we have
Z
Z
f d(T∗ µ) = f ◦ T dµ.
Proof. From the definition, for B ∈ B,
Z
Z
χB d(T∗ µ) = χB ◦ T dµ.
Thus the result also holds for simple functions. If f ∈ C(X, R) is such that f ≥ 0, we can
choose an increasing sequence of simple functions fn converging to f pointwise. We have
Z
Z
fn d(T∗ µ) = fn ◦ T dµ
and, applying the Monotone Convergence Theorem to each side, we obtain
Z
Z
f d(T∗ µ) = f ◦ T dµ.
The result extends to general real-valued f ∈ C(X, R) by considering positive and negative
parts.
Lemma 6.7. Let T : X → X be a continuous mapping of a compact metric space. The
following are equivalent:
(i) T∗ µ = µ;
(ii) for all f ∈ C(X, R)
Z
Z
f dµ =
37
f ◦ T dµ.
6.3
Existence of invariant measures 6
MEASURE PRESERVING TRANSFORMATIONS
Proof. (i) ⇒ (ii): This follows from Lemma 6.3, since C(X, R) ⊂ L1 (X, B, µ).
(ii) ⇒ (i): Define two linear functionals w1 , w2 : C(X, R) → R as follows:
Z
Z
w1 (f ) = f dµ, w2 (f ) = f dT∗ µ.
Note that both w1 and w2 are bounded positive normalised linear functionals on C(X, R).
Moreover, by Lemma 6.6
Z
Z
Z
w2 (f ) = f dT∗ µ = f ◦ T dµ = f dµ = w1 (f )
so that w1 and w2 determine the same linear functional. By uniqueness in the Riesz Representation Theorem, this implies that T∗ µ = µ.
Exercise 6.8. Show that the map T∗ : M(X) → M(X) is continuous in the weak∗ topology.
6.3
Existence of invariant measures
Given a continuous mapping T : X → X of a compact metric space, it is natural to ask
whether invariant measures necessarily exist, i.e., whether M(X, T ) 6= ∅. The next result
shows that this is the case.
Theorem 6.9. Let T : X → X be a continuous mapping of a compact metric space. Then
there exists at least one T -invariant probability measure.
Proof. Let σ ∈ M(X) be a probability measure (for example, we could take σ to be a Dirac
measure). Define the sequence µn ∈ M(X) by
n−1
µn =
1X j
T σ,
n j=0 ∗
so that, for B ∈ B,
1
(σ(B) + σ(T −1 B) + · · · + σ(T −(n−1) B)).
n
Since M(X) is weak∗ compact, some subsequence µnk converges, as k → ∞, to a measure
µ ∈ M(X). We shall show that µ ∈ M(X, T ). By Lemma 6.7, this is equivalent to showing
that
Z
Z
f dµ = f ◦ T dµ ∀f ∈ C(X, R).
µn (B) =
To see this, note that
Z
Z
f ◦ T dµ − f dµ =
Z
Z
lim f ◦ T dµnk − f dµnk k→∞
Z n −1
k
1
X
j+1
j
= lim (f ◦ T
− f ◦ T ) dσ k→∞ nk
j=0
Z
1
(f ◦ T nk − f ) dσ = lim k→∞ nk
2kf k∞
= 0.
≤ lim
k→∞
nk
38
6.4
Properties of M(X, T )
6
MEASURE PRESERVING TRANSFORMATIONS
Therefore, µ ∈ M(X, T ), as required.
6.4
Properties of M(X, T )
We now know that M(X, T ) 6= ∅. The next result gives us some basic information about its
structure.
Theorem 6.10. (i) M(X, T ) is convex: i.e. µ1 , µ2 ∈ M(X, T ) ⇒ αµ1 + (1 − α)µ2 ∈
M(X, T ), for all 0 ≤ α ≤ 1.
(ii) M(X, T ) is weak∗ closed (and hence compact).
Proof. (i) If µ1 , µ2 ∈ M(X, T ) and 0 ≤ α ≤ 1 then
(αµ1 + (1 − α)µ2 )(T −1 B)
= αµ1 (T −1 B) + (1 − α)µ2 (T −1 B)
= αµ1 (B) + (1 − α)µ2 (B) = (αµ1 + (1 − α)µ2 )(B),
so αµ1 + (1 − α)µ2 ∈ M(X, T ).
(ii) Let µn be a sequence in M(X, T ) and suppose that µn * µ ∈ M(X), as n → ∞. For
f ∈ C(X, R),
Z
Z
f dµn = f ◦ T dµn .
R
As
n
→
∞,
the
left-hand
side
converges
to
f dµ and the right-hand side converges to
R
R
R
f ◦ T dµ. Hence f dµ = f ◦ T dµ and so, by Lemma 11.3, µ ∈ M(X, T ). This shows
that M(X, T ) is closed. It is compact since it is a closed subset of the compact set M(X).
6.5
Simple examples
We give two methods by which one can show that a given dynamical system preserves a given
measure. We shall illustrate these two methods by proving that (i) a rotation of a torus, and
(ii) the doubling map preserve Lebesgue measure. Let us first recall how these examples are
defined.
6.5.1
Rotations on tori
Take X = Rk /Zk , the k-dimensional torus. Recall that Lebesgue measure µ is defined by
first defining the measure of a k-dimensional cube [a1 , b1 ] × · · · × [ak , bk ] to be
!
k
k
Y
Y
µ
[aj , bj ] =
(bj − aj )
j=1
j=1
and then extending this to the Borel σ-algebra by using the Kolmogorov Extension Theorem.
Fix α = (α1 , . . . , αk ) ∈ Rk and define T : X → X by
T (x1 , . . . , xk ) = (x1 + α1 , . . . , xk + αk ) mod 1.
39
6.6
Kolmogorov Extension Theorem6
MEASURE PRESERVING TRANSFORMATIONS
(In multiplicative notation this becomes:
T (e 2πiθ1 , . . . , e 2πiθk ) = (e 2πi(θ1 +α1 ) , . . . , e 2πi(θk +αk ) ).)
This is the rotation of the k-dimensional torus Rk /Zk by the vector (α1 , . . . , αk ).
In dimension k = 1 we get a rotation of a circle defined by
T : R/Z → R/Z : x 7→ x + α mod 1.
6.5.2
The doubling map
Let X = R/Z denote the circle. The doubling map is defined to be
T : R/Z → R/Z : x 7→ 2x mod 1.
6.6
Kolmogorov Extension Theorem
Recall the Kolmogorov Extension Theorem:
Theorem 6.11 (Kolmogorov Extension Theorem). Let A be an algebra of subsets of X.
Suppose that µ : A → R+ ∪ {∞} satisfies:
(i) µ(∅) = 0;
(ii) there exists finitely or countably many sets Xn ∈ A such that X =
∞;
S∞
(iii) if En ∈ A, n ≥ 1, are pairwise disjoint and if n=1 En ∈ A then
!
∞
∞
[
X
µ
En =
µ(En ).
n=1
S
n
Xn and µ(Xn ) <
n=1
Then there is a unique measure µ : B(A) → R+ ∪ {∞} which is an extension of µ : A →
R+ ∪ {∞}.
That is, if something looks like a measure on an algebra A, then it extends uniquely to a
measure defined on the σ-algebra B(A) generated by A.
Corollary 6.12. Let A be an algebra of subsets of X. Suppose that µ1 and µ2 are two
measures on B(A) such that µ1 (E) = µ2 (E) for all E ∈ A. Then µ1 = µ2 on B(A).
To show that a dynamical system T preserves a probability measure µ we have to show
that T∗ µ = µ. By the above corollary, we see that it is sufficient to check that T∗ µ = µ on
an algebra that generates the σ-algebra.
Recall that the collection of all finite unions of sub-intervals forms an algebra of subsets of
both [0, 1] and R/Z that generates the Borel σ-algebra. Similarly, the collection of all finite
40
6.6
Kolmogorov Extension Theorem6
MEASURE PRESERVING TRANSFORMATIONS
unions of k-dimensional sub-cubes of Rk /Zk forms an algebra of subsets of the k-dimensional
torus Rk /Zk that generates the Borel σ-algebra.
Thus to show that for a dynamical system T defined on R/Z preserves a measure µ we
need only check that
T∗ µ(a, b) = µT −1 (a, b) = µ(a, b)
for all subintervals (a, b).
6.6.1
Rotations of a circle
We claim that the rotation T (x) = x + α mod 1 preserves Lebesgue measure µ. First note
that
T −1 (a, b) = {x | T (x) ∈ (a, b)} = (a − α, b − α).
Hence
T∗ µ(a, b) = µT −1 (a, b)
= µ(a − α, b − α)
= (b − α) − (a − α)
= b−a
= µ(a, b).
Hence T∗ µ = µ on the algebra of finite unions of subintervals. As this algebra generates the
Borel σ-algebra, by uniqueness in the Kolmogorov Extension Theorem we see that T∗ µ = µ;
i.e. Lebesgue measure is T -invariant.
6.6.2
The doubling map
We claim that the doubling map T (x) = 2x mod 1 preserves Lebesgue measure µ. First note
that
a b
a+1 b+1
−1
T (a, b) = {x | T (x) ∈ (a, b)} =
,
∪
,
.
2 2
2
2
Hence
T∗ µ(a, b) = µT −1 (a, b)
a b
a+1 b+1
= µ
,
∪
,
2 2
2
2
b a (b + 1) (a + 1)
− +
−
=
2 2
2
2
= b − a = µ(a, b).
Hence T∗ µ = µ on the algebra of finite unions of subintervals. As this semi-algebra generates
the Borel σ-algebra, by uniqueness in the Kolmogorov Extension Theorem we see that T∗ µ =
µ; i.e. Lebesgue measure is T -invariant.
41
6.7
Fourier series
6.7
6
MEASURE PRESERVING TRANSFORMATIONS
Fourier series
Let B denote the Borel σ-algebra on R/Z and let µ be Lebesgue measure. Given a Lebesgue
integrable function f ∈ L1 (R/Z, B, µ), we can associate to f the Fourier series
∞
a0 X
+
(an cos 2πnx + bn sin 2πnx) ,
2
n=1
where
1
Z
an = 2
Z
f (x) cos 2πnx dµ,
1
bn = 2
0
f (x) sin 2πnx dµ.
0
(Notice that we are not claiming that the series converges—we are just formally associating
the Fourier series to f .)
We shall find it more convenient to work with a complex form of the Fourier series:
∞
X
cn e 2πinx ,
n=−∞
where
Z
cn =
1
f (x)e −2πinx dµ.
0
R1
(In particular, c0 = 0 f dµ.)
We are still not making any assumption as to (i) whether the series converges at all, or
(ii) whether, if the series does converge, it converges to f (x). In general, answering these
questions relies on the class of function to which f belongs.
The weakest class of function is f ∈ L1 (X, B, µ). In this case, we only know
P∞that the coefficients cn → 0 as |n| → ∞. Although this condition is clearly necessary for n=−∞ cn e 2πinx
to converge, it is not sufficient, and there exist examples of functions f ∈ L1 (X, B, µ) for
which the series does not converge to f (x).
Lemma 6.13 (Riemann-Lebesgue Lemma). If f ∈ L1 (R/Z, B, µ) then cn → 0 as |n| → ∞,
i.e.:
Z 1
lim
f (x)e 2πinx dµ = 0.
n→±∞
0
It is of great interest and practical importance to know when and in what sense the Fourier
series converges to the original function f . For convenience, we shall write the nth partial
sum of a Fourier series as
n
X
sn (x) =
c` e 2πi`x
`=−n
and the average of the first n partial sums as
σn (x) =
1
(s0 (x) + s1 (x) + · · · + sn−1 (x)) .
n
We define L2 (X, B, µ) to be the set of all functions f : X → R such that
Notice that L2 ⊂ L1 .
42
R
|f |2 dµ < ∞.
6.7
Fourier series
Theorem 6.14.
in L2 , i.e.,
6
MEASURE PRESERVING TRANSFORMATIONS
(i) (Riesz-Fischer Theorem) If f ∈ L2 (R/Z, B, µ) then sn converges to f
Z
|sn − f |2 dµ → 0, as n → ∞.
(ii) (Fejér’s Theorem) If f ∈ C(R/Z) then σn converges uniformly to f as n → ∞, i.e.,
kσn − f k∞ → 0, as n → ∞.
In summary:
Class of
function
L1
L2
Property of Fourier
coefficients
cn → 0
partial sums sn converge
continuous averages σn of partial
sums converge
6.7.1
Fourier series converges
to function
Not in general
Yes, sn → f
(convergence in L2 sense)
Yes, σn → f
(uniform convergence)
Rotations of a circle
Let T (x) = x + α mod 1 be a circle rotation. We now give an alternative method of proving
that µ is T -invariant using Fourier series. Recall Lemma 6.7: µ is T -invariant if and only if
Z
Z
f ◦ T dµ = f dµ for all f ∈ C(X, R).
Heuristically, the argument is as follows. First note that
Z
0, if n 6= 0
2πinx
e
dµ =
1, if n = 0.
P
P
If f ∈ C(X, R) has Fourier series n∈Z cn e 2πinx then f ◦T has Fourier series n∈Z cn e 2πinα e 2πinx .
The underlying idea is the following:
Z
Z X
f ◦ T dµ =
cn e 2πinα e 2πinx dµ
n∈Z
=
X
cn e
2πinα
Z
e 2πinx dµ
n∈Z
Z
= c0 =
f dµ.
Notice that the above involves saying the ‘the integral of an infinite sum is the infinite sum
of the integrals’. This is not necessarily the case, so to make this argument rigorous we need
to use Theorem 7.4(ii) to justify this step.
Let f ∈ C(X, R). Then f has a Fourier series
X
cn e 2πinx .
n∈Z
43
6.7
Fourier series
6
MEASURE PRESERVING TRANSFORMATIONS
Let sn (x) denote the nth partial sum:
sn (x) =
n
X
c` e 2πi`x .
`=−n
Then
sn (T x) =
n
X
c` e 2πi`α e 2πi`x
`=−n
and this is the nth partial sum for the Fourier series of f ◦ T . As
it follows that
Z
Z
sn dµ = c0 = sn ◦ T dµ.
R
e 2πi`x dµ = 0 unless ` = 0,
Consider
1
(s0 + · · · + sn−1 )(x).
n
Then σn (x) → f (x) uniformly. Moreover, σn (T x) → f (T x) uniformly. Hence
Z
Z
Z
Z
f dµ = lim
σn dµ = c0 = lim
σn ◦ T dµ = f ◦ T dµ
σn (x) =
n→∞
n→∞
and Lemma 6.7 implies that Lebesgue measure is invariant.
6.7.2
The doubling map
Define T : X → X by
T (x) = 2x mod 1.
Heuristically, the
is as follows: If f has Fourier series
P argument
2πi2nx
. Hence
has Fourier series n cn e
Z
Z X
f ◦ T dµ =
cn e 2πi2nx dµ
P
n
cn e 2πinx then f ◦ T
n
=
X
Z
cn
e 2πi2nx dµ
n
= cZ0
=
f dµ.
Again, this needs to be made rigorous, and the argument is similar to that above.
6.7.3
Higher dimensional Fourier series
Let X = Rk /Zk be the k-dimensional torus and let µ denote Lebesgue measure on X. Let
f ∈ L1 (X, B, µ) be an integrable function defined on the torus. For each n = (n1 , . . . , nk ) ∈
Zk define
Z
cn = f (x)e −2πihn,xi dµ
44
6.8
The continued fraction map
6
MEASURE PRESERVING TRANSFORMATIONS
where hn, xi = n1 x1 + · · · + nk xk . Then we can associate to f the Fourier series:
X
cn e 2πihn,xi ,
n∈Zk
where n = (n1 , . . . , nk ), x = (x1 , . . . , xk ). Essentially the same convergence results hold as
in the case k = 1, provided that we write
sn (x) =
n
X
n
X
···
`1 =−n
c` e 2πih`,xi .
`k =−n
Exercise 6.15. For an integer k ≥ 2 define T : R/Z → R/Z by T (x) = kx mod 1. Show that
T preserves Lebesgue measure.
Exercise 6.16. Let β > 1 denote the golden ratio (so that β 2 = β + 1). Define T : [0, 1] →
[0, 1] by T (x) = βx mod
R 1. Show that T does not preserve Lebesgue measure. Define the
measure µ by µ(B) = B k(x) dx where
 1
on [0, 1/β)
 1+ 1
β
β3
k(x) =
1
 β 1 + 1 on [1/β, 1).
β
β3
By using the Kolmogorov Extension Theorem, show that T preserves µ.
Exercise 6.17. Define the logistic map T : [0, 1] → [0, 1] by T (x) = 4x(1 − x). Define the
measure µ by
Z
1
1
p
dx.
µ(B) =
π B x(1 − x)
(i) Check that µ is a probability measure.
(ii) By using the Kolmogorov Extension Theorem, show that T preserves µ.
6.8
The continued fraction map
Recall that the continued fraction map T : [0, 1) → [0, 1) is defined by
0
if x = 0,
T (x) = 1 1
= x mod 1 if 0 < x < 1.
x
One can easily show that the continued fraction map does not preserve Lebesgue measure,
i.e. there exists B ∈ B such that T −1 B and B have different measure. (Indeed, choose B to
be any interval.)
Although the continued fraction map does not preserve Lebesgue measure, it does preserve
Gauss’ measure µ, defined by
Z
1
1
µ(B) =
dx.
log 2 B 1 + x
45
6.8
The continued fraction map
6
MEASURE PRESERVING TRANSFORMATIONS
Remark 6.18. Two measures are said to be equivalent if they have the same sets of measure
zero. Gauss’ measure and Lebesgue measure are equivalent. This means that any property
that holds for µ-almost every point also holds for Lebesgue almost every point. This remark
will have applications later when we use Birkhoff’s Ergodic Theorem to describe properties
of the continued fraction expansion for typical (i.e. Lebesgue almost every) points.
Proof. Using the Kolmogorov Extension Theorem argument, we only have to check that
µ(T −1 I) = µ(I) for intervals. If I = (a, b) then
∞ [
1
1
−1
T (a, b) =
,
.
b
+
n
a
+
n
n=1
Thus
µ(T −1 (a, b))
∞ Z 1
1 X a+n 1
dx
=
1
log 2 n=1 b+n
1+x
∞ 1 X
1
1
=
log 1 +
− log 1 +
log 2 n=1
a+n
b+n
∞
1 X
=
[log(a + n + 1) − log(a + n) − log(b + n + 1) + log(b + n)]
log 2 n=1
N
1 X
= lim
[log(a + n + 1) − log(a + n) − log(b + n + 1) + log(b + n)]
N→∞ log 2
n=1
1
lim [log(a + N + 1) − log(a + 1) − log(b + N + 1) + log(b + 1)]
log 2 N→∞
1
a+N +1
=
log(b + 1) − log(a + 1) + lim log
N→∞
log 2
b+N +1
1
(log(b + 1) − log(a + 1))
=
log 2
Z b
1
1
=
dx = µ(a, b),
log 2 a 1 + x
=
as required.
Exercise 6.19. Define the map T : [0, 1] → [0, 1] by
x
1−x if 0 ≤ x ≤ 1/2
T (x) =
1−x
if 1/2 ≤ x ≤ 1.
x
Define the measure µ on [0, 1] by
Z
µ(B) =
B
dx
x
(note that the measure µ is not a probability measure as µ([0, 1]) = ∞).
46
6.9
Linear toral endomorphisms
6
MEASURE PRESERVING TRANSFORMATIONS
(i) Show that µ([a, b]) = log b − log a.
(ii) Show that
T −1 [a, b] =
a
b
,
1+a 1+b
∪
1
1
,
1+a 1+b
.
(iii) Show that µ is T -invariant.
(iv) Define h : [0, 1] → [0, ∞] by
1
− 1.
x
Define S = hT h−1 : [0, ∞] → [0, ∞] (so that S and T are topologically conjugate—i.e.
they have the same dynamics). Show that we have
x − 1 if 1 ≤ x < ∞
S(x) =
1
x − 1 if 0 ≤ x < 1.
h(x) =
Relate the map S to continued fractions.
6.9
Linear toral endomorphisms
Let T : Rk /Zk → Rk /Zk be a linear toral endomorphism. Recall that this means that T is
given as follows:
T (x1 , . . . , xk ) = A(x1 , . . . , xk ) mod 1
where A = (ai,j ) is a k × k matrix with entries in Z and with det A 6= 0.
We shall show that µ is T -invariant by using Fourier series.
6.9.1
Fourier series in higher dimensions
Let X = Rk /Zk be the k-dimensional torus and let µ denote Lebesgue measure on X. Let
f ∈ L1 (X, B, µ) be an integrable function defined on the torus. For each n = (n1 , . . . , nk ) ∈
Zk define
Z
cn = f (x1 , . . . , xk )e −2πihn,xi dµ
where hn, xi = n1 x1 + · · · + nk xk .
Then we can associate to f the Fourier series:
X
cn e 2πihn,xi ,
n∈Zk
where n = (n1 , . . . , nk ), x = (x1 , . . . , xk ). Essentially the same convergence results hold as
in the case k = 1, provided that we write
sn (x) =
n
X
`1 =−n
n
X
···
`k =−n
47
c` e 2πih`,xi .
6.9
Linear toral endomorphisms
6
MEASURE PRESERVING TRANSFORMATIONS
As in the one-dimensional case, we have that
Z
c0 = f dµ,
and
Z
e
6.9.2
2πihn,xi
dµ =
0 if n =
6 (0, . . . , 0)
1 if n = (0, . . . , 0).
Lebesgue measure is an invariant measure for a toral endomorphism
Let µ denote Lebesgue measure. To show that µ is T -invariant, it is sufficient to prove that
for each continuous function f ∈ C(X, R) we have
Z
Z
f ◦ T dµ = f dµ.
We associate to such an f its Fourier series:
X
cn e 2πihn,xi .
n∈Zk
Then f ◦ T has Fourier series
X
cn e 2πihn,Axi .
n∈Zk
Intuitively, we can write
Z
f ◦ T dµ =
Z X
cn e 2πihn,Axi dµ
k
=
Z n∈Z
X
cn e 2πihnA,xi dµ
n∈Zk
=
X
Z
cn
e 2πihnA,xi dµ.
n∈Zk
Using the fact that det A 6= 0, we see that nA = 0 if and only if n = 0. Hence, all of the
integrals above are 0, except for the term corresponding to n = 0. Hence
Z
Z
f ◦ T dµ = c0 = f dµ.
(This argument can be made rigorous as for circle rotations.)
Therefore, by Lemma 6.7, µ is T -invariant.
Exercise 6.20. Fix α ∈ R and define the map T : R2 /Z2 → R2 /Z2 by
T (x, y ) = (x + α, x + y ).
By using Fourier Series, sketch a proof that Lebesgue measure is T -invariant.
48
6.10
6.10
Shifts of finite type
6
MEASURE PRESERVING TRANSFORMATIONS
Shifts of finite type
Let A be a k × k matrix with entries equal to 0 or 1. Recall that we have defined the
(two-sided) shift of finite type by
ΣA = {x = (xn ) ∈ {1, . . . , k}Z | A(xn , xn+1 ) = 1 ∀n ∈ Z}
and the (one-sided) shift of finite type
+
Z
Σ+
| A(xn , xn+1 ) = 1 ∀n ∈ Z+ }.
A = {x = (xn ) ∈ {1, . . . , k}
+
The shift maps σ : ΣA → ΣA , σ : Σ+
A → ΣA are defined by
σ(. . . , x1 ,
x0 , x1 , x2 , . . .) = (. . . , x0 , x1 , x2 , x3 , . . .),
|{z}
|{z}
0th place
0th place
σ(x0 , x1 , x2 , x3 , . . .) = (x1 , x2 , x3 , . . .),
respectively, i.e., σ shifts sequences one place to the left.
As an analogue of intervals in this case, we have so-called ‘cylinder sets’, formed by fixing
a finite set of co-ordinates. More precisely, in ΣA we define
[y−m , . . . , y−1 , y0 , y1 , . . . , yn ] = {x ∈ ΣA | xi = yi , −m ≤ i ≤ n},
and in Σ+
A we define
[y0 , y1 , . . . , yn ] = {x ∈ Σ+
A | xi = yi , 0 ≤ i ≤ n}.
In each case the cylinder sets form a semi-algebra which generates the Borel σ-algebra.
(Cylinder sets are both open and closed.)
6.10.1
The Perron-Frobenius theorem
The following standard result will be useful.
Theorem 6.21 (Perron-Frobenius). Let B be a non-negative aperiodic k × k matrix (i.e.
n
Bi,j ≥ 0 for each 1 ≤ i, j ≤ k and there exists n > 0 such that Bi,j
> 0 for all 0 ≤ i, j ≤ k).
Then:
(i) there exists a positive eigenvalue λ > 0 such that all other eigenvalues λi ∈ C satisfy
|λi | < λ,
(ii) the eigenvalue λ is simple (i.e. the corresponding eigenspace is one-dimensional),
Pn
(iii) there is a unique right-eigenvector v = (v1 , . . . , vk )T such that vj > 0, j=1 |vj | = 1
and
Bv = λv ,
Pn
(iv) there is a unique positive left-eigenvector u = (u1 , . . . , uk ) such that uj > 0, j=1 |uj | =
1 and
uB = λu,
(v) eigenvectors corresponding to eigenvalues other than λ are not positive: i.e. at least
one co-ordinate is positive and at least one co-ordinate is negative.
49
6.10
6.10.2
Shifts of finite type
6
MEASURE PRESERVING TRANSFORMATIONS
Markov measures
We will now see how to construct a large class of σ-invariant measures on shifts of finite
type.
Definition 6.22. A k × k matrix P is said to be stochastic if:
(i) P (i , j) ≥ 0 i, j = 1, . . . , k,
Pk
(ii)
j=1 P (i, j) = 1, i = 1, . . . , k.
Suppose that P is compatible with A, i.e.,
P (i, j) > 0 ⇐⇒ A(i, j) = 1.
Suppose in addition that P , or equivalently A, is aperiodic, i.e., there exists n such that for
each i , j we have P n (i, j) > 0.
Applying the Perron-Frobenius theorem, we see that there exists a unique maximal eigenvalue λ. As P is stochastic, we must have that λ = 1 (why?). Moreover, by (ii) in the above
definition, the right-eigenvector is (1, . . . , 1). Let p P
= (p1 , . . . , pk ) be the corresponding
k
(strictly positive) left eigenvector, normalised so that i=1 pi = 1.
Now we define a probability measure µ = µP on ΣA , Σ+
A by
µP [y` , y`+1 , . . . , yn ] = py` P (y` , y`+1 ) · · · P (yn−1 , yn ),
on cylinder sets. (By the Kolmogorov Extension Theorem, this uniquely defines a measure
on the whole Borel σ-algebra.)
We shall show that the measure µP on Σ+
A is σ-invariant. By the Kolmogorov Extension
Theorem, it is enough to show that µP and σ∗ µP agree on cylinder sets. Now
σ∗ µP [y0 , y1 , . . . , yn ] = µP (σ −1 [y0 , y1 , . . . , yn ])
!
k
[
= µP
[j, y0 , y1 , . . . , yn ]
j=1
=
k
X
µP [j, y0 , y1 , . . . , yn ]
j=1
=
k
X
pj P (j, y0 )P (y0 , y1 ) · · · P (yn−1 , yn )
j=1
= py0 P (y0 , y1 ) · · · P (yn−1 , yn )
= µP [y0 , y1 , . . . , yn ],
as required (where to get the penultimate line we have used the fact that pP = p). Probability
measures of this form are called Markov measures.
Given an aperiodic k ×k matrix A there are of course many compatible stochastic matrices
P . Each such stochastic matrix generates a different Markov measure. However, there are
several naturally defined measures that turn out to be Markov, and we give two of them here.
50
6.10
6.10.3
Shifts of finite type
6
MEASURE PRESERVING TRANSFORMATIONS
Full shifts
Recall that if A(i, j) = 1 for all i, j then
+
Z
Σ+
A = {1, . . . , k}
ΣA = {1, . . . , k}Z ,
are the full shifts on k symbols. In this case we may define a (family of) measures
by taking
Pk
p = (p1 , . . . , pk ) to be any (positive) probability vector (i.e., pi > 0,
p
=
1) and
i=1 i
defining
µ[yl , . . . , yn ] = pyl · · · pyn .
Such a µ is called a Bernoulli measure.
Exercise 6.23. Show that Bernoulli measures are Markov measures (i.e. construct a matrix
P for which (p1 , . . . , pk )P = (p1 , . . . , pk ).
6.10.4
The Parry measure
As A is a non-negative aperiodic matrix, by the Perron-Frobenius theorem there exists a
unique maximal eigenvalue λ with corresponding left and right eigenvectors u = (u1 , . . . , uk )
and v = (v1 , . . . , vk ), respectively. Define
Ai,j vj
λvi
ui vi
=
c
Pi,j =
pi
where c =
Pk
i=1
ui vi .
Exercise 6.24. Show that P is a stochastic matrix and that p is a normalised left-eigenvalue
for P .
The corresponding Markov measure is called the Parry measure.
51
7
7
7.1
ERGODICITY
Ergodicity
The definition of ergodicity
In this section, we introduce what it means to say that a transformation is ergodic with
respect to an invariant measure. Ergodicity is an important concept for many reasons, not
least because Birkhoff’s Ergodic Theorem holds:
Theorem 7.1. Let T be an ergodic transformation of the probability space (X, B, µ) and let
f ∈ L1 (X, B, µ). Then
Z
n−1
1X
j
f (T x) → f dµ
n j=0
for µ-almost every x ∈ X.
Definition 7.2. Let (X, B, µ) be a probability space and let T : X → X be a measurepreserving transformation. We say that T is an ergodic transformation (or µ is an ergodic
measure) if, for B ∈ B,
T −1 B = B ⇒ µ(B) = 0 or 1.
Remark 7.3. One can view ergodicity as an indecomposability condition. If ergodicity does
not hold and we have T −1 A = A with 0 < µ(A) < 1, then one can split T : X → X into
1
µ(· ∩ A) and
T : A → A and T : X \ A → X \ A with invariant probability measures µ(A)
1
1−µ(A) µ(· ∩ (X \ A)), respectively.
It will sometimes be convenient for us to weaken the condition T −1 B = B to µ(T −1 B4B) =
0, where 4 denotes the symmetric difference:
A4B = (A \ B) ∪ (B \ A).
The next lemma allows us to do this.
Lemma 7.4. If B ∈ B satisfies µ(T −1 B4B) = 0 then there exists B∞ ∈ B with T −1 B∞ =
B∞ and µ(B4B∞ ) = 0. (In particular, µ(B) = µ(B∞ ).)
Proof. For each j ≥ 0, we have the inclusion
−j
T B4B ⊂
=
j−1
[
i=0
j−1
[
T −(i+1) B4T −i B
T −i (T −1 B4B)
i=0
and so (since T preserves µ)
µ(T −j B4B) ≤ jµ(T −1 B4B) = 0.
52
7.2
An alternative characterisation of ergodicity
Let
B∞ =
∞ [
∞
\
7
ERGODICITY
T −i B.
j=0 i=j
We have that
µ B4
∞
[
!
−i
T B
i=j
Since the sets
S∞
i=j
≤
∞
X
µ(B4T −i B) = 0.
i=j
T −i B decrease as j increases we hence have µ(B4B∞ ) = 0. Also,
T
−1
B∞ =
=
∞ [
∞
\
T −(i+1) B
j=0 i=j
∞ [
∞
\
T −i B = B∞ ,
j=0 i=j+1
as required.
Corollary 7.5. If T is ergodic and µ(T −1 B4B) = 0 then µ(B) = 0 or 1.
Remark 7.6. Occasionally, instead of saying that µ(A4B) = 0, we will say that A = B a.e.
or A = B mod 0.
7.2
An alternative characterisation of ergodicity
The next result characterises ergodicity in a convenient way.
Proposition 7.7. Let T be a measure-preserving transformation of (X, B, µ). The following
are equivalent:
(i) T is ergodic;
(ii) whenever f ∈ L1 (X, B, µ) satisfies f ◦ T = f µ-a.e. we have that f is constant µ-a.e.
Remark 7.8. We can replace L1 in the statement by measurable or by L2 .
Proof. (i) ⇒ (ii): Suppose that T is ergodic and that f ∈ L1 (X, B, µ) with f ◦ T = f µ-a.e.
For k ∈ Z and n ∈ N, define
k +1
k k +1
k
−1
=f
,
.
X(k, n) = x ∈ X | n ≤ f (x) <
2
2n
2n 2n
Since f is measurable, X(k, n) ∈ B.
We have that
T −1 X(k, n)4X(k, n) ⊂ {x ∈ X | f (T x) 6= f (x)}
so that
µ(T −1 X(k, n)4X(k, n)) = 0.
53
7.3
Rotations of a circle
7
ERGODICITY
Hence µ(X(k, n)) = 0 or µ(X(k,
S n)) = 1.
For each fixed n, the union k∈Z X(k, n) is equal to X up to a set of measure zero, i.e.,
!
[
µ X4
X(k, n) = 0,
k∈Z
and this union is disjoint. Hence we have
X
µ(X(k, n)) = µ(X) = 1
k∈Z
and so there is a unique kn for which µ(X(kn , n)) = 1. Let
Y =
∞
\
X(kn , n).
n=1
Then µ(Y ) = 1 and, by construction, f is constant on Y , i.e., f is constant µ-a.e.
(ii) ⇒ (i): Suppose that B ∈ B with T −1 B = B. Then χB ∈ L1 (X, B, µ) and χB ◦T (x) =
χB (x) ∀ x ∈ X, so, by hypothesis, χB is constant µ-a.e. Since χB only takes the values 0
and 1, we must have χB = 0 µ-a.e. or χB = 1 µ-a.e. Therefore
Z
χB dµ = 0 or 1,
µ(B) =
X
and T is ergodic.
7.3
Rotations of a circle
Fix α ∈ R and define T : R/Z → R/Z by T (x) = x + α mod 1. We have already seen that
T preserves Lebesgue measure.
Theorem 7.9. Let T (x) = x + α mod 1.
(i) If α ∈ Q then T is not ergodic.
(ii) If α 6∈ Q then T is ergodic.
Proof. Suppose that α ∈ Q and write α = p/q for p, q ∈ Z with q 6= 0. Define
f (x) = e 2πiqx ∈ L2 (X, B, µ).
Then f is not constant but
f (T x) = e 2πiq(x+p/q) = e 2πi(qx+p) = e 2πiqx = f (x).
Hence T is not ergodic.
54
7.4
The doubling map
7
ERGODICITY
Suppose that α 6∈ Q. Suppose that f ∈ L2 (X, B, µ) is such that f ◦ T = f a.e. Suppose
that f has Fourier series
∞
X
cn e 2πinx .
n=−∞
Then f ◦ T has Fourier series
∞
X
cn e 2πinα e 2πinx .
n=−∞
Comparing Fourier coefficients we see that
cn = cn e 2πinα ,
for all n ∈ Z. As α 6∈ Q, e 2πinα 6= 1 unless n = 0. Hence cn = 0 for n 6= 0. Hence f has
Fourier series c0 , i.e. f is constant a.e.
Exercise 7.10. Show that, when α ∈ Q, the rotation T (x) = x + α mod 1 is not ergodic from
the definition, i.e. find an invariant set B = T −1 B which has Lebesgue measure 0 < µ(B) < 1.
7.4
The doubling map
Let X = R/Z and define T : X → X by T (x) = 2x mod 1.
Proposition 7.11. The doubling map T is ergodic with respect to Lebesgue measure µ.
Proof. Let f ∈ L2 (R/Z, B, µ) and suppose that f ◦ T = f µ-a.e. Let f have Fourier series
f (x) =
∞
X
am e 2πimx (in L2 ).
m=−∞
For each j ≥ 0, f ◦ T j has Fourier series
∞
X
j
am e 2πim2 x .
m=−∞
Comparing Fourier coefficients we see that
am = a2j m
for all m ∈ Z and each j = 0, 1, 2, . . .. The Riemann-Lebesgue Lemma says that an → 0 as
|n| → ∞. Hence, if m 6= 0, we have that am = a2j m → 0 as j → ∞. Hence for m 6= 0 we
have that am = 0. Thus f has Fourier series a0 , and so must be equal to a constant a.e.
Hence T is ergodic with respect to µ.
55
7.5
Linear toral automorphisms
7.5
7
ERGODICITY
Linear toral automorphisms
Let X = Rk /Zk and let µ denote Lebesgue measure. Let A be a k × k integer matrix with
det A = ±1 and define T : X → X by
T (x1 , . . . , xk ) = A(x1 , . . . , xk ) mod 1.
Proposition 7.12. A linear toral automorphism T is ergodic with respect to µ if and only if
no eigenvalue of A is a root of unity.
Remark 7.13. In particular, hyperbolic toral automorphisms (i.e. no eigenvalues of modulus
1) are ergodic with respect to Lebesgue measure.
To prove this criterion, we shall use the following:
Lemma 7.14. Let T be a linear toral automorphism. The following are equivalent:
(i) T is ergodic with respect to µ;
(ii) the only m ∈ Zk for which there exists p > 0 such that
e 2πihm,A
p xi
= e 2πihm,xi µ-a.e.
is m = 0.
Proof. (i) ⇒ (ii): Suppose that T is ergodic and that there exists m ∈ Zk and p > 0 such
that
p
e 2πihm,A xi = e 2πihm,xi µ-a.e.
Let p be the least such exponent and define
f (x) = e 2πihm,xi + e 2πihm,Axi + · · · + e 2πihm,A
= e 2πihm,xi + e 2πihmA,xi + · · · + e 2πihmA
p−1 xi
p−1 ,xi
.
Then f ∈ L2 (Rk /Zk , B, µ) and, since e 2πihm,·i ◦ T = e 2πihm,A·i , we have f ◦ T = f µ-a.e. Since
T is ergodic, we thus have f = constant a.e. However, the only way that this can happen is
if m = 0.
(ii) ⇒ (i): Now suppose that if, for some m ∈ Zk and p > 0, we have
e 2πihm,A
p xi
= e 2πihm,xi
µ-a.e.,
then m = 0. Let f ∈ L2 (Rk /Zk , B, µ) and suppose that f ◦ T = f µ-a.e. We shall show that
T is ergodic by showing that f is constant µ-a.e.
We may expand f as a Fourier series
X
f (x) =
am e 2πihm,xi (in L2 ).
m∈Zk
Since f ◦ T p = f µ-a.e., for all p > 0, we have
X
X
p
am e 2πihmA ,xi =
am e 2πihm,xi ,
m∈Zk
m∈Zk
56
7.5
Linear toral automorphisms
7
ERGODICITY
for all p > 0. By the uniqueness of Fourier expansions, we can compare coefficients and
obtain, for every m ∈ Zk ,
am = amA = · · · = amAp = · · · .
If am 6= 0 then there can only be finitely many indices in the above list, for otherwise it would
contradict the fact that am → 0 as |m| → ∞. In other words, there exists p > 0 such that
m = mAp .
p
But then e 2πihm,A xi = e 2πihm,xi and so, by hypothesis, m = 0. Thus am = 0 for all m 6= 0
and so f is equal to the constant a0 µ-a.e. Therefore T is ergodic.
Proof of Proposition 7.12. We shall prove the contrapositive statements in each case.
First suppose that T is not ergodic. Then, by the Lemma, there exists m ∈ Zk \ {0} and
p > 0 such that
p
e 2πihm,A xi = e 2πihm,xi ,
p
or, equivalently, e 2πihmA ,xi = e 2πihm,xi , which is to say that mAp = m. Thus Ap has 1 as an
eigenvalue and hence A has a p th root of unity as an eigenvalue.
Now suppose that A has a p th root of unity as an eigenvalue. Then Ap has 1 as an
eigenvalue and so
m(Ap − I) = 0
for some m ∈ Rk \ {0}. Since A is an integer matrix, we can in fact take m ∈ Zk \ {0}. We
have
p
p
e 2πihm,A xi = e 2πihmA ,xi = e 2πihm,xi ,
so, by the Lemma, T is not ergodic.
Exercise 7.15. Define T : R2 /Z2 → R2 /Z2 by
T (x, y ) = (x + α, x + y ).
Suppose that α 6∈ Q. By using Fourier series, show that T is ergodic with respect to Lebesgue
measure.
Exercise 7.16. (This exercise is outside the scope of the course and so is not examinable.)
It is easy to construct lots of examples of hyperbolic toral automorphisms (i.e. no eigenvalues of modulus 1—the cat map is such an example), which must necessarily be ergodic with
respect to Lebesgue measure. It is harder to show that there are ergodic toral automorphisms
with some eigenvalues of modulus 1.
(i) Show that to have ergodic toral automorphism of Rk /Zk with an eigenvalue of modulus
1, we must have k ≥ 4.
Consider the matrix

0
 0
A=
 0
−1
1 0
0 1
0 0
8 −6
57

0
0 
.
1 
8
7.6
Existence of ergodic measures
7
ERGODICITY
(ii) Show that A defines a linear toral automorphism TA of the 4-dimensional torus R4 /Z4 .
(iii) Show that A has four eigenvalues, two of which have modulus 1.
(iv*) Show that TA is ergodic with respect to Lebesgue measure. (Hint: you have to show
that the two eigenvalues of modulus 1 are not roots of unity, i.e. are not solutions to
λn − 1 = 0 for some n. The best way to do this is to use results from Galois theory on
the irreducibility of polynomials.)
7.6
Existence of ergodic measures
We now return to the general theory of studying the structure of continuous transformations
of compact metric spaces. Recall that we have already seen that the space M(X, T ) of T invariant probability measures is always non-empty. We now describe how ergodic measures
(for T ) fit in to the picture we have developed of M(X, T ). We shall then use this to show
that, in this case, ergodic measures always exist.
Recall that M(X, T ) is convex: if µ1 , µ2 ∈ M(X, T ) then αµ1 + (1 − α)µ2 ∈ M(X, T ) for
every 0 ≤ α ≤ 1.
A point in a convex set is called an extremal point if it cannot be written as a non-trivial
convex combination of (other) elements of the set. More precisely, µ is an extremal point of
M(X, T ) if whenever
µ = αµ1 + (1 − α)µ2 ,
with µ1 , µ2 ∈ M(X, T ), 0 < α < 1 then we have µ1 = µ2 = µ.
Remarks 7.17.
(i) Let Y be the unit square
Y = {(x, y ) | 0 ≤ x ≤ 1, 0 ≤ y ≤ 1} ⊂ R2 .
Then the extremal points of Y are the corners (0, 0), (0, 1), (1, 0), (1, 1).
(ii) Let Y be the (closed) unit disc
Y = {(x, y ) : x 2 + y 2 ≤ 1} ⊂ R2 .
Then the set of extremal points of Y is precisely the unit circle {(x, y ) | x 2 + y 2 = 1}.
The next result will allow us to show that ergodic measures for continuous transformations
on compact metric spaces always exist.
Theorem 7.18. The following are equivalent:
(i) the T -invariant probability measure µ is ergodic;
(ii) µ is an extremal point of M(X, T ).
58
7.6
Existence of ergodic measures
7
ERGODICITY
Proof. For the moment, we shall only prove (ii) ⇒ (i): if µ is extremal then it is ergodic. In
fact, we shall prove the contrapositive. Suppose that µ is not ergodic; we show that µ is not
extremal. As µ is not ergodic, there exists B ∈ B such that T −1 B = B and 0 < µ(B) < 1.
Define probability measures µ1 and µ2 on X by
µ1 (A) =
µ(A ∩ B)
,
µ(B)
µ2 (A) =
µ(A ∩ (X \ B))
.
µ(X \ B)
(The condition 0 < µ(B) < 1 ensures that the denominators are not equal to zero.) Clearly,
µ1 6= µ2 , since µ1 (B) = 1 while µ2 (B) = 0.
Since T −1 B = B, we also have T −1 (X \ B) = X \ B. Thus we have
µ1 (T −1 A) =
=
=
=
=
µ(T −1 A ∩ B)
µ(B)
µ(T −1 A ∩ T −1 B)
µ(B)
−1
µ(T (A ∩ B))
µ(B)
µ(A ∩ B)
µ(B)
µ1 (A)
and (by the same argument)
µ2 (T −1 A) =
µ(T −1 A ∩ (X \ B))
= µ2 (A),
µ(X \ B)
i.e., µ1 and µ2 are both in M(X, T ).
However, we may write µ as the non-trivial (since 0 < µ(B) < 1) convex combination
µ = µ(B)µ1 + (1 − µ(B))µ2 ,
so µ is not extremal.
We defer the proof of (i) ⇒ (ii) until later (as an appendix to section 9) as it requires the
Radon-Nikodym Theorem, which we have yet to state.
Theorem 7.19. Let T : X → X be a continuous mapping of a compact metric space. Then
there exists at least one ergodic measure in M(X, T ).
Proof. By Theorem 7.18, it is equivalent to prove that M(X, T ) has an extremal point.
Choose a countable dense subset of C(X, R), {fi }∞
i=0 say. Consider the first function f0 .
Since the map
Z
M(X, T ) → R : µ 7→
59
f0 dµ
7.6
Existence of ergodic measures
7
ERGODICITY
is (weak∗ ) continuous and M(X, T ) is compact, there exists (at least one) ν ∈ M(X, T ) such
that
Z
Z
f0 dν = sup
f0 dµ.
µ∈M(X,T )
If we define
(
Z
ν ∈ M(X, T ) |
M0 =
)
Z
f0 dν =
sup
f0 dµ
µ∈M(X,T )
then the above shows that M0 is non-empty. Also, M0 is closed and hence compact.
We now consider the next function f1 and define
Z
Z
M1 = ν ∈ M0 | f1 dν = sup
f1 dµ .
µ∈M0
By the same reasoning as above, M1 is a non-empty closed subset of M0 .
Continuing inductively, we define
(
)
Z
Z
Mj = ν ∈ Mj−1 | fj dν = sup
fj dµ
µ∈Mj−1
and hence obtain a nested sequence of sets
M(X, T ) ⊃ M0 ⊃ M1 ⊃ · · · ⊃ Mj ⊃ · · ·
with each Mj non-empty and closed.
Now consider the intersection
M∞ =
∞
\
Mj .
j=0
Recall that the countable intersection of a decreasing sequence of non-empty compact sets
is non-empty. Hence M∞ is non-empty and we can pick µ∞ ∈ M∞ . We shall show that µ∞
is extremal (and hence ergodic).
Suppose that we can write µ∞ = αµ1 + (1 − α)µ2 , µ1 , µ2 ∈ M(X, T ), 0 < α < 1. We
have to show that µ1 = µ2 . Since {fj }∞
j=0 is dense in C(X, R), it suffices to show that
Z
Z
fj dµ1 = fj dµ2 ∀ j ≥ 0.
Consider f0 . By assumption
Z
Z
Z
f0 dµ∞ = α f0 dµ1 + (1 − α) f0 dµ2 .
In particular,
Z
Z
f0 dµ∞ ≤ max
Z
f0 dµ1 ,
60
f0 dµ2 .
7.6
Existence of ergodic measures
However µ∞ ∈ M0 and so
Z
f0 dµ∞ =
7
Z
Z
sup
f0 dµ ≥ max
Z
f0 dµ1 ,
ERGODICITY
f0 dµ2 .
µ∈M(X,T )
Therefore
Z
Z
f0 dµ1 =
Z
f0 dµ2 =
f0 dµ∞ .
Thus, the first identity we require is proved and µ1 , µ2 ∈ M0 . This last fact allows us to
employ the same argument on f1 (with M(X, T ) replaced by M0 ) and conclude that
Z
Z
Z
f1 dµ1 = f1 dµ2 = f1 dµ∞
and µ1 , µ2 ∈ M1 .
Continuing inductively, we show that for an arbitrary j ≥ 0,
Z
Z
fj dµ1 = fj dµ2
and µ1 , µ2 ∈ Mj . This completes the proof.
Exercise 7.20. Define the North-South Map as follows. Let X be the circle of radius 1 centred
at (0, 1) in R2 . Call (0, 2) the North Pole (N) and (0, 0) the South Pole (S) of X. Define a
map φ : X \ {N} → R × {0} by drawing a straight line through N and x and denoting by φ(x)
the unique point on the x-axis that this line crosses (this is just stereographic projection of
the circle). Define T : X → X by
−1 1
φ(x) if x ∈ X \ {N},
φ
2
T (x) =
N
if x = N.
Hence T (N) = N, T (S) = S and if x 6= N, S then T n (x) → S as n → ∞.
(i) Show that δS and δN (the Dirac delta measures at N, S, respectively) are T -invariant.
(ii) Show that any invariant measure assigns zero measure to X \ {N, S}.
(Hint: take x 6= N, S and consider the interval I = [x, T (x)). Then ∪n∈Z T −n I is a
disjoint union.)
(iii) Hence find all invariant measures for the North-South map.
(iv) Find all ergodic measures for the North-South map.
61
7.7
Bernoulli Shifts
7.7
7
ERGODICITY
Bernoulli Shifts
Let σ : Σk → Σk be the full shift on k symbols. (The following discussion works equally well
for the one-sided version Σ+
k .) Let p = (p1 , . . . , pk ) be a probability vector and let µp be the
Bernoulli measure determined by p, i.e., on cylinder sets
µp [z0 , . . . , zn−1 ] = pz0 · · · pzn−1 .
We shall show that σ is ergodic with respect to µp .
We shall use the following fact: Given B ∈ B and ε > 0 we can find a finite disjoint
collection of cylinder sets C1 , . . . , CN such that
!
N
[
µp B4 Cj < ε.
j=1
Suppose
that B ∈ B satisfies σ −1 B = B. Choosing C1 , . . . , CN as above and writing
SN
E = j=1 Cj , we have
|µp (B) − µp (E)| < ε.
The key point is the following: If n is sufficiently large then F = σ −n E and E depend on
different co-ordinates. Hence, since µp is defined by a product,
µp (F ∩ E) = µp (F ) µp (E)
= µp (σ −n E) µp (E)
= µp (E)2 ,
since µp is σ-invariant.
We also have the estimate
µp (B4F ) = µp (σ −n B4σ −n E)
= µp (σ −n (B4E))
= µp (B4E) < ε.
Since B4(E ∩ F ) ⊂ (B4E) ∪ (B4F ), we therefore obtain
µp (B4(E ∩ F )) ≤ µp (B4E) + µp (B4F ) < 2ε
and hence
|µp (B) − µp (E ∩ F )| < 2ε.
Hence we can estimate
|µp (B) − µp (B)2 | ≤ |µp (B) − µp (E ∩ F )| + |µp (E ∩ F ) − µp (B)2 |
< 2ε + |µp (E)2 − µp (B)2 |
= 2ε + (µp (E) + µp (B)) |µp (E) − µp (B)|
|
{z
}|
{z
}
≤2
<ε
≤ 4ε.
Since ε > 0 is arbitrary, we have µp (B) = µp (B)2 . This is only possible if µp (B) = 0 or
µp (B) = 1. Therefore σ is ergodic with respect to µp .
62
7.8
Remarks on the continued fraction map
7
ERGODICITY
Remark 7.21. For general subshifts of finite type σ : ΣA → ΣA , σ is ergodic with respect
to the Markov measure µP if and only if the stochastic matrix P is irreducible (i.e., for each
(i , j) there exists n > 0 such that P n (i, j) > 0).
Exercise 7.22. We have seen that there are lots of (indeed, uncountably many) ergodic
measures for the full one-sided two-shift. We can use this fact to show that there are
uncountably many ergodic measures for the doubling map.
Z+
Let Σ+
be the one-sided full shift on two symbols. Define π : Σ+
2 = {0, 1}
2 → R/Z by
π(x0 , x1 , . . .) =
x0 x1
xn
+ 2 + · · · + n+1 + · · ·
2
2
2
(i) Show that π is continuous.
(ii) Let T : R/Z → R/Z be the doubling map: T (x) = 2x mod 1. Show that π ◦ σ = T ◦ π.
(Thus T is a factor of σ.)
(iii) If µ is a σ-invariant probability measure on Σ+
2 , show that π∗ µ (where π∗ µ(B) =
µ(π −1 B) for a Borel subset B ⊂ R/Z) is a T -invariant probability measure on R/Z.
(Lebesgue measure on R/Z corresponds to choosing µ to be the Bernoulli (1/2, 1/2)measure on Σ+
2 .)
(iv) Show that if µ is an ergodic measure for σ, then π∗ µ is an ergodic measure for T .
(v) Conclude that there are infinitely many ergodic measures for the doubling map.
7.8
Remarks on the continued fraction map
Recall that the continued fraction map T : [0, 1) → [0, 1) is defined by T (0) = 0 and, for
0 < x < 1,
0
if x = 0,
T (x) = 1 1
= x mod 1 if 0 < x < 1.
x
This sends the point
x=
1
x0 +
1
x1 +
1
1
x2 + x +···
3
to the point
T (x) =
1
x1 +
.
1
x2 +
1
1
x3 + x +···
4
That is, T acts by shifting the continued fraction expansion one place to the left (and
forgetting the 0th term). Thus we can think of T as a full one-sided subshift, albeit with an
infinite number of symbols.
Using this analogy, we can then define a cylinder to be a set of the form
I(x0 , x1 , . . . , xn ) = {x ∈ (0, 1) | x has continued fraction expansion
starting x0 , x1 , . . . xn }.
63
7.8
Remarks on the continued fraction map
7
ERGODICITY
Recall that T preserves Gauss’ measure, defined by
Z
1
dx
µ(B) =
.
log 2 B 1 + x
We claim that µ is an ergodic measure. One proof of this uses similar ideas as the proof
that Bernoulli measures for subshifts of finite type are ergodic. However, a crucial property
of Bernoulli measures that was used is the following: given two cylinders, E and F , we have
µ(E ∩ σ −n F ) = µ(E)µ(F ) provided n is sufficiently large. This equality holds because the
formula for the Bernoulli measure of a cylinder is ‘locally constant’, i.e. it depends only on a
finite number of co-ordinates. The formula for Gauss’ measure is not locally constant: the
function 1/(1 + x) depends on all (i.e. infinitely many) co-efficients in the continued fraction
expansion of x. However, with some effort, one can prove that there exist constants c, C > 0
such that
cµ(E)µ(F ) ≤ µ(E ∩ σ −n F ) ≤ Cµ(E)µ(F )
for ‘cylinders’ for the continued fraction map. It turns out that this is sufficient to prove
ergodicity. In summary:
Proposition 7.23. Let T denote the continued fraction map. Then T is ergodic with respect
to Gauss’ measure.
64
8
8
8.1
RECURRENCE AND UNIQUE ERGODICITY
Recurrence and Unique Ergodicity
Poincaré’s Recurrence Theorem
We now go back to the general setting of a measure-preserving transformation of a probability
space (X, B, µ). The following is the most basic result about the distribution of orbits.
Theorem 8.1 (Poincaré’s Recurrence Theorem). Let T : X → X be a measure-preserving
transformation of (X, B, µ) and let A ∈ B have µ(A) > 0. Then for µ-a.e. x ∈ A, the orbit
{T n x}∞
n=0 returns to A infinitely often.
Proof. Let
E = {x ∈ A | T n x ∈ A for infinitely many n},
then we have to show that µ(A\E) = 0.
If we write
F = {x ∈ A | T n x 6∈ A ∀n ≥ 1}
then we have the identity
A\E =
∞
[
(T −k F ∩ A).
k=0
Thus we have the estimate
∞
[
µ(A\E) = µ
!
(T −k F ∩ A)
k=0
∞
[
≤ µ
!
T −k F
k=0
≤
∞
X
µ(T −k F ).
k=0
Since µ(T −k F ) = µ(F ) ∀k ≥ 0 (because the measure is preserved), it suffices to show that
µ(F ) = 0.
First suppose that n > m and that T −m F ∩ T −n F 6= ∅. If y lies in this intersection then
T m y ∈ F and T n−m (T m y ) = T n y ∈ F ⊂ A, which contradicts the definition of F . Thus
T −m F and T −n F are disjoint.
Since {T −k F }∞
k=0 is a disjoint family, we have
!
∞
∞
X
[
µ(T −k F ) = µ
T −k F ≤ µ(X) = 1.
k=0
k=0
Since the terms in the summation have the constant value µ(F ), we must have µ(F ) = 0.
Exercise 8.2. Construct an example to show that Poincaré’s recurrence theorem does not hold
on infinite measure spaces. (Recall that a measure space (X, B, µ) is infinite if µ(X) = ∞.)
65
8.2
Unique Ergodicity
8.2
8
RECURRENCE AND UNIQUE ERGODICITY
Unique Ergodicity
We finish this section by looking at the case where T : X → X has a unique invariant
probability measure.
Definition 8.3. Let (X, B) be a measurable space and let T : X → X be a measurable
transformation. If there is a unique T -invariant probability measure then we say that T is
uniquely ergodic.
Remark 8.4. You might wonder why we don’t just call such T ‘uniquely invariant’. Recall
from Theorem 8.18 that the extremal points of M(X, T ) are precisely the ergodic measures.
If M(X, T ) consists of just one measure then that measure is extremal, and so must be
ergodic.
Unique ergodicity (for continuous maps) implies the following strong convergence result.
Theorem 8.5. Let X be a compact metric space and let T : X → X be a continuous
transformation. The following are equivalent:
(i) T is uniquely ergodic;
(ii) for each f ∈ C(X) there exists a constant c(f ) such that
n−1
1X
f (T j x) → c(f ),
n j=0
uniformly for x ∈ X, as n → ∞.
Proof. (ii) ⇒ (i): Suppose that µ, ν are T -invariant probability measures; we shall show that
µ = ν.
Integrating the expression in (ii), we obtain
Z
n−1 Z
1X
f dµ = lim
f ◦ T j dµ
n→∞ n
j=0
Z
n−1
1X
f ◦ T j dµ
=
lim
n→∞ n
j=0
Z
=
c(f ) dµ = c(f ),
and, by the same argument
Z
f dν = c(f ).
Therefore
Z
Z
f dµ =
f dν
∀f ∈ C(X)
and so µ = ν (by the Riesz Representation Theorem).
66
8.3
Example: The Irrational Rotation
8
RECURRENCE AND UNIQUE ERGODICITY
(i) ⇒ (ii): Let M(X, T ) = {µ}. If (ii)Ris true, then, by the Dominated Convergence
Theorem, we must necessarily have c(f ) = f dµ. Suppose that (ii) is false. Then we can
find f ∈ C(X) and sequences nk ∈ N and xk ∈ X such that
Z
nk −1
1 X
j
lim
f (T xk ) 6= f dµ.
k→∞ nk
j=0
For each k ≥ 1, define a measure νk ∈ M(X) by
nk −1
1 X
T j δx ,
νk =
nk j=0 ∗ k
so that
Z
f dνk =
nk −1
1 X
f (T j xk ).
nk j=0
By the proof of Theorem 13.1, νk has a subsequence which converges weak∗ to a measure
ν ∈ M(X, T ). In particular, we have
Z
Z
Z
f dν = lim
f dνk 6= f dµ.
k→∞
Therefore, ν 6= µ, contradicting unique ergodicity.
8.3
Example: The Irrational Rotation
Let X = R/Z, T : X → X : x 7→ x + α mod 1, α irrational. Then T is uniquely ergodic (and
µ = Lebesgue measure is the unique invariant probability measure).
Proof. Let m be an arbitrary T -invariant probability measure; we shall show that m = µ.
Write ek (x) = e 2πikx . Then
Z
Z
ek (x) dm =
ek (T x) dm
Z
=
ek (x + α) dm
Z
2πikα
ek (x) dm.
= e
Since α is irrational, if k 6= 0 then e 2πikα 6= 1 and so
Z
ek dm = 0.
(4)
R
P
Let f ∈ C(X) have Fourier series ∞
f dµ. For n ≥ 1, we let
k=−∞ ak ek , so that a0 =
σn denote the average of the first n partial sums. Then σn → f uniformly as n → ∞. Hence
Z
Z
lim
σn dm = f dm.
n→∞
67
8.3
Example: The Irrational Rotation
8
RECURRENCE AND UNIQUE ERGODICITY
However using (4), we may calculate that
Z
Z
σn dm = a0 = f dµ.
Thus we have that
R
f dm =
R
f dµ, for every f ∈ C(X), and so m = µ.
68
9
9
BIRKHOFF’S ERGODIC THEOREM
Birkhoff’s Ergodic Theorem
9.1
Introduction
An ergodic theorem is a result that describes the limiting behaviour of the sequence
n−1
1X
f ◦ Tj
n j=0
(5)
as n → ∞. The precise formulation of an ergodic theorem depends on the class of function
f (for example, one could assume that f is integrable, L2 , or continuous), and the notion
of convergence that we use (for example, we could study pointwise convergence, L2 convergence, or uniform convergence). The result that we are interested here—Birkhoff’s Ergodic
Theorem—deals with pointwise convergence of (5) for an integrable function f .
9.2
Conditional expectation
We will need the concepts of Radon-Nikodym derivates and conditional expectation.
Definition 9.1. Let µ be a measure on (X, B). We say that a measure ν is absolutely
continuous with respect to µ and write ν µ if ν(B) = 0 whenever µ(B) = 0, B ∈ B.
Remark 9.2. Thus ν is absolutely continuous with respect to µ if sets of µ-measure zero also
have ν-measure zero (but there may be more sets of ν-measure zero).
For example, let f ∈ L1 (X, B, µ) be non-negative and define a measure ν by
Z
ν(B) =
f dµ.
B
Then ν µ.
The following theorem says that, essentially, all absolutely continuous measures occur in
this way.
Theorem 9.3 (Radon-Nikodym). Let (X, B, µ) be a probability space. Let ν be a measure
defined on B and suppose that ν µ. Then there is a non-negative measurable function f
such that
Z
ν(B) =
f dµ, for all B ∈ B.
B
Moreover, f is unique in the sense that if g is a measurable function with the same property
then f = g µ-a.e.
Exercise 9.4. If ν µ then it is customary to write dν/dµ for the function given by the
Radon-Nikodym theorem, that is
Z
dν
ν(B) =
dµ.
B dµ
Prove the following relations:
69
9.2
Conditional expectation
9
BIRKHOFF’S ERGODIC THEOREM
(i) If ν µ and f is a µ-integrable function then
Z
Z
dν
f dν = f
dµ.
dµ
(ii) If ν1 , ν2 µ then
dν1 dν2
d(ν1 + ν2 )
=
+
.
dµ
dµ
dµ
(iii) If λ ν µ then
dλ dν
dλ
=
.
dµ
dν dµ
Let A ⊂ B be a sub-σ-algebra. Note that µ defines a measure on A by restriction. Let
f ∈ L1 (X, B, µ), with f non-negative. Then we can define a measure ν on A by setting
Z
f dµ.
ν(A) =
A
Note that ν µ|A . Hence by the Radon-Nikodym theorem, there is a unique A-measurable
function E(f | A) such that
Z
ν(A) = E(f | A) dµ.
We call E(f | A) the conditional expectation of f with respect to the σ-algebra A.
So far, we have only defined E(f | A) for non-negative f . To define E(f | A) for an
arbitrary f , we split f into positive and negative parts f = f+ − f− where f+ , f− ≥ 0 and define
E(f | A) = E(f+ | A) − E(f− | A).
Thus we can view conditional expectation as an operator
E(· | A) : L1 (X, B, µ) → L1 (X, A, µ).
Note that E(f | A) is uniquely determined by the two requirements that
(i) E(f | A) is A-measurable, and
R
R
(ii) A f dµ = A E(f | A) dµ for all A ∈ A.
Intuitively, one can think of E(f | A) as the best approximation to f in the smaller space of
all A-measurable functions.
Exercise 9.5.
(i) Prove that f 7→ E(f | A) is linear.
(ii) Suppose that g is A-measurable and |g| < ∞ µ-a.e. Show that E(f g | A) = gE(f | A).
(iii) Suppose that T is a measure-preserving transformation. Show that E(f | A) ◦ T =
E(f ◦ T | T −1 A).
(iv) Show that E(f | B) = f .
70
9.3
Birkhoff’s Pointwise Ergodic Theorem
9
BIRKHOFF’S ERGODIC THEOREM
(v) Let N denote
R the trivial σ-algebra consisting of all sets of measure 0 and 1. Show that
E(f | N ) = f dµ.
To state Birkhoff’s Ergodic Theorem precisely, we will need the sub-σ-algebra I of T invariant subsets, namely:
I = {B ∈ B | T −1 B = B a.e.}.
Exercise 9.6. Prove that I is a σ-algebra.
9.3
Birkhoff’s Pointwise Ergodic Theorem
Birkhoff’s Ergodic Theorem deals with the behaviour of
and for f ∈ L1 (X, B, µ).
1
n
Pn−1
j=0
f (T j x) for µ-a.e. x ∈ X,
Theorem 9.7 (Birkhoff’s Ergodic Theorem). Let (X, B, µ) be a probability space and let
T : X → X be a measure-preserving transformation. Let I denote the σ-algebra of T invariant sets. Then for every f ∈ L1 (X, B, µ), we have
n−1
1X
f (T j x) → E(f | I)
n j=0
for µ-a.e. x ∈ X.
Corollary 9.8. Let (X, B, µ) be a probability space and let T : X → X be an ergodic measurepreserving transformation. Let f ∈ L1 (X, B, µ). Then
n−1
1X
f (T j x) →
n j=0
Z
f dµ,
as n → ∞,
for µ-a.e. x ∈ X.
Proof. If T is ergodic then I is the trivial
R σ-algebra N consisting of sets of measure 0 and
1
1. If f ∈ L (X, B, µ) then E(f | N ) = f dµ. The result follows from the general version
of Birkhoff’s ergodic theorem.
9.4
The proof of Birkhoff’s Ergodic Theorem
The proof is something of a tour de force of hard analysis. It is based on the following
inequality.
Theorem 9.9 (Maximal Inequality). Let (X, B, µ) be a probability space, let T : X → X be
a measure-preserving transformation and let f ∈ L1 (X, B, µ). Define f0 = 0 and, for n ≥ 1,
fn = f + f ◦ T + · · · + f ◦ T n−1 .
71
9.4
The proof of Birkhoff’s Ergodic Theorem
9
BIRKHOFF’S ERGODIC THEOREM
For n ≥ 1, set
Fn = max fj
0≤j≤n
(so that Fn ≥ 0). Then
Z
f dµ ≥ 0.
{x|Fn (x)>0}
Proof. Clearly Fn ∈ L1 (X, B, µ). For 0 ≤ j ≤ n, we have Fn ≥ fj , so Fn ◦ T ≥ fj ◦ T . Hence
Fn ◦ T + f ≥ fj ◦ T + f = fj+1
and therefore
Fn ◦ T (x) + f (x) ≥ max fj (x).
1≤j≤n
If Fn (x) > 0 then
max fj (x) = max fj (x) = Fn (x),
1≤j≤n
0≤j≤n
so we obtain that
f ≥ Fn − F n ◦ T
on the set A = {x | Fn (x) > 0}.
Hence
Z
Z
Z
f dµ ≥
Fn dµ − Fn ◦ T dµ
A
A
Z
ZA
Fn dµ − Fn ◦ T dµ
=
ZA
ZX
Fn ◦ T dµ
Fn dµ −
≥
X
X
= 0
where we have used
(i) Fn = 0 on X \ A
(ii) Fn ◦ T ≥ 0
(iii) µ is T -invariant.
Corollary 9.10. If g ∈ L1 (X, B, µ) and if
(
Bα =
)
n−1
1X
x ∈ X | sup
g(T j x) > α
n
n≥1
j=0
then for all A ∈ B with T −1 A = A we have that
Z
g dµ ≥ αµ(Bα ∩ A).
Bα ∩A
72
9.4
The proof of Birkhoff’s Ergodic Theorem
9
BIRKHOFF’S ERGODIC THEOREM
Proof. Suppose first that A = X. Let f = g − α, then
( n−1
)
∞
[
X
Bα =
x|
g(T j x) > nα
=
=
n=1
∞
[
j=0
{x | fn (x) > 0}
n=1
∞
[
{x | Fn (x) > 0}
n=1
(since fn (x) > 0 ⇒ Fn (x) > 0 and Fn (x) > 0 ⇒ fj (x) > 0 for
Cn = {x | Fn (x) > 0} and observe that Cn ⊂ Cn+1 . Thus χCn
f χCn converges to f χBα , as n → ∞. Furthermore, |f χCn | ≤ |f |.
Convergence Theorem,
Z
Z
Z
Z
f dµ,
f χBα dµ =
f χCn dµ →
f dµ =
Cn
as n → ∞.
Bα
X
X
some 1 ≤ j ≤ n). Write
converges to χBα and so
Hence, by the Dominated
Applying the maximal inequality, we have, for all n ≥ 1,
Z
f dµ ≥ 0.
Cn
Therefore
Z
f dµ ≥ 0,
Bα
i.e.,
Z
g dµ ≥ αµ(Bα ).
Bα
For the general case, we work with the restriction of T to A, T : A → A, and apply the
maximal inequality on this subset to get
Z
g dµ ≥ αµ(Bα ∩ A),
Bα ∩A
as required.
Proof. Proof of Birkhoff’s Ergodic Theorem Let
n−1
1X
f (x) = lim sup
f (T j x)
n
n→∞
j=0
∗
and
n−1
f∗ (x) = lim inf
n→∞
1X
f (T j x).
n j=0
73
9.4
The proof of Birkhoff’s Ergodic Theorem
9
BIRKHOFF’S ERGODIC THEOREM
Writing
n−1
an (x) =
1X
f (T j x),
n j=0
observe that
n+1
1
an+1 (x) = an (T x) + f (x).
n
n
Taking the lim sup and lim inf as n → ∞ gives us that f ∗ ◦ T = f ∗ and f∗ ◦ T = f∗ .
We have to show
(i) f ∗ = f∗ µ-a.e
(ii) f ∗ ∈ L1 (X, B, µ)
R
R
(iii) f ∗ dµ = f dµ.
We prove (i). For α, β ∈ R, define
Eα,β = {x ∈ X | f∗ (x) < β and f ∗ (x) > α}.
Note that
[
{x ∈ X | f∗ (x) < f ∗ (x)} =
Eα,β
β<α, α,β∈Q
(a countable union). Thus, to show that f ∗ = f∗ µ-a.e., it suffices to show that µ(Eα,β ) = 0
whenever β < α. Since f∗ ◦ T = f∗ and f ∗ ◦ T = f ∗ , we see that T −1 Eα,β = Eα,β . If we write
)
(
n−1
X
1
f (T j x) > α
Bα = x ∈ X | sup
n≥1 n
j=0
then Eα,β ∩ Bα = Eα,β .
Applying Corollary 9.10 we have that
Z
Z
f dµ =
f dµ
Eα,β ∩Bα
Eα,β
≥ αµ(Eα,β ∩ Bα ) = αµ(Eα,β ).
Replacing f , α and β by −f , −β and −α and using the fact that (−f )∗ = −f∗ and
(−f )∗ = −f ∗ , we also get
Z
f dµ ≤ βµ(Eα,β ).
Eα,β
Therefore
αµ(Eα,β ) ≤ βµ(Eα,β )
and since β < α this shows that µ(Eα,β ) = 0. Thus f ∗ = f∗ µ-a.e. and
n−1
1X
lim
f (T j x) = f ∗ (x)
n→∞ n
j=0
74
µ-a.e.
9.4
The proof of Birkhoff’s Ergodic Theorem
9
BIRKHOFF’S ERGODIC THEOREM
We prove (ii). Let
n−1
1 X
j
gn = f ◦ T .
n
j=0
Then gn ≥ 0 and
Z
Z
gn dµ ≤
|f | dµ
so we can apply Fatou’s Lemma to conclude that limn→∞ gn = |f ∗ | is integrable, i.e., that
f ∗ ∈ L1 (X, B, µ).
We prove (iii). For n ∈ N and k ∈ Z, define
k
k +1
n
∗
Dk = x ∈ X | ≤ f (x) <
.
n
n
For every ε > 0, we have that
Dkn ∩ B k −ε = Dkn .
n
Since T
−1
Dkn
=
Dkn ,
we can apply Corollary 22.4 again to obtain
Z
k
− ε µ(Dkn ).
f dµ ≥
n
Dkn
Since ε > 0 is arbitrary, we have
Z
f dµ ≥
Dkn
k
µ(Dkn ).
n
Thus
Z
k +1
µ(Dkn )
n
Z
1
n
f dµ
≤
µ(Dk ) +
n
Dkn
f ∗ dµ ≤
Dkn
(where the first inequality follows from the definition of Dkn ). Since
[
X=
Dkn
k∈Z
(a disjoint union), summing over k ∈ Z gives
Z
Z
1
∗
f dµ ≤
µ(X) +
f dµ
n
X
X
Z
1
=
+
f dµ.
n
X
Since this holds for all n ≥ 1, we obtain
Z
Z
∗
f dµ ≤
f dµ.
X
X
75
9.5
Consequences of the Ergodic Theorem
9
BIRKHOFF’S ERGODIC THEOREM
Applying the same argument to −f gives
Z
Z
∗
(−f ) dµ ≤ −f dµ
so that
Z
Therefore
Z
∗
Z
f dµ =
f∗ dµ ≥
Z
Z
∗
f dµ =
f dµ.
f dµ,
as required.
Finally, we prove that f ∗ = E(f | I). First note that as f ∗ is T -invariant, it is measurable
with respect to I. Moreover, if I is any T -invariant set then
Z
Z
f , dµ = f ∗ dµ.
I
I
Hence f ∗ = E(f | I).
9.5
Consequences of the Ergodic Theorem
Here we give some simple corollaries of Birkhoff’s Ergodic Theorem. The first result says that,
for a typical orbit of an ergodic dynamical system, ‘time averages’ equal ‘space averages’.
Corollary 9.11. If T is ergodic and if B ∈ B then for µ-a.e. x ∈ X, the frequency with which
the orbit of x lies in B is given by µ(B), i.e.,
lim
n→∞
1
card{j ∈ {0, 1, . . . , n − 1} | T j x ∈ B} = µ(B) µ-a.e.
n
Proof. Apply the Birkhoff Ergodic Theorem with f = χB .
It is possible to characterise ergodicity in terms of the behaviour of sets, rather than
points, under iteration. The next result deals with this.
Theorem 9.12. Let (X, B, µ) be a probability space and let T : X → X be a measurepreserving transformation. The following are equivalent:
(i) T is ergodic;
(ii) for all A, B ∈ B,
n−1
1X
µ(T −j A ∩ B) → µ(A)µ(B),
n j=0
as n → ∞.
76
9.6
Normal numbers
9
BIRKHOFF’S ERGODIC THEOREM
Proof. (i) ⇒ (ii): Suppose that T is ergodic. Since χA ∈ L1 (X, B, µ), Birkhoff’s Ergodic
Theorem tells us that
n−1
1X
χA ◦ T j → µ(A), as n → ∞,
n j=0
µ-a.e. Multiplying both sides by χB gives
n−1
1X
χA ◦ T j χB → µ(A)χB , as n → ∞,
n j=0
µ-a.e. Since the left-hand side is bounded (by 1), we can apply the Dominated Convergence
Theorem to see that
n−1
n−1 Z
1X
1X
−j
χA ◦ T j χB dµ
µ(T A ∩ B) =
n j=0
n j=0
Z
n−1
1X
χA ◦ T j χB dµ → µ(A)µ(B),
=
n j=0
as n → ∞.
(ii) ⇒ (i): Now suppose that the convergence holds. Suppose that T −1 A = A and take
B = A. Then µ(T −j A ∩ B) = µ(A) so
n−1
1X
µ(A) → µ(A)2 ,
n j=0
as n → ∞. This gives µ(A) = µ(A)2 . Therefore µ(A) = 0 or 1 and so T is ergodic.
9.6
Normal numbers
A number x ∈ [0, 1) is called normal (in base 2) if it has a unique binary expansion, the
digit 0 occurs in its binary expansion with frequency 1/2, and the digit 1 occurs in its binary
expansion with frequency 1/2. We will show that Lebesgue a.e. x ∈ [0, 1) is normal.
To see this, observe that Lebesgue almost every x ∈ [0, 1) has a unique binary expansion
x = ·x1 x2 . . ., xi ∈ {0, 1}. Define T x = 2x mod 1. Then xn = 0 if and only if T n−1 x ∈
[0, 1/2). Thus
n−1
1
1X
card{1 ≤ i ≤ n | xi = 0} =
χ[0,1/2) (T i x).
n
n i=0
Since T is ergodic (with respect toR Lebesgue measure), for Lebesgue almost every point x
the above expression converges to χ[0,1/2) (x) dx = 1/2. Similarly the frequency with which
the digit 1 occurs is equal to 1/2. Hence Lebesgue almost every point in [0, 1) is normal.
Exercise 9.13. (i) Let r ≥ 2. What would it mean to say that a number x ∈ [0, 1) is
normal in base r ?
77
9.7
Continued fractions
9
BIRKHOFF’S ERGODIC THEOREM
(ii) Prove that for each r , Lebesgue a.e. x ∈ [0, 1) is normal in base r .
(iii) Conclude that Lebesgue a.e. x ∈ [0, 1) is simultaneously normal in every base r =
2, 3, 4, . . ..
Exercise 9.14. Prove that the arithmetic mean of the digits appearing in the
base 10 expansion
P∞
j+1
, xj ∈
of Lebesgue-a.e. x ∈ [0, 1) is equal to 4.5, i.e. prove that if x =
j=0 xj /10
{0, 1, . . . , 9} then
1
lim (x0 + x1 + · · · + xn−1 ) = 4.5 a.e.
n→∞ n
9.7
Continued fractions
We will show that for Lebesgue a.e. x ∈ (0, 1) the frequency with which the natural number
k occurs in the continued fraction expansion of x is given by
1
(k + 1)2
log
.
log 2
k(k + 2)
Let λ denote Lebesgue measure and let µ denote Gauss’ measure. Then λ-a.e. and µ-a.e.
x ∈ (0, 1) is irrational and has an infinite continued fraction expansion
x=
1
x1 +
.
1
x2 +
1
1
x3 + x +···
4
Let T denote the continued fraction map. Then xn = [1/T n−1 x].
Fix k ∈ N. Then xn = k precisely when [1/T n−1 x] = k, i.e.
k≤
1
T n−1 x
<k +1
which is equivalent to requiring
1
1
< T n−1 x ≤ .
k +1
k
Hence
1
card{1 ≤ i ≤ n | xi = k} =
n
n−1
1X
χ(1/(k+1),1/k] (T i x)
n i=0
Z
→
χ(1/(k+1),1/k] dµ for µ-a.e. x
1
1
1
=
log 1 +
− log 1 +
log 2
k
k +1
2
1
(k + 1)
=
log
.
log 2
k(k + 2)
As µ and λ are equivalent, this holds for Lebesgue almost every point.
78
9.8
Appendix
9
BIRKHOFF’S ERGODIC THEOREM
Exercise 9.15. (i) Deduce from Birkhoff’s Ergodic Theorem that if T is an ergodic measureRpreserving transformation of a probability space (X, B, µ) and f ≥ 0 is measurable but
f dµ = ∞ then
n−1
1X
f (T j x) → ∞ µ-a.e.
n j=0
(Hint: define fM = mi n{f , M} and note that fM ∈ L1 (X, B, µ). Apply Birkhoff’s
Ergodic Theorem to each fM .)
(ii) For x ∈ (0, 1) \ Q write its infinite continued fraction expansion as
x=
1
x1 +
1
x2 + x
.
1
3 +···
Show that for Lebesgue almost every x ∈ (0, 1) we have
1
(x1 + x2 + · · · + xn ) → ∞
n
as n → ∞. (That is, for a typical point x, the average value of the co-efficients in its
continued fraction expansion is infinite.)
9.8
Appendix
Completion of the proof of Theorem 7.18. Suppose that µ is ergodic and that µ = αµ1 +
(1 − α)µ2 , with µ1 , µ2 ∈ M(X, T ) and 0 < α < 1. We shall show that µ1 = µ (so that
µ2 = µ, also), i.e., that µ is extremal.
If µ(A) = 0 then µ1 (A) = 0, so µ1 µ. Therefore the Radon-Nikodym derivative
dµ1 /dµ ≥ 0 exists. One can easily deduce from the statement of the Radon-Nikodym
Theorem that µ1 = µ if and only if dµ1 /dµ = 1 µ-a.e. We shall show that this is indeed the
case by showing that the sets where, respectively, dµ1 /dµ < 1 and dµ1 /dµ > 1 both have
µ-measure zero.
Let
dµ1
B= x ∈X :
(x) < 1 .
dµ
We have that (*)
Z
B∩T −1 B
dµ1
dµ +
dµ
Z
B\T −1 B
dµ1
dµ =
dµ
Z
B
dµ1
dµ
dµ
= µ1 (B) = µ1 (T −1 B)
Z
dµ1
=
dµ
dµ
−1
ZT B
Z
dµ1
dµ1
=
dµ +
dµ.
B∩T −1 B dµ
T −1 B\B dµ
79
9.8
Appendix
9
BIRKHOFF’S ERGODIC THEOREM
Comparing the first and last terms, we see that
Z
Z
dµ1
dµ1
dµ =
dµ.
B\T −1 B dµ
T −1 B\B dµ
In fact, these integrals are taken over sets of the same µ-measure:
µ(T −1 B\B) = µ(T −1 B) − µ(T −1 B ∩ B)
= µ(B) − µ(T −1 B ∩ B)
= µ(B\T −1 B).
However on the LHS of (*) the integrand dµ1 /dµ < 1 and on the RHS of (*) the integrand
dµ1 /dµ ≥ 1. Thus we conclude that µ(B\T −1 B) = µ(T −1 B\B) = 0, which is to say that
µ(T −1 B4B) = 0. Therefore (since T is ergodic) by Corollary 8.5, µ(B) = 0 or µ(B) = 1.
We can rule out the possibility that µ(B) = 1 by observing that if µ(B) = 1 then
Z
Z
dµ1
dµ1
dµ =
dµ < µ(B) = 1,
1 = µ1 (X) =
B dµ
X dµ
a contradiction. Therefore µ(B) = 0.
If we define
C=
dµ1
x ∈X :
(x) > 1
dµ
then repeating essentially the same argument gives µ(C) = 0.
Hence
dµ1
µ x ∈X :
(x) = 1 = µ(X\(B ∪ C))
dµ
= µ(X) − µ(B) − µ(C) = 1,
i.e., dµ1 /dµ = 1 µ-a.e. Therefore µ1 = µ, as required.
80
10
10
10.1
ENTROPY
Entropy
The Classification Problem
The classification problem is to decide when two measure-preserving transformations are
‘the same’ ? We say that two measure-preserving transformations are ‘the same’ if they are
(measure theoretically) isomorphic.
Definition 10.1. We say that two measure-preserving transformations (X, B, µ, T ) and (Y, C, m, S)
are (measure theoretically) isomorphic if there exist M ∈ B and N ∈ C such that
(i) T M ⊂ M, SN ⊂ N,
(ii) µ(M) = 1, m(N) = 1,
and there exists a bijection φ : M → N such that
(i) φ, φ−1 are measurable and measure-preserving,
(ii) φ ◦ T = S ◦ φ.
Remark 10.2. We often say ‘metrically isomorphic’ in place of ‘measure-theoretically isomporphic’. Here, ‘metric’ is a contraction of ‘measure-theoretic’; it has no connection with
metric spaces!
How can we decide whether two measure-preserving transformations are isomorphic?
A partial answer is given by looking for isomorphism invariants. The most important and
successful invariant is the (measure theoretic) entropy. This is a non-negative real number
that characterizes the complexity of the measure-preserving transformation. It was introduced
by Kolmogorov and Sinai in 1958/59 and immediately solved the outstanding open problem
in the subject: whether, for example,
S : R/Z → R/Z : x →
7 2x mod 1
T : R/Z → R/Z : x →
7 3x mod 1
are isomorphic. (The invariant measure is Lebesgue in each case.) The answer is no since
the systems have different entropies (log 2 and log 3, respectively).
10.2
Conditional expectation
Recall that if A ⊂ B is a σ-algebra then we define the operator
E(· | A) : L1 (X, B, µ) → L1 (X, A, µ)
such that if f ∈ L1 (X, B, µ) then
(i) E(f | A) is A-measurable, and
R
R
(ii) for all A ∈ A, A E(f | A) dµ = A f dµ.
81
10.3
Information and Entropy
10
ENTROPY
Definition 10.3. Let A ⊂ B be a σ-algebra. We define the conditional probability of B ∈ B
given A to be the function
µ(B | A) = E(χB | A).
Suppose that α is a countable partition of the set X. By this we mean that α =
{A1 , A2 , . . .}, Ai ∈ B, and
(i) ∪i Ai = X
(ii) Ai ∩ Aj = ∅ if i 6= j
(up to sets of measure zero). (More precisely µ(∪i Ai 4X) = 0 and µ(Ai 4Aj ) = 0 if i 6= j.)
The partition α generates a σ-algebra. By an abuse of notation, we denote this σ-algebra
by α. The conditional expectation of an integrable function f with respect to the partition
α is easily seen to be:
R
X
f dµ
χA (x) A
E(f | α) =
.
µ(A)
A∈α
Finally, we will need the following useful result.
Theorem 10.4 (Increasing martingale theorem). Let A1 ⊂ A2 ⊂ · · · ⊂ An ⊂ · · · be an
increasing sequence of σ-algebras such that An ↑ A (i.e. ∪n An generates A). Then
(i) E(f | An ) → E(f | A) a.e., and
(ii) E(f | An ) → E(f | A) in L1 , i.e.
Z
|E(f | An ) − E(f | A)| dµ → 0
as n → ∞.
Proof. Omitted.
10.3
Information and Entropy
We begin with some motivation. Suppose we are trying to locate a point x in a probability
space (X, B, µ). To do this we use a (countable) partition α = {A1 , A2 , . . .}, Aj ∈ B. By
this we mean that
(i) ∪i Ai = X
(ii) Ai ∩ Aj = ∅ if i 6= j
(up to sets of measure zero). (More precisely µ(∪i Ai 4X) = 0 and µ(Ai 4Aj ) = 0 if i 6= j.)
If we find that x ∈ Ai then we have received some information. We want to define a
function
I(α) : X → R+
82
10.3
Information and Entropy
10
ENTROPY
such that I(α)(x) is the amount of information we receive on learning that x ∈ Ai . We would
like this to depend only on the size of Ai , i.e., µ(Ai ), and to be large when µ(Ai ) is small and
small when µ(Ai ) is large. Thus we require I(α) to have the form
X
I(α)(x) =
χA (x)φ(µ(A))
(6)
A∈α
for some function φ : [0, 1] → R+ , as yet unspecified.
Let α = {A1 , A2 , A3 , . . .}, β = {B1 , B2 , B3 , . . .} be two partitions. Define the join α ∨ β
of α and β to be the partition
α ∨ β = {A ∩ B | A ∈ α, B ∈ β}.
We say that two partitions α, β are independent if µ(A ∩ B) = µ(A)µ(B), whenever
A ∈ α, B ∈ β.
It is then natural to require that if α and β are two independent partitions then
I(α ∨ β) = I(α) + I(β).
(7)
That is, if α and β are independent then the amount of information we obtain by knowing
which element of α ∨ β we are in is equal to the amount of information we obtain by knowing
which element of α we are in together with the amount of information we obtain by knowing
which element of β we are in.
Applying (7) to (6), we see that we have
φ(µ(A ∩ B)) = φ(µ(A)µ(B)) = φ(µ(A)) + φ(µ(B)).
If we also want φ to be continuous, then φ(t) must be (a multiple of) − log t. Throughout,
we will use the convention that 0 × log 0 = 0.
Definition 10.5. Given a partition α, we define the information I(α) : X → R+ by
X
I(α)(x) = −
χA (x) log µ(A).
A∈α
We define the entropy H(α) of the partition α to be the average value, i.e.,
Z
H(α) =
I(α) dµ
Z
X
=
−
χA log µ(A) dµ
A∈α
= −
X
µ(A) log µ(A).
A∈α
83
10.4
10.4
Conditional Information and Entropy
10
ENTROPY
Conditional Information and Entropy
Let A be a sub-σ-algebra of B. We define the conditional information of α given A to be
X
I(α|A)(x) = −
χA (x) log µ(A|A)(x),
A∈α
where µ(A|A) = E(χA |A).
Once again, the conditional entropy H(α|A) is the average
Z
X
H(α|A) =
−
χA log µ(A|A) dµ
A∈α
Z
=
−
X
µ(A|A) log µ(A|A) dµ
A∈α
(by one of the properties of conditional expectation and the Monotone Convergence Theorem).
As a special case, we have that I(α|N ) = I(α) and H(α|N ) = H(α), where N is the
trivial σ-algebra consisting of sets of measure 0 and 1.
Exercise 10.6. Show that H(α|A) = 0 (or I(α|A) ≡ 0) if and only if α ⊂ A. (In particular,
H(α|B) = 0, I(α|B) ≡ 0.)
10.5
Basic Properties
Recall that if α is a countable partition of a measurable space (X, B) and if C ⊂ B is a subσ-algebra then we define the conditional information and conditional entropy of α relative to
C to be
X
I(α | C) = −
χA log µ(A | C)
A∈α
and
H(α | C) = −
= −
Z X
A∈α
Z X
χA log µ(A | C)
µ(A | C) log µ(A | C),
A∈α
respectively.
Exercise 10.7. Show that if γ is a countable partition of X then
R
X
χA dµ
µ(A | γ) =
χC C
µ(C)
C∈γ
=
X
C∈γ
84
χC
µ(A ∩ C)
.
µ(C)
10.5
Basic Properties
10
ENTROPY
Lemma 10.8 (The Basic Identities). For three countable partitions α, β, γ we have that
I(α ∨ β | γ) = I(α | γ) + I(β | α ∨ γ),
H(α ∨ β | γ) = H(α | γ) + H(β | α ∨ γ).
Proof. We only need to prove the first identity, the second follows by integration.
If x ∈ A ∩ B, A ∈ α, B ∈ β, then
I(α ∨ β | γ)(x) = − log µ(A ∩ B | γ)(x)
and
µ(A ∩ B | γ) =
X
C∈γ
χC
µ(A ∩ B ∩ C)
µ(C)
(exercise). Thus, if x ∈ A ∩ B ∩ C, A ∈ α, B ∈ β, C ∈ γ, we have
µ(A ∩ B ∩ C)
I(α ∨ β | γ)(x) = − log
.
µ(C)
On the other hand, if x ∈ A ∩ C, A ∈ α, C ∈ γ, then
µ(A ∩ C)
I(α | γ)(x) = − log
µ(C)
and if x ∈ A ∩ B ∩ C, A ∈ α, B ∈ β, C ∈ γ, then
µ(B ∩ A ∩ C)
I(β | α ∨ β)(x) = − log
.
µ(A ∩ C)
Hence, if x ∈ A ∩ B ∩ C, A ∈ α, B ∈ β, C ∈ γ, we have
µ(A ∩ B ∩ C)
= I(α ∨ β | γ)(x).
I(α | γ)(x) + I(β | α ∨ γ)(x) = − log
µ(C)
Definition 10.9. Let α and β be countable partitions of X. We say that β is a refinement
of α and write α ≤ β if every set in α is a union of sets in β.
Exercise 10.10. Show that if α ≤ β then I(α | β) = 0. (This corresponds to an intuitive
understand as to how information should behave: if α ≤ β then we receive no information
knowing which element of α a point is in, given that we know which element of β it lies in.)
Corollary 10.11. If γ ≥ β then
I(α ∨ β | γ) = I(α | γ),
H(α ∨ β | γ) = H(α | γ).
Proof. If γ ≥ β then β ≤ γ ≤ α ∨ γ and so I(β | α ∨ γ) ≡ 0, H(β | α ∨ γ) = 0. The result
now follows from the Basic Identities.
85
10.5
Basic Properties
10
ENTROPY
Corollary 10.12. If α ≥ β then
I(α | γ) ≥ I(β | γ),
H(α | γ) ≥ H(β | γ).
Proof. If α ≥ β then
I(α | γ) = I(α ∨ β | γ) = I(β | γ) + I(α | β ∨ γ) ≥ I(β | γ).
The same argument works for entropy.
We next need to show the harder result that if γ ≥ β then H(α | β) ≥ H(α | γ). This
requires the following inequality.
Proposition 10.13 (Jensen’s Inequality). Let φ : [0, 1] → R+ be continuous and concave
(i.e., for 0 ≤ p ≤ 1, φ(px + (1 − p)y ) ≥ pφ(x) + (1 − p)φ(y )). Let f : X → [0, 1] be
measurable (on (X, B)) and let A be a sub-σ-algebra of B. Then
φ(E(f | A)) ≥ E(φ(f ) | A) µ-a.e.
Proof. Omitted.
As a consequence we obtain:
Lemma 10.14. If γ ≥ β then H(α | β) ≥ H(α | γ).
Remark 10.15. Note!: the corresponding statement for information is not true.
Proof. Set φ(t) = −t log t, 0 < t ≤ 1, φ(0) = 0; this is continuous and concave on [0, 1].
Pick A ∈ α and define f (x) = µ(A | γ)(x) = E(χA | γ)(x). Then, applying Jensen’s
Inequality with β = A ⊂ γ = B, we have
φ(E(f | β)) ≥ E(φ(f ) | β).
Now, by one of the properties of conditional expectation,
E(f | β) = E(E(χA | γ) | β) = E(χA | β) = µ(A | β).
Therefore, we have that
−µ(A | β) log µ(A | β) = φ(µ(A | β)) ≥ E(−µ(A | γ) log µ(A | γ) | β).
Integrating, we can remove the conditional expectation on the right-hand side and obtain
Z
Z
−µ(A | β) log µ(A | β) dµ ≥ −µ(A | γ) log µ(A | γ) dµ.
Finally, summing over A ∈ α gives H(α | β) ≥ H(α | γ).
86
10.6
10.6
Entropy of a Transformation Relative to a Partition
10
ENTROPY
Entropy of a Transformation Relative to a Partition
We are now (at last!) in a position to bring measure-preserving transformations back into
the picture. We are going to define the entropy of a measure-preserving transformation T
relative to a partition α (with H(α) < +∞). Later we shall remove the dependence on α to
obtain the genuine entropy.
We first need the following standard analytic lemma.
Lemma 10.16. Let an be a sub-additive sequence of real numbers (i.e. an+m ≤ an + am ).
Then the sequence an /n converges to its infimum as n → ∞.
Proof. Omitted. (As an exercise, you might want to try to prove this.)
Exercise 10.17. Let α be a countable partition of X. Show that T −1 α = {T −1 A | A ∈ α} is
a countable partition of X. Show that H(T −1 α) = H(α).
Let us write
n−1
_
Hn (α) = H
!
T −i α .
i=0
Using the basic identity (with γ equal to the trivial partition) we have that
!
n+m−1
_
−i
Hn+m (α) = H
T α
i=0
n−1
_
!
i=0
n−1
_
!
!
n−1
_
T −i α T −i α
T −i α + H
= H
i=0
i=n
i=0
!
!
n−1
n+m−1
_
_
≤ H
T −i α + H
T −i α
= H
n+m−1
_
i=n
T −i α
+ H T −n
i=0
m−1
_
!
T −i α
i=0
= Hn (α) + Hm (α).
We have just shown that Hn (α) is a sub-additive sequence. Therefore, by Lemma 10.16,
1
Hn (α)
n→∞ n
lim
exists and we can make the following definition.
Definition 10.18. We define the entropy of a measure-preserving transformation T relative
to a partition α (with H(α) < +∞) to be
!
n−1
_
1
h(T, α) = lim H
T −i α .
n→∞ n
i=0
87
10.7
The entropy of a measure-preserving transformation
10
ENTROPY
Remark 10.19. Since
Hn (α) ≤ Hn−1 (α) + H(α) ≤ · · · ≤ nH(α)
we have
0 ≤ h(T, α) ≤ H(α).
Remark 10.20. Here is an alternative formula for h(T, α). Let
αn = α ∨ T −1 α ∨ · · · ∨ T −(n−1) α.
Then
H(αn ) = H(α | T −1 α ∨ · · · ∨ T −(n−1) α) + H(T −1 α ∨ · · · ∨ T −(n−1) α)
= H(α | T −1 α ∨ · · · ∨ T −(n−1) α) + H(αn−1 ).
Hence
H(αn )
H(α | T −1 α ∨ · · · ∨ T −(n−1) α)
=
n
n
−1
H(α | T α ∨ · · · ∨ T −(n−2) α)
+
n
H(α | T −1 α) H(α)
+
.
+··· +
n
n
Since
H(α | T −1 α ∨ · · · ∨ T −(n−1) α) ≤ H(α | T −1 α ∨ · · · ∨ T −(n−2) α) ≤ · · · ≤ H(α)
and
H(α | T
−1
α ∨ ··· ∨ T
−(n−1)
α) → H(α |
∞
_
T −i α)
i=1
(by the Increasing Martingale Theorem), we have
∞
_
1
n
h(T, α) = lim H(α ) = H(α |
T −i α).
n→∞ n
i=1
10.7
The entropy of a measure-preserving transformation
Finally, we can define the entropy of T with respect to the measure µ.
Definition 10.21. Let T be a measure-preserving transformation of the probability space
(X, B, µ). Then the entropy of T with respect to µ is defined to be
h(T ) = sup{h(T, α) | α is a countable partition such that H(α) < ∞}.
88
10.8
10.8
Entropy as an isomorphism invariant
10
ENTROPY
Entropy as an isomorphism invariant
Recall the definition of what it means to say that two measure-preserving transformations
are metrically isomorphic.
Definition 10.22. We say that two measure-preserving transformations (X, B, µ, T ) and
(Y, C, m, S) are (measure theoretically) isomorphic if there exist M ∈ B and N ∈ C such
that
(i) T M ⊂ M, SN ⊂ N,
(ii) µ(M) = 1, m(N) = 1,
and there exists a bijection φ : M → N such that
(i) φ, φ−1 are measurable and measure-preserving (i.e. µ(φ−1 A) = m(A) for all A ∈ C),
(ii) φ ◦ T = S ◦ φ.
We prove that two metrically isomorphic measure-preserving transformations have the
same entropy.
Theorem 10.23. Let T : X → X be a measure-preserving of (X, B, µ) and let S : Y → Y be a
measure-preserving transformation of (Y, C, m). If T and S are isomorphic then h(T ) = h(S).
Proof. Let M ⊂ X, N ⊂ Y and φ : M → N be as above. If α is a partition of Y then
(changing it on a set of measure zero if necessary) it is also a partition of N. The inverse
image φ−1 α = {φ−1 A | A ∈ α} is a partition of M and hence of X. Furthermore,
X
µ(φ−1 A) log µ(φ−1 A)
Hµ (φ−1 α) = −
A∈α
= −
X
m(A) log m(A)
A∈α
= Hm (α).
More generally,
Hµ
n−1
_
!
T −j (φ−1 α)
= Hµ
n−1
_
φ−1
j=0
!!
S −j α
j=0
= Hm
n−1
_
!
S −j α .
j=0
Therefore, dividing by n and letting n → ∞, we have
h(S, α) = h(T, φ−1 α).
89
10.9
Calculating entropy
10
ENTROPY
Thus
h(S) = sup{h(S, α) | α partition of Y, Hm (α) < ∞}
= sup{h(T, φ−1 α) | α partition of Y, Hm (α) < ∞}
≤ sup{h(T, β) | β partition of X, Hµ (β) < ∞}
= h(T ).
By symmetry, we also have h(T ) ≤ h(S). Therefore h(T ) = h(S).
Note that the converse to Theorem 10.23 is false in general: if two measure-preserving
transformations have the same entropy then they are not necessarily metrically isomorphic.
10.9
Calculating entropy
At first sight, the entropy of a measure-preserving transformation seems hard to calculate
as it involves taking a supremum over all possible (finite entropy) partitions. However, some
short cuts are possible.
10.10
Generators and Sinai’s theorem
A major complication in the definition of entropy is the need to take the supremum over all
finite entropy partitions. Sinai’s theorem guarantees that h(T ) = h(T, α) for a partition α
whose refinements generates the full σ-algebra.
We begin by proving the following result.
Theorem 10.24 (Abramov’s theorem). Suppose that α1 ≤ α2 ≤ · · · ↑ B are countable
partitions such that H(αn ) < ∞ for all n ≥ 1. Then
h(T ) = lim h(T, αn ).
n→∞
Proof. Choose any countable partition β such that H(β) < ∞. Fix n > 0. Then
!
!
k−1
k−1
k−1
_
_
_
H
T −j β
≤ H
T −j β ∨
T −j αn
j=0
j=0
≤ H
k−1
_
j=0
!
T −j αn
j=0
+H
k−1
_
j=0
by the basic identity.
90
T −j β |
k−1
_
j=0
!
T −j αn
,
10.10
Generators and Sinai’s theorem
10
ENTROPY
Observe that
H
k−1
_
T −j β |
j=0
k−1
_
!
T −j αn
j=0
= H β|
k−1
_
!
T −j αn
k−1
_
+H
j=0
T −j β | β ∨
j=1
≤ H(β|αn ) + H
= H(β|αn ) + H
k−1
_
T −j β |
k−1
_
j=1
j=1
k−2
_
k−2
_
T −j β |
j=0
k−1
_
!
T −j αn
j=0
!
T −j αn
!
T −j αn
.
j=0
Continuing this inductively we see that
H
k−1
_
T −j β |
j=0
k−1
_
!
T −j αn
≤ kH(β|αn ).
j=0
Hence
1
h(T, β) = lim H
k→∞ k
1
≤ lim H
k→∞ k
k−1
_
!
T −j β
j=0
k−1
_
!
T −j αn
+ H(α | αn )
j=0
= h(T, αn ) + H(β | αn ).
We now prove that H(β | αn ) → 0 as n → ∞. To do this, it is sufficient to prove that
I(β | αn ) → 0 in L1 as n → ∞. Recall that
X
I(β | αn )(x) = −
χB (x) log µ(B | αn )(x) = − log µ(B | αn )(x)
B∈β
if x ∈ B, B ∈ β. By the Increasing Martingale Theorem, we know that
µ(B | αn )(x) → χB a.e.
Hence for x ∈ B
I(β | αn )(x) → − log χB = 0.
Hence for any countable partition β with H(β) < ∞ we have that h(T, β) ≤ limn→∞ h(T, αn ).
The result follows by taking the supremum over all such β.
Definition 10.25. We say that a countable partition α is a generator if T is invertible and
n−1
_
T −j α → B
j=−(n−1)
91
10.11
Entropy of a power
10
ENTROPY
as n → ∞.
We say that a countable partition α is a strong generator if
n−1
_
T −j α → B
j=0
as n → ∞.
Remark 10.26. To check whether a partition α is a generator (respectively, a strong generator) it is sufficient to check that it separates almost every pair of points. That is, for
almost
∈ X, there exists n such that x, y are in different elements of the partition
Wn−1 every−jx, y W
n−1 −j
j=−(n−1) T α ( j=0 T α, respectively).
The following important theorem will be the main tool in calculating entropy.
Theorem 10.27 (Sinai’s theorem). Suppose α is a strong generator or that T is invertible
and α is a generator. If H(α) < ∞ then
h(T ) = h(T, α).
Proof. The proofs of the two cases are similar, we prove the case when T is invertible and α
is a generator of finite entropy.
Let n ≥ 1. Then
h(T,
n
_
T −j α)
j=−n
1
H(T n α ∨ · · · ∨ T −n α ∨ T −(n−1) α ∨ · · · ∨ T −(n+k−1) α)
k→∞ k
1
= lim H(α ∨ · · · ∨ T −(2n+k−1) α)
k→∞ k
= h(T, α)
=
lim
for each n. As α is a generator, we have that
n
_
T −j α → B.
j=−n
By Abramov’s theorem, h(T, α) = h(T ).
10.11
Entropy of a power
Observe that if T preserves the measure µ then so does T k . The following result relates the
entropy of T and T k .
Theorem 10.28.
(i) For k ≥ 0 we have that h(T k ) = kh(T ).
(ii) If T is invertible then h(T ) = h(T −1 ).
92
10.11
Entropy of a power
10
ENTROPY
Proof. We prove (i), leaving the case k = 0 as an exercise. Choose a countable partition α
with H(α) < ∞. Then
!
!
k−1
nk−1
_
_
1
h T k,
T −j α
= lim H
T −j α
n→∞ n
j=0
j=0
!
nk−1
_
1
= k lim
H
T −j α = kh(T, α).
n→∞ nk
j=0
Thus,
kh(T ) =
sup kh(T, α)
H(α)<∞
sup h T k ,
=
k−1
_
H(α)<∞
!
T −j α
j=0
k
sup h(T , α) = h(T k ).
≤
H(α)<∞
On the other hand,
1
h(T , α) = lim H
n→∞ n
k
1
≤ lim H
n→∞ n
!
n−1
_
T
−jk
α
j=0
nk−1
_
!
T −j α
by Corollary 25.3
j=0
1
= k lim
H
n→∞ nk
nk−1
_
!
T −j α
= kh(T, α),
j=0
and so h(T k ) ≤ kh(T ), completing the proof.
We prove (ii). We have
!
!
n−1
n−1
_
_
H
T −j α
= H T n−1
T −j α
j=0
j=0
= H
n−1
_
!
T jα .
j=0
Therefore
1
h(T, α) = lim H
n→∞ n
1
= lim H
n→∞ n
n−1
_
!
T −j α
j=0
n−1
_
!
T jα
= h(T −1 , α).
j=0
Taking the supremum over α gives h(T ) = h(T −1 ).
Exercise 10.29. Prove that the entropy of the identity map is zero.
93
10.12
10.12
Calculating entropy using generators
10
ENTROPY
Calculating entropy using generators
In this subsection, we show how generators and Sinai’s theorem can be used to calculate the
entropy for some of our examples.
10.13
Subshifts of finite type
Let A be an irreducible k × k matrix with entries from {0, 1}. Recall that we define the shifts
of finite type to be the spaces
Z
ΣA = {(xn )∞
n=−∞ ∈ {1, . . . , k} | A(xn , xn+1 ) = 1 for all n ∈ Z},
N
∞
Σ+
A = {(xn )n=0 ∈ {1, . . . , k} | A(xn , xn+1 ) = 1 for all n ∈ N},
+
and the shift maps σ : ΣA → ΣA , σ : Σ+
A → ΣA by (σx)n = xn+1 .
Let P be a stochastic matrix and let p be a normalised left eigenvector so that pP = p.
Suppose that P is compatible with A, so that Pi,j > 0 if and only if A(i, j) = 1. Recall that
we define the Markov measure µP by defining it on cylinder sets by
µP [z0 , z1 , . . . , zn ] = pz0 Pz0 z1 · · · Pzn−1 zn ,
and then extending it to the full σ-algebra by using the Kolmogorov Extension Theorem.
We shall calculate hµP (σ) for the one-sided shift which for notational brevity we denote
by σ : ΣA → ΣA ; the calculation for the two-sided shift is similar.
Let α be the partition {[1], . . . , [k]} of ΣA into cylinders of length 1. Then
H(α) = −
= −
k
X
i=1
k
X
µP [i ] log µP [i ]
pi log pi < ∞.
i=1
The partition αn =
n
_
Wn
i=0
σ −i α consists of all allowed cylinders of length n + 1:
σ −i α = {[z0 , z1 , . . . , zn ] | A(zi , zi+1 ) = 1, i = 0, . . . , n − 1}.
i=0
94
10.13
Subshifts of finite type
10
ENTROPY
Hence α is a strong generator. Moreover, we have
!
n
_
H
σ −i α
i=0
X
= −
µ[z0 , z1 , . . . , zn ] log µ[z0 , z1 , . . . , zn ]
[z0 ,z1 ,...,zn ]∈αn
X
= −
pz0 Pz0 z1 · · · Pzn−1 zn log(pz0 Pz0 z1 · · · Pzn−1 zn )
[z0 ,z1 ,...,zn ]∈αn
= −
= −
k
X
i0 =1
in =1
k
X
k
X
i0 =1
= −
···
k
X
k
X
···
pi0 Pi0 i1 · · · Pin−1 in log(pi0 Pi0 i1 · · · Pin−1 in )
pi0 Pi0 i1 · · · Pin−1 in (log pi0 + log Pi0 i1 + · · · + log Pin−1 in )
in =1
pi0 log pi0 − n
k
X
pi Pij log Pij ,
i,j=1
i0 =1
where we have used the identities
Therefore
Pk
j=1
Pij = 1 and
Pk
pi Pij = pj .
i=1
hµP (σ) = hµP (σ, α)
1
= lim
H
n→∞ n + 1
= −
k
X
n
_
!
σ −i α
i=0
pi Pij log Pij .
i,j=1
Exercise 10.30. Carry out the above calculation for a full shift on k symbols with a Bernoulli
measure determined
by the probability vector p = (p1 , . . . , pk ) to show that in this case the
P
entropy is − ki=1 pi log pi .
Recall from Theorem 11.25 that if two measure-preserving transformations are metrically isomorphic then they have the same entropy but that the converse is not necessarily
true. However, for Markov measures on two-sided shifts of finite type entropy is a complete
invariant:
Theorem 10.31 (Ornstein’s theorem). Any two 2-sided Bernoulli shifts with the same entropy
are metrically isomorphic.
Theorem 10.32 (Ornstein and Friedman). Any two 2-sided aperiodic Markov shifts with the
same entropy are metrically isomorphic.
Remark 10.33. Both of these theorems are false for 1-sided shifts. The isomorphism problem
for 1-sided shifts is a very subtle problem.
95
10.14
The continued fraction map
10.14
10
ENTROPY
The continued fraction map
Recall that the continued fraction map is defined by T (x) = 1/x mod 1 and preserves Gauss’
measure µ defined by
Z
1
1
dx.
µ(B) =
log 2 B 1 + x
Let An = (1/(n + 1), 1/n) and let α be the partition α = {An | n = 1, 2, 3, . . .}.
Exercise 10.34. Check that H(α) < ∞. (Hint: use the fact that Gauss’ measure µ and
Lebesgue measure λ are comparable, i.e. there exist constants c, C > 0 such that c ≤
µ(B)/λ(B) ≤ C for all B ∈ B.
We claim that α is a strong generator for T . To see this, recall that each irrational x
has a distinct continued fraction expansion. Hence α separates irrational, hence almost all,
points.
For notational convenience let
[x0 , . . . , xn−1 ] = Ax0 ∩ T −1 Ax1 ∩ · · · ∩ T −(n−1) Axn−1
= {x ∈ [0, 1] | T j (x) ∈ Axj for j = 0, . . . , n − 1}
so that [x0 , . . . , xn−1 ] is the set of all x ∈ [0, 1] whose continued fraction expansion starts
x0 , . . . , xn−1 .
If x ∈ [x0 , . . . , xn−1 ] then
I(α | T −1 α ∨ · · · ∨ T −n α) = − log
µ([x0 , . . . , xn ])
.
µ([x1 , . . . , xn ])
We will use the following fact: if In (x) is a nested sequence of intervals such that In (x) ↓
{x} as n → ∞ then
Z
1
lim
f (y ) dy = f (x)
n→∞ λ(In (x)) I (x)
n
where λ denotes Lebesgue measure. We will also need the fact that
λ([x0 , . . . , xn ])
1
= 0
.
n→∞ λ([x1 , . . . , xn ])
|T (x)|
lim
Hence
µ([x0 , . . . , xn ])
µ([x1 , . . . , xn ])
R
=
dx
[x0 ,...,xn ] 1+x
R
dx
[x1 ,...,xn ] 1+x
dx
[x0 ,...,xn ] 1+x
R
=
→
, R
dx
[x1 ,...,xn ] 1+x
λ([x0 , . . . , xn ])
λ([x1 , . . . , xn ])
1
1
1
.
0
1+x
1 + T x |T (x)|
96
!
×
λ([x0 , . . . , xn ])
λ([x1 , . . . , xn ])
10.14
The continued fraction map
Hence
I
α|
∞
_
10
!
−j
= − log
T α
j=1
Using the fact that µ is T -invariant we see that
!
Z
∞
_
−j
H α|
T α
=
I
1 + Tx 1
1 + x |T 0 (x)|
α|
j=1
∞
_
ENTROPY
.
!
T −j α
dµ
j=1
Z
− log
=
Z
=
1
|T 0 (x)|
dµ
log |T 0 (x)| dµ.
Now T (x) = 1/x mod 1 so that T 0 (x) = −1/x 2 . Hence
!
Z
∞
_
2
log x
−j
T α =−
dx,
h(T ) = H α |
log
2
1
+
x
j=1
which cannot be simplified much further.
Exercise 10.35. Define T : [0, 1] → [0, 1] to be the doubling map T (x) = 2x mod 1. Let µ
denote Lebesgue measure. We know that µ is a T -invariant probability measure. Prove that
h(T ) = log 2.
Exercise 10.36. Define T : [0, 1] → [0, 1] by T (x) = 4x(1 − x). Define the measure µ by
Z
1
1
p
µ(B) =
dx.
π B x(1 − x)
We have seen in a previous exercise that µ is an invariant probability measure. Show that
h(T ) = log 2.
(Hint: you may use the fact that the partition α = {[0, 1/2], [1/2, 1]} is a strong generator.)
Exercise 10.37. Let β > 1 by the golden mean, so that β 2 = β +1. Define T (x) = βx mod 1.
Define the density
 1
on [0, 1/β)
 1+ 1
β
β3
k(x) =
1
 1 1 on [1/β, 1).
β
β
+
and define the measure
β3
Z
µ(B) =
k(x) dx.
B
In a previous exercise, we saw that µ is T -invariant. Assuming that α = {[0, 1/β), [1/β, 1]}
is a strong generator, show that h(T ) = log β.
97
10.14
The continued fraction map
10
ENTROPY
Exercise 10.38. Let T (x) = 1/x mod 1 and let
1
µ(B) =
log 2
Z
B
1
dx
1+x
be Gauss’ measure.
Let Ak = (1/(k + 1), 1/k]. Explain why α = {Ak }∞
k=1 is a strong generator for T .
Show that the entropy of T with respect to µ can be written as
Z 1
1
log x 2
h(T ) = −
dx.
log 2 0 1 + x
98