Course compendium, Applied matrix analysis November 14, 2014 Christopher Engstr¨

Transcription

Course compendium, Applied matrix analysis November 14, 2014 Christopher Engstr¨
Course compendium, Applied matrix
analysis
Christopher Engstr¨
om
Karl Lundeng˚
ard
Sergei Silvestrov
November 14, 2014
Contents
1 Review of basic linear algebra
1.1 Notation and elementary matrix operations
1.2 Linear equation systems . . . . . . . . . . .
1.2.1 Determinants . . . . . . . . . . . . .
1.3 Eigenvalues and eigenvectors . . . . . . . .
1.4 Exercises . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
. 6
. 8
. 10
. 13
. 14
2 Matrix factorization and canonical forms
2.1 Important types of matrices and some useful properties .
2.1.1 Triangular and Hessenberg matrices . . . . . . . .
2.1.2 Hermitian matrices . . . . . . . . . . . . . . . . . .
2.1.3 Unitary matrices . . . . . . . . . . . . . . . . . . .
2.1.4 Positive definite matrices . . . . . . . . . . . . . .
2.2 Factorization . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Spectral factorization . . . . . . . . . . . . . . . .
2.2.2 Rank factorization . . . . . . . . . . . . . . . . . .
2.2.3 LU and Cholesky factorization . . . . . . . . . . .
2.2.4 QR factorization . . . . . . . . . . . . . . . . . . .
2.2.5 Canonical forms . . . . . . . . . . . . . . . . . . .
2.2.6 Reduced row echelon form . . . . . . . . . . . . . .
2.2.7 Jordan normal form . . . . . . . . . . . . . . . . .
2.2.8 Singular value decomposition (SVD) . . . . . . . .
2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Portfolio evaluation using Monte Carlo simulation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16
17
17
18
18
19
19
20
20
21
22
22
23
23
24
25
25
3 Non-negative matrices and graphs
3.1 Introduction to graphs . . . . . . . . .
3.2 Connectivity and irreducible matrices
3.3 The Laplacian matrix . . . . . . . . .
3.4 Exercises . . . . . . . . . . . . . . . .
3.5 Applications . . . . . . . . . . . . . . .
3.5.1 Shortest path . . . . . . . . . .
3.5.2 Resistance distance . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
27
30
33
35
35
35
36
.
.
.
.
39
39
42
43
43
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Perron-Frobenius theory
4.1 Perron-Frobenius for non negative matrices
4.2 Exercises . . . . . . . . . . . . . . . . . . .
4.3 Applications . . . . . . . . . . . . . . . . . .
4.3.1 Leontief input-output model . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Stochastic matrices and Perron-Frobenius
46
5.1 Stochastic matrices and Markov chains . . . . . . . . . . . . . . . . . . . . 46
1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
Irreducible, primitive stochastic matrices . . . .
Irreducible, imprimitive stochastic matrices . .
Reducible, stochastic matrices . . . . . . . . . .
Hitting times and hitting probabilities . . . . .
A short look at continuous time Markov chains
Exercises . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . .
5.8.1 Return to the Leontief model . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
50
51
53
55
57
58
58
6 Linear spaces and projections
6.1 Linear spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Inner product spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.1 Fourier approximation . . . . . . . . . . . . . . . . . . . . . . .
6.4.2 Finding a QR decomposition using the Gram-Schmidt process
6.4.3 Principal component analysis (PCA) and dimension reduction .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
60
60
66
70
74
74
75
77
7 Linear transformations
7.1 Linear transformations in geometry . . . . . . . . . . . . . . . . . . . . .
7.2 surjection,injection and combined transformations . . . . . . . . . . . .
7.3 Application: Transformations in computer graphics and homogeneous coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4 Kernel and image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5 Isomorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 Least Square Method (LSM)
8.1 Finding the coefficients . . . . . . . . . .
8.2 Generalized LSM . . . . . . . . . . . . .
8.3 Weighted Least Squares Method (WLS)
8.4 Pseudoinverses . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
79
. 82
. 84
. 85
. 88
. 91
.
.
.
.
93
95
96
97
98
9 Matrix functions
100
9.1 Matrix power function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9.1.1 Calculating An . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.2 Matrix root function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.2.1 Root of a diagonal matrix . . . . . . . . . . . . . . . . . . . . . . . 103
9.2.2 Finding the square root of a matrix using the Jordan canonical form104
9.3 Matrix polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.3.1 The Cayley-Hamilton theorem . . . . . . . . . . . . . . . . . . . . 106
9.3.2 Calculating matrix polynomials . . . . . . . . . . . . . . . . . . . . 109
9.3.3 Companion matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.4 Exponential matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.4.1 Solving matrix differential equations . . . . . . . . . . . . . . . . . 114
2
9.5
General matrix functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
10 Matrix equations
116
10.1 Tensor products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
10.2 Solving matrix equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
11 Numerical methods for computing eigenvalues and
11.1 The Power method . . . . . . . . . . . . . . . .
11.1.1 Inverse Power method . . . . . . . . . .
11.2 The QR-method . . . . . . . . . . . . . . . . .
11.3 Applications . . . . . . . . . . . . . . . . . . . .
11.3.1 PageRank . . . . . . . . . . . . . . . . .
Index
eigenvectors
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
121
. 121
. 124
. 126
. 129
. 129
133
3
4
Notation
v
vector (usually represented by a row or column matrix)
|v|
length (Euclidean norm) of a vector
kvk
norm of a vector
M
matrix
M
>
transpose
M
−1
inverse
M
∗
pseudoinverse
M
complex conjugate
M
H
Hermitian transpose
M
k
the k:th power of matrix M, see section 9.1
mi,j
The element on the i:th row and j:th column of M
(k)
mi,j
The element on the i:th row and j:th column of Mk
Mi.
i:th row of M
M.j
j:th column of M
⊗
Kroenecker (tensor) product
a
scalar
|a|
absolute value of a scalar
h·, ·i
Z
+
inner product (scalar product)
set of integers
Z
set of positive integers
R
set of real numbers
C
set of complex numbers
Mm×n (K) set of matrices with dimension m × n and elements from the set K
5
1 Review of basic linear algebra
The reader should already be familiar with the concepts and content in this chapter,
instead it serves more as a reminder and overview of what we already know. We will
start by giving some notation used throughout the book as well as reminding ourselves of
the most basic matrix operations. We will then take a look at linear systems, eigenvalues
and eigenvectors and related concepts as seen in elementary courses in linear algebra.
While the methods presented here works, they are often very slow or unstable when used
on a computer. Much of our later work will relate to how to find better ways to find
these or similar quantities or give other examples of where and why they are useful.
1.1 Notation and elementary matrix operations
We will here go through the most elementary matrix operations as well as give the notation which will be used throughout the book. We remind ourselves that a matrix is
a rectangular array of numbers, symbols or expressions. Some example of matrices can
be seen below:
1 0
A=
0 4

1 −2
C = −2 3 
3
1

B= 0 1−i
1 0 0
D=
0 3 0
We note that matrices need not be square as in C and D, we also note that although
matrices can be complex valued, in most of our examples or practical applications we
need only to work with real valued matrices.
• We denote every element of A as ai,j , where i is it’s row number and j is it’s column
number. For example looking at C we have c3,2 = 1.
• The size of a matrix is the number of rows and columns (or the index of the bottom
right element): We say that C above is a 3 × 2 matrix.
• A matrix with only one row or column is called a row vector or column vector
respectively. We usually denote vectors with an arrow v and we assume all vectors
are column vectors unless stated otherwise.
• The diagonal of a matrix is the elements diagonally from the top left corner towards
the bottom left (a1,1 , a2,2 , . . . an,n ). A matrix where all elements except those on
the diagonal is zero is called a diagonal matrix . For example A is a diagonal matrix,
but so is D although we will usually only consider square diagonal matrices.
• The trace of a matrix is the sum of the elements on the diagonal.
6
Matrix addition
Add every element in the first matrix with correspondign element in the second matrix.
Both matrices need to be of the same size for addition to be defined.
1 0
1 −2
1+1 0−2
2 −2
+
=
=
0 2
2 −1
0+2 2−1
2 1
Although addition between a scalar and a matrix is undefined, sometimes authors
write it as such in which case they usually means to add the scalar to every element of
the matrix as if the scalar was a matrix of the same size with every element equal to the
constant.
The general definition of matrix addition is:
Let A = B + C where all three matrices have the number of rows and columns. Then
ai,j = bi,j + ci,j for every element in A.
Matrix multiplication
Given the two matrices A and B we get the product AB as the matrix whose elements
ei,j are found by multiplying the elements in row i of A with the elements of column j
of B and adding the results.

 


11 10 7 14
3 2 0 1 3 2 0 1
1 2 3 4 1 2 3 4  5 9 16 24

 


0 1 2 1 0 1 2 1 =  1 4 8 9 
0 1 5 10
0 0 1 3 0 0 1 3
For the colored element we have: 1 · 3 + 2 · 1 + 3 · 0 + 4 · 0 = 5
We note that we need the number of columns in A to be the same as the number of
rows in B but A can have any number of rows and B can have any number of columns.
Also we have:
• Generally AB 6= BA, why?
• The size of AB is the number of rows of A times the number of columns of B.
• The Identity matrix I is the matrix with ones on the diagonal and zero elsewere.
For the Identity matrix we have: AI = IA = A.
Multiplying a scalar with a matrix is done by multiplying the scalar with every element
of the matrix.
1 0
3·1 3·0
3 0
3
=
=
0 2
3·0 3·2
0 6
The general definition of matrix multiplication is:
Let A = BC then
j
X
ai,j =
bi,k ck,j
k=1
for each element in A.
7
Matrix transpose
The transpose of a matrix is obtained by flipping the rows and columns such that the
elements of the first row is now the elements of the first column. For example given
matrix A below:


1 −2
A = −1 0 
3
1
We get A transpose, denoted A> as:
A> =
1 −1 3
−2 0 1
The general definition of the transpose is: Let B = A> then bi,j = aj,i for all elements
in B.
For complex valued matrices we usually instead talk about the Hermitian transpose
where we first take the transpose and then take the complex conjugate (change sign of
imaginary part). We usually denote this by AH (sometimes this is denoted by A∗ but
here we will reserve that notation for pseudoinverses, see section 8.4).
The general definition of the Hermitian transpose (also known as Hermitian conjugate
or conjugate transpose) is: Let B = A> then bi,j = a∗j,i , where ∗ denotes the complex
conjugate, for all elements in B.
A real matrix where A = A> is called a symmetric matrix and a complex matrix where
AH = A we call a hermitian matrix .
1.2 Linear equation systems
Very often in applications we end up with a system of linear equations which we want to
solve or find an as good solution as possible. Either repeatedly for the same or similar
systems or only once. Matrices presents presents powerful tools which we can use to solve
these problems, in fact even if we have a nonlinear system, we very often approximate
it with a linear system and try to solve that instead. Linear systems will be treated
repeatedly throughout the book, but here we take a short look at the problem itself,
when we can solve it, and how from what we should be familiar with from courses in
linear algebra.
Given a linear equation system

a1,1 x1 + a1,2 x2 + . . .



 a2,1 x1 + a2,2 x2 + . . .
..
..
..

.
.
.



am,1 x1 + am,2 x2 + . . .
8
+
+
a1,n xn
a2,n xn
..
.
=
=
b1 ,
b2 ,
..
.
+ am,n xn = bm ,
we can represent it using a matrix and column vectors
   

b1
x1
a1,1 a1,2 . . . a1,n
 a2,1 a2,2 . . . a2,n   x2   b2 
   

 ..
..
..   ..  =  .. 
.
.
 .
.
.
.  .   . 
bm
xm
am,1 am,2 . . . am,n
Or in a more compact form as Ax = b. Solving this system of equations is the same
as finding an inverse matrix A−1 such that A−1 A = I and multiply both expressions
from the left with this matrix. The reader should be familiar with using Gaussian
elimination to write the system in a easy to solve form, or going a step further and
finding the inverse A−1 using Gauss-Jordan elimination. We remember for Gaussian
elimination using elementary row operations (adding and multiplying rows) trying to
find pivot elements in order to change our matrix to diagonal/row echelon form .


F F F F F F F
0  0 0
F F F F F 
0 

 0 0 0
F F F F 
0 


 0 0 0 0
F
0
F
F


 0 0 0 0 0 0
F 
0 

 0 0 0 0 0 0 0 0 


 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0
After we have done this we can easily solve the system as long as it is solvable. Not
all systems can be solved exactly, for example the system below is obviously unsolvable:
1 2 x1
1
=
1 2 x2
2
For square matrices we can go one step further by using the Gauss-Jordan elimination
reducing the elements above the diagonal to zero as well in the same manner. Similarly
some systems have more than one solution. For example will any point on the line
y = b−1 (c − ax) solve the linear equation system
4 −1 x1
2
=
−8 2
x2
−4
We are often interested in if there is a single unique solution to a linear equation
system.
Theorem 1.1. The following statements are equivalent for A ∈ Mn×n (K):
1. AX = B has a unique solution for all B ∈ Mn×m where X ∈ Mn×m .
2. AX = 0 only has the trivial solution (X = 0).
3. A has linearly independent row/column vectors.
9
4. A is invertible/non-degenerate/non-singular.
5. det(A) 6= 0.
6. A has maximum rank (rank(A) = n).
7. A’s row/column space has dimension n.
8. A’s null space has dimension 0.
9. A’s row/column vectors span K
Most of these statements should be familiar already, some will be looked at more in
depth later, especially when we look at linear spaces in chapter 6.
1.2.1 Determinants
One of the ways to guarantee that a linear system Ax = b has a unique solution is that
the determinant is non-zero det(A) 6= 0. We remember the determinant as the product
of the eigenvalues. For a small matrix we can easily calculate the determinant by hand,
we had the formulas for 2 × 2 and 3 × 3 matrices as:
a b c d = ad − bc
a b c d e f = aei − af h − bdi + bf g + cdh − ceg
g h i And for larger matrices we could write it as a sum of determinants of smaller matrices:
a b c d e f = a e f − b d f + c d e h i g i g h
g h i And in general we get the determinant function
det : Mn×n (K) 7→ K
det(A) = det(A.1 , A.2 , . . . , A.k , . . . , A.n )
Some properties of the determinant:
• det A> = det A
• det AB = det A det B but det(A + B) 6= det(A) + det(B).
• det cA = cn det A
• det is multilinear
10
• det is alternating
• det(I) = 1
With multilinear we mean that it’s linear in each argument
det(A.1 , A.2 , . . . , a · A.k + b · B, . . . , A.n ) =
= a det(A.1 , A.2 , . . . , A.k , . . . , A.n ) + b det(A.1 , A.2 , . . . B, . . . A.n )
For example if A is a 3 × 3 matrix and B, C are 3 × 1 matrices.
Let M = [A.1 A.2 + B A.3 + C] then
det(M) = det(A.1 , A.2 + B, A.3 + C) =
= det(A.1 , A.2 , A.3 + C) + det(A.1 , B, A.3 + C) =
= det(A) + det(A.1 , B, A.3 ) + det(A.1 , A.2 , C) + det(A.1 , B, C)
Since the determinant is alternating, switching two columns (or rows) of A switches
the sign of det(A)
det(B) = det(A.1 , A.2 , . . . , A.k+1 , A.k , . . . , A.n ) = − det(A)
But more important is what happens when we have two linear dependent vectors:
det(A.1 , A.2 , . . . , B, B, . . . , A.n ) = 0
Then the multilinear and alternating property gives: linear dependence between columns
(or rows) in A ⇔ det(A) = 0
Apart from using column or row expansion to calculate the determinant there is an
alternative formula for the determinant
X
aσ1 ,1 aσ2 ,2 . . . aσn ,n sign(σ)
(1)
det(A) =
σ∈Sn
σ is a permutation of columns (rows), Sn is the set of all possible permutations, aσi ,i is
the element that ai,i is switched with. sign is a function that is equal to 1 if an even
number of rows have been switched and -1 if an odd number of rows have been switched.
For example for a 2 × 2 matrix A
a1,1 a1,2
A=
a2,1 a2,2
We get two possible permutations:
σ a do not switch places of columns, aσ1a ,1 = a1,1 , aσ2a ,2 = a2,2 .
σ b switch places of columns, aσb ,1 = a1,2 , aσb ,2 = a2,1 .
1
2
And we get determinant det(A) = aσ1a ,1 aσ2a ,2 sign(σ a ) + aσb ,1 aσb ,2 sign(σ b ) = a1,1 a2,2 −
1
2
a2,1 a1,2
11
The basic idea for the formula is as follows.
We start by rewriting columns as sums of columns of the identity matrix
n
X
A.k =
aσk ,k I.σk
σk =1
Using the multilinearity of the determinant function we get
det(
n
X
aσ1 ,1 I.σ1 , A.2 , . . . A.n ) =
σ1 =1
=
n
X
aσ1 ,1 det(I.σ1 , A.2 , . . . A.n ) =
σ1 =1
=
n
X
aσ1 ,1
σ1 =1
=
n
X
aσ1 ,1
σ1 =1
n
X
aσ2 ,2 det(I.σ1 , I.σ2 , . . . A.n ) =
σ2 =1
n
X
n
X
aσ2 ,2 . . .
σ2 =1
aσn ,n det(I.σ1 , I.σ2 , . . . I.σn )
σn =1
Terms where σi = σk for some i 6= k, 1 ≤ i, k ≤ n will be equal to zero.
det(A) =
X
aσ1 ,1 aσ2 ,2 . . . aσn ,n det(I.σ1 , I.σ2 , . . . I.σn )
σ∈Sn
The (sign) function captures the alternating behaviour.
Q
1≤i<j≤n
sign(σ) = Q
(σj − σi )
1≤i<j≤n (j
− i)
= ±1
While the function given here i correct, it’s extremely slow for large matrices, for an
n × n matrix there is n! permutations.
Instead we mostly we calculate it by column (row) expansion as we are used to for
3 × 3 matrices:
column i
row i
z }| { z }| {
n
n
X
X
det(A) =
ak,i Aki =
ai,k Aik
k=1
(2)
k=1
˜ ki ) where A
˜ ki is equal to A with the k:th row and
where Aki is equal to (−1)k+i det(A
i:th column removed. Aki is called the ki-cofactor of A.
12
1.3 Eigenvalues and eigenvectors
We are often interested in finding eigenvalues or eigenvectors of a matrix.
Definition 1.1
If Av = λv where A ∈ Mn×n (K), λ ∈ K, v ∈ Mn×1 (K) then λ is a eigenvalue of A and
v is a (right) eigenvector if v 6= 0.
We call an eigenvalue λ with corresponding eigenvector v an eigenpair (λ, v).
A matrix can have (and usually do) have multiple eigenvalues and eigenvectors. A single eigenvalue can also be associated with multiple eigenvectors. Often we are interested
in the ”dominant” eigenvalue or the spectral radius of a matrix A.
Definition 1.2
The set of eigenvalues for a matrix A is called the spectrum of A and is denoted by Sp(A).
The absolute value of the largest eigenvalue is called the spectral radius
ρ(A) = max |λ|
λ∈Sp(A)
The eigenvalues whith the largest absolute value (if there is only one) is often called
the dominant eigenvalue. This and it’s corresponding eigenvector is often of particular
interest as we will see for example while working with stochastic matrices in chapter 5.
We can find the eigenvalues by solving det(A − λI) = 0 which is an nth degree polynomial in λ. pA (λ) = det(A − λI) is called the characteristic polynomial and pA (λ) = 0
is called the characteristic equation.
Some properties of eigenvalues are:
• The trace of A is the sum of the eigenvalues of A: tr(A) =
n
X
i=1
• det(A) =
n
Y
ai,i =
n
X
λi
i=1
λi
i=1
• If A ∈ Mn×n (K) have n different eigenvalues then A is diagonalizable.
• If A have no eigenvalue λ = 0 then the determinant is non-zero and the linear
system Ax = b have a single unique solution.
Often when interested in finding the eigenvalues or eigenvectors we first find a ”similar”
matrix with the same unknown eigenvalues and then solve the problem for this new
matrix instead.
13
Definition 1.3
Given two n × n matrices A, B we say that they are similar if there is an invertible n × n
matrix P such that
B = P−1 AP
Similar matrices share not only eigenvalues but also many other properties because of
that.
Theorem 1.2. Let matrices A, B be similar, then A, B have the same
• Eigenvalues (but generally not the same eigenvectors)
• Determinant, det(A) = det(B).
• Rank
• Trace
We will look much more at eigenvalues and eigenvectors for special types of matrices
in later chapters, as well as how to compute them efficiently in chapter 11.
1.4 Exercises
a) Show that this linear equation system has one single

−x + 2y − z =
−x + y
=

2x − y + 2z =
b) Show that this linear equation system has infinitely

x + y − z =
2x + y + 2z =

3x + 2y − z =
solution.
2
3
1
many solutions.
1
2
3
c) Show that this linear equation system has no solution.

 x − 2y + z = 3
x + 3y − 2z = 2

2x + y − z = 1
Calculate these determinants
2 3
1 5 4
, c = 2 −1 5 a = 5 , b = 3 2
5 0 −1
14
1.4.1
Let

3
0
A=
0
0
3
1
0
0
4
2
3
0


4
3


2
4
,B = 
2
3
3
2
3
1
1
1
4
2
3
3


4
3


2
4
,C = 
2
3
3
3
3
1
1
2
4
2
3
1

4
2
 , det(C) = −26
3
3
The following determinants should all be very simple to calculate. Calculate them.
a) det(A)
b) det(B)
c) det(C> )
1.4.2
If the linear equation systems in 1.4 were written on the form Ax = y, for which systems
would A−1 exist?
1.4.3
4 1
Find the eigenvalues and eigenvectors of M =
3 2
1.4.4
Given





1 0 0
1 1 2
1 2 1
A = −1 1 0 , B =  0 3 2 , P = 0 0 1
0 1 0
−1 0 1
0 2 3

Show that A and B are ( similar).
15
2 Matrix factorization and canonical forms
Matrix factorization, also known as matrix decomposition, are methods in which we write
one matrix as a product of several matrices. The aim is typically to change the problem
into another easier problem through the choice of these factor matrices. Matrix factorization is very often used in order to speed up and/or stabilize calculations in numerical
algorithms such as when finding eigenvalues of a matrix or solving a linear system.
We will give some examples of applications but primary we will introduce different
factorizations which we will then use in later chapters.
One example we should already be familiar with is the diagonalization of a diagonalizable matrix A by elementary row (or column) operations. Since elementary row consist
of multiplying rows with non-zero scalars and adding them together the application of a
series of elementary row operations on a linear equation system, Ax = y, can be written
as multiplication by a matrix, S, from the left.
Elementary row operations:Ax = y → SAx = Sy
The matrix S is always square and since it is easy to revert any elementary row
operations it will also be invertible. Thus it is always allowed to write
ˆ=y
ˆ, x
ˆ = Sx, y
ˆ = Sy
SASS−1 x = Sy ⇔ SAS−1 x
Suppose that the matrix A was diagonalized in this manner.
SAS−1 = D, D diagonal matrix
This relation makes the following definition of a diagonalizable matrix quite natural
Definition 2.1
A matrix A is diagonalizable if there exists an invertible matrix S such that
A = S−1 DS
Where D is a diagonal matrix.
This if course an interesting property since a matrix equation Ax = y has a unique
solution if A is diagonalizable.
But there are many other important factorizations such as:
• Spectral decomposition QΛQ−1
• LU-factorization
• QR-factorization
• Rank factorization CF
• Jordan canonical form S−1 JS
16
• Singular value factorization UΣVH
• Cholesky factorization GGH
Before we look at how to find these factorizations and why they are useful we will
start by looking at a couple of important types of matrices with some useful properties.
2.1 Important types of matrices and some useful properties
We have already looked extensively at non-negative matrices in the earlier chapters,
however there are many other types of matrices which are useful when trying to find a
suitable matrix factorization.
2.1.1 Triangular and Hessenberg matrices
A triangular matrix as we should already know is a matrix with all zeros above or below
the diagonal. A matrix where all elements above the diagonal are zero we call lower
triangular and a matrix where all elements below the diagonal are zero such as the one
below, we call upper triangular.


F F ... F
 0 F . . . F


A=.
.. . .
.. 
 ..
.
.
.
0 0 ... F
We note that a triangular matrix does not need to be square although that is when
they are the most useful. For a non-square matrix the diagonal is always the one starting
in the upper left corner.
To understand why triangular matrices are desirable we will consider to important
properties. First of all we can very easily solve a linear system Ax = b if A is triangular
since the last row is a equation of the form axn = b, Solving this we can substitute this
answer into the row above that and get a new equation in just one variable.
A second important property of triangular matrices us that the eigenvalues of a triangular matrix is the same as the elements on the diagonal. You can easily check this
by looking at what happens with the characteristic polynomial det(A − Iλ) when A is
diagonal. This gives rise to one important question: Is it possible to rewrite a matrix as a
product of a triangular matrix and another matrix which doesn’t change the eigenvalues
when multiplied with this matrix and if possible, can it be done fast?
Related to triangular matrices are the Hessenberg matrices. These are matrices that
are ’almost’ triangular, instead of demanding that all elements below the diagonal is
zeros as for an upper triangular matrix, a matrix that is an upper Hessenberg matrix if
17
we also allow the elements immediately below the diagonal to be non zero. This gives
an upper Hessenberg matrix A the form:


F F F ··· F F F
F F F · · · F F F


 0 F F · · · F F F




A =  0 0 F · · · F F F
 ..
..
.. . .
..
..
.. 
.
.
.
.
.
.
.


 0 0 0 · · · F F F
0 0 0 ··· 0 F F
While not as immediately useful as a triangular matrix, Hessenberg matrices are useful just from the fact that they are nearly triangular. The diagonal doesn’t hold the
eigenvalues as in a triangular matrix, but they often give a good approximation or initial guess for iterative methods. Multiplication of two (both upper or lower) Hessenberg
matrices also result in a new Hessenberg matrix. We will see examples of how and why
Hessenberg matrices are useful later when we look at the QR-method in section 11.2.
2.1.2 Hermitian matrices
We start by defining the Hermitian conjugate of a matrix:
Definition 2.2
The Hermitian conjugate of a matrix A is denoted AH and is defined by (AH )i,j = (A)j,i .
In other words we transpose it and change the sign of the imaginary parts. So if A is
real valued, the hermitian conjugate is the same as the transpose.
Definition 2.3
A matrix is said to be Hermitian (or self-adjoint) if AH = A
Notice the similarities with a symmetric matrix A> = A, a Hermitian matrix can be
seen as the complex extension of real symmetric matrices. One property of Hermitian
(and real symmetric) are that they are normal and thus diagonalizable.
2.1.3 Unitary matrices
Matrix factorizations using unitary matrices are often used over other factorization techniques because of their stability properties for numerical methods.
Definition 2.4
A matrix, A, is said to be unitary if AH = A−1 .
We will look more at unitary matrices when we look at linear spaces and projections
in chapter 6.2. Unitary matrices have a large number of ”good” properties.
Theorem 2.1. Let U be a unitary matrix, then
18
a) U is always invertible.
b) U−1 is also unitary.
c) | det(U)| = 1
d) (UV)H = (UV)−1 if V is also unitary.
e) For any λ that is an eigenvalue of U, λ = eiω , 0 ≤ ω ≤ 2π.
f ) Let v be a vector, then |Uv| = |v| (for any vector norm).
H
g) The rows/columns of U are orthonormal, that is Ui. Uj.H = 0, i 6= j, Uk. Uk.
= 1.
One of the important things we can find from this is that if we multiply a matrix A with
a unitary matrix A, then UA will have the same eigenvalues as A. We will meet unitary
matrices again in later chapters when we will need them and easier can illustrate why
they are useful. One example of a unitary matrix is the C matrix below which rotates a
vector by the angle θ around the x-axis


1
0
0
C = 0 cos(θ) − sin(θ)
0 sin(θ) cos(θ)
2.1.4 Positive definite matrices
Positive definite matrices often come up in practical applications such as covariance
matrices in statistics. Positive definite matrices have many properties which makes
them easier to work with than general matrices.
Definition 2.5
We consider a square symmetric real valued n × n matrix A, then:
• A is positive definite if x> Ax is positive for all non-zero vectors x.
• A is positive semidefinite if x> Ax is non-negative for all non-zero vectors x.
We define negative definite and negative semidefinite in the same way except demanding that the resulting vector is negative or non-positive instead. Similarly we can define
it in the same way by replacing the transpose with the hermitian for matrices with
complex valued elements.
2.2 Factorization
As was mentioned in the introduction factorization (also called decomposition) is taking
a matrix and rewriting it as a product of several matrices. This can make it easier
to solve certain problems, similarly to how factorization of numbers can made division
easier.
19
A well known and previously visited example of factorization is rewriting a diagonalizable matrix as a product of an invertible matrix and a diagonal matrix
A = S−1 DS, D is diagonal
2.2.1 Spectral factorization
Spectral factorization is a special version of diagonal factorization. It is sometimes
referred to as eigendecomposition.
Theorem 2.2. Let A be an square (n × n) matrix with linearly independent rows. Then
A = QΛQ−1
where Q is a square n × n matrix whose columns are the eigenvectors of A and Λ is a
diagonal matrix with the corresponding eigenvalues.
The spectral factorization can be thought of as a ’standardized’ version of diagonal
factorization. Also relates to projection, see section ??.
2.2.2 Rank factorization
Theorem 2.3. Let A be an m × n matrix with rank(A) = r (A has r independent
rows/columns). Then there exist a factorization such that
A = CF
where C ∈ Mm×r and F ∈ Mr×n
We take a short look at how to find such a factorization, We start by rewriting the
matrix on reduced row echelon form


1 F 0 0 0 F 0
0  0 0
1 0 0 F 0 
0 

 0 0 0
1 0 F 0 
0 


 0 0 0 0
1
0
F
0

B=
 0 0 0 0 0 0
1 
0 

 0 0 0 0 0 0 0 0 


 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0
We then create C by removing all columns in A that corresponds to a non-pivot column
in B. In this example we get:
C = A.2 A.4 A.5 A.6 A.8
We then create F by removing all zero rows in B. In this example we get:
F = B1. B2. B3. B4. B5.
This gives the Rank factorization of A.
20
2.2.3 LU and Cholesky factorization
The LU factorization is a factorization of the form:
A = LU
Where L is lower triangular and U is upper triangular. If we can find such a factorization we can for example easily find the solution to a linear system Ax = L(Ux) = b
by first solving Ly = b and then solve Ux = y. Both these systems are easy to solve
since L and U are both triangular. However not every matrix A have a LU factorization,
not even every square invertible matrix. Instead we often talk about LUP factorization
instead:
Theorem 2.4. If A is a square matrix, then there exist a factorization such that:
PA = LU
Where P is a permutation matrix, U is a upper triangular matrix and L is a lower
triangular matrix. P−1 LU is called the LUP factorization of A.
We can still, provided that we found such a factorization for example easily solve the
equation system Ax = b by instead solving PAx = Pb. The LU factorization can be
seen as a variation of the Gaussian elimination which we often use when attempting to
solve a linear system by hand.
LU decomposition can also be used to calculate the determinant of A = P−1 LU. We
get the determinant as:
det(A) = det(P −1 ) det(L) det(U ) = det(P −1 )
n
Y
i=1
!
li,i
n
Y
!
ui,i
i=1
This since we remember that the eigenvalues of a triangular matrix is the values on
the diagonal. The determinant of a permutation matrix can be found to be det(P −1 ) =
(−1)S where S is the number of row exchanges.
If A is Hermitian and positive-definite however then we can find another related factorization, namely a Cholesky factorization:
Theorem 2.5. If A is a Hermitian, positive-definite matrix, then there exist a factorization such that:
A = LLH
Where L is a lower triangular matrix. LLH is called the Cholesky factorization of A.
The Cholesky factorization is mainly used instead of the LUP factorization when
possible to solve linear systems. Some example where such matrices naturally occur is.
21
• Any covariance matrix is positive-definite and any covariance matrix based on
measured data is going to be symmetric and real-valued. This means it’s hermitian
and we can use the Cholesky factorization.
• When simulating multiple correlated variables using Monte Carlo simulation the
correlation matrix can be decomposed using the Cholesky factorization. L can then
be used to simulate the dependent samples.
2.2.4 QR factorization
The QR-factorization is one of the most useful factorizations. While we will only show
what it is and give one example of what it can be used for, section 11 is dedicated to
how we can find the QR-factorization efficiently as well as it’s use in the QR-method,
one of the most widely used method to find the eigenvalues of a general square matrix.
Theorem 2.6. Every n × m matrix A have a matrix decomposition
A = QR
where
• R is a n × m upper triangular matrix.
• Q is a n × n unitary matrix.
If A is a square real matrix then Q is real and therefore an orthogonal matrix Q> =
Q. We can find a QR-factorization can be computed using for example Householder
transformations or the Gram-Schmidt process. We will look more into this in ??.
The QR-factorization have uses in for example:
• Given a QR-factorization we can solve a linear system Ax = b by solving Rx =
Q−1 b = QH b. Which is can be done fast since R is a triangular matrix.
• QR-factorization can also used in solving the linear least square problem.
• It plays an important role in the QR-method used to calculate eigenvalues of a
matrix.
2.2.5 Canonical forms
Some factorizations are referred to as canonical forms. A canonical form is a standard
way of describing an object. A single object can have several different canonical forms
and which one we use might depend on what we want to find or which canonical form
we can easily calculate. We have already seen one example of a canonical form which we
used when looking at reducible stochastic matrices. Some other examples of canonical
forms for matrices are:
• Diagonal form (for diagonalizable matrices)
22
• Reduced row echelon form (for all matrices)
• Jordan canonical form (for square matrices)
• Singular value factorization form (for all matrices)
2.2.6 Reduced row echelon form
Definition 2.6
A matrix is written on reduced row echelon form when they are written on echelon form
and their pivot elements are all equal to one and all other elements in a pivot column
are zero.






B=





0
0
0
0
0
0
0
0
1 F 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
F
F
F
F
0
0
0
0
0
0
0
0
1
0
0
0












Theorem 2.7. All matrices are similar to some reduced row echelon matrix.
We recognize this as the result of applying Gauss-Jordan elimination to the original
matrix.
2.2.7 Jordan normal form
The Jordan normal form, while not especially useful in practical applications because
of its sensitivity to rounding errors in numerical calculations is still good to be familiar
with since it is often usefor for developing mathematical theory. We start by defining
the building blocks which which we build the Jordan normal form.
Definition 2.7 (Jordan block)
A Jordan block is a square matrix of the form

λ
0


Jm (λ) =  ...

0
0
1 0 ...
λ 1 ...
.. . . . .
.
.
.
0 ... λ
0 ... 0

0
0

.. 
.

1
λ
We call a matrix made up of Jordan blocks a Jordan matrix:
23
Definition 2.8 (Jordan matrix)
A Jordan matrix is a square matrix of the form

0
Jm1 (λ1 )
 0
J
m
2 (λ2 )

J=
..
..

.
.
0
0
...
...
..
.
0
0
..
.





. . . Jm1 (λk )
Theorem 2.8. All square matrices are similar to a Jordan matrix. The Jordan matrix
is unique except for the order of the Jordan blocks. This Jordan matrix is called the
Jordan normal form of the matrix.
We remember that two matrices A, B are similar if there is a unitary matrix S such
that A = S−1 BS. This means that if we have found a Jordan matrix J similar to A,
they have the same eigenvalues. And since Jordan matrices are upper triangular, we can
easily find the eigenvalues along the diagonal of J.
Theorem 2.9 (Some other interesting properties of the Jordan normal form). Let A =
S−1 JS
a) The eigenvalues of J is the same as the diagonal elements of J.
b) J has one eigenvector per Jordan block.
c) The rank of J is equal to the number of Jordan blocks.
d) The normal form is sensitive to perturbations. This means that a small change in
the normal form can mean a large change in the A matrix and vice versa.
2.2.8 Singular value decomposition (SVD)
Singular value decomposition is a canonical form used for example in statistics and
information processing in dimension reducing methods such as Principal component
analysis or calculating pseudoinverses (see section 8.4).
Theorem 2.10. All A ∈ Mm×n can be factorized as
A = UΣVH
where U and V are unitary matrices and
Sr 0
Σ=
0 0
where Sr is a diagonal matrix with r = rank(A). The diagonal elements of Sr are called
the singular values. The singular values are uniquely determined by the matrix A (but
not necessarily their order).
24
The singular value decomposition is related to the spectral decomposition. Given a
singular value decomposition A = UΣVH we have:
• The columns of U are the eigenvectors of AAH
• The columns of V are the eigenvectors of AH A
• The singular values on the diagonal of Σ are the square root of the non-zero
eigenvalues of both AAH and AH A.
Having the singular value decomposition it’s easy to solve a lot of problems found
in many practical applications, for example if we have a singular values decomposition
A = UΣVH we can:
• Calculate pseudoinverse A+ = VΣ+ UH where Σ+ is easily found by taking the
reciprocal (x−1 ) of the non-zero elements of the diagonal and transposing the
resulting matrix.
• It is also used in Principal component analysis (PCA) as we will see later.
2.3 Applications
2.3.1 Portfolio evaluation using Monte Carlo simulation
This example uses a large amount of staistics and terminology not otherwise described
in this book. For the interested reader we refer to books or courses in Monte Carlo
methods and stochastic processes for the details.
Suppose we have two stocks A, B with prices at time t as S a (t), S b (t). Initially we
start with na stocks of A and nb stocks of B. This gives initial wealth:
W0 = na S a (0) + nb S b (0)
We let the stocks be for a time T after which we are interested in the new wealth of
the portfolio or more specifically how sure we can be that there have not occurred any
great losses. We assume the stock pricing can be modelled as geometric brownian motions (GBM) with parameters S a GBM(µa , σa ) and S b GBM(µb , σb ). We have estimated
correlation ρ between the return on A and B.
To evaluate our portfolio we are interested in the probability that the total wealth of
our portfolio WT at the end of the period T have dropped by more than 10%. We can
write this as:
P
WT
≤ 0.9
W0
25
Finding this probability analytically is rather hard, especially if we work with not 2
stock but say 10 dependent stocks. Instead we use Monte Carlo simulation to simulate
the process a large amount of times. Every time checking at the end if WT ≤ 0.9W0 .
The value of stock A at time T can be written:
STa = S0a exp((µa − σa2 /2)T +
√
T Va )
Where Va , Vb are multivariate normal distributed variables Va , Vb N (0, Σ).
2
σa
σa σb ρ
Σ=
σa σb ρ
σb2
So in order to generate one sample of STa , STa we need to simulate one sample from
N (0, Σ). We can usually rather easily simulate independent normal distributed variables
N (0, I). Our goal is then to use these independent samples to generate the dependent
samples N (0, Σ). Normal distribution have the property that a linear combination of
normal distributed variables is itself normal. In fact if we have normal distributed
variables with mean zero and variance one Z N (0, 1) then C> Z is distributed as follows.
C> Z N (0, C> C)
In other words, if we can find a Cholesky decomposition C> C = Σ then we need only to
calculate the product C> Z to get samples from the desired distribution. Σ is obviously
real valued and symmetric, but it’s also at least positive semi definite. In fact the only
time it isn’t strictly positive definite is when one one of the variables is set to be an exact
linear combination of the others. Since we can be almost sure Σ is a real, symmetric, positive definite matrix we can use Cholesky factorization to find a decomposition C> C = Σ.
We thus generate a large amount of samples C> C and check how often WT ≤ 0.9W0
to find the probability P (WT ≤ 0.9W0 ).
Some things to think about:
• Can you think of another situation where you would like to take samples from
another multivariate normal distribution?
• Can you think of why the covariance matrix Σ is always a real, symmetric, positive
(semi)definite matrix?
26
3 Non-negative matrices and graphs
Non-negative matrices are common in many applications such as when working with
stochastic processes and Markov chains as well as when working with things we can
represent with a graph, such as electrical networks, water flows, random walks, and so
on. Non-negative and positive matrices (not to be confused with positive (semi) definite matrices) have many known properties, especially concerning their eigenvalues and
eigenvectors which makes them easier to work with than general matrices.
This and the next to chapters are very much related in that we will start by looking
at some graph theory, how to represent a graph with a matrix and some things which
we can calculate here. After which we will take a look at the properties of non-negative
matrices themselves in chapter 4 and then last look at a special type of non-negative
matrix related to random walks on graphs in chapter 5.
3.1 Introduction to graphs
A graph is a representation of a set of objects (vertices) where some pair of objects have
links (edges) connecting them. The vertices can represent a state in a Markov chain, a
point on a grid, a crossing in a road network among other things. Likewise the edges
could in the same examples represent the possible transitions in the Markov chain, the
lines on the grid or the actual roads in a road network. Sometimes in litterature vertices
are called nodes and edges are called links.
An example of two graphs can be seen in Figure 1.
B
A
B
C
D
A
C
D
Figure 1: A directed graph (left) and a undirected graph (right)
We will concern ourselves with mainly two types of graphs, directed graphs as in the
one to the left where every edge connecting two vertices have a direction. And undirected
graphs as in the one to the right where the edges doesn’t have a direction but instead
only denote that the two vertices are connected. Sometimes we will allow edges which
goes from a vertice back to itself (so called loops) and sometimes we won’t, generally
loops will be allows for directed graphs but often not when working with undirected
27
graphs. In both cases we can also assign weights to the edges as we will see later.
We can represent graphs such as the ones in Figure 1 with a matrix by defining the
adjacency matrix for a graph.
Definition 3.1
The adjacency matrix A of a graph G with n vertices (|V | = n) is a square n × n matrix
with elements ai,j such that:
1, if there is an edge from vertice i to vertice j
ai,j =
0, otherwise
If the graph is undirected we consider every edge between two vertices as linking in
both directions.
We note one important thing, we don’t define in which order we take the vertices.
Sometimes it’s natural to order the vertices in a certain way, for example depending on
vertice labels, however any order is valid as long as you know what order you choose.
B
A
C

0
0

A=
0
1
0
1
1
1
0
0
0
0
0
0
0
1
1
0
0
0

0
0

1

0
0
D
E
Figure 2: A directed graph and corresponding adjacency matrix with vertices ordered
{A, B, C, D, E}
If the graph instead was undirected we would get the graph and adjacency matrix in
??.
We note that we don’t need to order the vertices in that order, for example the matrix:


0 1 0 0 1
1 1 1 1 0



A=
0 1 0 1 0
0 1 1 0 0
1 0 0 0 0
Is also a adjacency matrix of the graph in Figure 3 where we instead have ordered the
vertices as {C, B, A, D, E}.
28
B
A
C

0
1

A=
0
1
0
1
1
1
1
0
0
1
0
0
1
1
1
0
0
0

0
0

1

0
0
D
E
Figure 3: A undirected graph and corresponding adjacency matrix with vertices ordered
{A, B, C, D, E}
One common addition to graphs is that we also assign weights to the edges, usually we
only allow the weights to be positive real values. The weights can for example be used to
represent transition probabilities in a Markov chain, distance of a part of road in a road
network, time needed to travel between two vertices or the capacity in a flow network.
If we have a graph with weighted edges we often talk about the distance matrix rather
than the adjacency matrix.
Definition 3.2
The distance matrix D of a graph G with n vertices and weighted edges E is a square
n × n matrix with elements di,j such that:
Ei,j , if there is an edge from vertice i to vertice j
di,j =
0,
otherwise
Where Ei,j is the weight of the edge from vertice i to vertice j.
As with the adjacency matrix, if the graph is undirected we consider every edge
between two vertices as linking in both directions.
An example of a weighted directed graph and corresponding distance matrix can be
seen in 3.2.
As in the case with the adjacency matrix we don’t have a defined order of the vertices,
so we can use any as long as we know what order we use. While there would be no
problem using negative or non real weights on the edges in the matrix representation
of the graph, in practice we still usually consider only the cases with only positive
weights. This is because we have a lot of usefull theory concerning specifically nonnegative matrices (which we get if we allow only positive weights). In most real world
examples where we are interested in representing the problem with a graph we also
naturally end up with only positive weights, such as when the weights represent for
example distance, time, probability, capacity or a flow.
29
2
B
1
1
2
A
C
1
2
1

0
0

D=
0
1
0
1
2
1
0
0
0
0
0
0
0
1
2
0
0
0

0
0

2

0
0
D
E
Figure 4: A weighted directed graph and corresponding distance matrix with vertices
ordered {A, B, C, D, E}
3.2 Connectivity and irreducible matrices
It’s often interesting to see not only if there is an edge between two vertices but also if
there is a connection between two vertices if we are allowed to traverse more than one
edge. This is interesting both in the case of directed graphs as well as undirected ones:
can we get from A to B by following the edges? And if so in how many steps? Or what
is the shortest distance between A and B?
We take a look at so called paths in a graph.
Definition 3.3
Given a weighted or unweighted graph.
• A path in a graph is a sequence of vertices e1 , e2, . . . , en such that for every vertice
in the sequence there is an edge to the next vertice in the sequence.
• If the first vertice in a path is the same as the last vertice we call it a cycle.
• The length of a path is the number of edges in the path (counting multiple uses of
the same edge).
We take a look at the adjacency matrix although the same is true for the distance
matrix as long as we only allow positive weights.
(k)
Theorem 3.1. If element ai,j of the adjacency matrix A to the power of k is non zero,
there is a path from vertice i to vertice j of exactly length k. In other words:
(k)
ai,j ∈ Ak ⇒ There is a path of length k from vertice i to vertice j
30
The proof is left as an exercise.
The problem of finding a path or the existence of a path between two vertices is an
important problem in many applications regarding graphs. For example in finding the
shortest path in for example a road network as well as finding if and how likely it’s to
get from one state to another in a Markov chain. We divide the graph’s in two different
groups depending on if we can find a path from all vertices to all other vertices.
Definition 3.4
Given a weighted or unweighted graph.
• Two vertices is said to be connected if there exists a path from either vertice to
the other in the undirected graph.
• A graph is said to be connected if it is connected for all pair of vertices in the
undirected graph.
• If it is also connected for all pairs in the directed graph we say that it’s strongly
connected.
We see that if a graph is connected or strongly connected in the case of directed
graphs, we have a guarantee that we have a path from any vertex to another. Before
continuing to we take a short look at connected parts of a graph.
Definition 3.5
Given a (possible weighted) graph:
• A connected component in a undirected graph is a maximal part of the graph where
all vertices are connected with eachother.
• A strongly connected component is a part of the directed graph which is strongly
connected.
Example of connected components in an undirected graph can be seen in Figure 5
below.
We see that the the connected components can themselves be considered connected
graphs. A example of strongly connected components in a directed graph can be seen
in Figure 6. How many edges would you need to make the graph strongly connected?
We are now ready to move over to the matrix side of things. We look at general square
non-negative matrices, for which both the adjacency matrix as well as the distance matrix
belong if we only allow positive weights .
Theorem 3.2. A non-negative square n × n matrix A is said to be Irreducible if any of
the following equivalent properties is true:
1. The graph corresponding to A is strongly connected.
2. For every pair i, j there exists a natural number m such that am
i,j > 0.
31
Figure 5: A undirected graph with 3 connected components
Figure 6: A directed graph with 3 strongly connected components
3. (I + A)n−1 is a positive matrix.
4. A cannot be conjugated into a block upper triangular matrix by a permutation
matrix P.
We leave the last property for later and give a short motivation of why the first 3 are
equivalent. It’s clear that since the graph is strongly connected (1) there exist a path
from any vertice to any other vertice. However if the length of the path between vertice
i and vertice j is m we know that am
i,j > 0. So if (1) is fulfilled so is (2) and if two is
fulfilled we know there is a path between all vertices so the graph needs to be strongly
connected as well (1). For the third we can take the binomial expansion and get:
(I + A)
n−1
n X
k k
=
A
n
k=0
k
Since
is positive the matrix is positive if there is at least one 0 ≤ k ≤ n for every
n
32
pair i, j such that aki,j > 0. However since the longest possible path without repeating
any vertices is n − 1 (go through all vertices in the graph) we see that this is true only
if (2) is true since we need not check any m larger than n − 1.
The question regarding if a non-negative matrix is irreducible or not is a very important property which we will use repeatedly in the comming chapters both with regards
to graphs as well as generall non-negative matrices. Having a clear understanding of
what is and isn’t an irreducible matrix and how they relate to graphs and their distance or adjacency matrix will give a much easier understanding when we talk about
Perron-Frobenius theory and Markov chains later.
3.3 The Laplacian matrix
The Laplacian matrix is another matrix representation of graphs with no self loops (so
called simple graphs). It can be used to among others to find the resistance between
nodes in an electrical network, estimate the robustness of or find if there are any ”bottlenecks” in a network. It’s also usefull in finding some other properties of the graph
as well as it appearing naturally in for example the stiffness matrix of simple spring
networks.
Before defining the Laplacian matrix we need to define another matrix, namely the
degree matrix of a graph.
Definition 3.6
The degree matrix of a undirected graph is a diagonal matrix where the elements on
the diagonal is the number of edges connected with that vertex. In other words given a
graph G with n vertices the degree matrix D is a n × n matrix with elements:
deg(i), i = j
di,j :=
0,
i 6= j
Where deg(i) is the number of edges connected with vertice vi .
• If there are loops present (vertices linking to itself) we count those twice for the
degree.
• If the graph is directed we use either the in-degree(number of edges pointing towards the vertice) or out-degree (number of edges pointing out from the vertice).
Which one we use depend on application.
• With a directed graph we also only count loops once rather than twice.
You can see a undirected graph and corresponding degree matrix in Figure 7 and
Figure 8
Using this and the adjacency matrix we can define the laplacian matrix:
33
2
1
4

3
0
D=
0
0
0
4
0
0
0
0
1
0

0
0

0
2
3
Figure 7: Undirected graph and corresponding degree matrix.
2
1
4

2
0
D=
0
0
0
3
0
0
0
0
1
0

0
0

0
0
3
Figure 8: Directed graph and corresponding degree matrix using in-degree.
Definition 3.7
The Laplacian matrix L of a undirected graph with no self loops is defined as:
L := D − A
where D is the degree matrix and A is the adjacency matrix.
This gives elements li,j ∈ L:

i=j
 deg(i),
−1,
i 6= j, vi links to vj
li,j :=

0,
otherwise
Where deg(i) is the number of edges connected with vertice vi .
An example of a graph and corresponding Laplacian matrix can be seen in Figure 9.
Some properties of the Laplacian matrix L and it’s eigenvalues λ1 ≤ λ2 ≤ . . . ≤ λn .
• L is positive semidefinite (x> Lx ≥ 0).
• λ1 = 0 and the number of connected components is it’s algebraic multiplicity.
34
2

3 −1 −1 −1
−1 2
0 −1

L=
−1 0
1
0
−1 −1 0
2

1
4
3
Figure 9: Simple undirected graph and corresponding Laplacian matrix
• The smallest non zero eigenvalue is called the spectral gap.
• The second smallest eigenvalue of L is the algebraic connectivity of the graph.
We immediately see that it seems like the eigenvalues of the Laplacian gives us a lot of
information about the graph, this is especially important since it’s positive semidefinite
making more effective calculations of the eigenvalues possible, which might be important
if we need to work with very large graphs. We will look more into how we can use the
Laplacian matrix in the application part.
3.4 Exercises
3.1.
3.5 Applications
3.5.1 Shortest path
A common problem when working with networks such as a road network or in routing on
the internet is the shortest path problem. You could for example want to find the best
order in which to visit a number of places (the traveling salesman problem) or maybe
you want to find the shortest path between two places such as in the case of GPS in
cars. There are a number of related problems and solutions depending on if we need the
shortest distance between all pair of vertices or if it’s enough with a subset such as a
group of vertices or a single pair of vertices.
One common algorithm for finding the shortests path from one vertex to all others
is Djikstras algorithm which most other similar methods are based upon. The method
works by assigning a distance from the initial vertex to every other vertex and then traveling through the graph always choosing the next vertex among the unvisited vertices
that minimizes the distance. At every we we update the distance to every vertex linked
35
to by the current one.
More exact the method works through the following steps using the distance matrix
of the resulting graph:
1. Assing distance zero to initial vertex and infinity to all other verticess.
2. Mark all verticess as unvisited and set the initial vertex as current vertex.
3. For the current vertex, calculate the distance to all it’s unvisited neighbors as the
current distance plus the distance to the neighbors. Update these vertices distance
if the newly found distance is lower.
4. Mark current vertex as visited.
5. If the smallest distance in among the unvisited vertices is infinity: stop! Otherwise
set the unvisited vertex with the smallest distance as current vertex and go to step
3.
If we are only interested in the shortest distance from the initial vertex to one specific
vertex we can stop as soon as the distance to the target vertex is less than the distance
to the next current vertex. However for the method to work we need to make a few
assumptions, there are also a couple of properties of the graph itself that can say something about the expected result of the algorithm. Take a few minutes to think through
the following questions and how they affect the result or what assumptions we need.
• What could happen if there are negative weight’s in the graph?
• What if the graph is directed? Should a transportation network be considered
directed or undirected?
• If we know the graph is or isn’t irreducible, can we say something about the result
beforehand?
3.5.2 Resistance distance
When planning a large electrical network its important to try and get as low resistance
between vertices as possible. It is however not an easy task to use the common rules
for combining resistors when the network gets large and we might want to know how to
expand the network in order to lower the resistance between not just two vertices but
lower it overall.
The resistance distance between two vertices vi , vj in a graph G is defined as the
effective resistance between them when each edge is replaced with a 1 Ω resistor.
We define
1
Γ=L+
n
Where L is the laplacian matrix, n is the number of vertices in G and 1 is a n × n matrix
with elements equal to one.
36
The elements of the resistance distance matrix (Ω) is then:
−1
−1
(Ω)i,j = Γ−1
i,i + Γj,j − 2Γi,j .
For example if we have the graph below and corresponding Laplacian matrix in Figure 10 below:
B

3 −1 0 −1 −1
−1 3 −1 0 −1



L=
 0 −1 3 −1 −1
−1 0 −1 3 −1
−1 −1 −1 −1 4

A
E
C
D
Figure 10: Undirected graph and corresponding Laplacian matrix
We get Γ = L + 1/5 and it’s inverse:
Γ−1

0.4267
0.1600

=
0.0933
0.1600
0.1600
0.1600
0.4267
0.1600
0.0933
0.1600
0.0933
0.1600
0.4267
0.1600
0.1600
0.1600
0.0933
0.1600
0.4267
0.1600

0.1600
0.1600

0.1600

0.1600
0.3600
−1
−1
We get the resistance distance matrix by computing (Ω)i,j = Γ−1
i,i + Γj,j − 2Γi,j . for
every element:

0
0.5333

(Ω) = 
0.6667
0.5333
0.4667
0.5333
0
0.5333
0.6667
0.4667
0.6667
0.5333
0
0.5333
0.4667
0.5333
0.6667
0.5333
0
0.4667

0.4667
0.4667

0.4667

0.4667
0
And we have found the effective resistance between vertices as the value of the elements
in the matrix. We can see that we have the largest resistance between the vertices A, C
and B, D, which is quite natural since they are the ones furthest from eachother. Between
opposite vertices we can also easily check the result since we can see it as 2 ressistors
made up of 3 parallel connected resistors each in a series. This would give resistance:
RA,D =
R4 R5 R6
1
1
2
R1 R2 R3
+
=
+
=
R1 R2 + R1 R3 + R2 R3 R4 R5 + R4 R6 + R5 R6
1+1+1 1+1+1
3
37
Since all the resistances is 1 from the definition of resistance distance. Checking the result
for the resistance between for example A and E is harder using the usual computation
rules though.
38
4 Perron-Frobenius theory
4.1 Perron-Frobenius for non negative matrices
Perron-Frobenius theory gives information about many of the most essential properties
of non-negative square matrices. We will start by formulating item 4.1 which is essential
to this chapter. While it is not immediately easy to see why these properties are useful,
we will see throughout the chapter why they are important and how and when we can
use them. Be sure to review what we mean by irreducible matrices if your not sure since
it’s one of the recurring themes of this chapter.
Theorem 4.1. Let A be a square non-negative irreducible matrix with spectral radius
ρ(A) = r. Then we have:
1. r, called the Perron-Frobenius eigenvalue, is an eigenvalue of A. r is a positive
real number.
2. The (right)eigenvector x belonging to r, Ax = rx, called the (right)Perron vector,
is positive ( x > 0).
3. The left eigenvector y belonging to r called the left Perron vector, is positive
(y> A = ry> , y > 0).
4. No other eigenvector except those belonging to r have only positive elements.
5. The Perron-Frobenius eigenvalue r has algebraic and geometric multiplicity 1.
X
X
6. The Perron-Frobenius eigenvalue r satisfies: min
ai,j ≤ r ≤ max
ai,j .
i
j
i
j
We remember the spectral radius being the largest (in absolute value) eigenvalue. The
theorem says a lot about the ”largest” eigenvalue as well as the corresponding eigenvector. This is also often called the dominant eigenvalue and its value and eigenvector are
in many applications one of the most important things to know about the system as can
be seen in the application part. We give a short motivation for the correctness of the
theorem.
Given (2) we can easily prove that the eigenvalue r is positive: If x is a positive vector
and A is non-negative, then Ax > 0 since A is irreducible. Since Ax = rx if r is an
eigenvalue belonging to the eigenvector x we have r > 0. If r was negative, zero or
complex rx would be as well, but Ax is a positive vector.
We also prove (4) that the only positive eigenvectors are the ones belonging to r: let
A be a non-negative irreducible square matrix, set ρ(A) = r and suppose y > 0 is the
left Perron vector y> A = ry> . Assume we have an (right) eigenpair (λ, v), v ≥ 0. This
gives:
ry> v = y> Av = y> λv ⇒ r = λ
39
We give no further proof, but leave some of the other statements as exercises. Initially
the theorem was stated for positive matrices only and then later generalised for irreducible non-negative matrices. It’s easy to see that if A is positive it’s also non-negative
and irreducible.
Theorem 4.2. If A is positive, in addition to all properties for irreducible non-negative
matrices, the following is true as well:
• The Perron-Frobenius eigenvalue r has the strictly largest absolute value of all
eigenvalues. I.e., for any eigenvalue λ 6= r we have |λ| < r.
This last property ,that there is only one eigenvalue on the spectral radius, is a very
useful property as well. In the section on Markov chains, we will see that it determines
convergence properties. However this was for positive matrices and we see that it isn’t
true for general non-negative irreducible matrices as in the example below.
We look at the permutation matrix:
0 1
P =
1 0
It’s obviously non-negative and irreducible. From Perron-Frobenius we get that the
spectral radius ρ(A) = r = 1 since the sum of both rows are one. However the eigenvalues can easily be found to be ±1 which are both on the unit circle.
While there are irreducible matrices for which there are more than one eigenvalue
on the spectral radius, there are some irreducible matrices for which there is only one.
We let this last property of having or not having only one eigenvalue on the spectral
radius divide the non-negative irreducible matrices in two different classes. We say that
a non-negative irreducible matrix is primitive if it has only one eigenvalue with absolute
value equal to the spectral radius. If a non-negative irreducible matrix is not primitive,
it is imprimitive.
Theorem 4.3. We let A be a square non-negative irreducible matrix, then:
• A is primitive if and only if there is only one eigenvalue r = ρ(A) on the spectral
radius.
• A is primitive if and only if lim (A/r)k exists.
k→∞
xy>
> 0.
k→∞
y> x
where x, y are the right and left Perron vectors respectively.
For the limit we have: lim (A/r)k =
• A is imprimitive if and only if there are h > 1 eigenvalues on the spectral radius.
h is called it’s index of imprimitivity.
40
We note the easily missed part that for a matrix to be either primitive or imprimitive it
needs to first be non-negative and irreducible. We also see that that only one eigenvalue
on the spectral radius is equivalent to the limit lim (A/r)k existing.
k→∞
We return to the permutation matrix:
0 1
P =
1 0
We consider the limit lim (A/r)k , given r = 1 which we got earlier. That means that
k→∞
if lim (A)k exists then P is primitive. However P2 = I and the sequence A, A2 , A3 , . . .
k→∞
alternate between P and I. Since it alternates between two matrices it obviously doesn’t
converge and is therefore imprimitive. In fact all irreducible permutation matrices are
imprimitive since for all of them there is a k such that Pk = I (why?) and r = 1 since
every row sum to one.
There are two common ways to tell if a non-negative irreducible matrix is primitive:
Theorem 4.4. A non-negative irreducible matrix A is primitive if:
1. If any diagonal element ai,i > 0 A is primitive (but A could still be primitive even
if there are all zeroes on the diagonal).
2. A is primitive if and only if Am > 0 for some m > 0.
3. A is primitive if and only if A is aperiodic.
Proof.
• If (1) is true it’s obvious that there is a m > 0 such that Am > 0. This since if
we look at the graph: irreducibility guarantees there is a path from all vertices to
all others and Am > 0 means there is a path of exactly length m from all vertices
to all others. However since we have one diagonal element ai,i > 0 we can for all
paths between two vertices going through ai,i increase the length of the path by
one simply by looping within ai,i once.
• If Am > 0 and A had a non positive eigenvalue λ = ρ(A). Then λm should be an
eigenvalue on the spectral radius of Am . However since Am is positive the only
eigenvalue on ρ(Am ) is positive and simple, hence A is primitive. Conversely, if A
is primitive then the limit } lim Am /rm exists and is positive. But for this to be
m→∞
possible we must that Am is positive for large enough m.
Especially the first statement gives a way to often immediately conclude that a primitive matrix is in fact primitive. When the first can not be used you are left with trying
the second, however you might often need to try quite large m before you can find if a
41
matrix is primitive or not. To show that a matrix is imprimitive it’s often easiest to show
that it’s periodic as we did with the permutation matrix. The second statement can also
be used however since it can be shown that if A is primitive, the smallest possible m
which gives a positive matrix can be bounded by: m ≤ n2 − 2n + 2. So if you try a m
at least as large as this and the matrix still isn’t positive, it’s imprimitive.
We give a short note on imprimitive matrices as well:
Theorem 4.5. If A is a imprimitive matrix with h eigenvalues λ1 , λ2 , . . . , λh on the
spectral radius ρ(A). Then the following are true:
• alg mult(λn ) = 1, n = 1, 2, . . . , h.
• λ1 , λ2 , . . . , λh are the h−th roots of ρ(A), such that:
{λ1 , λ2 , . . . , λh } = {r, rω, rω 2 , . . . , rω h−1 },
ω = ei2π/h
We will not look at the imprimitive matrices more than this, except noting that the
eigenvalues on the spectral radius for imprimitive matrices are actually very special. Not
only do they all have algebraic multiplicity 1, they are evenly spaced around the spectral
radius.
4.2 Exercises
4.1. Consider the matrices:





1
1/2 0 1/2 0
0 0 1 0
1
 0 1/2 1/2 0 
0 0 0 1




A=
0 1 0 0 B = 1/3 1/3 0 1/3 C = 2
3
0
0
0
1
1 0 0 0
1
3
1
2
2
1
1
3

3
2

1
1
For each matrix consider:
• Is it non-negative?
• Is it irreducible?
• Is it primitive?
• What is the maximum and minimum possible value of the Perron-Frobenius eigenvalue?
• What can you say about the other eigenvalues?
4.2. Show that for all irreducible permutation matrices there is a k such that Pk = I.
Prove that all irreducible permutation matrices are imprimitive.
42
4.3. Consider the matrix:


1 1 2
A = 1 1 1
2 2 1
• Calculate the eigenvalues of this matrix.
• Calculate the (right) eigenvector to the Perron-Frobenius eigenvalue.
4.3 Applications
4.3.1 Leontief input-output model
Suppose we have a closed economic system made up of n large industries, each producing
one commodity. Let a J − unit be what industry j produces that sells for one unit of
money. We note that this could be a part of one commodity, for example a car would
typically sell for many K − units.
We create a vector S with elements sj containing the produced commodities and a
matrix A with consumption of other commodities needed to create the commodities. We
have:
• Elements sj ∈ S, 0 ≤ sj = number of J − units produced by industry j in one
year.
• Elements ai,j ∈ A,0 ≤ ai,j = number of I − units needed to produce one J − unit.
• ai,j sj = number of I − units consumed by industry J in one year.
• If we sum over all j we can then find the total I − units that are availible to to
public (not consumed by the industries themselves), we call this di .
• di = si −
n
X
ai,j sj .
j=1
• S − AS is then a vector with the total availible to the public of all commodities
(elements di ).
Suppose the public demands a certain amount of every commodity, we create a demand vector d = (d1 , d2 , . . . , dn ) ≥ 0. We want to find the minimal supply vector
S = [s1 , s2 , . . . , sn ]> ≥ 0 that satisfy the demand. We are also interested to know if we
can guarantee that there actually is a solution to the system or do we need to make
further assumptions?
We get the linear system:
S − AS = d ⇒ S = (I − A)−1 d
If (I − A) is invertible we can solve and find a solution relatively easily. However we don’t
know if there is one or if the solution could have negative values in S (which would be
equivalent to producing a negative amount of a commodity).
43
• We use the fact that if the Neumann series
∞
X
∞
X
An converges, then (I − A)−1 =
n=0
An .
n=0
• If we assume ρ(A) < 1 we get lim Ak = 0, Which is the same as saying the
k→∞
Neumann series converge.
• Since A is non-negative,
∞
X
An is as well and we have guaranteed not only a
n=0
solution, but also that S ≥ 0.
• The question is what assumptions we need to make in order for ρ(A) < 1?
We look at the column sums
X
ai,j , this can be seen as the total number of units
X
required to create one I − unit. It seems reasonable to assume
ai,j ≤ 1 with strict
i
i
inequality for at least one industry. Since otherwise the industries with go with a loss,
and we can assume at least one goes with a profit.
Then there is a matrix E ≥ 0, E 6= 0 with elements ei,j such that
X
ai,j + ei,j = 1,
i
(every column sum of A + E equal to one).If u is the unit vector with all elements equal
to 1, we get:
u> = u> (A + E)
We assume ρ(A) ≥ 1 and use the Perron vector x of A to get:
u> x = u> (A + E)x
⇔ x = Ax + Ex
This can’t be true since Ex > 0 (E non-negative, non-zero) and Ax ≥ x (since ρ(A)x =
Ax, ρ(A) ≥ 1).
So if we assume the column sums of A to be less than or equal to one with at least
one less than one, we get that ρ(A) < 1 and we are finished.
∞
X
We also note that if A is irreducible, then
An is even positive and not only nonn=0
negative. In other words if we increase the demand of any commodity, we will need to
increase the supply of all commodities and not only the one we increased the demand
for.
Take a few minutes to think through the following questions regarding the new assumptions of the system. Apart from assuming the industries aren’t making a loss, and at
44
least one makes a profit, we also assumed A to be irreducible. This corresponds to every commodity being directly or indirectly needed to produce every commodity. This
doesn’t seem unreasonable considering we work with large industries.
45
5 Stochastic matrices and Perron-Frobenius
5.1 Stochastic matrices and Markov chains
A stochastic matrix, sometimes also called transition matrix or probability matrix, is
used to describe the transitions between states in a Markov chain. Markov chains are
used in applications in a large number of fields, some example are:
• Markov chain Monte Carlo for generating numbers from very complex probability
distributions in statistics.
• Modeling of market behaviour, such as market crashes or periods of high or low
volatility in economics.
• Population processes in for example biology, queuing theory and finance.
• Other less obvious places like in speech recognition, error correction, thermodynamics and PageRank from the previous chapter.
Our aim here is to get an understanding of how and why Markov chains behave as
they do using what we know of Perron-Frobenius theory.
We start by defining stochastic matrices and Markov chains.
Definition 5.1
A square non-negative matrix is
• Row stochastic if every row sums to one.
• Column stochastic if every column sums to one.
• If it is both row and column stochastic we say that it is doubly stochastic.
We will from now only say stochastic matrix when we in fact mean row stochastic
matrix. We also notice that stochastic matrices need not be irreducible, however when
they are we can immediately use Perron-Frobenius to find a lot of information regarding
the matrix. A common example of doubly stochastic matrices are the permutation
matrices since only one element in every row and column is non-zero and that element
is one.
Every stochastic matrix corresponds to one Markov chain:
Definition 5.2
A Markov chain is a random process, with usually a finite number of states, where the
probability of transitions between states only depend on the current state and not any
previous state. A Markov chain is often said to be memoryless.
• Formal expression of the fact that a discrete time Markov chain is memoryless:
P (Xn+1 = x|X1 = x1 , X2 = x2 , . . . , Xn = xn )
= P (Xn+1 = x|Xn = xn )
46
• Every stochastic matrix defines a Markov chain and every Markov chain defines a
stochastic matrix.
You can see a discrete time Markov chain as a sequence of states X0 , X1 , X2 , . . .
happening one after another in discrete time and the state in the next step depends only
on the current state. So if our Markov chain has 3 different states, then this corresponds
to a 3 × 3 stochastic matrix with every element ai,j corresponding to the transition
probability from state xi to state xj . I.e., ai,j = P (Xn+1 = j|Xn = i)). Another way to
represent a Markov chain is with a directed graph by considering the stochastic matrix
as the graph’s distance matrix, where the distance is now a probability rather than a true
distance. In this case, with the edges representing transition probabilities, it’s natural
to only consider graphs with positive weights.
As an example of a Markov chain, let us consider this simple weather model:
• We know the weather on day zero to be sunny.
• If it’s sunny one day it will be sunny the next with probability 0.7, cloudy with
probability 0.2 and rainy with probability 0.1.
• If it’s cloudy it will be sunny with probability 0.3, cloudy with probability 0.4 and
rainy with probability 0.3.
• If it’s rainy it will be sunny with probability 0.2, cloudy with probability 0.3 and
rainy with probability 0.5.
Seeing this in text makes it quite hard to follow, especially if we would like to add
more states. However since the weather tomorrow only depend on the weather today, we
can describe the model with a Markov chain and corresponding stochastic matrix. We
order the states in the stochastic matrix from sunny to rainy and put in the transition
probabilities to get the stochastic matrix P and it’s corresponding graph below. The
vector X0 represents the probability distribution of the weather on day zero.


 


sunny
1
0.7 0.2 0.1
 cloudy  X0 = 0 P = 0.3 0.4 0.3
rainy
0
0.2 0.3 0.5
0.2
0.7
sunny
cloudy
0.4
0.3
0.1
0.3
0.2
0.3
rainy
0.5
47
When we worked with the distance matrix and adjacency matrix of graphs we saw that
we could take the powers of the matrix to find out if there is a path between vertices.
In the same way we can take the powers of the transition matrix P and if a element
pi,j ∈ Pk 6= 0 then there it’s possible to get from state i to state j in k steps in the
corresponding Markov chain. However we can say even more in the case of stochastic
matrices, namely we can tell not only if it’s possible to be in a state after k steps, we
can tell how likely it is as well. We can easily see that if we multiply the initial state
(or distribution) X>
0 with P from the right, we get the probability to be in all the states
in the next step X>
1 . Generally since we work with probabilities the initial distribution
should be non-negative as well as sum to one.


0.7 0.2 0.1
>
0.3 0.4 0.3 = 0.7 0.2 0.1
X>
1 = X0 P = 1 0 0
0.2 0.3 0.5
And we see there is only a probability 0.1 that it will rain tomorrow provided it is
sunny today. Now we can of course multiply this new vector X1 in the same way to get
the probabilities for the weather on the day after tomorrow.


0.7 0.2 0.1
> 2
>
0.3 0.4 0.3 = 0.57 0.25 0.18
X>
2 = X0 P = X1 P = 0.7 0.2 0.1
0.2 0.3 0.5
And we get the probability for rain two days from now as 0.18 if it’s sunny today. So
to get the probability distribution π k of the weather on day k we can take the transition
matrix to the power of k and multiply with the initial distribution X>
0 to the left.
> k
X>
k = X0 P
Sometimes the transposed transition matrix is used instead Xk = (P> )k X0 , both obviously work equally well.
• One important question in the theory of Markov chains is whether a stationary
distribution π > = π > P exist. Or, in other words, if there is a distribution for which
the distribution in the next step is the same as the distribution in the current step.
k >
• We are also interested to know if the limit π = ( lim X>
0 P ) exists. If it exists,
k→∞
it means that as long as we iterate the Markov chain for long enough, we will
eventually arrive at a stationary distribution π > = π > P
• The stationary distribution can also be seen as the distribution of the time the
Markov chain spend in every state if it is iterated a large number of times.
• Last, when the limit exists, we are also interested to know wheather it converges
towards the same stationary distribution π regardless of initial state X0 .
48
In order to answer these questions we divide the stochastic matrices in 4 different
classes:
1. Irreducible with lim (Pk ) existing. (P is primitive)
k→∞
2. Irreducible with lim (Pk ) not existing. (P is imprimitive)
k→∞
3. Reducible and lim (Pk ) existing.
k→∞
4. Reducible and lim (Pk ) not existing.
k→∞
5.2 Irreducible, primitive stochastic matrices
If P is irreducible we can use Perron-Frobenius for irreducible non-negative matrices.
From this we get that there is one eigenvector with only positive values, namely the
Perron- Frobenius eigenvector with eigenvalue r. From the same theorem we also have
that r = 1 since r is bounded by the minimum and maximum row sum, both of which
are equal to one since P is a stochastic matrix. The stationary distribution is then the
same as the left Perron vector since we have: y> = y> P, if r = 1. So if P is irreducible,
a stationary distribution exist, and it is unique.
We remember the requirement for an irreducible matrix to be primitive is for the limit
lim (P/r)k exist. Since r = ρ(P) = 1 the limit lim (Pk ) exist if P is primitive, and thus
k→∞
k→∞
k >
π = ( lim X>
0 P ) must exist as well.
k→∞
Using Perron-Frobenius for primitive matrices, for P we get the (right) Perron vector
(x = Px) as x = e/n, where e is the vector with every element equal to one and n is the
number of states . You can easily verify this since r = 1 and every row sums to one. If
we let π = (π1 , π2 , . . . , πn ) be the left Perron vector (π > = π > P) we get the limit:
lim Pk =
k→∞
(e/n)π >
= eπ > > 0
π > (e/n)
Calculating the probability distribution of the limit gives:
k
>
>
>
lim X>
0 (P ) = X0 eπ = π
k→∞
And we see that if P is irreducible and primitive, not only does the stationary distribution π exist and is unique, we also get that the Markov chain converge towards this
distribution regardless of initial distribution X0 . Additionally Perron-Frobenius gives
us an easy to calculate way to find the stationary distribution since it’s equal to the left
Perron vector of P. We give a short example using the earlier weather model.
49
We had the transition matrix:


0.7 0.2 0.1
P = 0.3 0.4 0.3
0.2 0.3 0.5
Since P is positive, it’s obviously irreducible and primitive as well and as such we
know that there is a stationary distribution and the distribution of the Markov chain
will converge towards it. For the limit we get:
k
>
>
>
lim X>
0 P = X0 eπ = π
k→∞
To find the stationary distribution we calculate the left Perron vector π which yields
π = [0.46 0.28 0.26]> which is our stationary distribution. From this we can say that on
average it will be sunny 46% of the time, cloudy 28% of the time and rainy 26% of the
time.
5.3 Irreducible, imprimitive stochastic matrices
Since P is irreducible we know that there at least exists a stationary distribution and
that it’s unique. However since it’s imprimitive the limit lim Pk doesn’t exist and the
k→∞
Markov chain therefore doesn’t necessarily converge towards the stationary distribution.
However we can easily see that the (right) Perron vector is the same as in the primitive
case (since every row sums to one). Similarly finding the stationary distribution can
also be done in the same way for imprimitive matrices as with primitive ones as the left
Perron vector.
While we in the primitive case would also have been able to find the stationary distribution simply by iterating the chain a large number of times, this is not the case when
working with imprimitive matrices.
Even though the Markov chain doesn’t converge, we are still interested in seeing how
the Markov chain behaves if we let it iterate for a long time.This is useful for examle
when we want to sample from a distribution using an imprimitive Markov chain in
Markov chain Monte Carlo methods. While the limit lim Pk doesn’t exist, what we are
k→∞
interested isn’t necessarily the limit itself but the distribution of time the Markov chain
spends in the different states if we let the Markov chain iterate a large number of times.
For primitive matrices this is the same as the limiting distribution. In the imprimitive
case we can calculate the Ces´aro sum instead:
k
1X j
P
k→∞ k
lim
j=0
This converges and we get:
k
1X j
P = eπ > > 0
k→∞ k
lim
j=0
50
We multiply the limit with the initial vector X0 and get just like in the primitive case:
X0
k
1 X j
>
>
P = X>
0 eπ = π
k→∞ k
lim
j=0
In other words the stationary distribution π exist and is the same regardless of the
choice of initial state X0 . The Markov chain doesn’t converge to it but rather ’oscillates’
around it. More on why the Ces´
aro sum always exist for these matrices can be read
about in [?]
We give a short example using our previous
0
P =
1
small permutation matrix:
1
0
We already saw that it’s imprimitive and therefore the limit doesn’t exist. However
limit of the Ces´
aro sum does exist and we still get the stationary distribution π > as the
Perron vector. This gives the stationary distribution π > = [0.5 0.5]> .
So while the Markov chain won’t converge it will on average spend half the time in
state 1 and half the time in state 2. In fact we can easily see that it spends exactly every
second step in state 1 and the other in state 2.
5.4 Reducible, stochastic matrices
Reducible matrices are not as common to work with in practice as with irreducible matrices, often because the model itself is constructed in such a way as to ensure it produces
an irreducible matrix. However sometimes it’s desirable to use a reducible chain, for
example using an absorbing Markov chain (one or more states which we can’t leave after
the Markov chain ends up in it) in risk analysis.
Since we can’t immedietly apply Perron Frobenius to reducible Markov chains, we for
example don’t know if there exists any stationary distribution or maybe there even exist
many. What we can do is try to get as close as possible to the irreducible case.
We recall that a reducible matrix A could be permuted into upper triangular form by
a permutation matrix P:
X Y
>
P AP =
0 Z
If X or Z is reducible as well we can permutate the original matrix further untill we end
up with:


X1,1 X1,2 · · · X1,k
 0
X2,2 · · · X2k 


>
P AP =  .
.. 
..
..
 ..
.
.
. 
0
0
0
Xk,k
51
Where Xi,i is either irreducible or a single zero element. One thing we can immediately
see already is that if we leave for example block X2,2 we can never come back since none
of the states in any block Xi,i have any connection to any of the previous blocks Xj,j ,
j < i.
Last we look for blocks where only the diagonal block Xi,i contains non-zero elements
and permutate those to the bottom so we end up with:








>
P AP = 






X1,1
0
..
.
X1,2
X2,2
..
.
···
···
..
.
X1,j
X2j
..
.
X1,j+1
X2,j+1
..
.
X1,j+2
X2,j+2
..
.
0
0
0
0
0
···
Xj,j
0
Xj,j+1
Xj+1,j+1
Xj,j+2
0
0
..
.
0
..
.
···
0
..
.
0
..
.
Xj+2,j+2
..
.
0
0
0
0
0
···
···
···
···
···
···
···
..
.
..
.
···
X1,k
X2,k
..
.
Xj,k
0
0
..
.















Xk,k
This is called the canonical form of the stochastic matrix and we notice a couple of
things. First we remind ourselves that this is still the same Markov chain, we have just
relabeled the vertices by permutating the matrix using a permutation matrix and we
remind ourselves that the blocks Xi,j are matrices and not necessarily single elements.
We also note that this form is generally not unique since we could swap the position of
two different blocks in the bottom right part or sometimes in the top left as well. Using
this form we can say some things about the behaviour of the Markov chain which might
not be as apparent otherwise.
We call the states in the upper left the transient states. Whenever we leave one of
the diagonal blocks Xi,i , i ≤ j we can never get back to the same or any previous block
since all elements below the diagonal blocks are zeroes. Also all the blocks in the top
left Xi,i at least on of the blocks Xi,i+1 , . . . , Xi,k are non zero (or they would belong to
the bottom right part). This means that if we start in any state in the top left part, we
will eventually end up in one of the blocks in the bottom right.
We call the states in the bottom right the ergodic states or the absorbing states, since
once you enter any of the blocks in the bottom right part you can never leave it. Either
because it’s a single vertex only linking back to itself, or because it’s in itself an irreducible stochastic matrix.
The first thing to note is that the initial state obviously influences the behaviour of
reducible Markov chains, for example if we start in one of the ergodic states we obviously
get a different sequence than if we start in another ergodic state. So even if there would
exist a stationary distribution we can expect it to depend on the initial state.
Often with reducible Markov chains we aren’t as interested in finding where the chain
is after a long time (especially if there is only one absorbing state), but we are often more
52
interested in seeing how long time we expect it to take in order to reach a absorbing
state or the probability to end up in a certain absorbind state when there are more than
one. One example would be when modelling the failure rates of a machine, we know it
will fail eventually, but for how long can we expect it to run?
Assuming a reducible stochastic matrix A is written in canonical form.
X Y
A=
0 Z
We can then find the limit lim Ak if all blocks in Z are primitive matrices or the
k→∞
corresponding limit of the Ces´
aro sum if at least one of them are imprimitive. We look
at the case where all blocks in Z is primitive: The limit can then be written as:
0 (I − X)−1 YE
lim A =
0
E
k→∞
k
 >
eπ j+1
0
···

..
 0
.
eπ >
j+2

E= .
.
..
 ..
···
0
···
0
0

0
..
.





eπ >
k
Obviously we don’t find a unique stationary distribution (since every block in Z is
absorbing), however the limit is still very useful. First of all we see that E contains
the individual Perron vectors for the blocks in Z. So if we start in one of the absorbing
blocks we find a stationary distribution unique to that block. In fact in that case we can
obviously disregard the rest of the Markov chain and only concern ourselves with that
one block if we have the transition matrix in canonical form.
Secondly the elements ap,q of (I − X)−1 Y holds the probability that starting in p we
leave the transient states and eventually hit state q in one of the absorbing blocks. If q
is a single absorbing state ap,q is thus the probability to end up in q when starting in p.
This is commonly called the hitting probability to hit q starting in p.
The vector (I − X)−1 e is of particular interest as well. It holds the average number of
steps needed to leave the transient state, we call this the hitting times for reaching the
absorbing states. We take a short look at hitting times and hitting probabilities in the
next section from a slightly different perspective to understand why this is correct.
5.5 Hitting times and hitting probabilities
When we have a reducible Markov chain (with a finite number of states) we know that
we will eventually reach one absorbing state. We consider a stochastic matrix P with
53
transient states D and two absorbing states A, B. We want to find the hitting probability hi that we end up in A starting in d ∈ D.
We illustrate how to find such a hitting probability with a short example where we
consider the Markov chain describes by the graph below.
2/3
1/3
1
1/3
2
3
4
2/3
We ask ourselves: what is the probability to end up in 1 starting in 2? We write this
probability as hi = Pi (hit 1).
After one step we are either in state 1 with probability 1/3 or we are in 3 with
probability 2/3. In other words we could write h2 as:
1
2
h2 = P2,1 h1 + P2,3 h3 = h1 + h3
3
3
But h1 is the probability to hit 1 starting in 1, since we are already in 1 this probability
is obviously 1. We then continue and find the expression for h3 which can be written:
1
2
h3 = P3,4 h4 + P3,2 h2 = h4 + h2
3
3
Since state 4 is absorbing so the probability h4 to hit 1 from 4 is zero and we get the
equation system:

1 2

 h2 = + h3
3 3

 h3 = 2 h2
3
Solving this gives the hitting probability h2 = 3/5. While this method can be done by
hand with a low number of states, it is mostly used to find a recurrence relation when
we have an infinite number of states such as a random walk on the integers. If we have
a finite number of states and thus can write out our transition matrix P we can find the
hitting probability using the following:
Theorem 5.1. Consider the Markov chain with transition matrix P and n states. Let
A be a subset of states. The vector of hitting probabilities hA
i = Pi (hit A) is the minimal
non-negative solution to the linear system:
 A
i∈A

 hi = 1,n
X
A
pi,j hA
i∈
/A

j
 hi =
j=1
With minimal solution we mean that for any other solution x to the system we have
xi ≥ hi .
54
By writing P in canonical form we can verify that this corresponds to the result in the
previous section. The minimality is needed in order to get hA
i = 0 in other absorbing
states than those in the absorbing block we seek. When we solve the system we can set
hA
i = 0, if i is any absorbing state not in A. We note that it does not matter whether
A consists of absorbing states, since we are not interested in what happens after we
have entered A. In terms of the previous example we could use the theorem to find the
probability that we reach state 3 at some point before reaching an absorbing state.
Next we continue with our example but instead ask ourselves: What is the expected
number of steps starting in 2 until we hit either of the absorbing states? We define the
hitting time:
ki = Ei (time to hit {1,4} )
As with the hitting probability we look at the next step but now also add a 1 for the
time it takes to reach the next step:
1
2
2
k2 = P2,1 k1 + P2,3 k3 + 1 = k1 + k3 + 1 = k3 + 1
3
3
3
Where we noted that k1 = 0 since we then have already reach one of the absorbing
states. Next we continue with k3 and get:
2
2
1
k3 = P3,4 k4 + P3,2 k2 + 1 = k4 + k2 + 1 = k2 + 1
3
3
3
Solving this system gives the expected number of steps until absorption as k2 = 3. In
the same way as for hitting time we get in general:
Theorem 5.2. Consider the Markov chain with transition matrix P, n states and absorbing states A. The vector of hitting times kiA = Pi (hit A) is the minimal non-negative
solution to the linear system:
 A
i∈A

 ki = 0, n
X
A
pi,j kjA i ∈
/A

 ki = 1 +
j=1
Once again we can write the system in canonical form to verify that this corresponds
to the result in the previous section. We do note however that care needs to be taken
so that we don’t have other absorbing states outside of A. This since we would then get
the relation for a single absorbing state outside of A as kj = 1 + kj .
5.6 A short look at continuous time Markov chains
Here we will seek to give a short introduction of continuous time Markov chains and
their relation to what we already know about discrete time chains without going to much
into probability theory. This is not meant to give a full understanding of continuous time
55
chains, but rather highlight similarities and relations with discrete time chains and previous chapters.
A continuous time Markov chain can be seen as a discrete time chain, but instead of
jumping to a new state at fixed intervals 1, 2, 3, . . . we instead jump to a new state randomly in time. With randomly in time we mean that the jump from a state is exponential
distributed with intensity λ (which might be different between vertices). Exponential
distributed jumps mean two important things, one if we assume there is only one jump
T in the interval T ∈ (0, t) then the jump is uniformly distributed on this interval
P (T = x|x ∈ (0, t)). Second the jumps are memoryless which means that if there was no
jump in (0, s) then the probability of a jump at a point in time after s is independent of
s: P (T ∈ (0, t)) = P (T ∈ (s, s + t)|T ≥ s). We will only concern ourselves with chains
with a finite state space. With infinite state space there are more things that needs
to be taken into consideration such as chances for the chain to ”explode” or when the
chance to eventually get back to where we started is less then one (even if it’s irreducible).
For a continuous time chain with a finite state space we can make a graph of the chain.
Now however we no longer have the probabilities to jump to another state as weights on
the edges, but rather the intensity λ for the jump. We also naturally have no self-loops
since this is handled by the intensity of the jumps to other states. An example of what
it could look like can be seen in subsection 5.6 below.
1
1
2
2 1
1
1
3
4
2
In continuous time we talk about two different matrices. First we have the Q matrix
which is the negative Laplacian matrix where we take into consideration the weight’s of
the edges. This means that we get elements qi,j = λi,j , i 6= j where λi,j is the intensity
of jumps from state i to state j. The diagonal elements we choose such that the sum of
every row is zero. For the Markov chain described by the graph above we get Q matrix:


−3 1
2
0
 0 −1 0
1

Q=
1
0 −2 1 
0
0
2 −2
If Q is irreducible and finite then there exist a stationary distribution π which we can
find as π > Q = 0. In continuous time we can’t have any periodicities and irreducibility
56
is actually enough for a finite chain to guarantee convergence towards this distribution.
While we can solve this system to find the stationary distribution there is also another
more familiar matrix we can work with. Namely the transition matrix P(t) = etQ .
The element pi,j (t) in P (t) describes the probability of being in state j at time t if the
chain started in state i at time 0.
We will look more at the exponential of a matrix and how to calculate it later in ??.
For the moment it is enough to know that we get :
P(t) = etQ =
∞
X
(tQ)k
n=0
k!
P(t) is then a stochastic matrix for all t ≥ 0 and we can find the stationary distribution
(using any t > 0) in the same way as we did for a discrete time Markov chain. For Q
above we get using t = 1


0.1654 0.2075 0.3895 0.2376
0.0569 0.3841 0.2533 0.3057

P(1) = 
0.1663 0.0982 0.4711 0.2645
0.1394 0.0569 0.4720 0.3317
And we can also find the stationary distribution by solving the linear system π > P(1) =
π by for example finding the Perron vector π.
It’s easy to see that π = [0.1429 0.1429 0.4286 0.2857]> is a stationary distribution
and that it satisfies both π > P(t) = π and π > Q = 0.
>
5.7 Exercises
5.1. Consider the stochastic matrix

0 0.5 0.5
0
1
P= 0
0.5 0 0.5

• Show that P is irreducible.
• If we start in state 1, what is the probability to be in state 3 after 3 steps (P (X3 =
s3 |X0 = s1 )).
• What is the stationary distribution of this Markov chain?
5.2. Consider the stochastic matrix


0
0.5
0
0.5
 1
0
0
0 

T=
 0
0
1
0 
0.25 0.25 0.25 0.25
57
• Write T in canonical form.
• Calculate the hitting times of this Markov chain. What is the expected number of
steps needed to reach the absorbing state if we start in the first state?
5.8 Applications
5.8.1 Return to the Leontief model
We return to the Leontief Input-Output model in subsubsection 4.3.1 for modeling a
economic system composed of n industries.. We remember we got a linear system
S − AS = d. We found that if A was irreducible (every industry directly or indirectly
depend on all others) and every column sum to one or less, with at least one column
with sum less than one (at least one goes with a profit and none goes with a loss), we
could guarantee that we could satisfy any given demand d.
We are now interested to answer some more questions about the system, such as what
is the gross national product (GNP) of the system? To answer this we will look at the
system as a Markov chain with transition matrix Q = A> . We note that we take the
transpose of the previous system matrix such that every row now sums to one or less
rather than column and elements pi,j now corresponds to the amount of J−units required
by industry I to create one I−unit. To account for the demand vector, which essentially
takes units of produced commodities out of the system we introduce a new state d in
the transition matrix. Since any produced commodity that enters the ”public”
Xcan’t
come back into the system we let d be an absorbing state such that Qi,d = 1 −
qi,j .
j
In other words we modify the transition matrix by adding a elements to the column
corresponding to the demand d such that every row of A> sum to one.
We get a transition matrix of the form:


q1,1 . . . q1,n q1,d
 ..
..
.. 
..

.
.
. 
Q= .

qn,1 . . . qn,n qn,d 
0 ... 0
1
Since the top left n × n part is irreducible, and at least one of qi,d , . . . qn,d > 0 since
at least one goes with a profit we see that we can reach the absorbing state d from any
other state. Our transition matrix therefore describes an absorbing Markov chain it’s
also easy to see that any initial supply S leads to supplying some demand d.
GNP is the total supply S produced to both satisfy the demand as well as the internal
demand of the system. If we look at the mean number of steps required to reach the
absorbing state from the transient states: T = (I − A> )−1 u.
• We can see the elements ti as the expected production in units to create one
product unit of industry i.
58
• Multiplying the demand vector d with T: (T> d) gives the expected total produced
units. i.e. the GNP.
59
6 Linear spaces and projections
In this chapter, which will be a bit more theoretical, we will take a look at linear spaces
and their relation to matrices. This in order to get a better understanding of some of
the underlying theory as well as some commonly used methods or concepts used later.
We will start by defining linear spaces and inner product spaces and later in the last
part take a closer look at projections and how they can be used to for example find a
basis of a matrix.
6.1 Linear spaces
We start by defining a linear space:
Definition 6.1
A linear space or vector space over a field F is a set V together with two binary operations
(+, ∗) defined such that:
u, v ∈ V ⇒ u + v ∈ V, and u ∈ V, a ∈ F ⇒ a ∗ u ∈ V
With the following being true, where we let u, v, w be vectors in V and a, b be scalars
in F .
For vector addition (+) we have:
• Associativity u + (v + w) = (u + v) + w.
• Commutativity: u + v = v + u.
• Existence of identity element 0 ∈ V such that v + 0 = v for all v ∈ V .
• Existence of a inverse element −v ∈ V such that v + (−v) = 0.
For scalar multiplication (∗) we have:
• Distributivity with respect to vector addition: a ∗ (u + v) = a ∗ u + a ∗ v.
• Distributivity with respect to field addition: (a + b) ∗ v = a ∗ v + b ∗ v
• Compatibility with field multiplication: a ∗ (b ∗ v) = (ab) ∗ v.
• Existence of identity element 1 ∈ F such that 1 ∗ v = v for all v ∈ V .
Usually we omit the sign (∗) for multiplication as we are used to. In order to not go to
much into algebraic structures we consider a field a set with two operations (+, ·) following the ”usual” rules such as associativity and commutativity. To read more about fields
and other algebraic structures we refer to a book on algebra such as [?]. Two examples
of fields are the real and complex numbers R, C.
60
We take a closer look at the definition of a linear space and the requirements on the
two operations (+, ∗) for V to be a linear space. First of all we see that + is a operation
between two vectors in V such that the result is a new vector also in V . We also need
it to be associative and commutative such that we can add in whatever order we want
and/or swap place of the two vectors. There also exist a zero(identity) and inverse element. These are all things we are used to from the real numbers.
Second we have another operation ∗ between a scalar in F and a vector in V such that
the result is a vector in V . Similarly to + we have a few requirements on the operation
such as distributivity with respect to both vector and field addition. This means means
for example we can either add two vectors and then multiply or multiply with both
vectors and then add the result. We also have Compatibility with field multiplication as
well as the existence of a multiplicative identity element.
Some examples of linear spaces using the common addition and multiplication operations are:
• Rn .
• Cn .
• Matrices of size n × n with elements in F : Mn×n (F ).
• {polynomials}
In the future, unless otherwise noted we will assume the common addition and multiplication operation is used. We take a look at R2 with vectors u = (u1 , u2 ), v =
(v1 , v2 ), w = (w1 , w2 ) ∈ V and scalars a, b ∈ F .
If we add two vectors we get: (u1 , u2 ) + (v1 , v2 ) = (u1 + v1 , u2 + v2 ), since the addition
of two real number (for example u1 + v1 ) is a new real number this obviously belong
to R2 as well. Both assiciativity and commutativity immediately follows from addition
in R as well. The identity element with respect to addition is obviously the zero vector
0 = (0, 0) since (u1 , u2 ) + (0, 0) = (u1 + 0, u2 + 0) = (u1 , u2 ). Similarly we find inverse
elements simply by multiplying every element of the vector with −1.
For scalar multiplication a ∗ (v1 , v2 ) = (av1 , av2 ) we can show that it fullfills the
requirements in a similar way and as such is left to the reader.
We now introduce something called subspaces of linear spaces, these can be seen as
another linear space contained within a linear space.
Definition 6.2
If V is a vector space, a subset U ⊆ V is called a subspace of V if U is itself a vector
space with the same operations (+, ∗) as V .
We also get a way to test if a subset U is a subspace of a linear space V by checking a
just a couple of different things rather than all the requirements in the original definition
for a linear space.
61
Theorem 6.1. The subset U ⊆ V is a subspace of V if and only if it satisfies the
following 3 conditions.
• 0 lies in U , where 0 is the zero vector (additive identity) of V .
• If u, v ∈ U then u + v ∈ U .
• If u ∈ U and a ∈ F then au ∈ U .
For example we look at the linear space R2 and want to see if the points on the line
y = 2x, x, y ∈ R is a subspace of R2 . We can write the points on the line as:
U = {a(1, 2)|a ∈ R}
U is obviously a set in R2 so we only need to show the three requirements above. If
we choose a = 0 it’s clear that 0 ∈ V also lies in U . Adding two vectors in U gives
a(1, 2) + b(1, 2) = (a + b)(1, 2) which is obviously in U since a + b ∈ R if a, b ∈ R. Last
we multiply with a scalar b ∗ a(1, 2) = (ba)(1, 2) which also obviously belong to U since
ba ∈ R.
Definition 6.3
A vector v such that:
v = c1 v1 + c2 v2 + . . . + cm vm =
m
X
cj vj
j=1
is called a linear combination of vectors {v1 , v2 , . . . , vn } ∈ V with coefficients in F if
{c1 , c2 , . . . , cm } ∈ F .
We illustrate it with an example: Considering the two vectors u = (1, −2, 1) and
v = (1, 1, 2). Then the vector (3, −3, 4) is a linear combination of u, v since 2u + v =
2(1, −2, 1) + (1, 1, 2) = (3, −3, 4). However the vector (1, 0, 1) is not a linear combination
of these vectors since if we write it as an equation system su + tv = (1, 0, 1) we get:
1=s+t
0 = −2s + t
1 = s + 2t
The only way to satisfy the first and third equation is if s = 1, t = 0 however then the
middle equation is not fulfilled.
We often talk about not one linear combination of vectors but all linear combinations
of a set of vectors:
Definition 6.4
If {v1 , v2 , . . . , vm } is a set of vectors in V . Then the set of all linear combinations of
these vectors is called their span denoted:
span(v1 , v2 , . . . , vm )
62
• If V = span(v1 , v2 , . . . , vm ) then we call these vectors a spanning set for V .
For example given the two vectors u = (1, −2, 1) and v = (1, 1, 2) above we get their
spanning set:
span(u, v) = {su + tv|s, t ∈ R} = (s + t, −2s + t, s + 2t)
So the spanning set is all vectors which we can write in this way. We recognize that
the spanning set above actually describes a plane in R3 . In fact all spanning sets in a
vector space V are subspaces of V .
Theorem 6.2. For a span U = span(v1 , v2 , . . . , vm ) in a vector space V we have:
• U is a subspace of V containing each of v1 , v2 , . . . , vm .
• U is the smallest subspace containing v1 , v2 , . . . , vm , in that all subspaces containing v1 , v2 , . . . , vm must also contain U .
We prove the first statement:
Proof. We use the test described earlier to see if this subset is also a subspace.
• U obviously contains 0 = 0v1 + 0v2 + . . . + 0vm .
• Given u = a1 v1 + a2 v2 + . . . + am vm and v = b1 v1 + b2 v2 + . . . + bm vm and c ∈ F .
• u + v = (a1 + b1 )v1 + (a2 + b2 )v2 + . . . + (am + bm )vm ∈ U
• cu = (ca1 )v1 + (ca2 )v2 + . . . + (cam )vm ∈ U .
• And we have proven that U is a subspace of V .
We take a look at the spanning set span{(1, 1, 1), (1, 0, 1), (0, 1, 1)} and want to find
if this spans the whole of R3 ? span{(1, 0, 1), (1, 1, 0), (0, 1, 1)} is obviously contained
in R3 and it’s clear that R3 = span{(1, 0, 0), (0, 1, 0), (0, 0, 1)}. We can then prove
that (1, 0, 0), (0, 1, 0), (0, 0, 1) is contained in our spanning set and use Theorem 6.2 to
prove that it contains R3 . We do this by writing (1, 0, 0), (0, 1, 0), (0, 0, 1) as a linear
combination of our vectors:
(1, 0, 0) = (1, 1, 1) − (0, 1, 1)
(0, 1, 0) = (1, 1, 1) − (1, 0, 1)
Using the two first found vectors we find the last.
(0, 0, 1) = (1, 1, 1) − (1, 0, 0) − (0, 1, 0)
63
It’s clear that there are generally more than one way to write a linear space as a
spanning set. We will now move on to look at two different types of spanning sets. We
compare two spanning sets in R2 :
span((1, −1), (1, 1), (1, 2)) = R2
span((1, 2), (2, 3)) = R2
In the first spanning set we can represent the vector (3, 2) as a linear combination of
the spanning vectors in two ways:
(3, 2) = 1(1, −1) + 1(1, 1) + 1(1, 2)
(3, 2) = 0(1, −1) + 4(1, 1) − 1(1, 2)
However in the second spanning set we get:
(3, 2) = s(1, 2) + t(2, 3) = (s + 2t, 2s + 3t) ⇒
s = −5
t=4
In fact, any vector (a, b) have only one unique representation (a, b) = s(1, 2) + t(2, 3).
We introduce the notion of linearly dependent or independent sets of vectors.
Definition 6.5
A set of vectors {v1 , v2 , . . . , vm } is called linearly indepependent if the following is true:
a1 v1 + a2 v2 + . . . + am vm = 0 ⇒ a1 = a2 = . . . = am = 0
In other words, a set of vectors {v1 , v2 , . . . , vm } are linearly independent if the only
way to represent 0 in these vectors is with all coefficient zero:
0 = 0v1 + 0v2 + . . . + 0vm
One important consequence of this is that it also means that any other vector v can
only be representes as a linear combination of {v1 , v2 , . . . , vm } in one way as well. Otherwise we could find two representations of 0 by subtracting v − v using two different
representations of v.
If a set of vectors isn’t linearly independent they are linearly dependent.
Definition 6.6
A set of vectors {v1 , v2 , . . . , vm } is called linearly depependent iff some vj is a linear
combination of the others.
We consider three vectors {(1 + x), (2x + x2 ), (1 + x + x2 )} in (P2 ) and want to find if
they are linearly independent or not by finding a linear combination of them that result
in the zero vector.
0 = s(1 + x) + t(2x + x2 ) + u(1 + x + x2 )
64
We get a set of linear equations:

 s
s

+
2t
t
+
+
+
u
u
u
=0
=0
=0
It’s clear that the only solution to the system is s = t = u = 0 and we see that the
vectors are linearly independent. Considering this relation between solving an equation
system and linear independence we can easily proove the following usefull theorem.
Theorem 6.3. If a vector space V can be spanned by n vectors.
Then for any set {v1 , v2 , . . . , vm } of linear independent vectors in V , m ≤ n.
In the proof which we leave to the reader we show that the m vectors are linear depenm
X
dent if m > n by setting up the equation system
xj vj = 0 We represent the m linear
j=1
independent vectors as a linear combination of a span of n vectors. This will result in a
system of n equations with m unknown variables, which has a nontrivial solution.
In practise the theorem means that every set of linear dependent set of vectors that
span a vector space V have the same number of vectors, we call this a basis.
Definition 6.7
A set of vectors {v1 , v2 , . . . , vm } is called a basis in V if the following is true:
• {v1 , v2 , . . . , vm } is linear independent.
• V = span{v1 , v2 , . . . , vm }.
We summarize our findings of bases in a linear space.
• If {v1 , v2 , . . . , vm } and {w1 , w2 , . . . , wn } are two bases in V , then m = n.
• In other words, all bases in a vector space contain the same number of vectors.
• If V have a basis of n vectors. We say that V have the dimension n:
dim V = n
• We say that a vector space is finite dimensional if either V = 0 or dim V < ∞.
In fact not only for all bases in a linear space V is it true that they all have the the
same amount of vectors, any set of that same amount of linear independent vectors in
V is a basis in V .
Theorem 6.4. If dim V = n < ∞ and {v1 , v2 , . . . , vn } is a set of linear independent
vectors, then {v1 , v2 , . . . , vn } is a basis in V .
65
Proof. It’s already linear independent so we then need to prove that {v1 , v2 , . . . , vn }
spans V . If we assume {v1 , v2 , . . . , vn } does not span V , we could then choose a new
vector vn+1 outside span{v1 , v2 , . . . , vn }. Then {v1 , v2 , . . . , vn , vn+1 } is linear independent. But we already know there is a basis {w1 , w2 , . . . , wn } which spans V since
dim V = n, and for any set of linear independent vectors {v1 , v2 , . . . , vm }, we have
m ≤ n. So we get a contradiction and {v1 , v2 , . . . , vn } must span V .
One common solution to problems is to fit a new more appropriate basis to the problem
and use that instead, this guarantee that any set of same amount of linear independent
vectors is also a basis making it unnecessary to check if our new proposed basis is actually
a basis or not. In essence it makes it easy to find a new basis if we already have one basis
to start with. For example changeing to spherical coordinates, finding eigenvectors of a
matrix, when finding many matrix decompositions, or making a fourier approximation
of a function can all be seen as more or less as finding a new basis to the problem. Before
continuing we give a short relation to matrices and how they can and are often used to
represent a basis.
Theorem 6.5. For a (real) n × n matrix M the following properties are equivalent:
• The rows of M are linearly independent in Rn
• The columns of M are linearly independent in Rn
• The rows of M span Rn .
• The columns of M span Rn .
• M is invertible.
6.2 Inner product spaces
We note that this far we have only seen vectors as a set of elements in F . However when
we talk about vectors we are often talking about length of or angle between vectors. To
be able to do this we need to add something to our linear space, namely we add another
operation called a inner product and call it a inner product space.
Definition 6.8
An inner product on a vector space V over F is a map h·, ·i that takes two vectors
u, v ∈ V and maps to a scalar s ∈ F .
We denote the inner product of two vectors u, v: hu, vi for which the following statements must be true:
• hu, vi = hv, ui, (symmetric in Rn ).
• hau, vi = ahu, vi.
66
• hu + v, wi = hu, wi + hv, wi.
• hu, ui > 0 for all u 6= 0 in V .
• A vector space V with an inner product h·, ·i we call a inner product space.
The most commonly used inner product is the dot product sometimes called scalar
product:
u · v = u1 v1 + u2 v2 + . . . un vn ∈ Rn
Using the definition above we can show that the dot product is in fact a inner product.
• For the first property we get u · v = u1 v1 + u2 v2 + . . . un vn = v1 u1 + v2 u2 + . . . vn un .
Which is obviously true since xy = yx in R.
• For the second we get: au·v = au1 v1 +au2 v2 +. . . aun vn = a(v1 u1 +v2 u2 +. . . vn un ).
Which also immediately follows from multiplication in R.
• Next we get (u + v) · w = (u1 + v1 )w1 + (u2 + v2 )w2 + . . . (un + vn )wn = (v1 w1 +
v2 w2 + . . . vn wn ) + (u1 w1 + u2 w2 + . . . un wn ) = (v · w) + (u · w) and the third
statement is fulfilled as well.
• Last we note that hu, ui is the sum of the square of all elements. Since the square
of a real number r is positive if r 6= 0 the only way for the dot product to equal
zero is if u = 0.
But although the dot product is the most commonly used inner product, there are
also other inner products and such as for example the following on the polynomials of
order n Pn :
Z
hp, qi =
1
p(x)q(x) dx
0
Where p, q are polynomials in Pn . It’s easy to show that this is a inner product from
the definition as well for the interested reader.
√
You should be familiar with the Euclidean norm |v| =
the norm of a inner product space in a similar way.
v> v of a vector, we define
Definition 6.9
If h·, ·i is an inner product in the vector space V , the norm (sometimes called length)
kvk of a vector v in V is defined as:
p
kvk = hv, vi
We immediately see that when using the dot product we get the Euclidian norm. The
norm in a inner product space share some properties with those of a vector norm:
67
• kvk ≥ 0, kvk = 0 if and only if v = 0.
• kcvk = |c| kvk.
• kv + wk ≤ kvk + kwk.
But we also have one additional important property of norms in a inner product space
called the Schwarz inequality:
• If u, v are vectors in V , then:
hu, vi2 ≤ kuk2 kvk2
• With equality if and only if one of u, v is a scalar multiple of the other.
Now that we have a notion of length (norm) in our inner product space we might also
want the distance between vectors. Once again we start by looking at how we calculate
distance using the dot product and the common geometric distance (in R3 ). Using the
dot product we get the distance between two vectors d(u, v) as:
d(u, v) =
p
(u1 − v1 )2 + (u2 − v2 )2 + (u3 − v3 )2
p
= (u − v) · (u − v)
Which we recognize as the Euclidean norm |u − v|. We define the distance between
vectors in a general inner product space in the same way:
Definition 6.10
The distance between two vectors u, v in an inner product space V is defined as:
d(u, v) = ku − vk
For the distance d(u, v) we have the following properties for vectors u, v, w in the
inner product space V .
• d(u, v) ≥ 0 with equality iff u = v.
• d(u, v) = d(v, u).
• d(u, w) ≤ d(u, v) + d(v, w).
While we can now calculate the length of a vector or the distance between two vectors
we still have no idea how to find out if for example two vectors are parallel or orthogonal.
To do this we need to define the angle between vectors. Using Schwarz inequality we
define the angle between two vectors u, v.
Definition 6.11
If 0 ≤ θ ≤ π is the angle between two vectors we have:
cos θ =
68
hu, vi
kuk kvk
We remember Schwarz inequality saying:
hu, vi2 ≤ kuk2 kvk2
This gives:
⇔
hu, vi
kuk kvk
2
≤ 1 ⇔ −1 ≤
hu, vi
≤1
kuk kvk
Which we then use to define the angle between two vectors.
One very important case is when the inner product is zero (hu, vi = 0) and therefor
the angle π/2 radians between them. This is not only interesting geometrically but for
general inner product spaces as well.
Definition 6.12
Two vectors u, v in a inner product space V are said to be orthogonal if:
hu, vi = 0
More important is if we have a whole set of vectors, all which are orthogonal to
eachother:
Definition 6.13
A set of vectors {v1 , v2 , . . . , vm } is called an orthogonal set of vectors if:
• vi 6= 0 for all vi .
• hvi , vj i = 0 for all i 6= j.
• If kvi k = 1 for all i as well, we call it an orthonormal set.
If we have a orthogonal set {v1 , v2 , . . . , vm } and want a orthonormal set we can easily
find a orthonormal set simply by normalizing every vector.
v1
v2
vm
,
,...,
kv1 k kv2 k
kvm k
Given a orthogonal set we have a couple of important properties, first of all it’s also
linear independent.
Theorem 6.6. If {v1 , v2 , . . . , vm } is an orthogonal set of vectors then it is linear independent.
We now have another way to find a basis in a inner product space V , rather than
trying to find a set of linear independent vectors, we could instead look for a orthogonal
set of vectors. The found orthogonal set will then automatically be linearly independent
as well. We will look more into how to do this in later when we take a closer look
at projections. Additionally we have one additional property concerning the norm and
orthogonal vectors.
69
Theorem 6.7. If {v1 , v2 , . . . , vm } is an orthogonal set of vectors then:
kv1 + v2 + . . . + vm k2 = kv1 k2 + kv2 k2 + . . . + kvm k2
We give a short motivation showing it for two vectors, the proof can then easily be
extended for n vectors as well. Given two orthogonal vectors v, u we get:
kv + uk2 = hv + u, v + ui = hv, vi + hu, ui + 2hv, ui
Since v, u are orthogonal hv, ui = 0 and we get:
kv + uk2 = hv, vi + hu, ui = kvk2 + kuk2
A matrix whose rows or columns form a basis of orthonormal vectors is called a unitary
matrix. We remember that for a unitary matrix we have:
Theorem 6.8. For a n × n matrix U the following conditions are equivalent.
• U is invertible and U−1 = UH .
• The rows of U are orthonormal with respect to the inner product.
• The columns of U are orthonormal with respect to the inner product.
• U is a unitary matrix.
If U is a real matrix using the ordinary dot product we say that U i an orthogonal
matrix if it satisfies the conditions above. Worth noting is that if U is real the hermitian
of a matrix is the same as the transpose and we get U−1 = UT . One example of unitary
matrices are the permutation matrices, in fact we can see the unitary matrices as a sort
of generalization. In the same way as multiplying with a permutation matrix PAPT can
be seen as changing from one basis to another by reordering the basis vectors (relabel
a graph). Multiplication with a unitary matrix UAUH can also be seen as a change of
basis. This is often used in practice by given M find a suitable matrix decomposition
M = UAUH and then continue working with the presumably easier A instead.
6.3 Projections
We will now look closer at why we usually want to work with a orthogonal basis and
how to construct such a basis. We still haven’t given a reason why we would want to
find a new orthogonal basis given that we usually start with the standard orthonormal
basis represented by the identity matrix. This however wil mainly have to wait until the
examples and applications.
One of the large advantages of orthogonal bases is in the ease of finding a representation
of any vector in this basis. We have already seen that the representation is unique since
it’s basis vectors are linearly independent and we therefor find the new representation
by solving a linear system.
However if we have an orthogonal basis {e1 , e2 , . . . , em } of a inner product space V .
Then for any vector v ∈ V :
70
• v=
hv, e1 i
hv, e2 i
hv, em i
e1 +
e2 + . . . +
em
ke1 k2
ke2 k2
kem k2
• We call this the expansion of v as a linear combination of the basis vectors
{e1 , e2 , . . . , em }.
This gives us an easy way to express v given an orthogonal basis without the need to
solve a linear equation system. Even better, if the basis is orthonormal it’s even easier
since then all kei k2 = 1 and we simply need to take the inner product of the vector v
and all the basis vectors.
The next step is in constructing an orthogonal basis, we will take a look at projections
and how to use these to create a basis. While we have already constructed bases many
times before, for example every time you use Gauss-elimination, we are now interested
in how we can ensure that we create a orthogonal basis and rather than only a linear
independent one. One method to do this is the Gram-Schmidt process, however before
looking at the method we must first show that such a basis actually exist.
Theorem 6.9. In every inner product space V with dim V < ∞ there exist an orthogonal
basis.
Proof. We let V be an inner product space with dimension dim V = n and show it using
induction.
• If n = 1 then any basis is orthogonal.
• We then assume every inner product space of dimension n have an orthogonal
basis.
Then if dim V = n+1 and {v1 , v2 , . . . , vn+1 } be any basis of V . If U = span(v1 , v2 , . . . , vn )
and {e1 , e2 , . . . , en } is an orthogonal basis of U .
It’s then enough to find any vector v such that hv, ei i = 0 for all i. since then
{e1 , e2 , . . . , en , v} will be an orthogonal basis in V . We consider:
v = vn+1 − t1 e1 − t2 e2 − . . . − tn en
Where we need to find all ti . Since vn+1 is not in U , v 6= 0 for all choices of ti . So if
we can find ti for all i such that hv, ei i = 0 we are finished.
For the orthogonality with ei we get hv, ei i
= hvn+1 , ei i − t1 he1 , ei i − . . . − ti hei , ei i − . . . − tn hen , ei i
= hvn+1 , ei i − 0 − . . . − ti kei k2 − . . . − 0
= hvn+1 , ei i − ti kei k2
And we get hv, ei i = 0 if ti =
hvn+1 , ei i
.
kei k2
71
We noice that the proof not only shows that there is an orthogonal basis, but also
gives us a way to find one given any other basis. It’s is recommended that you take
some time to really understand the proof and why it works. We use the results here in
order to create the Gram-Schmidt process used in order to find an orthogonal basis. We
summarize the previous finding in
Theorem 6.10. Let {e1 , e2 , . . . , en } be an orthogonal set of vectors in a inner product
space V and v be a vector not in span(e1 , e2 , . . . , en ).
• Then if:
en+1 = v −
hv, e1 i
hv, e2 i
hv, en i
e1 −
e2 − . . . −
en
2
2
ke1 k
ke2 k
ken k2
• {e1 , e2 , . . . , en , en+1 } is an orthogonal set of vectors.
Using this we construct the Gram-Schmidt process:
• Let V be an inner product space, and {v1 , v2 , . . . vn } be a basis of V .
• We then construct the vectors e1 , e2 , . . . en of an orthogonal basis in V succesively
using:
• e1 = v1 .
• e2 = v2 −
hv2 , e1 i
e1 .
ke1 k2
• e3 = v3 −
hv3 , e1 i
hv3 , e2 i
e1 −
e2 .
2
ke1 k
ke2 k2
• ...
• en = vn −
hvn , e1 i
hvn , en−1 i
e1 − . . . −
en−1 .
2
ke1 k
ken−1 k2
We note that to get a orthonormal basis (and orthogonal matrix) we need to also
normalize the resulting vectors. We can do that either at the end when we have found
an orthogonal basis or during the algorithm itself. While this works in theory, when
implemented on a computer there can be some stability problems. When computing
the vector ek , there might be some small error in precision resulting in vk being not
completely orthogonal. This is then compounded when we further orthogonalize against
this vector and resulting vectors repeatedly. Luckily we can make a small adjustment of
the method in order to handle these problems.
Modified Gram-Schmidt works like Gram-Schmidt with a small modification.
• u1 = v 1 .
(1)
• uk = v k −
72
hvk , u1 i
u1 .
ku1 k2
(1)
(2)
(1)
• uk = uk −
huk , u2 i
u2 .
ku2 k2
• ...
(n−2)
•
(n−1)
uk
=
(n−2)
uk
hu
, un−1 i
− k
un−1 .
kun−1 k2
(j)
• In every step j we orthogonalize all vectors uk , k > j with uj .
We note that while we assume that {v1 , v2 , . . . vn } it doesn’t actually need to be linearly independent. Since if we apply it to a set of linear dependent vectors then for some
i, ui = 0 and we can discard them and not orthogonalize further vectors with them. In
this case the number of resulting vectors will be the same as the dimension of the space
spanned by the original vectors. One example of where we can use Gram-Schmidt is in
finding a QR-decomposition of a matrix, we look more at that in the applications part
here and the QR-method later.
While we have already used projections, we haven’t actually defined what a projection
is in respect to a inner product space.
Definition 6.14
If v = v1 + v2 with v1 ∈ U and v2 ∈ U ⊥ , where U ⊥ is orthogonal to U .
• The vector v1 is called the projection of v on U , denoted: projU (v).
• The vector v2 is called the component of v orthogonal to U , given by: v2 =
v − projU (v).
Theorem 6.11. If V is a inner product space and U is a subspace of V . Then every
vector v ∈ V can be written uniquely as:
v = projU (v) + (v − projU (v))
• If e1 , e2 , . . . , em is any orthogonal basis of U , then:
projU (v) =
hv, e1 i
hv, e2 i
hv, em i
e1 +
e2 + . . . +
em
ke1 k2
ke2 k2
kem k2
• Additionally, for the subspaces U, U ⊥ we have:
dim V = dim U + dimU ⊥
The projection of a vector v in V on a subspace U can be seen as the vector in U
most similar to v in U . Looking at the Gram-Schmidt process earlier we also see that
another way to see the algorithm is by adding new vectors ei by taking the reminder
after subtracting the projection of ei on the subspace spanned by the previous vectors.
ei = vi − projU (v), U = span(e1 , . . . , ei−1 )
73
Theorem 6.12. If V is a inner product space and U is a subspace of V . Then projU (v)
is the vector in U closest to v ∈ V .
kv − projU (v)k ≤ kv − uk
for all u ∈ U .
Proof. For v − u we get:
v − u = (v − projU (v)) + (projU (v) − u)
Since the first lies in U ⊥ and the second lies in U they are orthogonal and we can use
pythagoras:
kv − uk2 = (v − projU (v))2 + (projU (v) − u)2 ≥ (v − projU (v))2
We look at an example in R3 where we want to find the point on a plane closest to a
point in space. We consider the plane given by the equation x − y − 2z = 0 and find the
point on the plane closest to the point v = (3, 1, −2).
• For the plane we get x = y + 2z and the subspace U = {s + 2t, s, t|s, t ∈ R} =
span((1, 1, 0), (2, 0, 1))
• Using Gram-Schmidt we get: {e1 = (1, 1, 0), e2 = (2, 0, 1)−
h(2, 0, 1), (1, 1, 0)i
(1, 1, 0).
k(1, 1, 0)k2
• e2 = (2, 0, 1) − (1, 1, 0) = (1, −1, 1).
• Then we find the closest point by finding the projection of v on U : projU (v) =
hv, e1 i
hv, e2 i
e1 +
e2
ke1 k2
ke2 k2
• = 2e1 + 0e2 = (2, 2, 0)
Since we have found an orthogonal base for the plane we can now easily also find the
point on the plane closest to any general point v = (x, y, z)in space as:
projU (v) =
hv, e1 i
hv, e2 i
x+y
x−y+z
e1 +
e2 =
(1, 1, 0) +
(1, −1, 1)
2
2
ke1 k
ke2 k
2
3
6.4 Applications
6.4.1 Fourier approximation
We inspect the space C[−π, π] of continuous functions on the interval [−π, π] using the
inner product:
Z π
hf, gi =
f (x)g(x) dx
π
.
74
• Then {1, sin x, cos x, sin 2x, cos 2x, . . .} is an orthogonal set.
• We can then approximate a function f by projecting it on a finite subspace of this
set:
Tn = span({1, sin x, cos x, sin 2x, cos 2x, . . . , sin mx, cos mx})
This gives f (x) ≈ tnf (x) = a0 +a1 cos x+b1 sin x+a2 cos 2xb2 sin 2x+. . .+am cos mx+
bm sin mx.
• The coefficients ak , bk we get as:
Z π
hf (x), 1i
1
• a0 =
=
f (x) dx.
k1k2
2π π
Z
hf (x), cos(kx)i
1 π
• ak =
f (x) cos(kx) dx.
=
k cos(kx)k2
π π
Z
hf (x), sin(kx)i
1 π
• ak =
=
f (x) sin(kx) dx.
k sin(kx)k2
π π
• It’s obvious that as we take more coefficients we get a better approximation of f
since T1 ⊆ T2 ⊆ T3 ⊆ . . ..
• If we take an infinite number of coefficients we call this the Fourier series of f .
6.4.2 Finding a QR decomposition using the Gram-Schmidt process
We remember that given a matrix M we can decompose it in two matrices M = QR such
that Q is a unitary matrix and R is a upper triangular matrix. The QR-decomposition
can then in turn be used to solve or find a least square solution to the linear system
defined by M . It can also be used to more effectively find the eigenvalues and eigenvectors of M . Both of these uses of the QR decomposition is handled elsewhere in the book
however one problem first need to be solved.
If we want to use QR decomposition to speed up calculations, we need to be able
to find the QR decomposition itself as fast (and stable) as possible. If we look at the
columns of A as a (possibly linearly dependent) set of basis vectors. Then QR can be
seen as a base change Q with coefficients in R. If we look at the Gram-Schmidt process
used on the column vectors v of A we get (before normalization):
ei = vi −
hvi , e1 i
hvi , ei−1 i
e1 − . . . −
ei−1
2
ke1 k
kei−1 k2
⇔ vi = ei +
hvi , e1 i
hvi , ei−1 i
e1 + . . . +
ei−1
2
ke1 k
kei−1 k2
So we write column i using i basis vectors. If we collect the coefficient in R and the
basis vectors in Q we get a QR decomposition. You can easily check this by considering
75
one column in A as we did above. In practice we usually first calculate Q and then find
R using multiplication R = QH A since Q is unitary.
We give an example considering the matrix A
Schmidt process:

1 −1 2
A = 2 0 2
2 2 3
below using the unmodified Gram
3
1
2
This gives :
 
 
1
−4
A.1
1 
1

2 , A0.2 = A.2 − QH
−2
Q.1 =
=
A
Q
=
.2
.1
.1
kA.1 k
3
3
2
4
 
 
2
6
1 
1 
0
0
−2 , A.4 =
−3
A.3 =
3
3
1
0
Continuing with the next column we get: :
 
 
−2
2
0
1 
1 
A.2
00
0
H 0
−1 , A.3 = A.3 − Q.2 A.3 Q.2 =
−2
=
Q.2 =
kA0.2 k
3
3
2
1
 
4
1
A00.4 = −4
3
2
 
2
A00.3
1 
−2
Q.3 =
=
kA00.3 k
3
1
We now have a full basis and can stop since Q.4 must be zero.
For the QR-decomposition A = QR we then get the matrices:


1 −2 2
1
Q = (Q.1 , Q.2 , Q.3 ) = 2 −1 −2
3
2 2
1
And


9 3 12 9
1
R = QH A = 0 6 0 −3
3
0 0 3
6
Here are a couple of things to consider and think about:
• Can you recognize the Gram-Schmidt process when written in this form? Where
are we doing the normalization?
76
• Can you find the QR-decomposition using the modified algorithm?
• Using that Q is hermitian we can write a linear system Ax = b as QRx = b ⇒
Rx = QH b which is very easy to solve since R is triangular. How does this compare
to other standard methods such as Gauss elimination?
6.4.3 Principal component analysis (PCA) and dimension reduction
Let us consider n sets of mesurements at time t = 1, 2, 3, . . . , m. For example the value
of n stocks, consumption of n commoditys, rainfall at n locations, etc. While making
our analysis of the data might be easy if n is small, as n grows the time to make the
analysis typically grows faster (for example in n2 ). As n grows large enough we reach a
point where we cannot do our analysis in a reasonable time anymore.
However let us assume the measurements are dependent, this seems like a rather reasonable assumption. Our goal is to find a new basis in which we can represent the n
sets using fewer basis vectors. If one measurement is a linear combination of the others
we can use for example Gram-Schmidts to find a new basis. However in practice this is
rarely the case, especially considering as in this case typical stochastic processes.
Since Gram-Schmidt is guaranteed to find a basis with the minimum number of basis
vectors since it gives an orthogonal basis, findign a basis with less basis vectors that
captures all the data is impossible. However what if we can find a new basis which
captures most of the data? For example we might have points in 3 dimensions which all
lie on or close to a plane. In this case projecting the points to this plane should give a
good approximation of the data, as well as reducing the dimension of the problem.
One method used to reduce the dimension of a problem is principal component analysis (PCA): Our first step is to subtract the mean of every individual data set. We use
this to construct a data matrix by inserting the different sets of zero mean measurements
as row vectors Xn×m with mean zero of every row.
Next we compute the singular value decomposition (SVD):
X = WΣVT
We remember that W is a unitary matrix of the eigenvectors of XXT , Σ is a diagonal
matrix with the singular values on the diagonal and V is a unitary matrix which contains
the eigenvectors of XT X. The singular values Σi,i are the square roots of the eigenvalues
of XXT . The eigenvalues are proportional to the variance of the data captured by corresponding eigenvector. Essentially we can see the first eigenvector corresponding to the
largest eigenvalue as the direction in which our original data have the largest variance.
The second eigenvector corresponds to the direction orthogonal to the first which which
captures the most of the remaining variance, the third is orthogonal to the first two and
77
so on.
If we would project our data onto the space projected by the first L eigenvectors we
would thus get the L dimensional space that captures the most variance in the data. If
the last principal components are small they can be disregarded without a large loss of
information.
If we let Yn×m be the projection on the space described by the eigenvectors of W we
get:
Y = WT X = WT WΣVT = ΣVT
If we only want the space described by the first L eigenvectors we get:
T
YL = WL
X = ΣL VT
T
Where WL
is a L × n matrix taken as the first L rows of W, and ΣL is a L × m
matrix composed of the first L rows of Σ. We can now work with our L × m matrix
YL rather than X with little loss of information if the remaining singular values are small.
We note that it is also possible to work with the covariance matrix C = XXT since
W is the eigenvectors of XXT And as such, if we have the covariance matrix XXT we
can calculate it’s eigendecomposition rather than the singular value decomposition of X.
However it is recommended to use SVD decomposition since constructing the covariance
matrix XXT can give a loss of precision using the covariance method.
While PCA gives the ”best” orthogonal basis, there are other methods as well as other
variations of PCA to reduce the dimension of a dataset. Some things to think about
concerning the use of PCA is:
• PCA assumes tha data are scaled equally. If the data is of temperatures for
example, what happens if one dataset is in Fahrenheit while all others are in
Celsius? Similar problems arise whenever one dataset have much larger/smaller
variance than the others.
• How does outliers (datapoints far from the others often the result of error in measurements) effect the results of PCA? Can you think of any way to handle them
better?
• What happens if the datasets are independent or close to independent? Why can’t
we use PCA now? Is there something else you could possible do in this case?
• Can you think of any examples in your main field where PCA or similar methods
could prove useful?
78
7 Linear transformations
A linear transformations also called a linear map is a function mapping elements from
one set to another (possibly same) set fulfilling certain conditions. Namely if the two
sets are linear spaces then the map preserves vector addition and scalar multiplication.
In this chapter we will take a look at linear transformations, what properties makes them
important and some examples of where we can use them in applications. This as with
the chapter before this will have a heavier focus on theory than other chapters. The
definition of a linear transformation is as follows:
Definition 7.1
Given two vector spaces V, W over the same field F , then a function T : V → W is a
linear transformation if:
• T (v1 + v2 ) = T (v1 ) + T (v2 ), v1 , v2 ∈ V .
• T (rv) = rT (v), v ∈ V, r ∈ F .
We see that the first condition is equivalent to saying that T needs to preserve vector
addition and the second that it preserves scalar multiplication. We note that the two
vector spaces V, W need not be the same vector space, or even have the same number
of elements. This also means that the addition operation need not be the same since
v1 + v2 is done in V and T (v1 ) + T (v2 ) is done in W . From the definition follows
immediately a couple of properties, it’s recommended to stop and think about why if it
isn’t immediately clear.
• T (0) = 0
• T (−v) = −T (v), for all v ∈ V .
• T (r1 v1 + r2 v2 + . . . rn vn ) = T (r1 v1 ) + T (r2 v2 ) + . . . + T (rn vn )
for all v ∈ V and all r ∈ F .
Many important transformations are in fact linear, some important examples are:
• The identity map x → x and the zero map x → 0.
• Rotation, translation and projection in geometry.
• Differentiation.
• Change of one basis to another.
• The expected value of a random variable.
The importance of many linear transformations gives a motivation why we should
study them further and try to find properties of them that we can use. In fact, just
by knowing a transformation is linear we already know two useful properties from the
79
definition. For example it doesn’t matter if we first add two vectors and then change
their basis or if we first change both their basis and then add them. Often it’s possible to check if a transformation is linear using only the definition as in the example below.
We define a function T : R2 → R3 such that:


x+y
x
T
= 2x − y 
y
y
We then want to show that it’s a linear transformation.


(x1 + y1 ) + (x2 + y2 )
x1
x
T
+ 2
= (2x1 − y1 ) + (2x2 − y2 )
y1
y2
y1 + y2

 

(x1 + y1 )
(x2 + y2 )
x1
x




= (2x1 − y1 ) + (2x2 − y2 ) = T
+T 2
y1
y2
y1
y2
And the first requirement is true.
For the second requirement T (rv) = rT (v) we get:


(rx + ry)
rx
x
= (r2x − ry)
=T
T r
ry
y
ry


(x + y)
x


= r (2x − y) = rT
y
y
And the second requirement is true as well. Next we look at how to linear transformations
act on basis vectors and how this can be used in order to create a linear transformation.
Theorem 7.1. We let V, W be vector spaces over the field F .
Given a basis e1 , e2 , . . . , en of V and any set of vectors {w1 , w2 , . . . , wn } ∈ W .
There exists a unique linear transformation T : V → W such that T (ei ) = wi .
• Additionally we have:
T (v1 e1 + v2 e2 + . . . + vn en ) = v1 w1 + v2 w2 + . . . vn wn
Proof. We start by looking at the uniqueness.
If two transformations S, T exist such that T (ei ) = S(ei ) = wi . Then they must be
the same since e1 , e2 , . . . , en spans V . You can easily see that they must be the same
80
transformation by writing any S(v) = T (v) and expanding v in the basis vectors. And
we can conclude that the transformation is unique if it exists.
Next we want to show that such a linear transformation always exists.
Given v = v1 e1 + v2 e2 + . . . + vn en ∈ V , where v1 , v2 , . . . , vn are uniquely determined
since e1 , e2 , . . . , en is a basis of V .
We define a transformation T : V → W as:
T (v) = T (v1 e1 + v2 e2 + . . . + vn en )
= v1 w1 + v2 w2 + . . . + vn wn
For all v ∈ V . We clearly have that T (ei ) = wi and showing that T is linear can be
done simply by trying:
T (av1 + bv2 ) = aT (v1 ) + bT (v2 )
Using this we see that it’s quite easy to define a linear transformation, since we only
need to specify where to take the basis vectors. Then everything else is handled by the
linearity. This result also gives an easy way to find if two linear transformations are
equal. Simply check if they give the same result when applied to the basis vectors. If
they do, they need to be equal because of the uniqueness. We continue by looking at
matrices multiplied with vectors acting as transformations.
Consider the m × n matrix A and define a function TA : Rn → Rm such that
TA (v) = Av for all (column vectors) v ∈ Rn . Show that TA is a linear transformation.
We use the definition and immediately get:
TA (v1 + v2 ) = A(v1 + v2 ) = Av1 + Av2 = TA (v1 ) + TA (v2 )
TA (rv) = A(rv) = r(Av) = rTA (v)
And TA must be a linear transformation.
We see that every m × n matrix A defines a linear transformation TA : Rn → Rm .
This gives us a new perspective in which to view matrices, namely we can view them
as linear transformations on vector spaces. But even more importantly we will also see
that every linear transformation T : Rn → Rm can be written in this way using a m × n
matrix.
Theorem 7.2. If T : Rn → Rm is a linear transformation, then there exists a m×n matrix A such that T (v) = Av for all column vectors v ∈ Rn . Additionally if e1 , e2 , . . . , en
is the standard basis in Rn , then A can be written as:
A = [T (e1 ) T (e2 ) . . . T (en )]
Where T (ei ) are the columns of A. The matrix A is called the standard matrix of the
linear transformation T .
81
7.1 Linear transformations in geometry
The matrix representation of linear transformations is especially useful in geometry and
3d graphics. Especially since most commonly used transformations such as rotation and
scaling are in fact linear. If we can find the standard matrix for a certain transformation
we can then easily apply that transformation on a large number of vectors with a simple matrix-vector multiplication. This is especially important in for example computer
graphics where we often work with a huge number of points even for very primitive objects, which need to be moved whenever the object or camera (view point) moves. Here
we will give some examples of linear transformations in geometry and show a couple of
methods which we can use to find the standard matrix.
Using the previous result we find the standard matrix for a couple of common transformaitons in Rn → Rm .
We start in R2 → R2 by finding the standard matrix for rotation clockwise about the
origin.
By making the unit circle we can easily see that if we rotate the first basis vector
[1 0]> by θ degrees we end up at
[cos θ sin θ]>
And if we rotate the second one [0 1]> we end up at
[− sin θ cos θ]>
Our standard matrix is thus
cos θ − sin θ
A=
sin θ cos θ
And we can use this to rotate any point θ degrees about the origin. For example
remembering that the angle between two orthogonal vectors is θ = 90◦ , given a vector
v = [x y]> we can find an orthogonal vector in 2d by multiplying with:
A=
cos 90 − sin 90
0 −1
=
sin 90 cos 90
1 0
Giving Av = [−y x]> , which can easily seen to give zero when taking the dot product
together with v. But what if we now instead consider this rotation in the xy−plane in
R3 ? When looking at rotation around the z axis we get the matrix:


cos θ − sin θ 0
A =  sin θ cos θ 0
0
0
1
Since for the first two basis vectors we rotate in the plane we had in two dimensions,
and the last basis-vector in the direction of the z−axis should remain unchanged. We
can find the rotation matrices around the other axes in the same way.
82
Next we take a look at projection in R2 → R2 where we want to find the standard
matrix for projection on the line y = ax. We let d = [1 a]> be the direction vector for
the line. Then using the projection formula with the basis vectors we get:
projd (1, 0) =
1
h(1, 0), (1, a)i
d=
(1, a)
2
k(1, a)k
1 + a2
projd (0, 1) =
h(0, 1), (1, a)i
a
d=
(1, a)
2
k(1, a)k
1 + a2
Which gives the standard matrix
1
1 a
A=
1 + a2 a a2
But what if we instead want to find the standard matrix for projection on the plane
described by x − z = 0 in R3 ? We introduce a slightly different method more suitable
for higher dimensions. We have an orthogonal basis U = {(1, 0, 1), (0, 1, 0)} for points
in the plane. Then instead of finding the projection of each basis vector on the plane
individually, we use the projection formula to find the projection of a general point onto
this plane. Doing this yields:
projU (x, y, z) =
h(x, y, z), (1, 0, 1)i
(1, 0, 1)
k(1, 0, 1)k2
h(x, y, z), (0, 1, 0)i
+
(0, 1, 0) =
k(0, 1, 0)k2
We then split ut up in the part belonging to
basis vectors:

1/2
A= 0
1/2
x+z
x+z
, y,
2
2
x, y, z to get the columns for the individual

0 1/2
1 0 
0 1/2
While we probably could have found the standard matrix for this projection regardless,
we can use the same methodology for more complex subspaces in higher dimensions.
• First find an orthogonal basis for the subspace.
• Then find the projection of the basis vectors in the original space on the subspace.
We then get a matrix for this transformation such that we can calculate this projection
of any points onto the subspace with a simple matrix-vector multiplication. Before
continuing with some more theory we return to R2 and look at the reflection in the line
y = ax
• The reflection of v is equal to 2projd (v) − v (make a picture).
83
This gives:
2
1 a
1 0
−
A=
0 1
1 + a2 a a2
1
1 − a2
2a
=
2a
a2 − 1
1 + a2
We note that the methodology presented here is possibly since the transformations in
question are all linear. While most needed transformations in this case are linear, we
will see later that there are at least one critical non-linear transformation we need to be
able to do in many applications. Can you think about which one?
7.2 surjection,injection and combined transformations
While nice that we can for example rotate a point, what if we want to combine transformations. For example maybe we want to rotate around a point which itself rotate
around another point. We will also give a way in which to make certain non-linear
transformations into linear ones, a method essential to many applications in computer
graphics. We start by defining what we mean by a surjection or injection.
Definition 7.2
A linear transformation (or more generally a function) F : V → U is a:
• Surjection, or onto if every element w ∈ W can be written w = T (v at least one
vector v ∈ V .
• Injection, or one-to-one function if every element in U is mapped to by at most
one element in V .
• If it’s both surjective and injective it’s called a bijection.
One important property is that for F to be invertible it needs to be an injection since
otherwise F −1 (u), u ∈ U would map to at least two elements in V for some u.
Theorem 7.3. Let F : V → W and G : W → U be two linear transformations. Then
the combined transformation G ◦ F : V → U is linear. If F is invertible, then the inverse
F −1 : W → V is linear.
Proof. For the first part we get:
G ◦ F (au + bv) = G(F (au + bv)) = G(F (au)) + G(F (bv))
= aG(F (u)) + bG(F (v))
For the second part we get:
F (F −1 (au + bv)) = au + bv = aF (F −1 (u)) + bF (F −1 (v))
= F (aF −1 (u) + bF −1 (v))
84
Since F is linear and injective. From this follows that:
F −1 (au + bv) = aF −1 (u) + bF −1 (v)
But if the composition of two linear transformations is itself a linear transformation,
how do we represent the combined linear transformation G ◦ F ? If we look at the matrix
representations of the transformations we get:
AF = [F (e1 ) F (e2 ) . . . F (en )]
AG = [G(e1 ) G(e2 ) . . . G(en )]
Then:
AG◦F = [G(F (e1 )) G(F (e2 )) . . . G(F (en ))] = AG AF
In other words if we can combine two linear transformations simply by multiplying
their respective standard matrices. We note that since matrix multiplication is not
commutative, the order in which we apply the linear transformations does impact the
resulting transformation. The transformation we want to apply ”first” should be to the
right in the matrix multiplication.
7.3 Application: Transformations in computer graphics and homogeneous
coordinates
We want to make a simple computer graphics of our solar system containing the sun,
planets as well as their moons. For simplicity we assume all bodies are spherical and all
bodies move in circular paths around something else. To work with we have the points of
a unit sphere located in origo. This is a good example of how we can combine relatively
simple transformations in order to create very complex paths such as the movement of
someone on the Moons surface relative the sun.
We start by assuming the sun is stationary positioned in the middle of the solar system
(origo) and everything else revolves around it. Then we only need to scale our unit sphere
to appropriate size as well as rotate all the points on the sphere around some axis. Since
both rotation and scaling are linear transformations we can apply this transformation
without further problems. We let t be a point in time, Rsun be the rotation of the
sun around it’s own axis and Ssun be the standard matrix for the scaling needed to
get the appropriate size, we then get standard matrix for the points on the unit sphere
representing the sun as.
Asun = Rsun Ssun
Lets say we choose to let the sun rotate around the z−axis, we then get the rotation
matrix as before and we get transformation matrix in time t as:



cos θs t − sin θs t 0 ssun
0
0
ssun
0 
Asun =  sin θs t cos θs t 0  0
0
0
1
0
0
ssun
85


ssun cos θs t −ssun sin θs t
0
0 
=  ssun sin θs t ssun cos θs t
0
0
ssun
We are now finished with the sun, in every frame we multiply all the points on the
unit sphere with corresponding matrix to get their position in our animation.
However as soon as we want to animate our first planet we immediately move into
problems, moving a point a fixed amount in any direction (translation) is not a linear
translation. You can easily realize this if you let T be translation in R: T (x) = x + a →
T (0 + 0) = a 6= T (0) + T (0) = 2a. The same is obviously true for higher dimensions.
While we could start by adding a vector to every point and then rotate around origo,
we need to take great care in order to get the rotation around it’s own axis right(rotate
points, then add vector, then finally rotate around origo) as well as it now requiring
2 matrix multiplications and one vector addition. Additionally as soon as we want to
add a moon or want to make more complex objects, the problem becomes more and
more complicated. Luckily there is a way to handle these transformations, a trick you
could say in which we represent our n dimensional vector in n + 1 dimensions. We represent a vector v = (v1 , v2 , . . . , vn )in n dimensions as a vector v = (v1 , v2 , . . . , vn , 1) in
n + 1 dimensions. For two dimensions this can be seen as representing the points in the
xy−plane by the points in the plane 1 unit above the xy−plane in three dimensions.
Now we can represent translation x0 = x + tx , y 0 = y + ty , z 0 = z + tz (in three
dimensions) using matrix multiplication:
 0 
 
1 0 0 tx
x
x
y 0  0 1 0 ty  y 
 0 = 
 
 z  0 0 1 tz   z 
1
1
0 0 0 1
This is commonly referred to as homogeneous coordinates. We note that the last
element of every vector is always 1 and the last row of the standard matrix is always the
same as well. So by using one extra dimension even if the value of every vector in this
extra dimension is always one, we can make translation into a linear transformation.
• We note that origo is no longer represented by the zero vector (0, 0, 0) in homogeneous coordinates since it’s there represented by (0, 0, 0, 1).
• It’s also easy to write any other previous already linear transformation in homogeneous coordinates, we simply add another row and column with all zeros except
for the bottom right corner which we set to one.
We are now ready to return to our animation of the solar system using homogeneous
coordinates instead.
• We start by rewriting the transformation matrix for the points on the suns surface
simply by adding the additional row and column:
86
Asun

ssun cos θt −ssun sin θt
0
 ssun sin θt ssun cos θt
0
=

0
0
ssun
0
0
0

0
0

0
1
Next we want to find the transformation matrix for the points on the earth’s surface
from another unit sphere in the origin. A good guideline is to start by applying all
transformations of the object in reference to itself, and then apply the transformation
with respect to the ”closest” reference point, in this case the position of the sun.
• We start by applying the scaling SE and rotation around it’s own axis RE just like
we did with the sun.
• Then we need to transpose the planet away from the sun in the origin. We apply
the translation matrix TES so that it’s appropriately far away.
• We then apply another rotation on this to rotate all of these points around the
sun (origin) RES representing the rotation of earth around the sun.
• Applying all of the transformations in order gives transformation matrix AE =
RES TES RE SE .
Especially we note that the first two RES TES denoting the centre position of the earth
relative to the sun and the last two REE SE is the position of the points on the earths
surface relative a unit sphere positioned at earth’s core.
We can use the first (already calculated) half of the transformation when finding the
transformation matrix for the moon. In the same way we get the transformation SM , RM
to apply the scaling and rotation around the moons own axis. We consider the position
of the earth as the origin and translate and rotate around that giving: TM E , RM E . We
notice that we now have the position of the moon relative the centre of earth, we apply
the transformations ARES TES previously applied on the earth’s centre to get the moons
coordinates relative the origin. This gives transformation matrix:
AM = RES TES RM E TM E RM SM
We notice that we can interpret this as:
(from right to left).
• RM SM -Position of points on the moons surface relative a unit sphere at the centre
of the moon, times:
• RM E TM E -Position of the centre of the moon relative the centre of earth, times:
• RES TES -Position of the centre of earth relative the centre of the sun (origin).
This methodology is used extensively in computer graphics and robotics in order to describe very complex movements using relatively simple transformations and hierarchies.
87
7.4 Kernel and image
To every linear transformation T : V → W we associate two linear spaces. One of them
a subspace of V and the other a subspace of W .
Definition 7.3
Given the linear transformation T .
• R(T ) = {w|w = T (v), v ∈ V } ⊂ W , is called the image of T .
• In other words the subspace of W which is mapped to from V
• N (T ) = {v|T (v) = 0} ⊂ V . is called the Kernel of T or the Nullspace of T .
• In other words the subspace of V for which the linear transformation maps to the
zero-vector in W
If we look at the transformation in terms of the standard matrix we see that the image
R(T ) is the solutions to the matrix equation Ax = b. While the Kernel N (T )l is the
solutions to Ax = 0.
Theorem 7.4. Given the linear transformation T : V → W , the kernel N (T ) is a
subspace of V and the image R(T ) is a subspace of W .
Proof. Since T (0) = 0 they both contain the zero vector.
For the kernel we get:
T (av1 + bv2 ) = aT (v1 ) + bT (v2 ) = a0 + b0 = 0, v1 , v2 ∈ N (T )
And av1 + bv2 lies in N (T ) since it satisfies the condition. So N (T ) is a subspace of V .
For the image we use:
w1 = T (v1 ), w2 = T (v2 )
v1 , v2 ∈ V
This gives:
aw1 + bw2 = aT (v1 ) + bT (v2 ) = T (av1 ) + T (bv2 ) = T (av1 + bv2 )
And aw1 + bw2 lies in R(T ). So R(T ) is a subspace of W .
We can use this to easily check if a linear transformation is injective,surjective and
bijective (and invertible).
Theorem 7.5. A linear transformation T : V → W is:
• surjective ⇔ R(T ) = W
• injective ⇔ N (T ) = {0}
• bijective (and invertible) ⇔ N (T ) = {0} and R(T ) = W .
88
If the two vector spaces have the same dimension (square standard matrix) it’s actually
enough to show only that it’s injective or surjective and it will always be bijective as
well.
Theorem 7.6. Assume a linear transformation T : V → W , where dim V = dim W .
• If N (T ) = {0} or R(T ) = W then T is bijective.
In fact we can show even more, we can show something about the combined dimension
of the kernel and image:
Theorem 7.7. Assume we have a linear transformation T : V → W , where V is finite
dimensional, then:
• dim V = dim N (T ) + dim R(T )
We move on to a short example showing how we can use what we have learned here.
Hermite interpolation is a method in which to find a curve through a given number
of points. This is often needed in computer graphics as well as in construction for
calculating strength of materials. We will look at one subproblem where we want to
fit a curve through two points in the plane where the slope is specified in both points.
The easiest possible interpolation function we can think of in this case is a third order
polynomial:
y = a + bx + cx2 + dx3
Which we want to fit to the points such that the curve goes through the points with the
given slope. In order to do this we get a system of four unknowns and four equations:
y(x1 ) = a + bx1 + cx21 + dx31 = y1
y 0 (x1 ) = b + 2cx1 + 3dx21 = s1
y(x2 ) = a + bx2 + cx22 + dx32 = y2
y 0 (x2 ) = b + 2cx2 + 3dx22 = s2
Writing it in matrix form we get a linear system in the four unknowns (a, b, c, d) with
matrix A:


1 x1 x21 x31
0 1 2x1 3x21 

A=
1 x2 x22 x32 
0 1 2x2 3x22
While we know how to solve this problem, we ask ourselves, is it always possible to
solve this system? Obviously we could simply solve it and see, but doing so (especially
by hand) is rather tedious work even for this relatively small matrix.
• You could also calculate the determinant, less work than solving the system but
still alot of work.
89
• Instead we look at the transformation
 


a + bx1 + cx21 + dx31
y(x1 )
y 0 (x1 )  b + 2cx1 + 3dx21 
 

T : y → T (y) = 
 y(x2 )  = a + bx2 + cx22 + dx32 
y 0 (x2 )
b + 2cx2 + 3dx22
As the transformation from V = {y|y(x) ∈ P3 } to W = R4 . Using the basis {1, x, x2 , x3 }
in V this is obviously a linear transformation with matrix A above. We take a look at
the Kernel of A. For a polynomial in N (T ) we have
y(x1 ) = y(x2 ) = 0,
y 0 (x1 ) = y 0 (x2 ) = 0
So it has a root in x1 , x2 both with multiplicity two. However y is a third order polynomial and can only have 3 roots (counting multiplicity) except for the zero polynomial.
So N (T ) = {0} and T is injective. Since it’s injective and dim P3 = dim R4 T is
bijective and invertible so the system have a unique solution.
We remember the definition of rank for a square matrix:
Definition 7.4
For A ∈ Mn×n (F) the rank(A) is equivalent to:
• The number of linearly independent rows or columns.
• The dimension of the column space of A.
But we can give even more definitions of rank, we now move on to define the rank of
a matrix Mm×n (F) using the kernel and image of A.
Definition 7.5
Given a linear transformation T : V → W with standard matrix A.
• Rank or the column rank r(A) is the dimension of the image of A: dim R(A), with
elements Ax, x ∈ F n .
• Row rank is the dimension of the image of A> : dim R(A> ), with elements A> y, y ∈
F m.
• Nullity or the right nullity v(A) is the dimension of the kernel of A: dim N (A),
with elements Ax = 0, x ∈ F n .
• Left nullity is the dimension of the kernel of A> : dim N (A> ), with elements
Ay = 0, y ∈ F m .
From this immediately follows:
• Nullity zero: v(A) = 0 ⇔ A have a left inverse.
90
• rank m: r(A) = m ⇔ A have a right inverse.
We can of course do the same for A> , however we will see that they are related in such
a way that knowing only one of these four values (R(A), R(A> ), N (A), N (A> )), we can
immediately find the others.
Theorem 7.8. For every matrix A we have:
dim R(A) = dim R(A> )
In other words, for non square as with the square matrices the column rank and row rank
are the same.
As with a square matrix you can show it by writing the matrix in echelon form. We
end this part with a short example:
Let A be a n × m matrix with rank r. We want to show that the space of all solutions
to Ax = 0 of n homogeneous equations in m variables has dimension m − r. This space
is the kernel of N (T ) with T : Rn → Rm such that T (v) = Av. We use
dim V = dim N (T ) + dim R(T )
We already have the dimension of the image R(T ), which is the rank r. The dimension
of V is m And we immediately get the dimension of the kernel as dim N (T ) = m − r
7.5 Isomorphisms
Definition 7.6
A linear transformation T : V → W is called an Isomorphism if it’s both surjective and
injective (bijective).
• Two vector spaces V and W are called isomorphic if there exists an isomorphism
T : V → W.
• We denote this by writing V ∼
= W.
We see that for two vector spaces that are isomorphic, we have a pairing v → T (v), v ∈
V, T (v) ∈ W . For this pairing we preserve the vector addition and scalar multiplication
in respective space. Because addition and multiplication in one space is completely
determined by the same operation in the other all vector space properties of one can be
determined by those of the other. In essence you could say the two vector spaces are
identical apart from notation.
Theorem 7.9. For two finite dimensional spaces V, W and a linear transformation
T : V → W the following are equivalent:
• T is an isomorphism.
• If {e1 , e2 , . . . , en } is a basis of V then {T (e1 ), T (e2 ), . . . , T (en )} is a basis of W .
91
• There exists a basis {e1 , e2 , . . . , en } of V such that {T (e1 ), T (e2 ), . . . , T (en )} is a
basis of W .
• There exist a inverse linear transformation T −1 : W → V such that T −1 T = 1V
and T T −1 = 1W
Here we see one of the important relations between isomorphic vector spaces, if we
have a basis in one we can given a linear isomorphic transformation also find a basis in
the other. If both vector spaces have a finite dimension we can say even more.
Theorem 7.10. For the finite dimensional spaces V, W the following are equivalent.
• V ∼
= W , (V and W isomorphic).
• dim V = dim W
In other words, the dimension of a vector space decides which spaces it’s isomorphic to.
For example we can immediately conclude that the linear space R3 is isomorphic to the
polynomials of order two: P2 (Polynomials that can be written on the form a + bx + cx2 ).
Some examples of important isomorphisms are:
• The Laplace transform, mapping (hard) differential equations into algebraic equations.
• Graph isomorphisms T : V → W such that if there is an edge between edges u, v
in V there is an edge between T (u), T (v) in W .
• Others include isomorphisms on groups and rings (such as fields).
This also give a little motivation of why the study of isomorphisms are useful, if we
have a problematic vector space where something we want to do is very complicated,
if we can find another isomorphic vector space where the corresponding operation is
easier. We can formulate our problem in this new vector space instead, do our now
easier calculations and then change back, effectively sidetracking the whole issue.
92
8 Least Square Method (LSM)
Many of the properties and ideas we have examined apply to square matrices, a good
example of this is theorem 1.1 which list nine conditions that are all equivalent to a square
matrix being invertible. If a square matrix A is invertible then the linear equation system
Ax = y is solvable.
In this section we will discuss a scenario where we have a system of linear equations
where the A-matrix has more rows than columns. This kind of system is called an overdetermined system (we have more equations than unknowns) and if A is a m × n matrix
with m > n it is said to have m − n redundant equations. An overdetermined system
can be solvable if the elements of vector y represents points on a straight line. Having
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
2
4
6
8
(a) A set of points on a single line.
10
0
0
2
4
6
8
10
(b) A set of points near but not on a single
line.
a number of points near but not on a straight line is a very common situation. One of
the simplest ways to model a relation between two quantities is to use a straight line,
l(x) = a + bx, and often experiments are designed such that this kind of relation is
expected or the quantities measured are chosen such that the relation should appear.
In reality any system, be it the temperature of a piece of machinery, the amplitude of a
signal or the price of a bond, will be subject to some disturbance, often referred to as
noise, that will move any move the measured values of the ideal line.
In this situation the two most common problems are:
• Hypothesis testing: does this data fit our theory?
• Parameter estimation: how well can we fit this model to our data?
In both of these cases it is important to somehow define the ’best-fitting’ line. Usually
this is done in what is called the least-square sense. Our problem can be described in
93
the following way:
Ac = y

 

y1
1 x1
 y2 
1 x2 
a

 

, y= . 
A = . .  , c =
.
.
b
 .. 
. . 
yn
1 xn
Where x = [x1 x2 . . . xn ]> and y is the data and c is the vector of coefficients that we
want to find. To define what we mean by least-square sense we first note that for each
x there is a corresponding spot on the line, a + bx, and a corresponding measurement y.
We can create a vector that describes these distances for each measurement called the
residual vector or the error vector .
e = Ax − y, is the residual vector
We can then measure the normal (Euclidean) length of this vector.
v
u n
√
uX
|e| = t
|ei |2 = e> e
i=1
Minimizing this length gives us the best-fitting line in the least-square sense.
This can also easily be expanded into other kinds of functions, not just straight lines.
Suppose we want to fit a parabola (second degree polynomial instead, p(x) = a+bx+cx2 )
then the problem can be written:
Ac = y


 
1 x1 x21
y1
 
a
1 x2 x2 
 y2 
2

 
A = . .
..  , c =  b  , y =  .. 
 .. ..

.
.
c
2
yn
1 xn xn
Here the residual vector can easily be defined in the same way as for the straight line.
This can be extended in the same way for a polynomial of any degree smaller than n − 1
where n is the number of measurements. If a polynomial of degree n − 1 is chosen we
get the following situation:

1
1

A = .
 ..
x1
x2
..
.
1 xm
94
x21
x22
..
.
x2m
Ac = y

 
 
. . . xm−1
c1
y1
1
m−1 




. . . x2 
 c2 
 y2 
,
c
=
,
y
=





.
.
..
..
 .. 
 .. 
.
. 
yn
cn
. . . xm−1
m
Here the matrix A will be square and the system will be solvable as long as all the
x-values are different. This means that for every set of n measurements there is a
polynomial of degree n − 1 that passes through each point in the measurement. This
is often referred to as interpolation of the data set (and in this case we it is equivalent
to Lagrange’s method of interpolation). The matrix A here is known as a Vandermonde
matrix. It and similar matrices appear in many different applications and it is also
known for having a simple formula for calculating its determinant.
8.1 Finding the coefficients
How can we actually find the coefficients that minimize the length of the residual vector?
First we can define the square of the length of the residual vector as a function
>
s(e) = e e =
n
X
|ei |2 = (Ax − y)> (Ax − y)
i=1
This kind of function is a positive second degree polynomial with no mixed terms and
∂s
thus has a global minima where
= 0 for all 1 ≤ i ≤ n. We can find the global
∂ei
minima by looking at the derivative of the function, ei is determined by ci and
∂ei
= Ai,j
∂ci
thus
n
n
i=1
i=1
X
X
∂s
∂ei
=
2ei
=
2(Ai. c − yi )Ai,j = 0 ⇔ A> Ac = A> y
∂ci
∂cj
The linear equation system
A> Ac = A> y
is called the normal equations. The normal equations are not overdetermined and will
be solvable if the x-values are different. Finding the solution the the normal equations
gives the coefficients that minimizes the error function s.
The reason that A> Ac = A> y are called the normal equations is that B = A> A is a
normal matrix.
Definition 8.1
A normal matrix is a matrix for which AH A = AAH .
An alternate definition would be A = UDUH where U is a unitary matrix and D is a
diagonal matrix.
Proof that the definitions are equivalent. Here we can use a type of factorization that we
have not discussed previously in the course called Schur factorization:
A = URUH
95
where U is unitary and R triangular.
AH A = UUH UH URUH = URH RUH = URRH UH = AAH
if and only if R is diagonal (check this yourself using matrix multiplication).
AH A = UDH UH UDUH = UDH DUH = UDDH UH = AH A
since diagonal matrices always commute.
Normal matrices are always diagonalizable which means they are always invertible if
D has full rank. If B = A> A and A has full rank (equivalent to rank(A) = n) then
B = UDUH with rank(D) = n. Thus: if rank(A) = n then x = (A> A)−1 A> y minimizes
the error function or ’solves’ Ax = y in the least square sense.
Sometimes the normal equations can become easier to solve if the A-matrix is factorized. One useful example of a factorization that can simplify the calculation is the
QR-factorization that you will learn how to do in section 11.2.
Using QR-factorization we get A = QR with Q being an orthogonal matrix, Q> = Q−1 ,
and R an upper triangular matrix. The normal equations can then be rewritten in this
way:
(A> A)c = (QR)> QR c =
= R> Q> QR c = R> R c =
= A> y = (QR)> y = R> Q> y ⇔
⇔ Rc = Q> y
8.2 Generalized LSM
LSM can be used to fit polynomials, as shown previously, but also other functions that
can be written on the form
X
f (x) =
gi (x)
i
For example, if we want to fit a trigonometric function, f (x), instead of a polynomial
to some data and f (x) looks like this
f (x) = a sin(x) + b sin(2x) + c cos(x)
then the fitting problem can be described in this way
Ac = y


 
sin(x1 ) sin(2x1 ) cos(x1 )
y1
 
a
 sin(x2 ) sin(2x2 ) cos(x2 ) 
 y2 

 

A= .
..
..  , c =  b  , y =  .. 
 ..

.
.
.
c
yn
sin(xn ) sin(2xn ) cos(xn )
96
For this kind of problems we do not have the same simple guarantees that the A matrix
is invertible though.
There are also many functions that can be rewritten on the form
X
f (x) =
gi (x)
i
for example, suppose we have a trigonometric function with a polynomial argument
f (x) = tan(a + bx + cx2 ) ⇒ arctan(f (x)) = a + bx + cx2
And even though we have a non-linear model we have turned it into the linear problem
Ac = y




2
1 x1 x1
arctan y1
 
a
1 x2 x2 
 arctan y2 
2



A = . .
, c = b , y = 


..
.
.. 
 .. ..


.
c
2
arctan yn
1 xn xn
that can be solved with the regular least square method. Note thought that when doing
this we no longer fit the original function but instead the rewritten function. In some
applications this is problematic and in some it is not. We will not concern ourselves
further with this in this course but in a real application this sort of rewrite should not
be used indiscriminately.
8.3 Weighted Least Squares Method (WLS)
It was previously described that the least square method was useful ion situations where
some data had been collected and we wanted to analyze the data or adapt a specific
model to the data.In both these cases certain parts of the data can be more relevant
than the other. The data can be a compilation of data from many different sources,
it can be measured by a tool that is more efficient for certain x or there is a certain
area where it is more important that the model fits the data better than everywhere
else, for example if we describe a system that is intended to spend most time in a state
corresponding to value x the it probably more important that our model fits well near
that x than far away from that x.
This can be taken into consideration by using the weighted least square method . When
we use the weighted least square method we add a a weight in the error function.
s(e) = (e)> W(e) =
n
X
Wi,i |ei |2
i=1
Then we get a new set of normal equations
A> WAx = A> Wy
97
How to choose W? For physical experiments W is dictated by physical properties of
the measured system or physical properties of the measurement tools. In statistics the
1
and
correct weights for independent variables are the reciprocal of the variance wi =
σi
for variables with the correlation matrix Σ the weights are decided by W> W = Σ−1 to
get the best linear unbiased estimate. If the different variables in x are independent
then W is diagonal but if there is some correlation between them then there will be
off-diagonal elements as well.
8.4 Pseudoinverses
For square matrices we could describe the solution of a linear equation system using the
inverse matrix
Ax = y ⇔ x = A−1 y
Not all matrices have an inverse, diagonalizable matrices where one eigenvalue is zero
for example, but the idea behind the inverse can be generalized to and made to extend
to these matrices as well. For the regular inverse we have
AA−1 = A−1 A = I
In this section we are going to discuss pseudoinverses, which are denoted by
the following rules
A∗ AA∗ = A∗ , AA∗ A = A
∗
and obey
The inverse of a matrix is unique but a matrix can have many different pseudo inverses,
for example, if:
0 1
A=
0 0
then all of the following are pseudo-inverses
1 1
0 0
1 0
+
+
+
, A3 =
, A2 =
A1 =
1 1
1 1
1 0
In engineering there is a very important pseudoinverse called the Moore-Penrose pseudoinverse.
Definition 8.2
The Moore-Penrose pseudoinverse of a matrix A is defined as follows
A∗ = (A> A)A>
The Moore-Penrose pseudoinverse is interesting because it is closely connected to the
least square method and writing
Ax = y ⇔ x = A∗ y
98
is the same as solving Ax = y using the least square method. Another good property
of the Moore-Penrose pseudoinverse is that if the matrix A is invertible then A∗ = A−1 .
In practice calculating the Moore-Penrose pseudoinverse is often done using the singular
value factorization (SVD) or QR-factorization.
Showing how to calculate the Moore-Penrose pseudoinverse using QR-factorization is
very similar to showing how it is done for the least square method.
−1
A∗ = (A> A)−1 A> = (QR)> QR (QR)> =
−1
−1
= R> Q> QR R> Q> = R> R R> Q> =
= R−1 (R> )−1 R> Q> = R−1 Q>
99
9 Matrix functions
Functions are a very useful mathematical concept in many different areas.
Power function: f (x) = xn , x ∈ C, n ∈ Z
Root function: f (x) =
Polynomials: p(x) =
√
n
x, x ∈ R, n ∈ Z+
n
X
ak xk , a, x ∈ C, n ∈ Z+
k=0
Exponential function: f (x) = ex , x ∈ C
Natural logarithm: f (x) = ln(x), x ∈ C
Table 1: Some common functions on the real and complex numbers.
Each of the functions in table ?? are based on some fundamental idea that makes them
useful for solving problems involving numbers of different kinds. If new functions that
have corresponding or similar properties to these functions but acts on matrices instead
of numbers could be defined then there might be ways to use them to solve problems
that involve matrices of some kind. This can actually be done to some extent and this
section will explore this further.
9.1 Matrix power function
The normal power function with positive integer exponents looks like this
f (x) = xn = |x · x ·{z
. . . · x}, x ∈ C, n ∈ Z+
n
Inspired by this it is quite intuitive to define a corresponding matrix version of the
power function
Definition 9.1
For any square matrix A ∈ Mk×k , k ∈ Z+ the matrix power function is defined as
An = AAA
. . . A}, n ∈ Z+
| {z
n
+
for any n ∈ Z .
The regular power function is not only defined for positive integers, but also for 0 and
negative integers. How can definition 9.1 be extended to let n be any integer, not just a
positive one?
First consider how that x0 = 1 for any complex number x. For matrices this corresponds to
100
Definition 9.2
For any square matrix A ∈ Mk×k , k ∈ Z+
A0 = I
where I is the identity matrix.
For negative exponents it is well known that
1
, n ∈ Z+
xn
Here it is not as intuitive and simple to find a corresponding definition for matrices.
But consider that
x−n =
x·
1
= xx−1 = 1
x
1
can be viewed as the inverse of x. This means that there is a
x
natural way to define A−n , n ∈ Z+ for any invertible matrix A.
This means that
Definition 9.3
For any square and invertible matrix A ∈ Mk×k , k ∈ Z+
+
−1 −1 −1
−1
A−n = (A−1 )n = A
| A A{z . . . A }, n ∈ Z
n
where A−1 is the inverse of A.
Now a matrix version of the power function An with integer exponent n ≥ 0 has been
defined for any square matrix A. If A is also invertible An is defined for all integer n.
When discussing (real or complex) numbers, x, the function xn is not limited to integer
n. In fact, n can be any (real or complex number). In the next section the definition
of An is extended to fractional n and in section 9.5 we will show how to define matrix
versions of more general functions, including An for any (real or complex) number n.
9.1.1 Calculating An
Calculating the matrix An can be done in a very straightforward manner by multiplying
A with itself n times (or the inverse of A with itself −n times if n is negative). But the
calculations can sometimes be simplified further. Previously, see page 13 and definition
2.1, the concept of a diagonalizable matrix has been examined. If A is a diagonalizable
matrix, A = S−1 DS, then the power function (for positive integer exponents) becomes
−1
−1 n
An = AA . . . A = S−1 D |SS{z−1} D |SS{z−1} . . . SS
| {z } DS = S D S
=I
=I
=I
n
where D of course becomes
101
 n
d1,1 0 . . .
 0 dn . . .
2,2

Dn =  .
..
..
.
 .
.
.
0
0 ...

0
0
..
.
dnm,m




Thus if a matrix A is simple to diagonalize then An becomes much less tedious to
calculate for high n.
The same idea can be generalized to similar matrices. Consider two matrices, A and
B, that are similar, in the mathematical sense given in definition 1.3.
A = P−1 BP
In this case the power function of A can be calculated as follows
−1
−1
−1
−1 n
An = AA . . . A = P−1 B |PP
{z } B |PP
{z } . . . PP
| {z } BP = P B P
=I
=I
=I
In some cases, for example when B is a diagonal matrix and n is large, it can be easier
to calculate An in this way.
In theorem 2.8 we saw that even if a matrix was not diagonalizable it could be rewritten
as a matrix which is almost diagonal


Jm1 (λ1 )
0
...
0
 0
Jm2 (λ2 ) . . .
0 


A = P−1 JP = P−1 
P
..
..
.
.

.
... 
.
.
0
0
. . . Jmk (λk )
Here Jmi , 1 ≤ i ≤ k are all Jordan blocks, that is matrices with the value λi on the
diagonal and 1 in the element above the diagonal, see definition 2.7. k is the number of
different eigenvalues of A.
This means that for any square matrix A the power function can be written like this


Jm1 (λ1 )n
0
...
0


0
Jm2 (λ2 )n . . .
0

n
−1 n
−1 
A =P J P=P 
P
..
..
.
.

.
.
.
... 
0
0
. . . Jmk (λk )n
Since all the Jordan blocks, Jmi (λi ), are either small or contains a lot of zeros each
Jmi (λi )n is simple to calculate and thus Jn will often be easier to calculate than An .
One kind of matrix for which it can be easier to calculate An are nilpotent matrices.
Definition 9.4
A square matrix An is said to be a nilpotent matrix if there is some positive integer,
n ∈ Z+ , such that
An = 0
From this it follows that Ak , k ≥ n.
102
For nilpotent matrices it obviously easy to calculate An for sufficiently high numbers.
One example of a nilpotent matrix is a strictly triangular matrix (a triangular matrix
with zeros on the diagonal).
Remember any square matrix is similar to a matrix written on the Jordan canonical
form, which is an upper triangular with the matrix eigenvalues along the diagonal. This
means that if all the eigenvalues for a matrix are equal to zero then the matrix will be
similar to a strictly triangular matrix which meant that it will be nilpotent.
9.2 Matrix root function
For (real or complex numbers)
√ we have the following relation between the power function
xn and the root function n x
y = xn ⇔ x =
√
n
1
y = yn
For any positive integer, n ∈ Z+ , it is easy to do the corresponding definition for
matrices
Definition 9.5
If the following relation holds for two square matrices A, B ∈ Mk×k
A = BBB
. . . B}
| {z
n
then B is said to be the nth root of A. This is annotated B =
√
n
1
A = An
√
n
Simply defining the function is not very useful though. In order to make A useful a
few questions need to be answered
√
n
• How do you find A?
√
n
• For how many different B is B = A
√
√
2
In this course we will only consider the square root of a matrix A = A .
9.2.1 Root of a diagonal matrix
It is not obvious how to find the square root of a matrix, it is not even clear whether
there exist any roots at all for a given matrix. For this reason it is suitable to first
consider a specific kind of matrix for which the roots exist and are easy to find.
Diagonal matrices (see page 6) are generally simple to work with. It is also easy to
find their roots. Consider a diagonal matrix D ∈ Mm×m .


d1,1 0 . . .
0
 0 d2,2 . . .
0 


D= .
..
.. 
..
 ..
.
.
. 
0
0
. . . dm,m
103
Then we can define a matrix C in this way
p


C=




d1,1 p0
...
0
d1,1 0 . . .
0
 0 d2,2 . . .
0 
0
d2,2 . . .
0 



2
⇒
CC
=
C
=


.
.
.. 
..
..
.
.
..
.
.
.
.

 .
.
.
.
. 
.
.
p .
0
0 . . . dm,m
0
0
...
dm,m
(3)
From the definition it follows that C is a root of D. It is not the only root though. Each
of the diagonal elements di,i has two roots, one positive and one negative and regardless
which root that is chosen in C equation (3) will still hold. This means that there are 2n
different Cs that are roots of D. These are not the only possible roots though. For an
example of how many roots a matrix can have, see the following theorem.
1 0
Theorem 9.1. All of the following matrices are roots to I2 =
0 1
1 ∓s ∓r
1 ±s ∓r
±1 0
1 0
,
,
,
0 1
0 ±1
t ∓r ±s
t ∓r ∓s
if r2 + s2 = t2 .
Thus the simple matrix I2 has infinitely many roots. Clearly roots are more complicated for matrices than they are for numbers. Fortunately though it is not always
necessary to find all the roots of a matrix, often a single root can be enough, and there
are several ways to make it easier to construct.
For a diagonal matrix D it is very simple to find a root, simply take the matrix that
had the positive root of all the elements in the diagonal of D in its diagonal. This is
often referred to as the principal root.
9.2.2 Finding the square root of a matrix using the Jordan canonical form
The technique for finding a root for a diagonal matrix can be extended to diagonalizable
matrices. Consider a diagonalizable matrix A
√ √
A = S−1 DS = S−1 D DS =
√
√
√
= S−1 DSS−1 DS = S−1 DS ⇒
√
√
⇒ A = S−1 DS
By an analogous argument you can find a similar relation between any square matrix
and its Jordan canonical form
√
√
A = P−1 JP
Finding the square root for J is not as simple as it is for a diagonal matrix, but there
are some similarities
104
p
Jm1 (λ1 ) p 0

0
Jm2 (λ2 )
√
√

.
..
A = P−1 JP = P−1 
..

.

0
0
...
...
..
.
0
0



P

.
.
.

q
...
Jmk (λk )
√
This means that if a square root of the Jordan blocks are known then A can be
calculated.
Let J be a Jordan block


 √

λ 1 0 ... 0
1
λ √0 . . . 0
0 λ 1 . . . 0
0 1
λ . . . 0







1 . . . 0
J =  0 0 λ . . . 0  = λ 0 0
 = λ(I + K)
 .. .. .. . .
.

.. 
.
.
.
.
..
..
. . .. 
. . .
 ..
. .
0 0 0 ... λ
0 0
0 ... 1
Consider the function f (x) =
f (x) =
√
1 + x, this function can be written as a Taylor series
√
f 00 (a) 2 f (3) (0) 3 f (4) (0) 4
f 0 (0)
x+
x +
x +
x =
1 + x = f (0) +
1!
2!
3!
4!
1
1
5 4
1
x + ...
= 1 + x − x2 + x3 −
2
8
16
128
It turns out that if you switch 1 for I and x for K the equality still holds (see page
115). K is also a strictly upper-triangular matrix and is therefore a nilpotent matrix (see
definition 9.4). Thus the following formula will always converge
√
1
1
1
5 4
I + K = 1 + K − K2 + K3 −
K + ...
2
8
16
128
which makes it possible to find a square root for any square matrix.
9.3 Matrix polynomials
A polynomial of degree n over a set K (for example the real or complex numbers) is a
function p(x) that can be written as
p(x) =
n
X
ak xk , ak ∈ K
k=0
where n is a finite non-negative integer and x is a variable taken from some set such
that xn and addition is defined in a sensible way (more specifically x can only take
values from a ring). Examples of sets that the variable x can belong to are real numbers,
complex numbers and matrices.
105
Definition 9.6
A matrix polynomial (with complex coefficients) is a function p : Mn×n 7→ Mn×n that
can be written as follows
p(A) =
n
X
ak Ak , ak ∈ C, A ∈ Mn×n
k=0
where n is a non-negative integer.
Matrix polynomials has a number of convenient properties, one of them can be said
to be inherited from the power function
Theorem 9.2. For two similar matrices, A and B, it is true for any polynomial p that
p(A) = p(P−1 BP) = P−1 p(B)P
Proof. This theorem is easily proved by replacing A with P−1 BP in the expression describing the polynomial and using An = P−1 Bn P.
9.3.1 The Cayley-Hamilton theorem
A very useful and important theorem for matrix polynomials is the Cayley-Hamilton
theorem.
Theorem 9.3 (Cayley-Hamilton theorem). Let pA (λ) = det(λI − A) then pA (A) = 0
The Cayley-Hamilton theorem can be useful both for simplifying the calculations of
matrix polynomials and for solving regular polynomial equations. This will be demonstrated in the following sections. The rest of this section focuses on proving the CayleyHamilton theorem.
Proving the Cayley-Hamilton theorem becomes easier using the following formula.
Theorem 9.4 (The adjugate formula). For any square matrix A the following equality
holds
A adj(A) = adj(A)A = det(A)I
(4)
where adj(A) is a square matrix such that
adj(A)ki = Aki
where Aki is the ki-cofactor of A, (see equation (2))
Proof of the adjugate formula. Using equation (2) gives
det(A) =
n
X
ai,k Ai,k
k=1
If we collect all the cofactors, Aki , in a matrix
106
(adj(A))ki = Aik
Note the transpose.
It can then easily be verified that

n
X
n
X
a1,k A1k


 k=1
n
X

a2,k A1k

A adj(A) = 
 k=1

..

 n .
X

a A
n,k
k=1
n
X
k=1
n
X
1k
k=1
k=1
n
X

 det(A)

 n
X

a2,k A1k

=
k=1


..

 n .
X

a A
n,k
a1,k A2k
a2,k A2k
..
.
an,k A2k
a1,k A2k
det(A)
..
.
n
X
k=1

a1,k Ank 

k=1

n

X
...
a2,k Ank 

=
k=1


..
..

.
.

n

X
...
an,k Ank 
k=1
k=1
1k
...
n
X
an,k A2k
...
n
X

a1,k Ank 



...
a2,k Ank 


k=1


.
..
..

.


...
det(A) 
k=1
n
X
(5)
k=1
In order to get equation (4) we need to show that all off-diagonal element are equal
to zero. To do this, consider a square matrix B. From this matrix we construct a new
matrix, C, by replacing the i:th row with the j:th row (i 6= j). This means that in C row
i and row j are identical
cik = cj,k = bj,k , 1 ≤ k ≤ n
The matrices B and C will also share any cofactors created by removing row i
Cik = Bik
Using (2) on C gives
det(C) =
n
X
k=1
ci,k Cik =
n
X
cj,k Cik , i 6= j =
k=1
n
X
bj,k Bik
k=1
Since C has two identical rows the rows are not linearly independent which means
det(C) = 0
Thus, for any square matrix B it is true that
107
n
X
bj,k Bik = 0
k=1
This means that all elements in matrix (5) except the diagonal elements are equal to
zero. This gives the adjugate formula
A adj(A) = det(AI
Checking that the same holds for adj(A)A can be done in an analogous way.
Now the adjugate formula can be used to prove the Cayley-Hamilton theorem
Proof of the Cayley-Hamilton theorem. Set pA (λ) = det(λI − A) and let B = adj(λI − A)
then the adjugate formula gives
pA (λ)I = (λI − A)B
Every element in B is a polynomial of degree n − 1 or lower, since each element in B
is the determinant of a (n − 1) × (n − 1) matrix. Thus B can be rewritten as a matrix
polynomial
n−1
X
λk Kk
B=
k=0
where Bk is a matrix containing the coefficients of the polynomial terms with degree k
in B. Combining the to formulas gives
(λI − A)B = (λI − A)
n−1
X
λk Bk =
k=0
= λn Bn−1 + λn−1 (Bn−2 − ABn−1 ) + . . . + λ(B0 − AB1 ) + (−AB0 ) =
= λn I + λn−1 an−1 I + . . . + a1 λ + a0 = pA (λ)I
If this expression holds for all λ then the following three conditions must be fulfilled
ak I = Bk−1 − ABk for 1 ≤ k ≤ n − 1
I = Bn−1
(7)
a0 I = −AB0
(8)
Next consider the polynomial pA (A)
pA (A) = An + An−1 an−1 + . . . + Aa1 + Ia0
Combining this with condition (6)-(8) results in
108
(6)
pA (A) = An + An−1 (Bn−2 − ABn−1 ) + . . . + A(B0 − AB1 ) + (−AB0 ) =
= An (I − Bn−1 ) +An−1 (Bn−2 − Bn−2 ) + . . . + A (B0 − B0 ) = 0
|
| {z }
| {z }
{z
}
=0
=0
=0
9.3.2 Calculating matrix polynomials
In section 9.1.1 a number of different techniques and properties worth considering when
calculating the power function was described. Naturally all of this is applicable to
matrix polynomials as well. But there is other techniques that can be used to simplify
the calculations of a matrix polynomial.
On page 106 it was mentioned that the Cayley-Hamilton theorem could be used to
calculate matrix polynomials. It is very simple to see hoe the following theorem follows
from Cayley-Hamilton.
Theorem 9.5. Let pA (λ) = det(λI − A) where A is a square matrix. Then the following
formula holds
pA (A) = 0
c · pA (A) = 0 where c ∈ C
q(A) · pA (A) = 0 where q is a polynomial
Many things that can be done with polynomials of numbers can also be with matrix
polynomials. Factorization is one such thing. If
p(x) = (x − λ1 )(x − λ2 ) . . . (x − λn )
then
p(A) = (A − λ1 I)(A − λ2 I) . . . (A − λn I)
This means that if a polynomial on normal numbers can be simplified by factorization,
the corresponding matrix polynomial can be simplified in a corresponding way. It also
means that polynomial division can be used for matrix polynomials as well as for normal
polynomials. That is, any matrix polynomial, p(A), with degree n can be rewritten
as p(A) = q(A)π(A) + r(A) where q(A), π(A) and r(A) are all polynomials of degree
deg(q) = n − k, deg(π) = k and deg(r) ≤ n with 1 ≤ k ≤ n.
If π(A) is chosen as one of the polynomials in theorem 9.5 then π(A) = 0 and p(A) =
r(A) with potentially lowers the degree of p(A) making it easier to calculate.
One especially interesting consequence of this is the following theorem
Theorem 9.6. For two polynomials, a and b, with different degree, deg(a) ≥ deg(b),
such that a(A) = 0 and b(A) = 0 then the following relation must apply a(A) = c(A)b(A)
where c(A) is a polynomial with degree deg(c) = deg(a) − deg(b).
Proof. Dividing a and b gives a(A) = c(A)b(A) + r(A). From b(A) = 0 it follows that
a(A) = r(A) but a(A) = 0 thus r(A) = 0.
109
9.3.3 Companion matrices
If λ is an eigenvalue of a square matrix A then det(λI − A) = 0. Since an n × n matrix
has n eigenvalues (some of which can be the same) and since pA (λ) = det(λI − A) is a
polynomial of λ with degree n this means that each eigenvalue corresponds to a root of
the polynomial pA (λ). This can be expressed as a formula
pA (λ) = det(λI − A) = (λ − λ1 )(λ − λ2 ) . . . (λ − λn )
where λi for 1 ≤ i ≤ n are eigenvalues of A.
This means that if we have found the eigenvalues of some matrix we have also found
the roots of some polynomial and vice versa. That solving polynomials can be used to
find eigenvalues of matrices is something that we have seen before, the standard method
of solving det(λI − A) = 0 is exactly this. But there are other ways of finding the
eigenvalues of a matrix, in section 11 a number of different ways of calculating the roots
numerically will be shown for example.
There is also a simple way of constructing a matrix that has the roots of a given
polynomial as eigenvalues. This matrix is called the companion matrix of a polynomial.
The origin of this kind of matrix is linear differential equations.
Consider a linear differential equation of order n
y (n) + an−1 y (n−1) + . . . + a1 y (1) + a0 = f
where y (k) is the k:th derivative of y and f is a function. When solving this kind of
differential equation it is often helpful to find the roots of the characteristic polynomial
p(x) = xn + an−1 xn−1 + . . . + a1 x + a0
To get the companion matrix we rewrite the differential equation using state variables










x0 = y
x1 = y (1)
..
.





xn−1 = y (n−1)




xn = y (n)
Using the state variables x1 , . . . xn the differential equation can be written as a system
of first order differential equations. If we denote a vector containing all xi with x and a
vector containing all x0i with x0 then the differential equation can be written as a matrix
equation
110
x0 = C(p)> x ⇔









x00 = x1
x01 = x2
..
.




x0n−1 = xn




x0n = −a0 x0 − a1 x1 − . . . − an−1 xn + f
With C(p) defined as follows.
Definition 9.7
Let p(x) = a0 + a1 x + a2 x2 + . . . + an−1 xn−1 + xn

0 0 ... 0
1 0 . . . 0


C(p) = 0 1 . . . 0
 .. .. . . ..
. .
. .
then the companion matrix of p is

−a0
−a1 

−a2 

.. 
. 
0 0 . . . 0 −an−1
In some literature C(p)> is referred to as the companion matrix instead of C(p).
Theorem 9.7. If C(p) is a companion matrix to the polynomial p then the eigenvalues
of C(p) are roots of p.
Proof. Let ei denote vectors in the standard basis, that is column vectors with a 1 in
element i and 0 everywhere else. For these vectors multiplication with C(p) will result
in the following
C(p)ei = ei+1 for 0 ≤ i < n
n
X
C(p)en =
−ai ei
(9)
(10)
i=1
From (9) it can be easily shown that
ei = C(p)i e1 for 0 ≤ i < n
n
en = C(p) e1
(11)
(12)
Combining equation (9)-(12) gives
C(p)n e1 =
n−1
X
−ai C(p)i e1 ⇔
i=0
n
n−1
⇔ C(p) + an−1 C(p)
+ . . . + a1 C(p) + a0 I = 0
⇔ p(C(p)) = 0
111
By the Cayley-Hamilton theorem pC (C(p)) = 0 with pC (λ) = det(λI − C(p)). Thus we
have two polynomials that are zero for the companion matrix.
p(C(p)) = 0
pC (C(p)) = 0
Then we can use theorem (9.6) to see that that p(A) = c · pC (A) (since p and pC have
the same degree). This means that p(λ) and pC (λ) must have the same roots. The
eigenvalues of C(p) are roots of pC (λ) and thus the theorem is proved.
Calculating roots of polynomials using companion matrices can be quite efficient,
especially since a polynomial of high degree gives a sparse matrix , that is a matrix with
only a few non-zero elements. Sparse matrices can be used quite efficiently by computers
and many algorithms can be specifically modified to work much faster on sparse matrices
than non-sparse matrices. The power method described in section 11.1 can be used to
efficiently find numerical eigenvalues of sparse matrices. Many computer programs that
finds roots of polynomials uses companion matrices, for example the roots command in
MATLAB (see [5]).
9.4 Exponential matrix
So far we have constructed matrix functions by taking the definition of a function for
numbers and replaced numbers with matrices when it was possible. Now we are going
to construct a matrix version of a function in a different way, we are going to identify
the most important properties of a function and then construct a matrix function that
has the same properties.
The function that we are going to construct a matrix version of the exponential function. A common definition for the exponential function is
x n
)
n
The exponential function is useful in many ways but most of its usefulness comes from
the following three properties
ex = lim (1 +
n→∞
d at
(e ) = aeat = eat a
dt
(eat )−1 = e−at
ea · eb = ea+b = eb+a = eb · ea
(13)
(14)
(15)
Here a, b and t are all real or complex numbers. Multiplication of numbers is usually
commutative (a · b = b · a) but matrix multiplication is often not commutative. In
equation (13)-(15) the various places where commutativity comes in has been written
out explicitly since we want to have commutativity in the same way for the matrix
version.
112
The most complex function we have defined so far is the matrix polynomial. Can
we construct a polynomial such that it fulfills (13)-(13)? Consider what happens to a
general polynomial of At when we take the first time derivative
p(At) =
n
X
k=0
n
n−1
k=1
l=0
X
X
d
ak (At) ⇔ (p(At)) =
ak kAk tk−1 = A
al+1 (l + 1)Al tl
dt
k
The last equality is achieved by simply taking l = k − 1. Compare this to
Ap(At) = A
n
X
al Al tl
l=0
If the polynomial coefficients are chosen such that al+1 (l + 1) = al these expressions
would be identical except for the final term. Let us choose the correct coefficients
al+1 (l + 1) = al ⇔ al =
1
l!
and simply avoid the problem of the last term not matching up by making the polynomial
have infinite degree.
Definition 9.8
The matrix polynomial function is defined as
e
At
=
∞ k
X
t
k=0
k!
Ak , A ∈ Mn×n
It is simple to create eA from this by setting t = 1.
An important thing to note here is that since the sum in the definition is infinite
we should check for convergence. As it turns out the exponential function will always
converge. This follows from a more general theorem presented in section 9.5.
Theorem 9.8. For the matrix exponential function the following is true
d At
(e ) = AeAt = eAt A
dt
(eAt )−1 = e−At
A
A
e ·e =e
A+B
=e
B+A
(16)
(17)
B
A
= e · e if AB = BA
(18)
where A and B are square matrices.
This definition of eAt has all the properties we want, with the exception that we need
the extra condition on (18) compared to (15). There is no way around this unfortunately.
That the definition gives a function that has property (16) is clear from the construction. Property (17) and (18) can be shown by direct calculations using the definition.
Calculating the exponential matrix can be challenging. First of all we should note
that the exponential function for numbers can be written like this
113
ex =
∞
X
xk
k=0
k!
using a Taylor series. Using this we can see how to calculate the matrix exponential
for a diagonal matrix D
∞

X dk1,1
0
...
0


k!

 

k=0


∞
ed1,1
0
...
0
k


X
d
2,2
 0
 
...
0
ed2,2 . . .
0 

  0

D
k!


e =
=

.
.
.. 
.
k=0
  .
.
.
.
.
.
. 


..
..
..
..


.
.
.
.
dm,m


0
0
... e
∞

X
dkn,n 
 0

0
...
k!
k=0
Since the exponential matrix is essentially a polynomial it should not be surprising
that the following relation holds.
A = S−1 BS ⇒ eA = S−1 eB S
On page 102 it was discussed that nilpotent matrices could make it easier to calculate
the matrix power function. This is of course also true for the matrix exponential function,
since it turn the infinites sum into a finite sum and thus turns the exponential function
into a finite polynomial.
There are also many ways of computing the exponential matrix function numerically.
A nice collection can be found in [10].
9.4.1 Solving matrix differential equations
One of the most common applications of the exponential function for numbers is solving
differential equations.
df
= kf (x) ⇒ f (x) = c · ekx
dx
The matrix exponential function can be used in the same way.
Theorem 9.9. The exponential matrix exA is the solution to the initial value problem

 dU = AU
dx

U (0) = I
Proof. From the construction of exA we already know that if fulfills the differential
equation. Checking that the initial condition is fulfilled can be done simply by setting
x = 0 in the definition.
114
9.5 General matrix functions
There is a systematic way of creating a matrix version of a function. When the matrix
exponential function was defined in the previous section (see page 113) we found a matrix
polynomial of infinite degree that created the matrix version. One way of doing this is
given below
Definition 9.9 (Matrix function via Hermite interpolation)
If a function f (λ) is defined on all λ that are eigenvalues of a square matrix A then the
matrix function f (A) is defined as
f (A) =
h sX
i −1
X
f (k) (λi )
i=1 k=0
k!
φik (A)
where φik are the Hermite interpolation polynomials (defined below), λi are the different
eigenvalues and si is the multiplicity of each eigenvalue.
Definition 9.10
Let λi , 1 ≤ i ≤ n be complex numbers and si , 1 ≤ i ≤ n, be positive integers. Let s
be the sum of all si . The Hermitian interpolation polynomials are a set of polynomials
with degree lower than s that fulfills the conditions
(
(l)
0 if i 6= j or k 6= l
φik (λj )
=
l!
1 if i = j and k = l
Using this kind of definition to create a matrix function works as long as the function
f is defined on the eigenvalues of the matrix.
There is another definition af a matrix function that is equivalent to definition 9.9
but is only applicable to analytical functions, that is functions that can be written as a
power series.
Definition 9.11 (Matrix function via Cauchy Integral)
Let C be a curve that encloses all eigenvalues of the square matrix A. If the function f
is analytical (can be written as a power series) on C and inside C then
Z
1
f (A) =
f (z)(zI − A)−1 dz
2πi C
The matrix RA (z) = (zI − A)−1 is called the resolvent matrix of A.
Proofs that these two definitions makes sense and are equivalent can be found in [?]
on page 1-8.
The formula in definition 9.11 is a matrix version of Cauchy’s integral formula which
can be used to derive the formula for a Taylor series. As a consequence of this is that we
can construct the matrix version of any analytical function can by replacing the number
variable in a Taylor series expansion (or other convergent power series) with a square
matrix. For example of this see the formula for the matrix square root (page 105) and
compare the definition of the matrix exponential function with the Taylor series for the
exponential function for numbers.
115
10 Matrix equations
A matrix equation is an equation that involves matrices and vectors instead of numbers.
A simple example is Ax = y where A is a matrix, y is a known vector and x is an unknown
vector. You are already familiar with some ways of solving this kind of equations, for
example Gaussian elimination and matrix inverses.
There are of course many other kinds of matrix equations. One example that we have
seen before are the first order differential equation that was solved using the exponential
matrix function, see page 114.
There are many other kinds of matrix equations. A simple example would be exchanging the x and y vectors in the typical matrix equation for a pair of square matrices, X
and Y. There are two variations of this kind of equation
AX = Y
XA = Y
Another example of a matrix equation that is important in many applications (first and
foremost in control theory) is Sylvester’s equation
AX + XB = C
There is also a version of Sylvester’s equation that is especially important known as the
continous Liapunov1 equation.
AX + XAH = −C
There area many methods for solving specific matrix equations, but here we will take a
look at a basic tool for examining certain properties of the solutions of matrix equations
and turning the more exotic kind of matrix equations into the typical matrix-vector
equation that we already know how to solve. For this a new tool will be introduced.
10.1 Tensor products
The tensor product, usually denoted by ⊗ is an interesting tool that can be defined for
many kinds of mathematical objects, not only matrices. It is sometimes called the outer
product or direct product. A ⊗ B is usually pronounced ’A tensor B’.
Definition 10.1
The tensor product for matrices is a bilinear function of matrices that has two properties:
1. Bilinearity:
(µA + ηB) ⊗ C = µA ⊗ C + ηB ⊗ C
A ⊗ (µB + ηC) = µA ⊗ B + ηA ⊗ C
2. Associativity: (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C)
1
sometimes spelled Lyapunov or Ljapunow
116
It can be possible to find different function that fullfil these condition for at set of
matrices but the most general (and most common) tensor product is the Kroenecker
product.
Definition 10.2
The Kroenecker product ⊗ between two matrices,
as

a1,1 B a1,2 B
 a2,1 B a2,2 B

A⊗B= .
..
 ..
.
A ∈ Mm×n and B ∈ Mp×q , is defined

. . . a1,n B
. . . a2,n B 

.. 
..
.
. 
am,1 B am,2 B . . . am,n B
or

Ab1,1
 Ab2,1

A⊗B= .
 ..
Ab1,2
Ab2,2
..
.
...
...
..
.

Ab1,n
Ab2,n 

.. 
. 
Abm,1 Abm,2 . . . Abm,n
these two definitions are equivalent but not equal, one is a permutation of the other.
The Kroenecker product is quite simple and intuitive to work with, even if it creates
large matrices.
Theorem 10.1. For the Kroenecker product the following is true
1. (µA + ηB) ⊗ C = µA ⊗ C + ηB ⊗ C
2. A ⊗ (µB + ηC) = µA ⊗ B + ηA ⊗ C
3. (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C)
4. (A ⊗ B)> = A> ⊗ B>
5. A ⊗ B = A ⊗ B (complex conjugate)
6. (A ⊗ B)H = AH ⊗ BH
7. (A ⊗ B)−1 = A−1 ⊗ B−1 , for all invertible A and B
8. det(A ⊗ B) = det(A)k det(B)n , with A ∈ Mn×n , B ∈ Mk×k
9. (A ⊗ B)(C ⊗ D) = (AC ⊗ BD), A ∈ Mm×n , B ∈ Mn×k , C ∈ Mp×q , D ∈ Mq×r
10. A ⊗ B = (A ⊗ Ik×k )(In×n × B), A ∈ Mn×n , B ∈ Mk×k
The first 7 properties listed here shows that the Kroenecker product is bilinear, associative and react in a very intuitive way to transpose, complex conjugate, Hermitian
conjugate and inverse operations. Number 8 shows that determinant are almost as intuitive but not quite (note that the determinant of A has been raised to the number of rows
117
in B and vice versa). Properties 9 and 10 are examples of how matrix multiplication and
Kroenecker products relate to one another.
Taking the Kroenecker product of two matrices gives a new matrix. Many of the
properties if the old matrices relate to the properties of the old matrix in a fairly uncomplicated manner. One property that this applies to are the eigenvalues
Theorem 10.2. Let {λ} be the eigenvalues of A and {µ} be the eigenvalues of B. Then
the following is true:
{λµ} are the eigenvalues of A ⊗ B
{λ + µ} are the eigenvalues of A ⊗ B
Sketch of proof.
a) Use Shurs lemma A = S−1 RA S, B = T−1 RB T where RA and RB
are triangular matrices with eigenvalues along the diagonal. Then
A ⊗ B = (S−1 RA S) ⊗ (T−1 RB T) =
= (S−1 RA ⊗ T−1 RB )(S ⊗ T) =
= (S ⊗ T)−1 (RA ⊗ RB )(S ⊗ T)
RA ⊗RB will be triangular and will have the elements ai,i bj,j = λi µj on the diagonal,
thus λi µj is an eigenvalue of A⊗B since similar matrices have the same eigenvalues.
b) Same argument as above gives A ⊗ I (B ⊗ I) have the same eigenvalues as A (B)
adding the two terms together gives λ + µ is an eigenvalue.
10.2 Solving matrix equations
The tensor product can be used to make it easy to solve equations like these
AX = Y
XA = Y
Taking the tensor product between the identity matrix and some square matrix results
in the following theorem
Theorem 10.3.
e=Y
e
AX = Y ⇔ (I ⊗ A)X
e=Y
e
XA = Y ⇔ (A> ⊗ I)X
where

B.1
 B.2 

e=
. . . B.n , B
 .. 
 . 

B = B.1 B.2
B.n
118
This turns an equation consisting of n × n matrices into an equation with one n2 × n2
matrix and two n2 vectors. This equation can be solved the usual way. Solving the
matrix this way is equivalent to assigning each element in the unknown matrix a variable
and the solve the resulting linear equation system. Using the Kroenecker product can
make subsequent manipulations of the matrices involved easier though, thanks to the
properties shown in theorem 10.1.
Often it is interesting to know if there exists a solution to a given equation and if that
solution is unique. Using the Kroenecker product we can see for which matrices A and
B that there is a unique solution to Sylvester’s equation.
Theorem 10.4. Sylvester’s equation AX + XB = C has a unique solution if and only if
A and −B have no common eigenvalues.
Proof. Rewrite the equation using the Kroenecker product
e=C
e
AX + XB = C ⇔ (I ⊗ A + B> ⊗ I) X
|
{z
}
K
This is a normal matrix equation which is solvable if λ = 0 is not an eigenvalue of K.
The eigenvalues of K are λ + µ where λ is an eigenvalue of A and µ is an eigenvalue of B.
Thus if A and −B have no common eigenvalues the eigenvalues of K will 6= 0.
As mentioned previously Sylvester’s equation is important in control theory especially
the version of it that is called the continuous Liapunov equation. This will be illustrated
using two theorems (neither theorem will be proved).
A stable system is a system that will in some sense ’autocorrect’ itself. If left alone it
will approach some equilibrium state and not increase indefinitely or behave chaotically.
If the system is described by a first order differential matrix equation then the following
condition gives stability.
Definition 10.3
A matrix, A, is stable when all solutions to
dx
= Ax(t)
dt
converges to 0, x(t) → 0 when t → ∞.
⇔
All the eigenvalues of A have negative real component, Re(λ) < 0 for all λ that is an
eigenvalue to A.
Theorem 10.5. If A and B are stable matrices then
AX + XB = −C
has the unique solution
Z
X=
∞
etA CetB dt
0
119
Theorem 10.6. Let P be a positive definite matrix. If and only if there is a positive
definite solution to
AH X + CA = −P
is A a stable matrix.
120
11 Numerical methods for computing eigenvalues and
eigenvectors
In practise we can rarely use a analytical solution to finding eigenvalues or eigenvectors
of a system. Either because the system is two large making Gausselimination, by solving
the characteristic polynomial or similar methods to slow or maybe we only need one or a
few eigenvalues or eigenvectors or maybe a rough estimate is enough. You have probably
seen by now how time consuming it is to find eigenvalues and eigenvectors analytically.
Since there are a huge number of different problems where we want to find eigenvalues or
eigenvectors there are also a large number of different methods with different strengths
and weaknesses depending on the size and structure of the problem. We will look at
two different related methods, the Power method and the QR-method and a couple of
variations of both.
While this chapter gives some insight into how to use or implement some methods,
it also serves to help in showing that there is no generall method that somehow always
works. If a general method doesn’t work, there might be another specified method for
your type of problem. It’s then important to know what to look for both in our matrix
as well as what we are actually trying to find. It also serves to show how we can start
with a very primitive algorithm and then carefully build upon it in order to remove some
or most of it’s limitations while still keeping it’s strengths.
Some things to take into consideration when choosing which method to use is:
• The size of the matrix, especially very large matrices which we maybe can’t even
fit in memory.
• Density of non-zero elements, Is the matrix a full matrix (few zeroes) or a sparse
matrix (few non-zero elements).
• Does the matrix have any kind of useful structure we could use, for example if it’s
positive definite, symmetric, Hessian or a bandmatrix.
• What do we seek? Do we need all eigenvalues and eigenvectors or just one or a
few? Do we neeed exact result or is a good approximation enough? Or maybe we
don’t need bth eigenvalues and eigenvectors?
11.1 The Power method
The Power method is a very simple algorithm for finding the eigenvalue λ1 with the
largest absolute value and it’s corresponding eigenvector. While the method have a
couple of disadvantages making it unsuitable for general matrices, it proves it’s worth
in a couple of specialized problems unsuitable for more general methods such as the
QR-method which we will look at later. The Power method also acts as an introduction
to many of the thoughts behind the QR-method and it’s variations.
121
The Power method in its basic form is a very simple and easy to implement method.
We consider the matrix A for which we want to find the ”largest” eigenvalue and corresponding eigenvector.
• We assume there is only one eigenvalue on the spectral radius and that A is diagonalizable.
• The method works by iterating xk+1 =
Axk
.
kAxk k
Axk
approaches a vector parallel to the eigenvector of the eigenkAxk k
value with largest absolute value λ1 .
• If λ1 6= 0 then
• We get the eigenvalue λ1 from the relation Ak x ≈ λ1 Ak−1 x when k is large.
It’s easy to see that if the Matrix A> is a stochastic matrix, running the Power method
can be seen as iterating the Markov chain. This also motivates why we need there to be
only one eigenvalue on the spectral radius since otherwise the Markov chain and therefor
neither the Power method would converge in those cases. We will not look into how fast
the convergence is here, but be content by noting that the relative error can be shown to
be in the order of |λ2 /λ1 |k . In other words it’s not only other eigenvalues on the spectral
radius that’s problematic, if there are eigenvalues close to the spectral radius they will
slow down the convergence as well.
We give a short example using the following matrix:


−2 1 1
A =  3 −2 0
1
3 1
We start with x0 = [1 1 1]> and iterate:
   

0
1
−2 1 1





1 = 1
Ax0 = 3 −2 0
1
5
1
3 1
Next we calculate the norm of this vector:
kAx0 k =
√
26
And last we use the norm kAx0 k in order to normalize Ax0 :
x1 =
√
Ax0
= [0, 1, 5]/ 26
kAx0 k
Continuing for the next iteration we

−2 1
Ax1 =  3 −2
1
3
122
get:
 
 
1
0
6
1
1
0 1 √ = −2 √
26
26
1
5
8
x2 =
√
Ax1
= [6, −2, 8]/ 104
kAx1 k
And if we continue for a couple of more iterations we get:
• x2 = [0.59, −0.17, 0.78]> .
• x3 = [−0.25, 0.91, 0.33]> .
• x4 = [0.41, −0.61, 0.67]> .
• x7 = [−0.28, 0.85, −0.44]> .
• x10 = [0.28, −0.79, 0.55]> .
• For reference the true eigenvector is xtrue = [0.267, −0.802, 0.525].
We see that it does seem to converge, but note that it could take quite some time.
However since every step gives an a little better approximation, if we only need a rough
estimate we could stop considerably earlier. Because of the assumptions regarding the
system as well as the problem with convergence if λ1 ≈ λ2 this simple version of the
algorithm is rarely used unless we already have a good understanding of the system.
In short a couple of things to consider when using the power method are:
• If there is more than one eigenvalue on the spectral radius or if it’s multiplicity is
larger than one, the method won’t converge.
• Since the error is of the order |λ2 /λ1 |k , if λ2 is close to λ1 the convergence can be
very slow.
• We only get the (in absolute value) largest eigenvalue and corresponding eigenvector.
While these are some considerable disadvantages (we can’t use it on all matrices and the
convergence speed is uncertain), there are a couple of things we can do to handle these
problems. However while the method does have a couple of negatives, it does have a
couple of very strong points as well.
• In every step we make only a single vector-matrix multiplication, making it very
fast if we can ensure a fast convergence.
• If A is a sparse matrix (few non-zero elements) the iterations will be fast even for
extremely large matrices.
• Since we do no matrix decomposition (as we do in most other common methods) we do not ”destroy” the sparcity of the matrix when calculating the matrix
decomposition.
123
Especially we see the advantage of the method on sparse matrices since most other
commonly used methods uses a matrix decomposition of the matrix, usually destroying
the sparsity of the matrix in the process. This means that while the method might not
be as reliable or as fast as other methods such as the QR-method in the general case, if
we have a very large sparse matrix we might not even be able to use many other methods
since they would be to slow to be usable. In some cases the matrix itself might not even
fit in memory anymore after matrix decomposition further increasing the time needed.
11.1.1 Inverse Power method
In order to handle some of the problems found earlier we will take a look at something
called the inverse Power method. While this fixes some of the problems it doesn’t fix
all of them, and there is a cost in that the method is no longer as suitible for sparse
matrices as before.
We consider the invertible matrix A with one eigenpair λ, x. From the definition of
eigenvalues and eigenvectors we then have Ax = λx. This gives:
Ax = λx ⇔ (A − µI)x = (λ − µ)x ⇔ x = (A − µI)−1 (λ − µ)x
⇔ (λ − µ)−1 x = (A − µI)−1 x
And we see that ((λ − µ)−1 , x) is an eigenpair of (A − µI)−1
If we instead iterate using this new matrix (A − µI)−1 the method becomes:
xk+1 =
(A − µI)−1 xk
k(A − µI)−1 xk k
And just like the ordinary Power method this will converge towards the eigenvector
of the dominant eigenvalue of (A − µI)−1 . The dominant eigenvalue of (A − µI)−1 is
not necesarily the same as the dominant eigenvalue of A. Rather we get the dominant
eigenvalue (λ − µI)−1 of (A − µI)−1 for the λ closest to µ.
If we take a look at the error we get:
(λ
−1 k
−
µ)
(second closest to µ)
(λ(closest to µ) − µ)−1 (λ(closest to µ) − µ) k
= (λ
− µ) (second closest to µ)
We can easy see that if µ is a good approximation of an eigenvalue of A then (λ(closest to µ) ≈
0 and the error will be close to zero as well, this especially improves the convergence with
multiple close eigenvalues. With a good approximation of λ the method will converge
very fast, often in only one or a couple of iterations. Especially in the case where we
know the eigenvalue exactly beforehand such as when working with irreducible stochastic matrices we get the error equal to zero and the method should converge in only
124
one step. We also see that if we change µ we can find the eigenvectors of not only
the dominant eigenvalue, but the eigenvectors to multiple eigenvalues. However we still
have problems with eigenvalues with multiplicity larger than one, or when more than
one have the same absolute value. The inverse Power method also requires us to solve
a linear system or calculate the inverse of the matrix (A − µI). This makes the method
unsuitable for large sparse matrices where the basic Power method shows it’s greatest
strength, since calculating the inverse matrix usually destroys the sparcity of the matrix.
We give a short example using the inverse
the dominant eigenvalue of:

3 3
1 1
A=
1 0
1 3
power method to find the eigenvector to
1
0
1
0

0
3

3
2
Since A is non-negative and irreducible (why?) we can use Perron Frobenius Theorem 4.1 to find bounds on the dominant eigenvalue. We remember this means the
dominant eigenvalue is at least equal to the lowest row sum and at most equal to the
largest row sum. Since at least one of the diagonal elements are non-zero we know that
A is primitive as well so that we know that there is only one eigenvalue on the spectral
radius. This ensures the convergence (when seeking the dominant eigenvalue).
We choose µ as the largest possible value of the dominant eigenvalue: µ = max row sum =
7. This ensures that the dominant eigenvalue is the eigenvalue closest to µ, ensuring we
converge towards the right eigenvector. We get:


−4 3
1
0
 1 −6 0
3

A − µI = A − 7I = 
1
0 −6 3 
1
3
0 −5
And we get its inverse as:
(A − µI)−1

−0.4038
−0.1538
=
−0.1538
−0.1731
−0.3173
−0.3590
−0.1923
−0.2788
−0.0673
−0.0256
−0.1923
−0.0288

−0.2308
−0.2308

−0.2308
−0.3846
Starting with x0 = [1 1 1 1]> gives:
x1 =
(A − µI)−1 x0
k(A − µI)−1 x0 k
= [−0.5913 − 0.4463 − 0.4463 − 0.5020]>
If we continue to iterate a couple of times and multiply the end result of those that are
negative with −1 to end up with a positive vector we get:
125
•
x1 = [0.5913 0.4463 0.4463 0.5020]>
•
x2 = [0.6074 0.4368 0.4368 0.4995]>
•
x3 = [0.6105 0.4351 0.4351 0.4986]>
•
xtrue = [0.6113 0.4347 0.4347 0.4983]>
We see that we are very close to the true eigenvector after only three iterations,
we in fact have a rather good approximation after only one iteration. Using the found
eigenvector we can now also find a better approximation of the true dominant eigenvalue
λ = 5.8445 as well from the relation Ax = λx. For example using x3 we get:
Ax3 = [3.5721 2.5414 2.5414 2.9131]> → λ ≈
3.5721
= 5.8511
0.6105
Which is very close to the true eigenvalue λ = 5.8445.
The Inverse power method is often used when you have a good approximation of an
eigenvalue and need to find an approximation of the corresponding eigenvector. This
is often the case in real time systems where a particular eigenvector might needs to be
computed multiple times during a very short time as the matrix changes.
Overall both the Power method and the Inverse Power method are unsuitable for
general matrices since we cannot guarantee the convergence without further inspecting
the matrix. In order to solve the limitations of these methods we will need to calculate
many eigenvalues at the same time as for example in the QR-method below.
11.2 The QR-method
Often we might not know anything about the matrix beforehand and we might want to
find not only one or a couple of eigenvalues/eigenvectors but all of them. We also don’t
want to be limited to only finding the eigenpairs for simple eigenvalues.
The QR-method solves all of these problems, it finds all eigenpairs, and it’s generally
very fast unless we are working with very large matrices where we might need specified algorithms. Since the method is fast, reliable and finds all eigenpairs it is almost
always the QR-method (or a variation there of) that is used whenever you calculate
eigenvalues/eigenvectors using computational software. Unless you have very specific requirements or already know a more suitable method it’s usually a good idea to start by
trying the QR-method and then continue looking at other methods if that is unsufficient.
126
Like the Power method, the QR-method is a iterative method, which after some time
converge and gives us the eigenvalues and eigenvectors. Like the name suggest the
method uses the QR-decomposition of the original matrix. We remember that we could
find a QR decomposition A = QR where Q is a unitary matrix and R is upper triangular
using for example the Gram-Schmidt process.
The method works by iterating:
• We start by assigning A0 := A
• In every step we then compute the QR-decomposition Ak = Qk Rk .
−1
H
• We then form Ak+1 = Rk Qk = QH
k Qk Rk Qk = Qk Ak Qk = Qk Ak Qk
• Since Q is unitary all Ak are similar and therefor have the same eigenvalues.
• When the method converge Ak will be a triangular matrix and therefor have it’s
eigenvalues on the diagonal.
The QR-method can be seen as a variation of the Power method, but instead of
working with only one vector, we work with a complete basis of vectors, using the
QR-decomposition to normalize and orthogonalize in every step. In practice a few modifications are needed in order to handle a couple of problems in it’s most basic form some
of which are:
• Computing the QR-decomposition of a matrix is expensive, and we need to do it
in every step.
• The convergence itself like with the Power method can also be very slow.
• Like the Power method we have problems with complex and non-simple eigenvalues.
However if we can handle these problems we have a very stable and fast algorithm
for general matrices. Especially the stability of the algorithm is one of it’s huge selling
point and the biggest reason why it’s used over other similar methods using for example
the LUP decomposition. But how do we improve the algorithm?
We start by looking at how we can speed up the computations of the QR-decomposition
since even if we use Gram Schmidt or some other method it’s still a very expensive
operation. We remember a matrix (A) is (upper) Hessenberg if all elements ai,j = 0, i >
j + 1. In other words A is of the form:


F F F ··· F F F
F F F · · · F F F


 0 F F · · · F F F




A =  0 0 F · · · F F F
 ..
..
.. . .
.
..
.. 
.
. ..
.
.
.
.


 0 0 0 · · · F F F
0 0 0 ··· 0 F F
127
If A is Hessenberg then we would speed up the algorithm considerably since:
• Calculating the QR-factorization of a Hessenberg matrix is much faster than for a
generall matrix.
• Since multiplication of a Hessenberg matrix with a triangular matrix (both in the
same direction) is itself a Hessenberg matrix. If A0 is Hessenberg every other
Ak , k = 0, 1, 2, . . . is Hessenberg as well.
• Since a Hessenberg matrix is nearly triangular, it’s convergence towards the triangular matrix will be improved, resulting in fewer total steps as well
We don’t want to limit our algorithm to only Hessenberg matrices, however what if
we can change the matrix we work with to one in Hessenberg form? This is possible
since:
Theorem 11.1. Every square matrix A have a matrix decomposition
A = UH HU
where
• U is a unitary matrix.
• H is a Hessenberg matrix.
Since U is a unitary matrix, A and U have the same eigenvalues so in order to find
the eigenvalues of A we can instead find the eigenvalues of H which is as we wanted Hessenberg. A Hessenberg decomposition can be calculated effectively using for example
householder transformations.
While we generally have a faster convergence using H, it can still be very slow. We
remember we could solve this problem in the Power method by shifting the eigenvalues
of the matrix. In the same way we can shift the eigenvalues in the QR-method using
(Ak − σI). For Ak we then get:
H
Ak+1 = QH
k (Ak − σI)Qk + σI = Qk (Ak )Qk
We see that Ak , Ak+1 is still similar and therefor will have the same eigenvalues. As
in the Power method we want σ to be as close as possible to one of the eigenvalues.
Usually the diagonal elements of the hessenberg matrix is used since they prove to give
a rather good approximation. Since the diagonal elements of Ak should give increasingly
accurate approximation of the eigenvalues we can change our shift to another eigenvalue
or improve the current one between iterations as well.
As fast as a subdiagonal element is close to zero, we have either found an eigenvalue
if it’s one of the outer elements. Or we can divide the problem in two smaller ones since
we would have:
A1,1 A1,2
A=
0
A2,2
128
This still leaves the problem with complex eigenvalues in real matrices. In order to
solve this problem an implicit method using multiple shifts at the same time is used in
practice. We leave this implicit method, also called the Francis algorithm by noting that
it exist and works by much the same principle as the inverse Power algorithm.
We give a summary of the modified QR-method using a single shift:
1. Compute the Hessenberg form (A = UH HU) using for example Householder transformations.
2. Assign A0 = H. Then in every step, choose a shift σ as one of the diagonal elements
of Ak .
3. Compute the Qk Rk factorization of (Ak − σI).
4. Compute Ak+1 = Rk Qk + σI
5. Iterate 2 − 4 until Ak is a triangular matrix, optionally dividing the problem in
two whenever a subdiagonal element becomes zero.
11.3 Applications
11.3.1 PageRank
Pagerank is the method originally designed by Google to rank homepages in their search
engine in order to show the most relevant pages first. In the ranking algorithm we
consider homepages as nodes in a directed graph, and a link between two homepages as
an edge between the corresponding vertices in the graph. Since the number of homepages
is huge (and rapidly increasing) this does not only put high demand on the accuracy
of the model (few looks at more than the first couple of pages of results) it also puts a
huge demand on the speed of the algorithm. The principle used in the ranking is such
that homepages that are linked to by many other homepages and/or homepages that are
linked to by other important homepages are considered more important (higher rank)
than others. An image illustrating the principle can be seen in Figure ??.
PageRank uses the adjacency matrix 3.1 of the graph with a couple of modifications.
First we do not allow any homepage to link to itself. Secondly, any homepages which links
to no other page, usually called dangling nodes needs to be handled separately. Usually
they are assumed to instead link to all pages (even itself even though we otherwise don’t
allow it). This is the method we will use here, although there are other methods available
as well.
Before calculating PageRank we make a modification to the adjacency matrix. We
weight the matrix by normalizing the sum of every non-zero row such that it sums to
one. This is our matrix A. Pagerank is then the eigenvector normalized such that the
sum is equal to one of the dominant eigenvalue of matrix M :
M = c(A + gv> )> + (1 − c)ve>
129
image by Felipe Micaroni Lalli
Where the 0 < c < 1 is a scalar, usually c = 0.85, e is the one vector and A as above
is a non-negative square matrix whose rows all sum to one. v is a non-negative weigh
vector with the sum of all elements equal to one. g is a vector with elements equal to
zero for all elements except those corresponding to dangling nodes which are equal to
one. We see that saying that we consider dangling nodes linking to all other nods is
not strictly true, rather we assume they link according to a weight vector with sum one,
usually however all or at least most of the elements of v are equal such that this is more
or less true.
To illustrate the method we give a short example where we consider the following
system:
1
2
3
4
5
We assume all elements of v equal such that we give no node a higher weight. Using
the matrix for the graph we get the matrix (A + gv> ) by normalizing every row to one
and add the change on the third row corresponding to the dangling node we get:


0 1/3 1/3 1/3 0
 0
0
0
1
0 


>

(A + gv ) = 1/5 1/5 1/5 1/5 1/5

 0
0
0
0
1 
0
0
0
1
0
130
Using this we can now

0
 0

M=
c/5
 0
0
get our matrix M = c(A + gv> )> + (1 − c)ve> :
>


c/3 c/3 c/3 0
1 1 1 1 1
1 1 1 1 1
0
0
c
0 


1−c

1 1 1 1 1
c/5 c/5 c/5 c/5 +


5 
0
0
0
c 
1 1 1 1 1
0
0
c
0
1 1 1 1 1
If we choose c = 0.85 and calculate the eigenvector to the dominant eigenvalue we end
up with PageRank p = [0.0384, 0.0492, 0.0492, 0.4458, 0.4173]> .
From the result we can see a couple of things, first we see that node one which have
no other node linking to it have the lowest PageRank and node two and three which
are only linked to by node one have equal and the second lowest PageRank. Node four
have a very high PageRank since it’s linked to by many nodes and node five have high
PageRank since it’s linked to by node four which have a very high PageRank.
We see that the definition seems reasonable, however we still have a large problem
to consider: how do we find the eigenvector to the dominant eigenvalue of such a huge
matrix in a reasonable time? For reference 2008 − 2009 both Google and Bing reported
indexing over 1.000.000.000.000 webpages which in turn means our matrix M would have
about 1024 elements. Obviously something needs to be done.
We note that the matrix M has been constructed such that it’s a non-negative matrix.
And it also seems reasonable to assume it’s irreducible through our vector v. In fact
unless some nodes have a zero-weight in v, M will be a positive matrix. We also have
that the sum of every column of M equal to one, so we can say that M is actually a
(column) stochastic matrix. This gives us a more intuitive way to describe pagerank
looking at is from a Markov chain perspective.
Consider a internet surfer surfing the web. The internet surfer starts surfing at a
random page then with probability c it follows any of the links from the current page
ending up in a new page or with probability 1 − c it starts over at a new random page.
The Internet surfer then continue repeatedly either follows a link from the current page
or starts over at a new random page. The pagerank of a node (page) can then be seen
as the probability that the Internet surfer is at that page after a long time.
While this by itself doesn’t make it easier, we note something about A. Since every
homepage on average only links to a couple of other homepages, A is an extremely
sparse matrix. While A have maybe 1024 elements only 1012 m, where m is the average
number of links on a homepage is non-zero. If we have a sparse matrix, we could use
the Power method to use the sparsity of the data rather than trying (and failing) using
other methods such as the QR-method. We note that even though M is a full matrix,
the multiplication xn+1 = Mxn in the Power method can be done fast by calculating
Axn first.
Take a couple of minutes to think through the following:
131
• How can we calculate xn+1 = Mxn by multiplying xn with cA> instead so that we
can use that A is sparse?
• How can the weightvector v be used in order to combat linkfarms, where people
try to artificially increase ones PageRank?
• The attribute ’nofollow’ can be used on links to not increase the PageRank of the
target page. Can you think of any sites where this could be useful?
• What influence does the value c have on the PageRank? For example what is
important for a high PageRank if c is close to zero rather than 0.85?
132
Index
adjacency matrix, 28
adjugate formula, 106
adjugate matrix, adj(A), 106
aperiodic, 41
associativity, 60
basis, 65
bijection, 84
bijective, 88
canonical form, 22
stochastic matrix, 52
Cayley-Hamilton theorem, 106
Ces´aro sum, 50
characteristic polynomial, 13
Cholesky factorization, 21
commutativity, 60
connected (graph), 31
connected component (graph), 31
continous time Markov chain, 56
covariance matrix, 22
cycle (graph), 30
degree matrix, 33
determinant, 10, 21
determinant function, 10
diagonal matrix, 6
diagonalizable, 13
distance, 68
distance matrix, 29
distributivity, 60
echelon form, 20
eigenvalue, 13, 39
eigenvector, 13, 39
error vector, 94
field, 60
Gauss elimination, 9
Hermitian conjugate, 18
Hermitian matrix, 18, 21
hermitian matrix, 8
hermitian transpose, 8
Hessenberg matrix, 17
hitting probability, 54
hitting time, 55
homogeneous coordinates, 86
Image, R(T), 88
imprimitive matrix, 42
injection, 84
injective, 88
inner product space, 66
interpolation, 95
irreducible matrix, 31, 39
Isomorphism, 91
Jordan block, 23
Jordan normal form, 23
Kernel, N(T), 88
Kroenecker product, 117
Laplacian matrix, 34
least-square sense, 93
linear combination, 62
linear space, 60
linear system, 8
linear transformation, 79
linearly dependent, 64
linearly independent, 64
LU-factorization, 21
LUP-factorization, 21
Markov chain, 46, 46
matrix decomposition, 16
matrix exponential function, eA , 113
matrix factorization, 16
matrix polynomial, 106
matrix power function,√An , 100
1
n
matrix root function, A, A n , 103
Moore-Penrose pseudoinverse, 98
multilinear, 11
133
nilpotent matrix, 102
non-negative matrix, 39
normal equations, 95
Nullity, 90
Nullspace, N(T), 88
orthogonal, 69
orthogonal set, 69
orthonormal set, 69
path (graph), 30
permutation matrix, 21
Perron-Frobenius (theorem), 39
positive definite matrix, 19, 21
positive matrix, 40
Power method, 121
primitive matrix, 40
pseudoinverses, 98
QR-factorization, 22
QR-method, 126
rank, 90
rank factorization, 20
reduced row echelon form, 23
residual vector, 94
rotation matrix, 82
row echelon form, 9
Schwarz inequality, 68
similar matrix, 14
simple graph, 33
singular value decomposition, 24
span, 62
spanning set, 63
sparse matrix, 112
spectral factorization, 20
spectral radius, 13, 39
spectrum, 13
√
1
square root of a matrix, A, A 2 , 103
stable, 119
standard matrix, 81
stationary distribution, 48
stochastic matrix, 46
irreducible and imprimitive, 50
134
irreducible and primitive, 49
reducible, 51
strongly connected (graph), 31
strongly connected component, 31
subspace, 61
surjection, 84
surjective, 88
SVD, 24
symmetric matrix, 8
trace, 6
transpose, 8
triangular matrix, 17, 21, 51
unitary matrix, 18, 22, 70
vector space, 60
weighted least square method, 97
References
[1] K. Bryan and T. Leise. The $25, 000, 000, 000 eigenvector: the linear algebra behind
google. SIAM Rev., 48(3):569–581, 2006.
[2] M. Haugh. The monte carlo framework, examples from finance and generating
correlated random variables. Course Notes, 2004.
[3] J.R.Norris. Markov Chains. Cambridge university press, New York, 2009.
[4] C. D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, Boston, 2001.
[5] C. Moler. Cleve’s corner: Chebfun, roots, October 2012. visited November 22 2012.
[6] W. Nicholson. Elementary linear algebra with applications, second edition. PWS
Publishing company, Boston, 1993.
[7] P. G. D. J. L. Snell. Random walks and electric networks. Free Software Foundation,
2000.
[8] S. Spanne. F¨
orel¨
asningar i Matristeori. KFS i Lund AB, Lund, 1994.
[9] R. Tobias and L. Georg. Markovprocesser. Univ., Lund, 2000.
[10] C. M. C. van Loan. Nineteen dubious ways to compute the exponential of a matrix,
twenty-five years later. SIAM Review, 45, 2003.
135