Principal Components Analysis in Tree Space

Transcription

Principal Components Analysis in Tree Space
Principal Components Analysis in Tree Space
Tom M. W. Nye, Newcastle University
[email protected]
Isaac Newton Institute, 21 June 2011
June 21, 2011
Summary
The problem
Take a sample of trees on a fixed set of taxa
- gene trees
- bootstrap or posterior sample
How can you characterize the data set? Quantify variability?
Standard tools of multivariate analysis: clustering,
Multi-Dimensional Scaling (MDS), Principal Components
Analysis (PCA)
Trees aren’t vectors – so how will PCA work?
A solution
Regard PCA as a geometrical procedure
Re-frame this procedure in tree space
Summary
The problem
Take a sample of trees on a fixed set of taxa
- gene trees
- bootstrap or posterior sample
How can you characterize the data set? Quantify variability?
Standard tools of multivariate analysis: clustering,
Multi-Dimensional Scaling (MDS), Principal Components
Analysis (PCA)
Trees aren’t vectors – so how will PCA work?
A solution
Regard PCA as a geometrical procedure
Re-frame this procedure in tree space
Example
btaurus
human
human
canis
pantro
pantro
monodelphis
mmulatta
human
mmusculus
pantro
rnorvegicus
mmulatta
mmusculus
canis
rnorvegicus
rnorvegicus
monodelphis
xtropicalis
monodelphis
gallus
gallus
gallus
xtropicalis
takifugu
xtropicalis
takifugu
tetraodon
takifugu
tetraodon
danio
tetraodon
danio
ciona
danio
ciona
dmelanogaster
dmelanogaster
yeast
dpseudoobscura
dpseudoobscura
anopheles
anopheles
anopheles
apis
apis
apis
dmelanogaster
yeast
celegans
btaurus
btaurus
mmusculus
cbriggsae
mmulatta
canis
ciona
dpseudoobscura
cbriggsae
celegans
yeast
cbriggsae
celegans
Data thanks to Anne Zupczok and Arndt von Haeseler
Why PCA?
Potentially have 1000’s of trees on 100’s of species.
How do the data vary?
Principal components analysis:Which features vary the most?
How are features correlated?
Quantification: proportion of variance explained
Analysing tree geometry (rather than topology) has advantages:Geometry and topology interdependent
Differences in geometry offer finer resolution of comparison
Why PCA?
Potentially have 1000’s of trees on 100’s of species.
How do the data vary?
Principal components analysis:Which features vary the most?
How are features correlated?
Quantification: proportion of variance explained
Analysing tree geometry (rather than topology) has advantages:Geometry and topology interdependent
Differences in geometry offer finer resolution of comparison
Why PCA?
Potentially have 1000’s of trees on 100’s of species.
How do the data vary?
Principal components analysis:Which features vary the most?
How are features correlated?
Quantification: proportion of variance explained
Analysing tree geometry (rather than topology) has advantages:Geometry and topology interdependent
Differences in geometry offer finer resolution of comparison
A Naive Approach
A split is a bi-partition of the leaves induced by cutting a branch:
B
C
@
@
A
@
@
D
AB|CDE
E
Trees consist of sets of compatible splits
e.g. AB|CDE and AC |BDE cannot both be in a tree
Embed tree-space in RM :On m leaves there are M = (2m−1 − 1) possible splits
Associate each split with a basis vector of RM and represent
tree as vector of branch lengths
Each tree contains at most 2m − 3 M splits
Principal components in RM usually do not correspond to trees
A Naive Approach
A split is a bi-partition of the leaves induced by cutting a branch:
B
C
@
@
A
@
@
D
AB|CDE
E
Trees consist of sets of compatible splits
e.g. AB|CDE and AC |BDE cannot both be in a tree
Embed tree-space in RM :On m leaves there are M = (2m−1 − 1) possible splits
Associate each split with a basis vector of RM and represent
tree as vector of branch lengths
Each tree contains at most 2m − 3 M splits
Principal components in RM usually do not correspond to trees
Geometry of PCA
q
q
q
q
q
q
q
q
Geometry of PCA
q
q
q
q
q
q
q
q
1
Find centre of data (point which minimizes sum of squared
distances)
Geometry of PCA
q
q
q
q
q
q
q
q
1
2
Find centre of data (point which minimizes sum of squared
distances)
Consider line through centre
Geometry of PCA
q
q
q
q
q
q
q
q
1
2
3
Find centre of data (point which minimizes sum of squared
distances)
Consider line through centre
Project points onto line
Geometry of PCA
q
q
q
q
q
q
q
q
1
2
3
4
Find centre of data (point which minimizes sum of squared
distances)
Consider line through centre
Project points onto line
Identify line that maximizes variance (or minimizes sum of
squared distances)
Topological Decomposition of Tree-Space (1)
Trees with same fixed topology represented as points in positive
orthant of Rm−3 (ignore leaves)
e.g. m = 5 taxa represented by quadrant in R2
A
B
AB|CDE
E
C
D
A
A
B
B
C
C
E
E
D
D
ABC|DE
Topological Decomposition of Tree-Space (2)
Orthants are stitched together along their faces
A
AB|CDE
B
C
A
E
B
E
C
D
D
ABC|DE
ABE|CD
A
Nearest neighbour interchange:
B
ABE|CD
ABD|CE
E
D
C
ABC|DE
ABD|CE
Geodesic Metric
Billera, Holmes, Vogtmann (2001) proved existence of the geodesic
metric
Consider paths composed of straight line segments in each
orthant
Define length of path to be sum segment of lengths
Then:The geodesic between two points x, y is the shortest path
between points
Geodesic metric d(x, y ) = length of geodesic
Within each orthant, geodesic metric = Euclidean metric
Calculation:Anne Kupczok and Megan Owen presented algorithms (2007)
Owen and Provan (2009) developed O(m3 ) algorithm
Geodesic Metric
Billera, Holmes, Vogtmann (2001) proved existence of the geodesic
metric
Consider paths composed of straight line segments in each
orthant
Define length of path to be sum segment of lengths
Then:The geodesic between two points x, y is the shortest path
between points
Geodesic metric d(x, y ) = length of geodesic
Within each orthant, geodesic metric = Euclidean metric
Calculation:Anne Kupczok and Megan Owen presented algorithms (2007)
Owen and Provan (2009) developed O(m3 ) algorithm
Geodesic Metric
Billera, Holmes, Vogtmann (2001) proved existence of the geodesic
metric
Consider paths composed of straight line segments in each
orthant
Define length of path to be sum segment of lengths
Then:The geodesic between two points x, y is the shortest path
between points
Geodesic metric d(x, y ) = length of geodesic
Within each orthant, geodesic metric = Euclidean metric
Calculation:Anne Kupczok and Megan Owen presented algorithms (2007)
Owen and Provan (2009) developed O(m3 ) algorithm
Example
A
C
D
@
@
@
@
B
DE |ABC
6
A
q x1
C
D
@
@
B
@
@
E
E
q x2
""
"
"
"
AC |BDE
q
y1
C
A
@
@
D
B
@
@
E
"
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
AB|CDE
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
q
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
y2
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
?
BE |ACD
ΦPCA: PCA with the geodesic metric
For first principal component:1
Find centre x0 of data x1 , . . . , xn
2
Consider each line through x0 :
- line L is geodesic between all points x, y ∈ L
- lines extend to infinity in two directions
3
Project x1 , . . . , xn onto line by finding points yi on geodesic
closest to xi
4
Identify linePthat maximizes variance of y1 , . . . , yn (or
minimizes
d(xi , yi )2 )
Details
Finding center x0 :Cannot ‘add’ trees – so cannot compute a mean
P
Minimizing
d(x0 , xi )2 computationally expensive
Use the majority consensus tree
Projection:Find closest point on L to each point xi
Numerical search made efficient by
- Euclidean approximation
- shift end points slightly – geodesic doesn’t change much
Searching for the Optimal Line
No analytic solution – need to search for optimal line
Geometrical problem: direction vector in initial quadrant
Topological problem: which orthants?
ΦPCA algorithm:Add one split p and its NNI replacement p 0 at a time
- p, p 0 must satisfy compatibility constraint with L
- optimize direction vector given splits p, p 0
- for each pair p, p 0 and direction vector, perform projection
Analogous to one coord at a time in regular PCA
Search algorithms for p, p 0 :
- greedy approach
- simulated annealing: birth / death moves
Searching for the Optimal Line
No analytic solution – need to search for optimal line
Geometrical problem: direction vector in initial quadrant
Topological problem: which orthants?
ΦPCA algorithm:Add one split p and its NNI replacement p 0 at a time
- p, p 0 must satisfy compatibility constraint with L
- optimize direction vector given splits p, p 0
- for each pair p, p 0 and direction vector, perform projection
Analogous to one coord at a time in regular PCA
Search algorithms for p, p 0 :
- greedy approach
- simulated annealing: birth / death moves
Results
1
Application to metazoan (animal) data set
2
Long branch attraction simulation example
Metazoan data set (von Haeseler group, J. Comp. Biol. 2008)
118 gene trees from 20 metazoan species
Metazoan Data Set
he
le
s
api s
op
ra
bscu r
udoo aste
g
no
ela
dm
dpse
an
cbr
dan
ta io
ki
fu
g
u
igg
sae
celegans
te
tra
od
x tr o
yeast
on
p ic a
li s
gallus
na
pa
nt
mon
ode
lp h is
m
m
us
cu
lu
s
cu
gi
ve
or
s
r nt a u r u
b is
can
ro
s
hu
mmu ma
la tt an
cio
Metazoan Data Set
he
le
s
api s
op
ra
bscu r
udoo aste
g
no
ela
dm
dpse
an
cbr
dan
ta io
ki
fu
g
u
igg
sae
celegans
te
tra
od
x tr o
yeast
on
p ic a
li s
gallus
na
pa
nt
mon
ode
lp h is
m
m
us
cu
lu
s
cu
gi
ve
or
s
r nt a u r u
b is
can
ro
s
hu
mmu ma
la tt an
cio
Metazoan Data Set
cbr
he
le
s
api s
op
ra
bscu r
udoo aste
g
no
ela
dm
dpse
an
igg
dan
ta io
ki
fu
g
u
sae
celegans
te
tra
od
x tr o
yeast
on
p ic a
li s
gallus
na
pa
nt
mon
ode
lp h is
m
m
us
cu
lu
s
cu
gi
ve
or
s
r nt a u r u
b is
can
ro
s
hu
mmu ma
la tt an
cio
Metazoan Data Set
cbr
he
le
api s
op
ra
bscu r
udoo aste
g
no
ela
dm
dpse
an
s
igg
dan
ta io
ki
fu
g
u
sae
celegans
te
tra
od
x tr o
yeast
on
p ic a
li s
gallus
na
pa
nt
mon
ode
lp h is
m
m
us
cu
lu
s
cu
gi
ve
or
s
r nt a u r u
b is
can
ro
s
hu
mmu ma
la tt an
cio
Metazoan Data Set
cbr
he
le
api s
op
ra
bscu r
udoo aste
g
no
ela
dm
dpse
an
s
igg
dan
ta io
ki
fu
g
u
sae
celegans
te
tra
od
on
x tr o
yeast
p ic a
li s
gallus
na
pa
nt
mon
ode
lp h is
m
m
us
cu
lu
s
cu
gi
ve
or
s
r nt a u r u
b is
can
ro
s
hu
mmu ma
la tt an
cio
Metazoan Data Set
cbr
he
le
api s
op
ra
bscu r
udoo aste
g
no
ela
dm
dpse
an
s
igg
dan
ta io
ki
fu
g
u
sae
celegans
te
tra
od
on
x tr o
p ic a
li s
gallus
cio
na
pa
nt
mon
ode
lp h is
m
m
us
cu
lu
s
cu
gi
ve
or
s
r nt a u r u
b is
can
ro
s
hu
mmu ma
la tt an
yeast
Metazoan Data Set
cbr
he
le
api s
op
ra
bscu r
udoo aste
g
no
ela
dm
dpse
an
s
igg
dan
ta io
ki
fu
g
u
sae
celegans
te
tra
od
on
x tr o
p ic a
li s
gallus
cio
na
pa
nt
mon
ode
lp h is
m
m
us
cu
lu
s
cu
gi
ve
or
s
r nt a u r u
b is
can
ro
s
hu
mmu ma
la tt an
yeast
Metazoan Data Set
cbr
igg
he
le
api s
op
ra
bscu r
udoo aste
g
no
ela
dm
dpse
an
s
dan
ta io
ki
fu
g
u
sae
celegans
te
tra
od
on
x tr o
p ic a
li s
gallus
cio
na
pa
nt
mon
ode
lp h is
m
m
us
cu
lu
s
cu
gi
ve
or
s
r nt a u r u
b is
can
ro
s
hu
mmu ma
la tt an
yeast
Metazoan Data Set
he
le
api s
op
igg
ra
bscu r
udoo aste
g
no
ela
dm
dpse
an
cbr
s
dan
ta io
ki
fu
g
u
sae
celegans
te
tra
od
on
x tr o
li s
gallus
cio
na
pa
nt
mon
ode
lp h is
m
m
us
cu
lu
s
cu
gi
ve
or
s
r nt a u r u
b is
can
ro
s
hu
mmu ma
la tt an
yeast
p ic a
Metazoan Data Set
he
le
s
dan
ta io
ki
fu
g
u
sae
celegans
api s
op
igg
ra
bscu r
udoo aste
g
no
ela
dm
dpse
an
cbr
te
tra
od
on
x tr o
li s
gallus
cio
na
pa
nt
mon
ode
lp h is
m
m
us
cu
lu
s
cu
gi
ve
or
s
r nt a u r u
b is
can
ro
s
hu
mmu ma
la tt an
yeast
p ic a
Metazoan Data Set ∗
he
le
s
dan
ta io
ki
fu
g
u
sae
celegans
api s
op
igg
ra
bscu r
udoo aste
g
no
ela
dm
dpse
an
cbr
te
tra
od
on
x tr o
li s
gallus
cio
na
pa
nt
mon
ode
lp h is
m
m
us
cu
lu
s
cu
gi
ve
or
s
r nt a u r u
b is
can
ro
s
hu
mmu ma
la tt an
yeast
p ic a
m
o
el
s
ph
llu
d
no
ga
lis
o
ra
is
rve
gic
u
s c u lu s
s
xt ro pi ca
is
mmu
ap
rn
les
phe
ano
dps eud oob scu
s te r
a
on
noga
ci
d m e la
cb
ye
rig
gs
as
a
t
ce le ga nse
da
o
ugu
t a k if
tetraodon
bt
ca a u r u s
ni
s
ni
Metazoan Data Set
ro
nt
p au m a n
h
mm ula tta
m
de
hi
s
lp
llu
lis
ra
s
rve
gic
u
s c u lu s
s
o
o
on
ga
xt ro pi ca
is
mmu
ap
rn
les
phe
ano
dps eud oob scu
s te r
a
on
noga
ci
d m e la
cb
ye
rig
gs
as
a
t
ce le ga nse
da
o
ugu
t a k if
tetraodon
bt
ca a u r u s
ni
s
ni
Metazoan Data Set
ro
nt
p au m a n
h
mm ula tta
m
o
el
s
ph
llu
d
no
ga
lis
o
ra
is
rve
gic
u
s c u lu s
s
xt ro pi ca
is
mmu
ap
rn
les
phe
ano
dps eud oob scu
s te r
a
on
noga
ci
d m e la
cb
ye
rig
gs
as
a
t
ce le ga nse
ni
ugu
t a k if
tetraodon
o
bt
ca a u r u s
ni
s
da
Metazoan Data Set
ro
nt
p au m a n
h
mm ula tta
m
on
hi
s
lp
llu
lis
ra
s
rve
gic
u
s c u lu s
s
e
od
ga
xt ro pi ca
is
mmu
ap
rn
o
les
phe
ano
dps eud oob scu
s te r
a
on
noga
ci
d m e la
cb
ye
rig
gs
as
a
t
ce le ga nse
ni
ugu
t a k if
tetraodon
o
bt
ca a u r u s
ni
s
da
Metazoan Data Set
ro
nt
p au m a n
h
mm ula tta
m
o
d
no
ga
el
s
ph
llu
lis
is
xt ro pi ca
ap
is
les
phe
ra
rn
or
v
e
gic
u
mmu
s c u lu s
s
ano
dps eud oob scu
s te r
a
on
noga
ci
d m e la
cb
ye
rig
gs
as
a
t
ce le ga nse
bt
ca a u r u s
ni
s
o
ni
u
d a t a k if u g
tetraodon
Metazoan Data Set
ro
nt
pa u m a n
h
mm ula tta
m
o
d
no
is
s
h
lp
llu
e
ga
lis
is
xt ro pi ca
ap
les
phe
ra
rn
or
v
e
gic
u
mmu
s c u lu s
s
ano
dps eud oob scu
s te r
a
on
noga
ci
d m e la
cb
ye
rig
gs
as
a
t
ce le ga nse
o
ni
ugu
t a k if
tetraodon
bt
ca a u r u s
ni
s
da
Metazoan Data Set
ro
nt
n
pa h u m a
mm ula tta
m
on
od
g
p
el
u
all
hi
s
lis
is
xt ro pi ca
ap
s
a
bt
ca a u r u s
ni
s
o
ni
d a a k if u g u
t
tetraodon
pa
les
phe
ra
rn
or
v
e
gic
u
mmu
s c u lu s
s
ano
dps eud oob scu
s te r
on
noga
ci
d m e la
cb
ye
rig
gs
as
a
t
ce le ga nse
Metazoan Data Set
nt
ro
an
hum
mm ula tta
m
o
s
llu
ga
is
ph
l
de
o
n
lis
is
xt ro pi ca
ap
a
bt
ca a u r u s
ni
s
o
ni
d a a k if u g u
t
tetraodon
pa
les
phe
ra
rn
or
v
e
gic
u
mmu
s c u lu s
s
ano
dps eud oob scu
s te r
on
noga
ci
d m e la
cb
ye
rig
gs
as
a
t
ce le ga nse
Metazoan Data Set
nt
ro
an
hum
mm ula tta
ap
lis
xt ro pi ca
s
llu s
a
g
hi
lp
e
od
on
m
is
a
o
ni
ugu
t a k if
tetraodon
bt
ca a u r u s
ni
s
da
pa
nt
les
phe
ra
rn
or
v
e
gic
u
mmu
s c u lu s
s
ano
dps eud oob scu
s te r
on
noga
ci
d m e la
cb
ye
rig
gs
as
a
t
ce le ga nse
Metazoan Data Set
ro
an
hum
mm ula tta
ap
lis
xt ro pi ca
s
llu is
a
g
h
lp
e
od
on
m
is
a
o
ni
ugu
t a k if
tetraodon
bta
ca u r u s
ni
s
da
pa
nt
les
phe
ra
rn
or
v
e
gic
u
mmu
s c u lu s
s
ano
dps eud oob scu
s te r
on
noga
ci
d m e la
cb
ye
rig
gs
as
a
t
ce le ga nse
Metazoan Data Set ∗
ro
an
hum
mm ula tta
Long branch attraction (LBA) simulation
When the true tree contains long branches next to short branches,
long branches are erroneously placed together in estimated
phylogenies
Simulation:Simulate 100 DNA alignments from ‘true tree’ (genes)
Estimate trees from alignments (gene trees)
Apply ΦPCA to resultant collection of alternative trees
Long Branch Attraction Tree
Tree taken from Philippe, Syst. Biol. 2005, ‘An Empirical assessment of
LBA artefacts’
apicomp_1
apicomp_2
apicomp_3
apicomp_5
apicomp_4
cilliates
apicomp_6
diatoms_1
diatoms_2
algae
oomycetes
plants_1
plants_2
plants_3
plants_5
plants_4
plants_6
conosa_1
rhodophyta
fungi_2
fungi_1
conosa_2
fungi_4
fungi_3
fungi_6
fungi_5
fungi_7
fungi_8
metazoa_2
metazoa_1
metazoa_3
choanoflagellida
metazoa_4
crenarch_3
crenarch_2
crenarch_1
euryarch_3
euryarch_2
euryarch_1
guillardia
microsporidia
ap
cil
_a
_a
oo my ce tes
2
ms_ 1
ae
d ia too m s _ a l g
t
d ia
eu
eu
apicom p_1
a p ic ap ic om
p_ 2
om
p_3
aapp i c
ic om
om p_
p_ 4
5
LBA Simulation
rc
rc
lia
om
te
p_
6
s
h_
h_
ic
3
eu 2
_a
rch
_1
cn_
arc
cn_a
rch_3 h_1
cn_ arch _2
h
ill
gruh o d o p
p la n
co
1
oa_
ta z
m eet az oa _2
m
metazo a_3
me taz oa _4
i_ 7
fl
fu n g
no
3
sa
_2
_1
oa
fu
f
n
fun gi_ ung
g i_ 2
i_
1
p
p l la n t
an s_
ts 5
_6
sa
i_ 8
ch
6
g i_ 5
fun gi_
fun
_4
gi
n
fu
no
no
fu n g
co
mic rosp
pl an ts _1
plants_2
pl an ts _3
ts _ 4
ap
_a
_a
oo my ce tes
2
ms_ 1
ae
d ia too m s _ a l g
t
d ia
eu
eu
apicom p_1
a p ic ap ic om
p_ 2
om
p_3
aapp i c
ic om
om p_
p_ 4
5
LBA Simulation
cil
rc
rc
h_
h_
ic
lia
te
p_
6
s
3
eu 2
_a
rch
_1
cn_
arc
cn_a
rch_3 h_1
cn_ arch _2
om
h
ill
gruh o d o p
p la n
mic rosp
co
co
i_ 8
no
1
oa_
ta z
m eet az oa _2
m
metazo a_3
me taz oa _4
fl
fu n g
no
i_ 7
oa
fu
f
n
fun gi_ ung
g i_ 2
i_
1
3
sa
_2
_1
ch
6
g i_ 5
fun gi_
4
fun
i_
ng
fu
p
p l la n t
an s_
ts 5
_6
sa
fu n g
no
pl an ts _1
plants_2
pl an ts _3
ts _ 4
i
ap
_a
_a
oo my ce tes
2
ms_ 1
ae
d ia too m s _ a l g
t
d ia
eu
eu
apicom p_1
a p ic ap ic om
p_ 2
om
p_3
aapp i c
ic om
om p_
p_ 4
5
LBA Simulation
cil
rc
rc
h_
eu 2
_a
rch
_1
cn_
a
cn_a
rch_3 rch_1
cn_ arch _2
co
m
te
s
lia
p_
6
h_
3
h
ill
gruh o d o p
p la n
p
p l la n t
an s_
ts 5
_6
mic rosp
i_ 8
1
oa_
ta z
m eet az oa _2
m
metazo a_3
me taz oa _4
i_ 7
fl
fu n g
no
3
_2
_1
oa
fu
f
n
fun gi_ ung
g i_ 2
i_
1
sa
sa
ch
6
g i_ 5
fun gi_
fun
_4
gi
n
fu
no
no
fu n g
co
co
pl an ts _1
plants_2
pl an ts _3
ts _ 4
_a
_a
oo my ce tes
2
ms_ 1
ae
d ia too m s _ a l g
t
d ia
eu
eu
apicom p_1
a p ic ap ic om
p_ 2
om
p_3
aapp i c
o
ic m
om p_
p_ 4
5
LBA Simulation
i
ap
cil
rc
rc
h_
eu 2
_a
rch
_1
cn_
a
cn_a
rch_3 rch_1
cn_ arch _2
co
m
te
s
lia
p_
6
h_
3
h
ill
gruh o d o p
p la n
p
p l la n t
an s_
ts 5
_6
mic rosp
i_ 8
co
_1
zoa
e ta oa _2
m
m et az
metazo a_3
me taz oa _4
fl
i_ 7
no
fu n g
_2
_1
oa
fu
f
n
fun gi_ ung
g i_ 2
i_
1
3
sa
sa
ch
6
g i_ 5
fun gi_
4
fun
i_
ng
fu
no
no
fu n g
co
pl an ts _1
plants_2
pl an ts _3
ts _ 4
_a
_a
oo my ce tes
2
ms_ 1
ae
d ia too m s _ a l g
t
d ia
eu
eu
apicom p_1
a p ic ap ic om
p_ 2
om
p_3
aapp i c
ic om
om p_
p_ 4
5
LBA Simulation
ap
rc
rc
h_
eu 2
_a
rch
_1
cn_
a
cn_a
rch_3 rch_1
cn_ arch _2
h_
cil
ic
lia
om
te
p_
6
s
3
ill
gruh o d o p
p la n
mic rosp
co
n g i_ 8
co
1
oa_
ta z
m eet az oa _2
m
metazo a_3
me taz oa _4
fl
i_ 7
no
fu n g
sa
_2
_1
oa
fu
f
n
fun gi_ ung
g i_ 2
i_
1
3
pl an ts _1
plants_2
pl an ts _3
ts _ 4
p
p l la n t
an s_
ts 5
_6
sa
ch
6
g i_ 5
fun gi_
4
fun
i_
ng
fu
no
no
fu
h
_a
_a
oo my ce tes
2
ms_ 1
ae
d ia too m s _ a l g
t
d ia
eu
eu
apicom p_1
a p ic ap ic om
p_ 2
om
p_3
aapp i c
ic om
om p_
p_ 4
5
LBA Simulation
ap
rc
rc
h_
eu 2
_a
rch
_1
cn_
a
cn_a
rch_3 rch_1
cn_ arch _2
h_
cil
ic
lia
6
s
3
p la n
co
co
i_ 8
fu n g
no
fl
1
oa_
ta z
m eet az oa _2
m
metazo a_3
me taz oa _4
sa
_2
_1
oa
i_ 7
pl an ts _1
plants_2
pl an ts _3
ts _ 4
p
p l la n t
an s_
ts 5
_6
sa
ch
fu n g
no
no
fu
f
n
fun gi_ ung
g i_ 2
i_
1
3
te
p_
h
ill
gruh o d o p
mic rosp
6
g i_ 5
fun gi_
fun
_4
gi
n
fu
om
_a
_a
ap
rc
rc
h_
eu 2
_a
rch
_1
cn_
a
cn_a
rch_3 rch_1
cn_ arch _2
oo my ce tes
2
ms_ 1
ae
d ia too m s _ a l g
t
d ia
eu
eu
apicom p_1
a p ic ap ic om
p_ 2
om
p_3
aapp i c
ic om
om p_
4
p_
5
LBA Simulation
h_
cil
ic
lia
3
co
co
s
sa
_2
_1
oa
no
fl
1
oa_
ta z
m eet az oa _2
m
metazo a_3
me taz oa _4
pl an ts _1
plants_2
pl an ts _3
ts _ 4
p
p l la n t
an s_
ts 5
_6
sa
ch
i_ 7
no
no
fu n g
6
p la n
i_ 8
fu n g
fu
f
n
fun gi_ ung
g i_ 2
i_
1
3
te
p_
h
ill
gruh o d o p
mic rosp
6
g i_ 5
fun gi_
4
fun
i_
ng
fu
om
_a
_a
rc
rc
h_
eu 2
_a
rch
_1
cn_
a
cn_a
rch_3 rch_1
cn_ arch _2
oo my ce tes
2
ms_ 1
ae
d ia too m s _ a l g
t
d ia
eu
eu
apicom p_1
a p ic ap ic om
p_ 2
om
p_3
aapp i c
ic om
om p_
p_ 4
5
LBA Simulation
i
ap
h_
cil
co
m
te
s
lia
3
p la n
co
i_ 8
co
_1
zoa
e ta oa _2
m
m et az
metazo a_3
me taz oa _4
i_ 7
fl
fu n g
_2
_1
no
3
pl an ts _1
plants_2
pl an ts _3
ts _ 4
p
p l la n t
an s_
ts 5
_6
sa
sa
oa
fu
f
n
fun gi_ ung
g i_ 2
i_
1
no
no
ch
6
g i_ 5
fun gi_
4
fun
i_
ng
u
f
6
h
ill
gruh o d o p
mic rosp
fu n g
p_
_a
_a
rc
rc
h_
eu 2
_a
rch
_1
cn_
a
cn_a
rch_3 rch_1
cn_ arch _2
oo my ce tes
2
ms_ 1
ae
d ia too m s _ a l g
t
d ia
eu
eu
apicom p_1
a p ic ap ic om
p_ 2
om
p_3
aapp i c
ic om
om p_
4
p_
5
LBA Simulation
ap
h_
3
cil
ic
lia
te
co
s
1
oa_
ta z
m eet az oa _2
m
metazo a_3
me taz oa _4
fl
i_ 7
sa
_2
_1
no
fu n g
pl an ts _1
plants_2
pl an ts _3
ts _ 4
p
p l la n t
an s_
ts 5
_6
sa
oa
3
no
no
n g i_ 8
ch
fu
f
n
fun gi_ ung
g i_ 2
i_
1
6
p la n
co
6
g i_ 5
f u nn g i _
4
fu
i_
ng
u
f
p_
h
ill
gruh o d o p
mic rosp
fu
om
_a
_a
rc
rc
h_
eu 2
_a
rch
_1
cn_
a
cn_a
rch_3 rch_1
cn_ arch _2
h_
oo my ce tes
2
ms_ 1
ae
d ia too m s _ a l g
t
d ia
eu
eu
apicom p_1
a p ic ap ic om
p_ 2
om
p_3
aapp i c
o
ic m
om p_
p_ 4
5
LBA Simulation
ap
3
cil
ic
lia
om
te
co
sa
fl
1
oa_
ta z
m eet az oa _2
m
metazo a_3
me taz oa _4
_2
_1
no
i_ 7
pl an ts _1
plants_2
pl an ts _3
ts _ 4
p
p l la n t
an s_
ts 5
_6
sa
oa
fu n g
no
no
i_ 8
ch
fu
f
n
fun gi_ ung
g i_ 2
i_
1
3
s
p la n
co
6
g i_ 5
fun gi_
4
fun
i_
ng
fu
6
h
ill
gruh o d o p
mic rosp
fu n g
p_
_a
_a
rc
rc
h_
eu 2
_a
rch
_1
cn_
a
cn_a
rch_3 rch_1
cn_ arch _2
h_
oo my ce tes
2
ms_ 1
ae
d ia too m s _ a l g
t
d ia
eu
eu
apicom p_1
a p ic ap ic om
p_ 2
om
p_3
aapp i c
ic om
om p_
p_ 4
5
LBA Simulation
i
ap
3
cil
co
m
te
s
lia
p la n
co
co
1
oa_
ta z
m eet az oa _2
m
metazo a_3
me taz oa _4
fl
i_ 7
sa
_2
_1
no
fu n g
pl an ts _1
plants_2
pl an ts _3
ts _ 4
p
p l la n t
an s_
ts 5
_6
sa
oa
fu
f
n
fun gi_ ung
g i_ 2
i_
1
3
no
no
i_ 8
ch
6
g i_ 5
fun gi_
4
fun
i_
ng
fu
6
h
ill
gruh o d o p
mic rosp
fu n g
p_
_a
_a
h_
2
_a
rch
h_
3
eu
cn_
rc
rc
oo my ce tes
2
ms_ 1
ae
d ia too m s _ a l g
t
d ia
eu
eu
apicom p_1
a p ic ap ic om
p_ 2
om
p_3
aapp i c
ic om
om p_
p_ 4
5
LBA Simulation ∗
ap
cil
_1
a
cn_a
rch_3 rch_1
cn_ arch _2
ic
lia
om
te
p_
6
s
h
ill
gruh o d o p
mic rosp
p la n
co
co
no
no
1
oa_
ta z
m eet az oa _2
m
metazo a_3
me taz oa _4
i_ 7
fl
fu n g
no
fu
f
n
fun gi_ ung
g i_ 2
i_
1
3
sa
_2
_1
oa
6
g i_ 5
fun gi_
4
fun
i_
ng
fu
p
p l la n t
an s_
ts 5
_6
sa
i_ 8
ch
fu n g
pl an ts _1
plants_2
pl an ts _3
ts _ 4
Summary
How can you understand variability in large sets of trees?
Attempt to re-interpret PCA as a geometrical procedure in
tree-space
Incorporates geometrical and topological information about
trees via the geodesic metric
Computational feasibility is achieved by greedy or stochastic
search for optimal line
Method successfully identifies the most variable features and
correlations between features
Captures known variability in experimental data and reveals
long branch attraction in simulated data
Summary
How can you understand variability in large sets of trees?
Attempt to re-interpret PCA as a geometrical procedure in
tree-space
Incorporates geometrical and topological information about
trees via the geodesic metric
Computational feasibility is achieved by greedy or stochastic
search for optimal line
Method successfully identifies the most variable features and
correlations between features
Captures known variability in experimental data and reveals
long branch attraction in simulated data
Summary
How can you understand variability in large sets of trees?
Attempt to re-interpret PCA as a geometrical procedure in
tree-space
Incorporates geometrical and topological information about
trees via the geodesic metric
Computational feasibility is achieved by greedy or stochastic
search for optimal line
Method successfully identifies the most variable features and
correlations between features
Captures known variability in experimental data and reveals
long branch attraction in simulated data
Summary
How can you understand variability in large sets of trees?
Attempt to re-interpret PCA as a geometrical procedure in
tree-space
Incorporates geometrical and topological information about
trees via the geodesic metric
Computational feasibility is achieved by greedy or stochastic
search for optimal line
Method successfully identifies the most variable features and
correlations between features
Captures known variability in experimental data and reveals
long branch attraction in simulated data
Summary
How can you understand variability in large sets of trees?
Attempt to re-interpret PCA as a geometrical procedure in
tree-space
Incorporates geometrical and topological information about
trees via the geodesic metric
Computational feasibility is achieved by greedy or stochastic
search for optimal line
Method successfully identifies the most variable features and
correlations between features
Captures known variability in experimental data and reveals
long branch attraction in simulated data
Summary
How can you understand variability in large sets of trees?
Attempt to re-interpret PCA as a geometrical procedure in
tree-space
Incorporates geometrical and topological information about
trees via the geodesic metric
Computational feasibility is achieved by greedy or stochastic
search for optimal line
Method successfully identifies the most variable features and
correlations between features
Captures known variability in experimental data and reveals
long branch attraction in simulated data
Further work:Higher components: approximation by ‘planes’
Better theory of distributions on tree space
Thanks to:Anne Kupczok and Arndt von Haeseler
Two anonymous reviewers

Similar documents