DNA within - Francois

Transcription

DNA within - Francois
The DNA Within: Mutation, Selection, Evolution and
Phylogenetic Dance Composition
François-Joseph Lapointe
Département de sciences biologiques, Université de Montréal,
CP 6128, Succ. Centre-ville, Montréal (QC), H3C 3J7, Canada
[email protected]
F- J. Lapointe – The DNA Within
Abstract. The science of phylogenetics aims at recovering the evolution of a
gene over time by estimating the ancestor-descendent relationships among
different species, or populations. The complex processes of mutation and
selection are the main forces driving the evolution of genes that make up our
DNA, and the complete history of a particular gene can only be studied using
appropriate models of evolutionary change. A choreographer also applies
mutation and selection to generate dance pieces. Whereas the genes compose
the words of the genetic vocabulary, the movements represent the vocabulary
of dance. Consequently, a phylogenetic approach can be applied to transform
movement sequences over time, to mutate the movement vocabulary, and to
evolve choreographies. In this paper, I will propose different models of DNA
evolution, in the specific context of phylogenetic dance composition. This
process can be easily implemented using a computer algorithm, which takes as
input an original movement sequence to produce a number of mutant
sequences that are obtained through various types of choreographic operations.
The history of movement sequences over time, the phylogenetic
choreography, thus retraces the cumulative transformation of ancestral
sequences into descendant sequences, based on a choreographic model of
evolutionary change.
Keywords: Dance composition, Maximum
Phylogenetic Trees, Models of Sequence Evolution.
Likelihood
Inference,
2
F- J. Lapointe – The DNA Within
1. Introduction
What if one could go back in time to document the genesis of an artistic
piece from its inception to the end? What if one could retrace the
chronological process leading to a masterpiece? What if one could reverseengineer creativity? That is for sure impossible. Yet, using very simplistic
models of evolutionary computation, it may be feasible to propose a solution
to this difficult task. Retracing the evolution of a piece in time is not unlike
inferring the evolution of Life. There exists a wide range of mathematical
models to estimate the phylogenetic history of species. These methods take as
input a set of objects (species) represented by a given number of attributes
(molecular sequences), and return a tree (a phylogeny) depicting the
relationships among these objects. Several trees, each one representing a
different phylogenetic hypothesis, are distinguishable. The real challenge is to
infer the best tree, given a particular model of sequence evolution.
In this paper, it is postulated that the phylogenetic approach can be applied
to dance composition. Assuming that the species can be replaced by the
different dancers, each represented by a sequence of movements, phylogenetic
inference can be applied to retrace the original sequence of movements from
which all other sequences may have evolved. In other words, this process aims
at finding the mother of all movement sequences, that from which the
choreographer draws inspiration.
In order to present the details of the algorithmic process to deciphering
dance composition, phylogenetic analysis will first be introduced, with
definitions of trees and the symbolic notations that will be used throughout
this paper. Different models of nucleotide evolution will also be presented
with respect to maximum likelihood inference. These molecular models will
3
F- J. Lapointe – The DNA Within
then be translated into various choreographic models of movement sequence
evolution, and the phylogenetic approach will be illustrated with a specific
example. Possible extensions and generalizations to other artistic fields will
finally be discussed.
2. Phylogenetic Analysis in a Nutshell
We all have a family tree; a genealogy retracing the relationships among
siblings, parents, grand parents, uncles and cousins. A phylogeny is nothing
more than a family tree of species. However, unlike a genealogy for which the
ancestors are usually known (at least for recent generations), a phylogenetic
tree is merely presenting a hypothesis about ancestor-descendant relationships.
Fig. 1. A phylogenetic tree T, with labeled leaves corresponding to a set S = {A, B, C, D, E}
of different species. The internal nodes are represented by black circles, whereas the terminal
nodes (the leaves) are represented by white circles. The path between D and E is illustrated by
dashed edges. The root imposes a direction on the branches from the bottom to the top of the
tree. This tree is unweighted.
In mathematical terms, a phylogenetic tree T is defined as an acyclic and
connected graph. That is that, it depicts a set V of vertices (the nodes of the
tree) connected by a set E of edges (the branches of a the tree), in such a way
that there exists one and only one path of adjacent edges between any two
4
F- J. Lapointe – The DNA Within
nodes (Fig. 1). Typically, labels are attached to the nodes of a tree to represent
objects. In a phylogeny, only the terminal nodes (the leaves of the tree) are
labeled, whereas internal nodes also are labeled in a genealogy. In both cases,
one node can be labeled as the ancestor to all other nodes (the root of the tree)
to impose a direction on the branches of the tree. Furthermore, a tree can have
branch lengths (or weights) attached to the edges to represent the amount of
time separating two nodes. In a phylogeny, these weights are a function of
genetic distances between nodes, so that the sum of branch lengths along the
path connecting any two leaves represent the evolutionary distance between a
pair of species. Unweighted trees do not have branch lengths and evolutionary
distances are defined in such cases as the number of branches along the path
between two nodes.
2.1 The Number of Phylogenetic Trees
Given a set of species S = {1, 2, …, n}, the objective of phylogenetic
inference is to find the tree T that best represents the evolutionary
relationships among these species. When the number of leaves n is small, this
problem can be solved exhaustively by considering each and every possible
hypothesis of relationships and finding the one that optimizes a given criterion
(see section 2.3). However, the number of distinguishable trees Tn increases
dramatically with the number of leaves n [1].
Tn = (2n - 3)! / 2n - 2 (n - 2)!
(1)
Namely there are 15 possible topologies (depicting rooted phylogenies) for
a set of 4 objects, but that number reaches 34,459,425 for 10 objects.
Because it is impossible to evaluate all possible trees when the number of
species is too large, heuristic algorithms are employed to approximate the
results of phylogenetic analysis, and numerous different search strategies have
5
F- J. Lapointe – The DNA Within
been proposed over the years to speed up the process and improve its accuracy
[2, 3]. Yet, it is important to note at this stage that the “true” phylogeny, which
is sought, is not known, and that phylogenetic trees represent, at best, probable
relationships among extant species. In order to recover that elusive tree of life,
different models of molecular evolution are thus required.
2.2 Models of Molecular Evolution
To estimate phylogenetic trees from a set of molecular sequences, one first
has to define a model of molecular evolution to compute the substitution rates
among the different nucleotides. For the sake of clarity, this section will only
focus on models of DNA evolution, but others models for different types of
sequences (e.g., proteins) are also available [2]. All models are based on
different number of parameters that describe the (1) frequencies of the four
nucleotides in DNA sequences (A, C, T, and G), and (2) the substitution
probabilities among these four states, among others (for more complex
models, the variance of the evolutionary rates among sites and the number of
invariable sites can also be considered).
Fig. 2. Examples of models of DNA evolution. (a) JC model [4], with only one rate of
substitutions (α) among all nucleotides. (b) K2P model [6], with different substitution rates for
transitions (α) and transversions (β). (Adapted from [2]).
6
F- J. Lapointe – The DNA Within
There exist six possible types of substitution between pairs of nucleotides.
The simplest model (JC model), introduced by Jukes-Cantor [4], assumes that
the frequencies of all nucleotides are equal in the sequences, and that the
substitution rates from one nucleotide to another are all the same (see Fig. 2a).
In other words, there is no difference between transitions (substitution
between purines {A, G} or between pyrimidines {C, T}), and transversion
(substitution between a purine and a pyrimidine), and at equilibrium, all
nucleotides should appear in the sequences with a 1/4th probability. Because,
this model does not accurately represent actual sequences, a large number of
alternative scenarios have been proposed (see Fig. 3 for the relationships
among some of these models).
Fig. 3. Relationships between different models of DNA evolution for different types of
substitutions rates, and with and without equal nucleotide frequencies. (Modified from [3]).
For example, the F81 model [5] assumes that all substitution rates are equal,
but that the frequencies of nucleotides in the sequence can differ. On the other
hand, one can fix the nucleotide frequencies to be equal, but allow transition
rates to differ from transversion rate (see Fig. 2b); this is the so-called Kimura
7
F- J. Lapointe – The DNA Within
2-parameter model (K2P [6]) or the corresponding HKY [7] model for
unequal nucleotide frequencies. Of course, a 3-parameter model [8], is also
available, as well as any other that assigns different rates to specific types of
substitutions [3]. The richest of all, the General Time-Reversible model (GTR)
assumes that all six substitution rates are different, and that the frequencies of
nucleotides can vary [9]. Although the GTR model may better represent actual
sequences, it is also much more difficult to compute. One of the challenges of
phylogenetic analysis is to find the optimal model to infer a phylogeny, and
several tests are available to do so [10]. In the following section, the maximum
likelihood approach will be introduced to infer a phylogeny with respect to a
given model of DNA evolution, and to allow for comparisons among different
phylogenetic hypotheses.
2.3 Maximum Likelihood Inference of a Phylogenetic Tree
Given a set of DNA sequences of length m, collected on a set S of n
species, the objective of phylogenetic inference is to find the tree T that best
explains the actual distribution of sequences D, with respect to a evolutionary
model of sequence evolution. Formally speaking, we are thus looking for the
tree that will maximize the following function [2]:
L = Prob (D | T)
(2)
where L is the likelihood of the data D, given a phylogenetic hypothesis T.
To be able to do so, a large number of trees have to be visited by a search
algorithm (when n is small the tree space can be evaluated exhaustively), and
the tree with the best likelihood is selected as the most probable solution.
However, because different trees can be obtained when using different models
of DNA evolution, log-likelihood ratio tests [10] have to be computed to select
the most parsimonious model (with the least number of parameters). Given
this optimal tree, it then becomes possible to infer the ancestral sequences of
8
F- J. Lapointe – The DNA Within
DNA at internal nodes, going as far back as the root of the tree [11]. This
approach is exactly what will be used to estimate the mother of all movement
sequences for phylogenetic dance composition.
3 Phylogenetic Dance Composition
The rationale of this paper is that a phylogenetic approach can be employed
to reverse engineer dance composition. That is, that it may be possible to infer
the choreographic process by going back in time. Assuming that the leaves of
a tree are dancers, and that the corresponding terminal sequences are
representing the movements they must execute, the phylogenetic analysis then
amounts to retracing the evolution of movements from a common ancestor, or
to turn the tree upside down to infer the creative process that may have
generated that piece. However, the result of this process greatly depends on
the model at hand.
In the following sections, the models of nucleotide evolution will be
generalized to movement sequence evolution. This choreographic model will
then be implemented and illustrated to infer a choreographic tree for three
dancers, using a maximum likelihood criterion.
3.1 Models of Choreographic Evolution
The creation of a dance involves a series of choreographic operations,
which are not unlike the mutations occurring at the DNA level. Consequently,
models of molecular evolution can be applied to modelize the evolution of
movement sequences in time. However, different types of mutations are
required for dance composition, in addition to the simple substitution of a
movement by another. Indeed, six choreographic operators between movement
sequences have already been defined to modelize the evolution of dance using
9
F- J. Lapointe – The DNA Within
a genetic algorithm [12, 13]: substitution, insertion, deletion, repetition,
inversion, and translocation (see Fig. 4 for details). However, contrary to DNA
sequences, these mutations apply to the entire sequence and not to single
nucleotides (or movements). Several different models of choreographic
evolution may thus be defined. The simplest one, related to the JC model,
assumes that all movements are equiprobable and that all types of mutations
have the same rate of evolution. At other end of the spectrum, the equivalent
of the GTR model allows the different movements to have unequal
frequencies, and assumes different rates for the different types of mutations.
As for molecular sequence, intermediate models for restricted number of
mutations are also available, depending on the choreography.
Fig. 4. Different types of choreographic mutations between movement sequences. The top line
represents the original movement sequence, whereas the bottom line is the mutant sequence.
In each case, the movements affected by the mutation are underlined.
In addition to the models of movement sequence evolution, dance
composition algorithms also require information about the transitions from
one movement to the next. For example, some combinations of movements
may be restricted or impossible, whereas other combinations may be more
likely. To account for this syntaxic process, a transition probability matrix
between all possible pairs of movements in a sequence is computed. This
matrix P contains the Pij values corresponding to the number of times that
10
F- J. Lapointe – The DNA Within
movement i is followed by movement j in the sequences. It can be estimated
empirically from the actual sequences, or be defined theoretically to impose a
preferred syntax on the sequences. When all Pij values are equal, the order of
movements in the sequence is not important for the modelization.
Additional parameters, such as tempo, spacing, orientation, can also be
included in the choreographic models, at the expenses of algorithmic
complexity. For the sake of clarity, these parameters have been ignored in the
present paper.
3.2 Maximum Likelihood Inference of a Choreographic Tree
For simplicity, let us assume that only 4 movements {1, 2, 3, 4} are used to
infer a choreography generated for three dancers {A, B, C}. The movement
sequences corresponding to the different dancers are known, and the objective
is to find the best possible tree relating these sequences, in order to infer the
sequence at the root of the tree from which they have evolved. A maximum
likelihood approach implementing the full choreographic model with six
possible types of mutations is employed. The resulting tree is shown in Fig. 5.
The application of the phylogenetic dance composition model, clearly
shows that it is possible to infer ancestral movement sequences. Interestingly,
these estimates are not only providing information about the evolution of the
choreographic process, they can be used as well to generate dance pieces. For
example, by traveling down the tree, from the leaves to the root, the dancers
will converge to the same original sequence and unison will appear as the
result of this process. Such a creative approach may allow the choreographer
to create syntaxic structures, based on a purely objective model of movement
sequence evolution.
11
F- J. Lapointe – The DNA Within
Fig. 5. Application of the maximum likelihood approach for inferring a choreographic tree.
The movement sequences for the three dancers are represented at the leaves of the branches
(white circles), and the estimated ancestral sequences are shown at the internal nodes (black
circles). The types of mutations required to transform ancestral sequences into descendant
sequences are presented on the branches of the tree. This solution is the optimal tree obtained
under a complete choreographic model with unequal frequencies for the movements, and with
independent rates of evolution for the six types of mutations.
4 Extensions and Conlusion
In this paper, I have shown that a phylogenetic model can be used to reverse
engineer the creative process or to generate dance pieces. But, this approach
may not only shed light on the dance composition technique, it may also
provide insights about the styles of different choreographers. By comparing
the types of choreographic mutations employed by various composers, a
classification of choreographers could be derived. Also, the evolution of an
artist’s work throughout his or her career could be explained under a
phylogenetic model. Still, this naive process of dance composition is far from
the reality. Several elements of style are not accounted for in the model, such
as tempo, energy, position of the dancers in space, the orientation of the
movements, and the spacing among dancers, among others. Also, because the
models are based on movement sequences and not on movements per se, a
fixed vocabulary has to be defined before hand to be able to apply the
12
F- J. Lapointe – The DNA Within
phylogenetic model. Future work in this direction will investigate mutations
on movements and hybridization among different vocabularies.
Interestingly, the phylogenetic approach may also be extended to other
artistic fields. For example, the composition of a musical piece could be
modelized with a similar technique. Likewise, phylogenetic analysis may
apply to visual art, for studying the relationship among different pieces created
by an artist. As long as there exists possible ways of measuring the attributes
defining a piece of art (e.g., pitch, frequency, duration, hue, saturation, etc…),
and that specific models of evolution relating the different modalities of these
attributes are available, the maximum likelihood criterion could be employed
to infer the creative process.
Acknowledgments. This work is part of the author’s Ph.D. research in
Études et Pratiques des Arts at Université du Québec à Montréal.
13
F- J. Lapointe – The DNA Within
References
1.
Felsenstein, J.
1978. The Number of Evolutionary Trees, Systematic Zoology, 27: 2733.
2.
Felsenstein, J.
2004. Inferring Phylogenies, Sinauer Associates, Sunderland, Mass.
3.
Swofford, D.L., Olsen, G.J., Waddell, P.J., & Hillis, D.M.
1996. Molecular Systematics. Hillis, D.M., Moritz, C., & Mable, B.K.
(eds.) Phylogenetic Inference, Sinauer Associates, Sunderland, Mass:
407-514.
4.
Jukes, T.H., & Cantor, C.R.
1969. Mammalian Protein Metabolism. Munro, H.N. (ed.).
Evolution of Protein Molecules Academic Press, New York: 21-132.
5.
Felsenstein, J.
1981. Evolutionary Trees From DNA Sequences: A Maximum
Likelihood Approach, Journal of Molecular Evolution, 17: 368-376.
6.
Kimura, M.
1980. A Simple Method for Estimating Evolutionary Rate of Base
Substitutions Through Comparative Studies of Nucleotide Sequences,
Journal of Molecular Evolution, 16: 111-120.
7.
Hasegawa, M., Kishino, H., & Yano, T.
1985. Dating of the Human-Ape Splitting by a Molecular Clock of
Mitochondrial DNA, Journal of Molecular Evolution, 21: 160-174.
8.
Kimura, M.
1981. Estimation of Evolutionary Distances Between Homologous
Nucleotide Sequence, Proceedings of the National Academy of Sciences
of the USA, 78 : 454-458.
9.
Lanave, C., Preparata, C., Saccone, C., & Serio, G.:
1984. A New Method for Calculating Evolutionary Substitution Rates.
Journal of Molecular Evolution, 20: 86-93.
10. Posada, D., & Crandall, K.A.
14
F- J. Lapointe – The DNA Within
1998. Modeltest: testing the model of DNA substitution, Bioinformatics,
14: 817-818.
11. Koshi, J.M., & Goldstein, R.A.
1996. Probabilistic Reconstruction of Ancestral Protein Sequences.
Journal of Molecular Evolution, 42: 313-320.
12. Lapointe, F.-J.
2005. Proceedings of the Genetic and Evolutionary Computation
Conference Choreogenetics, Rothlauf, F. (ed.)
The Generation of Choreographic Variants Through Genetic Mutations
and Selection, ACM Press, New York: 366-369.
13. Lapointe, F.-J., & Époque, M.
2005. Proceedings of ACM-Multimedia Conference.
The Dancing Genome Project: Generation of Human-Computer
Choreography Using a Genetic Algorithm, ACM Press, New York:
555-558.
15