DNA within - Francois
Transcription
DNA within - Francois
The DNA Within: Mutation, Selection, Evolution and Phylogenetic Dance Composition François-Joseph Lapointe Département de sciences biologiques, Université de Montréal, CP 6128, Succ. Centre-ville, Montréal (QC), H3C 3J7, Canada [email protected] F- J. Lapointe – The DNA Within Abstract. The science of phylogenetics aims at recovering the evolution of a gene over time by estimating the ancestor-descendent relationships among different species, or populations. The complex processes of mutation and selection are the main forces driving the evolution of genes that make up our DNA, and the complete history of a particular gene can only be studied using appropriate models of evolutionary change. A choreographer also applies mutation and selection to generate dance pieces. Whereas the genes compose the words of the genetic vocabulary, the movements represent the vocabulary of dance. Consequently, a phylogenetic approach can be applied to transform movement sequences over time, to mutate the movement vocabulary, and to evolve choreographies. In this paper, I will propose different models of DNA evolution, in the specific context of phylogenetic dance composition. This process can be easily implemented using a computer algorithm, which takes as input an original movement sequence to produce a number of mutant sequences that are obtained through various types of choreographic operations. The history of movement sequences over time, the phylogenetic choreography, thus retraces the cumulative transformation of ancestral sequences into descendant sequences, based on a choreographic model of evolutionary change. Keywords: Dance composition, Maximum Phylogenetic Trees, Models of Sequence Evolution. Likelihood Inference, 2 F- J. Lapointe – The DNA Within 1. Introduction What if one could go back in time to document the genesis of an artistic piece from its inception to the end? What if one could retrace the chronological process leading to a masterpiece? What if one could reverseengineer creativity? That is for sure impossible. Yet, using very simplistic models of evolutionary computation, it may be feasible to propose a solution to this difficult task. Retracing the evolution of a piece in time is not unlike inferring the evolution of Life. There exists a wide range of mathematical models to estimate the phylogenetic history of species. These methods take as input a set of objects (species) represented by a given number of attributes (molecular sequences), and return a tree (a phylogeny) depicting the relationships among these objects. Several trees, each one representing a different phylogenetic hypothesis, are distinguishable. The real challenge is to infer the best tree, given a particular model of sequence evolution. In this paper, it is postulated that the phylogenetic approach can be applied to dance composition. Assuming that the species can be replaced by the different dancers, each represented by a sequence of movements, phylogenetic inference can be applied to retrace the original sequence of movements from which all other sequences may have evolved. In other words, this process aims at finding the mother of all movement sequences, that from which the choreographer draws inspiration. In order to present the details of the algorithmic process to deciphering dance composition, phylogenetic analysis will first be introduced, with definitions of trees and the symbolic notations that will be used throughout this paper. Different models of nucleotide evolution will also be presented with respect to maximum likelihood inference. These molecular models will 3 F- J. Lapointe – The DNA Within then be translated into various choreographic models of movement sequence evolution, and the phylogenetic approach will be illustrated with a specific example. Possible extensions and generalizations to other artistic fields will finally be discussed. 2. Phylogenetic Analysis in a Nutshell We all have a family tree; a genealogy retracing the relationships among siblings, parents, grand parents, uncles and cousins. A phylogeny is nothing more than a family tree of species. However, unlike a genealogy for which the ancestors are usually known (at least for recent generations), a phylogenetic tree is merely presenting a hypothesis about ancestor-descendant relationships. Fig. 1. A phylogenetic tree T, with labeled leaves corresponding to a set S = {A, B, C, D, E} of different species. The internal nodes are represented by black circles, whereas the terminal nodes (the leaves) are represented by white circles. The path between D and E is illustrated by dashed edges. The root imposes a direction on the branches from the bottom to the top of the tree. This tree is unweighted. In mathematical terms, a phylogenetic tree T is defined as an acyclic and connected graph. That is that, it depicts a set V of vertices (the nodes of the tree) connected by a set E of edges (the branches of a the tree), in such a way that there exists one and only one path of adjacent edges between any two 4 F- J. Lapointe – The DNA Within nodes (Fig. 1). Typically, labels are attached to the nodes of a tree to represent objects. In a phylogeny, only the terminal nodes (the leaves of the tree) are labeled, whereas internal nodes also are labeled in a genealogy. In both cases, one node can be labeled as the ancestor to all other nodes (the root of the tree) to impose a direction on the branches of the tree. Furthermore, a tree can have branch lengths (or weights) attached to the edges to represent the amount of time separating two nodes. In a phylogeny, these weights are a function of genetic distances between nodes, so that the sum of branch lengths along the path connecting any two leaves represent the evolutionary distance between a pair of species. Unweighted trees do not have branch lengths and evolutionary distances are defined in such cases as the number of branches along the path between two nodes. 2.1 The Number of Phylogenetic Trees Given a set of species S = {1, 2, …, n}, the objective of phylogenetic inference is to find the tree T that best represents the evolutionary relationships among these species. When the number of leaves n is small, this problem can be solved exhaustively by considering each and every possible hypothesis of relationships and finding the one that optimizes a given criterion (see section 2.3). However, the number of distinguishable trees Tn increases dramatically with the number of leaves n [1]. Tn = (2n - 3)! / 2n - 2 (n - 2)! (1) Namely there are 15 possible topologies (depicting rooted phylogenies) for a set of 4 objects, but that number reaches 34,459,425 for 10 objects. Because it is impossible to evaluate all possible trees when the number of species is too large, heuristic algorithms are employed to approximate the results of phylogenetic analysis, and numerous different search strategies have 5 F- J. Lapointe – The DNA Within been proposed over the years to speed up the process and improve its accuracy [2, 3]. Yet, it is important to note at this stage that the “true” phylogeny, which is sought, is not known, and that phylogenetic trees represent, at best, probable relationships among extant species. In order to recover that elusive tree of life, different models of molecular evolution are thus required. 2.2 Models of Molecular Evolution To estimate phylogenetic trees from a set of molecular sequences, one first has to define a model of molecular evolution to compute the substitution rates among the different nucleotides. For the sake of clarity, this section will only focus on models of DNA evolution, but others models for different types of sequences (e.g., proteins) are also available [2]. All models are based on different number of parameters that describe the (1) frequencies of the four nucleotides in DNA sequences (A, C, T, and G), and (2) the substitution probabilities among these four states, among others (for more complex models, the variance of the evolutionary rates among sites and the number of invariable sites can also be considered). Fig. 2. Examples of models of DNA evolution. (a) JC model [4], with only one rate of substitutions (α) among all nucleotides. (b) K2P model [6], with different substitution rates for transitions (α) and transversions (β). (Adapted from [2]). 6 F- J. Lapointe – The DNA Within There exist six possible types of substitution between pairs of nucleotides. The simplest model (JC model), introduced by Jukes-Cantor [4], assumes that the frequencies of all nucleotides are equal in the sequences, and that the substitution rates from one nucleotide to another are all the same (see Fig. 2a). In other words, there is no difference between transitions (substitution between purines {A, G} or between pyrimidines {C, T}), and transversion (substitution between a purine and a pyrimidine), and at equilibrium, all nucleotides should appear in the sequences with a 1/4th probability. Because, this model does not accurately represent actual sequences, a large number of alternative scenarios have been proposed (see Fig. 3 for the relationships among some of these models). Fig. 3. Relationships between different models of DNA evolution for different types of substitutions rates, and with and without equal nucleotide frequencies. (Modified from [3]). For example, the F81 model [5] assumes that all substitution rates are equal, but that the frequencies of nucleotides in the sequence can differ. On the other hand, one can fix the nucleotide frequencies to be equal, but allow transition rates to differ from transversion rate (see Fig. 2b); this is the so-called Kimura 7 F- J. Lapointe – The DNA Within 2-parameter model (K2P [6]) or the corresponding HKY [7] model for unequal nucleotide frequencies. Of course, a 3-parameter model [8], is also available, as well as any other that assigns different rates to specific types of substitutions [3]. The richest of all, the General Time-Reversible model (GTR) assumes that all six substitution rates are different, and that the frequencies of nucleotides can vary [9]. Although the GTR model may better represent actual sequences, it is also much more difficult to compute. One of the challenges of phylogenetic analysis is to find the optimal model to infer a phylogeny, and several tests are available to do so [10]. In the following section, the maximum likelihood approach will be introduced to infer a phylogeny with respect to a given model of DNA evolution, and to allow for comparisons among different phylogenetic hypotheses. 2.3 Maximum Likelihood Inference of a Phylogenetic Tree Given a set of DNA sequences of length m, collected on a set S of n species, the objective of phylogenetic inference is to find the tree T that best explains the actual distribution of sequences D, with respect to a evolutionary model of sequence evolution. Formally speaking, we are thus looking for the tree that will maximize the following function [2]: L = Prob (D | T) (2) where L is the likelihood of the data D, given a phylogenetic hypothesis T. To be able to do so, a large number of trees have to be visited by a search algorithm (when n is small the tree space can be evaluated exhaustively), and the tree with the best likelihood is selected as the most probable solution. However, because different trees can be obtained when using different models of DNA evolution, log-likelihood ratio tests [10] have to be computed to select the most parsimonious model (with the least number of parameters). Given this optimal tree, it then becomes possible to infer the ancestral sequences of 8 F- J. Lapointe – The DNA Within DNA at internal nodes, going as far back as the root of the tree [11]. This approach is exactly what will be used to estimate the mother of all movement sequences for phylogenetic dance composition. 3 Phylogenetic Dance Composition The rationale of this paper is that a phylogenetic approach can be employed to reverse engineer dance composition. That is, that it may be possible to infer the choreographic process by going back in time. Assuming that the leaves of a tree are dancers, and that the corresponding terminal sequences are representing the movements they must execute, the phylogenetic analysis then amounts to retracing the evolution of movements from a common ancestor, or to turn the tree upside down to infer the creative process that may have generated that piece. However, the result of this process greatly depends on the model at hand. In the following sections, the models of nucleotide evolution will be generalized to movement sequence evolution. This choreographic model will then be implemented and illustrated to infer a choreographic tree for three dancers, using a maximum likelihood criterion. 3.1 Models of Choreographic Evolution The creation of a dance involves a series of choreographic operations, which are not unlike the mutations occurring at the DNA level. Consequently, models of molecular evolution can be applied to modelize the evolution of movement sequences in time. However, different types of mutations are required for dance composition, in addition to the simple substitution of a movement by another. Indeed, six choreographic operators between movement sequences have already been defined to modelize the evolution of dance using 9 F- J. Lapointe – The DNA Within a genetic algorithm [12, 13]: substitution, insertion, deletion, repetition, inversion, and translocation (see Fig. 4 for details). However, contrary to DNA sequences, these mutations apply to the entire sequence and not to single nucleotides (or movements). Several different models of choreographic evolution may thus be defined. The simplest one, related to the JC model, assumes that all movements are equiprobable and that all types of mutations have the same rate of evolution. At other end of the spectrum, the equivalent of the GTR model allows the different movements to have unequal frequencies, and assumes different rates for the different types of mutations. As for molecular sequence, intermediate models for restricted number of mutations are also available, depending on the choreography. Fig. 4. Different types of choreographic mutations between movement sequences. The top line represents the original movement sequence, whereas the bottom line is the mutant sequence. In each case, the movements affected by the mutation are underlined. In addition to the models of movement sequence evolution, dance composition algorithms also require information about the transitions from one movement to the next. For example, some combinations of movements may be restricted or impossible, whereas other combinations may be more likely. To account for this syntaxic process, a transition probability matrix between all possible pairs of movements in a sequence is computed. This matrix P contains the Pij values corresponding to the number of times that 10 F- J. Lapointe – The DNA Within movement i is followed by movement j in the sequences. It can be estimated empirically from the actual sequences, or be defined theoretically to impose a preferred syntax on the sequences. When all Pij values are equal, the order of movements in the sequence is not important for the modelization. Additional parameters, such as tempo, spacing, orientation, can also be included in the choreographic models, at the expenses of algorithmic complexity. For the sake of clarity, these parameters have been ignored in the present paper. 3.2 Maximum Likelihood Inference of a Choreographic Tree For simplicity, let us assume that only 4 movements {1, 2, 3, 4} are used to infer a choreography generated for three dancers {A, B, C}. The movement sequences corresponding to the different dancers are known, and the objective is to find the best possible tree relating these sequences, in order to infer the sequence at the root of the tree from which they have evolved. A maximum likelihood approach implementing the full choreographic model with six possible types of mutations is employed. The resulting tree is shown in Fig. 5. The application of the phylogenetic dance composition model, clearly shows that it is possible to infer ancestral movement sequences. Interestingly, these estimates are not only providing information about the evolution of the choreographic process, they can be used as well to generate dance pieces. For example, by traveling down the tree, from the leaves to the root, the dancers will converge to the same original sequence and unison will appear as the result of this process. Such a creative approach may allow the choreographer to create syntaxic structures, based on a purely objective model of movement sequence evolution. 11 F- J. Lapointe – The DNA Within Fig. 5. Application of the maximum likelihood approach for inferring a choreographic tree. The movement sequences for the three dancers are represented at the leaves of the branches (white circles), and the estimated ancestral sequences are shown at the internal nodes (black circles). The types of mutations required to transform ancestral sequences into descendant sequences are presented on the branches of the tree. This solution is the optimal tree obtained under a complete choreographic model with unequal frequencies for the movements, and with independent rates of evolution for the six types of mutations. 4 Extensions and Conlusion In this paper, I have shown that a phylogenetic model can be used to reverse engineer the creative process or to generate dance pieces. But, this approach may not only shed light on the dance composition technique, it may also provide insights about the styles of different choreographers. By comparing the types of choreographic mutations employed by various composers, a classification of choreographers could be derived. Also, the evolution of an artist’s work throughout his or her career could be explained under a phylogenetic model. Still, this naive process of dance composition is far from the reality. Several elements of style are not accounted for in the model, such as tempo, energy, position of the dancers in space, the orientation of the movements, and the spacing among dancers, among others. Also, because the models are based on movement sequences and not on movements per se, a fixed vocabulary has to be defined before hand to be able to apply the 12 F- J. Lapointe – The DNA Within phylogenetic model. Future work in this direction will investigate mutations on movements and hybridization among different vocabularies. Interestingly, the phylogenetic approach may also be extended to other artistic fields. For example, the composition of a musical piece could be modelized with a similar technique. Likewise, phylogenetic analysis may apply to visual art, for studying the relationship among different pieces created by an artist. As long as there exists possible ways of measuring the attributes defining a piece of art (e.g., pitch, frequency, duration, hue, saturation, etc…), and that specific models of evolution relating the different modalities of these attributes are available, the maximum likelihood criterion could be employed to infer the creative process. Acknowledgments. This work is part of the author’s Ph.D. research in Études et Pratiques des Arts at Université du Québec à Montréal. 13 F- J. Lapointe – The DNA Within References 1. Felsenstein, J. 1978. The Number of Evolutionary Trees, Systematic Zoology, 27: 2733. 2. Felsenstein, J. 2004. Inferring Phylogenies, Sinauer Associates, Sunderland, Mass. 3. Swofford, D.L., Olsen, G.J., Waddell, P.J., & Hillis, D.M. 1996. Molecular Systematics. Hillis, D.M., Moritz, C., & Mable, B.K. (eds.) Phylogenetic Inference, Sinauer Associates, Sunderland, Mass: 407-514. 4. Jukes, T.H., & Cantor, C.R. 1969. Mammalian Protein Metabolism. Munro, H.N. (ed.). Evolution of Protein Molecules Academic Press, New York: 21-132. 5. Felsenstein, J. 1981. Evolutionary Trees From DNA Sequences: A Maximum Likelihood Approach, Journal of Molecular Evolution, 17: 368-376. 6. Kimura, M. 1980. A Simple Method for Estimating Evolutionary Rate of Base Substitutions Through Comparative Studies of Nucleotide Sequences, Journal of Molecular Evolution, 16: 111-120. 7. Hasegawa, M., Kishino, H., & Yano, T. 1985. Dating of the Human-Ape Splitting by a Molecular Clock of Mitochondrial DNA, Journal of Molecular Evolution, 21: 160-174. 8. Kimura, M. 1981. Estimation of Evolutionary Distances Between Homologous Nucleotide Sequence, Proceedings of the National Academy of Sciences of the USA, 78 : 454-458. 9. Lanave, C., Preparata, C., Saccone, C., & Serio, G.: 1984. A New Method for Calculating Evolutionary Substitution Rates. Journal of Molecular Evolution, 20: 86-93. 10. Posada, D., & Crandall, K.A. 14 F- J. Lapointe – The DNA Within 1998. Modeltest: testing the model of DNA substitution, Bioinformatics, 14: 817-818. 11. Koshi, J.M., & Goldstein, R.A. 1996. Probabilistic Reconstruction of Ancestral Protein Sequences. Journal of Molecular Evolution, 42: 313-320. 12. Lapointe, F.-J. 2005. Proceedings of the Genetic and Evolutionary Computation Conference Choreogenetics, Rothlauf, F. (ed.) The Generation of Choreographic Variants Through Genetic Mutations and Selection, ACM Press, New York: 366-369. 13. Lapointe, F.-J., & Époque, M. 2005. Proceedings of ACM-Multimedia Conference. The Dancing Genome Project: Generation of Human-Computer Choreography Using a Genetic Algorithm, ACM Press, New York: 555-558. 15