symmetries in genetic information and algebraic biology
Transcription
symmetries in genetic information and algebraic biology
Symmetry: Founding editors: G. Darvas and D. Nagy The journal of the Symmetrion Editor: György Darvas Volume 23, Numbers 3-4,225-448, 2012 SYMMETRIES IN GENETIC INFORMATION AND ALGEBRAIC BIOLOGY CONTENTS ANNOUNCEMENT Symmetry Festival 2013, 2-7 August, Delft, The Netherlands EDITORIAL, Sergey Petoukhov 228 229 SYMMETRY IN SCIENCE AND ART Genome symmetries, Paul Dan Cristea Symmetry of mitochondrial DNA. The case of COXn genes in primates and carnivores, Teodora Popovici and Paul Dan Cristea Symmetries of the genetic code, hypercomplex numbers and genetic matrices with internal complementarities, Sergey V. Petoukhov Fractal genetic nets and symmetry principles in long nucleotide sequences, S.V. Petoukhov, V.I. Svirin A Markov information source for the syntactic characterization of amino acid substitutions in protein evolution, Miguel A. Jiménez-Montaño Symmetries in molecular-genetic systems and musical harmony, G. Darvas, A.A. Koblyakov, S.V.Petoukhov, I.V.Stepanian Modeling “cognition” with nonlinear dynamic systems, Yuri V. Andreyev, Alexander S. Dmitriev The irregular (integer) tetrahedron as a warehouse of biological information, Tidjani Négadi Theory of topological coding of proteins and nature of antisymmetry of the amino acids canonical set, Vladimir A. Karasev 233 255 275 303 323 343 377 403 427 SYMMETRY: CULTURE AND SCIENCE is the journal of and is published by the Symmetrion, http://symmetry.hu/. Edition is backed by the Executive Board and the Advisory Board (http://symmetry.hu/isa_leadership.html) of the International Symmetry Association. The views expressed are those of individual authors, and not necessarily shared by the boards and the editor. Editor: György Darvas Any correspondence should be addressed to the Symmetrion Mailing address: Symmetrion c/o G. Darvas, 29 Eötvös St., Budapest, H-1067 Hungary Phone: 36-1-302-6965 E-mail: [email protected] http://symmetry.hu/ Annual subscription: Normal Members of ISA Student Members of ISA Benefactors Institutional Members € 120.00, € 90.00, € 60.00, € 900.00, please contact the Symmetrion. Make checks payable to the Symmetrion and mail to the above address, or transfer to the following account: Symmetrology Foundation, IBAN: HU24 1040 5004 5048 5557 4953 1021, SWIFT: OKHBHUHB, K&H Bank, 20 Arany J. St., Budapest, H-1051. © Symmetrion. No part of this publication may be reproduced without written permission from the publisher. ISSN 0865-4824 – print version ISSN 2226-1877 – electronic version Cover layout: Günter Schmitz; Image on the front cover: Matjuska Teja Krasek: Star(s) for Donald, 2000, (tribute to H.S.M. Coxeter); Images on the back cover: Matjuska Teja Krasek: Twinstar and Octapent; Ambigram on the back cover: Douglas R. Hofstadter. Symmetry: Culture and Science Vol. 23, Nos. 3-4, 225-448, 2012 SYMMETRIES IN GENETIC INFORMATION AND ALGEBRAIC BIOLOGY A thematic issue Guest editor: Sergey V. Petoukhov Symmetry: Culture and Science Vol. 23, Nos.3-4, 228, 2012 Symmetry: Culture and Science Vol. 23, Nos. 3-4, 229-231, 2012 EDITORIAL The biological meaning of genetic informatics is reflected in the brief statement: "life is a partnership between genes and mathematics" (I. Stewart, 1999, Life’s other secret: The New Mathematics of the Living World. New-York: 304 p.) But, what kind of mathematics has partner relations with the genetic code and what kind of mathematics is behind genetic phenomenology? This question is one of the main challenges in the exact natural sciences today. This thematic issue Symmetries in genetic information and algebraic biology is focused on different aspects of impressive properties of living matter: noise-immunity transmission of genetic information along the chain of generations; genetic coordination of all inherited subsystems of any organism into a single unit including members of a huge chorus of its cyclic processes; genetically concerted self-reproduction of the genetic system and a whole organism, etc. Authors of the issue are trying to understand such properties from mathematical point of view of the exact natural sciences where symmetry principles play main roles. Modern science knows that deep knowledge about phenomenological relations of symmetry among separate parts of a complex natural system can tell many important things about the evolution and mechanisms of these systems. Physics and other natural sciences have a great number of successful applications of symmetry approaches. Nowadays, many physical theories, beginning from the theory of relativity to quantum mechanics, are created as theories of invariants of mathematical groups of transformations, in other words as theories of special kinds of symmetry. The study of symmetries and asymmetries in molecular structures is one of the important branches of chemistry. For example, functional differences between the right handed forms of molecules and the left handed forms of molecules in living organisms have become known to mankind due to investigations of symmetry in biological molecules. Biological organisms belong to a category of very complex natural systems, which correspond to a huge number of biological species with inherited properties. But surprisingly, molecular genetics has discovered that all organisms are identical to each 230 S. V. PETOUKHOV other by their basic molecular-genetic structures. Due to this revolutionary discovery, a great unification of all biological organisms has happened in the science. The information-genetic line of investigations has become one of the most prospective lines not only in biology, but also in science as a whole. The basic system of genetic coding has become strikingly simple. Its simplicity and orderliness presented challenges to specialists from many scientific fields. Bioinformatics considers each biological organism as an ensemble of information systems which are interrelated to each other. The genetic coding system is the basic one. All other biological systems must be correlated to this system to be transmitted to the next generations of organisms. The natural technology of genetic coding is a major and most effective technology of life on our planet. Using this natural technology, huge biomass of living matter with unique and valuable properties is produced around the world. Bioinformatics and biotechnology have been applied to many areas such as biology, medicine, and life sciences. Bioinformatical knowledge is used to manufacture biological organisms with new properties, to extend human life, to diagnose and treat disease on the basis of “personal genetics”, to clone organisms, to develop new computer technologies, to create new materials with unique characteristics, to propose new genetic algorithms for technical applications, and so on. It seems that all fields of human life will be influenced in the future by progress in bioinformatics. Modern science recognizes a key meaning of information principles for inherited selforganization of living matter. Modern informatics is an independent branch of science, which possesses its own language and mathematical formalisms and exists together with physics, chemistry and other scientific branches. A problem of information evolution of living matter has been investigated intensively in the last decades in addition to studies of the classical problem of biochemical evolution. Not only physics and chemistry deal with principles and methods of symmetry, informatics and digital signal processing also pay great attention to them. How is theory of signal processing connected to geometry and geometrical symmetries? Signals are represented there in a form of a sequence of the numeric values of their amplitude in reference points. The theory of signal processing is based on the interpretation of discrete signals as a form of vector in multi-dimensional spaces. In every tact of time, a signal value is interpreted as the corresponding value of a coordinate in a multidimensional vector space of signals. In this way, the theory of discrete signals turns out to be the science of geometries of multi-dimensional spaces where different multidimensional numeric systems can be useful. The number of dimensions of such a EDITORIAL 231 space is equal to the quantity of reference points for the signal. Metric notions and all other necessary things are introduced in these multi-dimensional vector spaces for those or other problems of maintenance of reliability, speed, economy of the signal information. For example, important notions of the energy and the power of a discrete signal appear in multi-dimensional geometry of the space of signals as forms of a square of the length of a multi-dimensional vector-signal and of a square of the length of a vector-signal divided by the number of dimensions of an appropriate space. On this geometrical basis, many methods and algorithms of recognition of signals and images, coding information, detection and correction of information mistakes, artificial intellect and training of robots are constructed. One can add here the importance of symmetries in permutations of components for coding signals, in spectral analysis of signals, in orthogonal and other transformations of signals, and so on. Investigation of symmetrical and structural analogies between computer informatics and genetic informatics is also needed for the creation of DNA-computers and DNA-robotics. This thematic issue contains nine articles in the following order. The first four articles – by P. Cristea; T. Popovici and P. Cristea; S. Petoukhov; S. Petoukhov and V. Svirin – are devoted to the study of hidden regularities in long nucleotide sequences and of a possible use of multi-dimensional numerical systems for the understanding the structural organization of the genetic coding system. The article by M.A. JiménezMontaño introduces a theoretical model to understand some aspects of protein evolution. The article by G. Darvas, A. Koblyakov and S.Petoukhov presents some relations between musical harmony and the genetic coding system that have applications, in particular, in the field of musical culture. The article by Y. Andreyev and A. Dmitriev reviews different biological applications of the theory of non-linear dynamic systems and proposes a method for storing and processing information on the base of dynamic chaos. The article by T. Negadi is devoted to his classification model of the amino acids and to some numeric features of the genetic code. The article by V. Karasev describes his theory of topological coding of proteins and the nature of antisymmetry of the amino acids’ canonical set. Sergey Petoukhov Symmetry: Culture and Science Vol. 23 , Nos. 3-4 , 233-254 , 2012 SYMMETRY IN SCIENCE AND ART GENOME SYMMETRIES Paul Dan Cristea Electronics Engineer, Physicist (1941-2013) Affiliation: BioMedical Engineering Center, University “Politehnica” of Bucharest (UPB), Splaiul Independentei no. 313, 060042 Bucharest, Romania (e-mail: [email protected]). Current position: Professor of Electrical Engineering and Applied Information Science, Member of the Romanian Academy, Fellow IEEE, Member of Honor of the Romanian Scientists Academy. Publications: Over 510 papers on Genomic Signals (including Symmetry in Genomics), Signal & Image Digital Processing, Neural Systems. Evolutionary Intelligent Agents, Intelligent e-Learning environments, Circuit Theory and Design Circuit Theory, Evolutionary Intelligent Agents, Computerized Medical Equipment, Measurement Equipment, High Performance Electrical Batteries, and Semiconductor Thin Layers and Technical Physics. Abstract: The paper gives an overview of the nucleotide genomic signal (NuGS) methodology and its applications in revealing regularities and symmetries in genomes. The long range symmetries make a genome resemble, from the structural point of view, less to a "plain text," and closer to a "poem," which obeys rules evoking the “rhythm” and “rhyme.” Both tangible symmetries—in the genomes of extant taxa, as well as hidden symmetries—which existed in ancestral genomes, but have disappeared under the evolutionary pressures linked to species separation, can be detected. The approach offers also the possibility to apply signal processing methods for the analysis of genomic data in the local study of genes, inserts, motifs, mtDNA, etc. Keywords: nucleotide genomic signals, nucleotide imbalance, nucleotide pair imbalance 1. INTRODUCTION The currently accepted concept of symmetry reaches much further than the classical one, which only requested “the harmony of the different parts of an object, the good 234 P. D. CRISTEA proportion between its constituent parts” (Darvas, 2007). The link between symmetry and conservation laws was formulated in the work and thinking of the early physicists, such as Galileo and Newton, but especially as a result of Einstein's theories on equations invariance (Wigner, 1964). Now, it is clear that symmetry, regularity and conservation share an ontological dimension. The existence of a system, object or property requests their conservation from a moment to the next, essential for its identification and recognition. Any conservation of a system, object or property with respect to a transformation corresponds to a specific symmetry, ostensive or not. Moreover, symmetry is much more than the harmony between parts of some objects, actually underlining the deep harmony of the cosmos, granting its existence. The lack of symmetry and regularity, which is the main feature of chaos, is always relative, being accompanied by hidden symmetries and regularities. Cosmos and chaos form a dialectic dichotomy, as opposite faces of the reality of the universe. Science focuses on reproducible systems or phenomena, so that it always assumes their conservation and the existence of the corresponding essential symmetry. In the living world, the very existence of life as we know it, a large variety of extremely slowly changing forms, requires a very well structured genetic information and its almost strict conservation in non-stationary and closely interacting multiple component eco-systems. At a close analysis, it is a matter of wonder that genomes survived and even evolved on Earth, in their anti-entropic march towards increasing complexity, from the first primitive self-replicating single-celled organisms, going up to the tremendously complex extant organisms, including Homo sapiens. This process took place for more than 3.5 billion years, over many successive generations and many parallel or successive interrelated species. It continuous nowadays, in spite of the intricate complexity of genomes, the apparent vulnerability of their replication mechanisms, and the large number of perturbing factors potentially hindering their replication. The genomes can consist of thousands of nucleotides – for some viruses, millions of nucleotides – for bacteria, and up to billions of nucleotides – for eukaryotes, again, including humans. For all extant species, the replication accuracy is high enough to conserve the essential genetic information to a degree at which the various correction, repair, and selection mechanisms, acting at molecular, infra-cellular, cellular, tissue, organism, population, species and biosphere levels, are able to assure life continuity, i.e., survival. In fact, this means about one error per 10,000 replicated nucleotides, for most genomic areas, but reaches less than one error per 10,000,000 replicated nucleotides, for critical genomic areas (Kunkel, 2004). A large variety of GENOME SYMMETRIES 235 DNA polymerases and repair enzymes are used to carry-out and control these processes. The replication speed is also impressively high. In humans, it is typically about 50 nucleotides per second per replication fork (initiation site). Consequently, the whole genome can be copied in only a few hours, due to the many replication forks operating simultaneously. But the replication speed can reach over 1,000 nucleotides per second in bacteria, the rate of error being also considerably higher. The multiple adverse factors that tend to disturb genome accurate replication, introducing a genetic noise, include a large range from simple physical (thermal noise, ultra-violet, radioactive and cosmic radiations), and chemical factors (fluctuating pH, instable environment, various active radicals triggering interconversion), up to more complex molecular (insertion, deletion, recombination, transposition, chromosome breakage), or pathogen level (chromatin virus modulation, retro virus invasion) factors, and many more. Despite the amazing complexity of the mechanisms through which it is put into action, the strategy that proved effective in the conservation and evolution of genomes and of their corresponding organisms during all the successive phases of biosphere development on Earth remains essentially simple: the replication is fast enough to keep low the accumulation of mutations. This way, potentially harmful factors can create only a small amount of deviant genomes, cells and, finally, organisms. The first line of defense against this genetic noise is given by the correction, repair and maintenance mechanisms, acting at molecular and cellular levels. These mechanisms have evolved in time themselves, to gradually become highly complex and powerful, being now able to assure the already mentioned surprisingly good replication accuracy. But even more important are the processes that act at higher level, involving variability and selection as creative tools to enable evolution. The vast majority of the genome is copied faultlessly, or is properly corrected, carrying the intact genomic information to the next generation. Most of the remaining mutations simply harm the functionality and replication capacity of genomes and/or of their corresponding organisms, compromising their survival. Natural selection will subsequently eliminate these faulty genomes and organisms, without significant effects on the majority population. Nevertheless, there remain exceptionally few mutations that confer advantages to genomes and/or to their corresponding organisms. These exceptional mutations will be successful and, under the effect of the same natural selection, will spread out over the whole species by a diffusion like process, in a few thousand years. In this way, the potentially harmful genetic noise, which seems to threaten genome integrity, actually emerges as the fruitful 236 P. D. CRISTEA genetic variability, providing the informatics support on which natural selection can enrich the genetic pool of species with positive mutations evolution. This fundamental mechanism uses properly selected “noise” to produce evolution, by reversing the flow of entropy towards higher order and complexity. The process is neither neat, nor gentle, and occurs through attacks on genomes by all the mentioned and many other adverse factors (Albrecht-Buehler, 2006). Still, without it, natural selection would have no variants to act upon, and evolution would be impossible. 2. EXPLORING REGULARITIES IN DNA MOLECULES The regularity and symmetry of genomes, which result in highly correlated distributions of nucleotides along DNA molecules, are among the conditions necessary for the existence of low level correction mechanisms in genome replication. By using the Nucleotide Genomic Signal (NuGS) methodology (Cristea, 2002, 2005), we have revealed such correlations, for both large- and short-scale sequences of nucleotides. We have also proven theoretically, and verified experimentally for all available genomes, that cross-over and recombination conserve the distribution of pairs of nucleotides (Cristea, 2004). We have used this result to build an efficient model for the prediction of nucleotide sequences, similar to the models used in time series prediction. An improved performance prediction system has been obtained, using a novel two component architecture comprising a Principal Component Analysis (PCA) block, and an Artificial Neural Network (ANN), acting as a learning machine (Cristea et al., 2007). For real life sequences, such as genome nucleotide sequences, the system requires learning a much lower number of parameters, thus greatly reducing the training time in comparison to classical time-series prediction systems of similar complexity. A retraining algorithm, which further reduces the computational burden and increases the prediction accuracy has also been used. Elements of such a system will be discussed in Section 5. There were many attempts to reveal the symmetry and regular inner structure of a DNA molecule. The best known results is the Chargaff’s first law for mono- and oligonucleotides in a double-stranded DNA molecule (Chargaff, 1951): “The numbers of mono-, di-, tri-, ..., nucleotides on one strand of a doublehelix DNA molecule are equal to the numbers of their reverse complements on the other.” GENOME SYMMETRIES 237 This result was fundamental in guiding Watson and Crick to formulate their well known double-helix model for the DNA molecule (Watson & Crick, 1953). It clearly pointed towards the complementarity of the two DNA molecule strands, and opened the way to establish the existence of the nitrogenous base-pairs adenine-thymine (A-T), and cytosine-guanine (C-G), formed by nucleotides placed on the two strands. Conversely, the law results as a direct consequence of the nucleotide pair complementarity. It is satisfied with utmost precision by all natural double-helix DNA molecules. Chargaff also formulated a second law for mono- and oligo-nucleotides, which refers to each strand of a DNA molecule: “The numbers of mono-, di-, tri-, ..., nucleotides on one strand of a natural DNA molecule (>100 Kb) are equal to the numbers of their reverse complements on the same strand.” This statement refers only to the nucleotides on one strand, so that it has no connection to nitrogenous base-pair complementary in the DNA double helix. The rule is a truly statistical law, valid only approximately and only for large enough nucleotide sequences (> 100 Kb) of nuclear DNA. It is not valid for mitochondrial DNA or plasmids. Chloroplast DNA contains two inverted repeats (IRA and IRB), therefore many genes encoded by a chloroplast genome have two complementary copies, and Chargaff's second law is valid for them. The mitochondrial, chloroplast and plasmid DNA's are naked, i.e., not histone associated. Chargaff’s second law is satisfied by the total numbers of mono- and oligo-nucleotides, not also locally, by the density of their distribution along a nucleotide sequence, as it is the case for the Chargaff’s first law. In many cases, the law is better satisfied by whole chromosomes, or whole genomes. Still, this law is important as it reveals a marked regularity in the DNA longitudinal structure. The genome pixel image (GPxI) method has been introduced as a simple and straightforward graphical representation of the distribution of nucleotides along a DNA strand, in order to explore its possible ordered structure (Albrecht-Buehler, 2006). Different gray-tone values (or colors, Cristea, 2002) are arbitrarily assigned to the four bases (A, C, G, T) of a DNA sequence, transforming the symbolic sequence of nucleotides into a continuous line of pixels with varying gray values. Similarly, the method can be extended for visualizing the distribution of di- and tri-nucleotides along a strand, by using a large enough number of color hues. The resulting line is arranged in 238 P. D. CRISTEA a rectangular frame, as successive lines of an arbitrarily chosen length w, disposed one below the other like a text. It is expected that, for a computer generated random sequence, a featureless dot-pattern would be obtained, while for exactly repetitive motifs, there will exist specific values of the width w for which the motifs would align perfectly, to generate a pattern of vertical lines. Furthermore, in the more general case of pseudo-repetitive sequences, the existence of otherwise easily overlooked relationships could be revealed, as well as the possible causes of the broken periodicity – e.g., mutations. Nevertheless, the need to carefully select the visualized segment of the analyzed DNA sequence, as well as the need to decide upon the adequate width w, for which seemingly significant images are obtained, makes the method too subjective. Much more objective methods are based on Nucleotide Genomic Signal (NuGS) methodology, mentioned above and discussed in the next section. 3. NUCLEOTIDE GENOMIC SIGNALS The NuGS method is based on the conversion of symbolic nucleotide sequences into digital signals (Cristea, 2005). The approach is adequate for the analysis of large scale genomic sequences, including both exons and introns, but also for the local DNA analysis, such as in study of the variability of pathogens. Comparison of sets of closely related signals describing the same genomic area in various individuals of a population, or across species, are the topics of interest. An important case is the development of pathogen resistance to treatment (Cristea, 2006). A key feature of this approach is that it operates with signals and, as a direct consequence, the observed regularities take the form of mathematical properties, like a piece-wise linear distribution of nucleotides or pairs of nucleotides, so that it is possible to measure the extent to which it is complied with. The method reveals significant regularities, not only in the distribution of nucleotides along DNA sequences, but also in the distribution of nucleotide pairs, similar to Chargaff’s second law. One important difference is that not only the total numbers of nucleotides, di-nucleotides (pairs) or other oligo-nucleotides are considered, but also the distributions of these genetic alphabet structural elements along DNA sequences. Significant regularities have been found in all studied genomes – Archaea, Bacteria and Eukaryota. A genome appears to be more than a plain text, by satisfying symmetry restrictions that evoke the rhythm and rhyme in poems (Cristea, 2010). GENOME SYMMETRIES 239 Such regularities help to identify exogenous inserts in the genomes of prokaryotes, because such inserts show clearly different regularities. Inserts can comprise entire retro-viruses, individual genes, or some non-coding repeats. It is interesting to note that even in the case when these inserts are not fully active viruses, they can nevertheless retain a certain pathogenicity, as they correspond to genes of enzymes, such as protease, which facilitate the multiplication and dissemination of certain viruses, generating an increased susceptivity to the contamination with the corresponding virus. For convenience, we briefly review here the main features of the 2D (complex) NuGS representation we have mostly used in our work (Cristea, 2002, 2010, 2012). A graphical representation with many similarities has been used for both DNA and protein sequences (Randić et al., 2011). The mapping is an unbiased representation of nucleotide classes, which uses the ordinality of numbers – their capability to order classes, instead of the cardinality – their capability to express quantities (Cristea, 2002). Four complex numbers (a = 1+j, c = –1–j, g= –1+j, and t = 1–j), having the quadrantal symmetry shown in Fig. 1, are attached to the four nucleotides (adenine, cytosine, guanine and thymine) of the DNA alphabet. The complex representation in Fig. 1 is the result of a nucleotide representation genome analysis recursive process, which showed that it is possible, and desirable, to ignore the less important “amino” vs. “keto” dichotomy we have used in a 3D tetrahedral representation (Cristea, 2002), thus reducing dimensionality, without information loss. The real part (Re) of these numbers corresponds to the strength of the hydrogen bonds among the nucleotide pairs (di-nucleotides) in the double helix Watson-Crick DNA model (Watson, 1953), expressing the dichotomy “strong bonds” (C-G pair, Re = –1) vs. “weak bonds” (A-T pair, Re = +1). Similarly, the imaginary part (Im) corresponds to the nucleotide molecular structure, expressing the dichotomy “pyrimidines” – heterocyclic aromatic compounds similar to benzene, containing two nitrogen atoms in positions 1 and 3 of a ring (C-T pair, Im=–1) vs. “purines” – heterocyclic aromatic compounds consisting of a pyrimidine ring fused to a five-membered imidazole ring (A-G pair, Im = +1). 240 P. D. CRISTEA Figure 1: The complex representation of nucleotides All the complex representations of the nucleotides in Fig. 1 have the same absolute value ( 2 ), but their phases can be used to build NuGSs associated to the nucleotide sequences. Two phase signals are particularly useful to this purpose: - The cumulated phase – the sum of the phases of the complex representations of the nucleotides in a sequence, from the first to the current hth sample in the sequence: h (1) c (h) arg(C{Nu (k )}) N (h), h 1,, nb , 4 k 1 where Nu(k) is the kth nucleotide in the sequence, C{Nu(k)} – its complex representation, N(h) – the nucleotide imbalance, and nb – the number of nucleotides (bases) in the sequence. We have shown (Cristea, 2005) that the nucleotide imbalance is a signature of the distribution of nucleotides in the sequence: N (h) 3 nG (h) nC (h) n A (h) nT (h), h 1,, nb , (2) where nA(h), nC(h), nG(h) and nT(h) are the number of occurrences of adenine, cytosine, guanine and thymine nucleotides, respectively, in the first h samples of the sequence. - The unwrapped phase – the phase of the elements in the sequence, corrected by adding an integer multiple of 2 (i.e., 2m, mZ, Z – the set of integers), so that the absolute value of the difference of phase between any two successive entries of the sequence be smaller than : GENOME SYMMETRIES u (1) arg(C{Nu (1)}), u (h) arg(C{Nu (k )}) 2m , m Z , so that u (h) u (h 1) . 241 (3) We have also shown (Cristea, 2005) that: u (h) u (1) P(h), h{2,, nb}, (4) 2 where P(h) is the nucleotide pair imbalance, a signature of the distribution of dinucleotides (pairs of nucleotides) in the sequence, given by: (5) P(h) n (h) n(h), h{2,, nb}, where n+ is the number of positive pairs (AG, GC, CT, TA), n is the number of negative pairs (AT, TC, CG, GA) formed by the first h samples of the sequence. Usually, u(1) is negligible. As they have a direct statistical significance and are expressed by integer numbers, it is convenient to use the nucleotide imbalance (N) and the nucleotide pair imbalance (P) in genomic signal analysis, instead of the cumulated phase (c) and the unwrapped phase (u), respectively. For the comparative analysis of genes or other conserved nucleotide sequences, e.g., when studying variability, one considers a set of n similar NuGSs derived from n individuals in a group, among which might have occurred mutations. In such cases, the set of signals Sk, (k = 1,..., n), can be characterized by a pair of signals: R – the reference against which we compare the set of signals, usually a signal that expresses the common trend of the signals, and Ok = Sk – R – the individual offset of Sk (k = 1,..., n) with respect to R. Most of the times, the reference can be chosen as: (1) the average (mean) or any other linear combination of the signals in the set (weighted mean), including the choice of one of the signals in the set, (2) the median, and (3) the mode step. As its name suggests, the ModeStep signal is built by selecting in each point the variation (step) that occurs the largest number of times (the mode) in that point, for the entire set of signals. The starting point is chosen as the median of the starting points. 242 P. D. CRISTEA To monitor a pathogen variability, e.g., to detect and track the development of its resistance to treatment, the natural choice for R is the wild type (WT) of the pathogen nucleotide sequence – usually downloaded from a genomic database, such as (GenBank, 2012). This is feasible when the variability is small enough, such as in the case of Mycobacterium tuberculosis, so that the pathogens in the isolates from various patients are not too different from the WT. But in the case of highly variable pathogens, such as in the case of HIV, the differences between individual signals and the WT signal might be too large, so that the reference must be constructed in terms of the set of signals itself. The resolution can be further improved by using the digital derivatives of the offsets. This is particularly useful to identify punctual mutations (single-nucleotide genetic variations), or to determine the distance between the individual signals and the reference, which correspond to step variations in the offsets. 4. REGULARITIES IN GENOMIC SIGNALS To illustrate the NuGS methodology and the regularities in the nucleotide distribution it can reveal, we present some results in the global and local analysis of DNA molecules. We briefly discuss the phase (Subsection 4.1) and di-nucleotide (Subsection 4.2) statistical analysis of a prokaryote genome (Helicobacter pylori), as well as a comparison of Hominidae family mitochondrial DNA (mtDNA) genes (Subsection 4.3). The three problems presented below have been chosen not only to show the versatility of the NuGSs approach to tackle global and local aspects of nucleotide sequence analysis, but also because of the intrinsic interest in understanding molecular scale properties of genomes and their ecological role. This is certainly true for Helicobacter pylori, a bacterium with a significant function in both the normal and the pathological state of human gastrointestinal tract, and for which the opportunity and management of antibiotic treatment needs further improvement. On the other hand, it is important to find new methods to study gene variability and evolution, to better understand their functions and the possibility of unharmful control. 4.1 Entire genome NuGS analysis Figure 2 presents the nucleotide imbalance (N) and the nucleotide pair imbalance (P) along the DNA sequence of Helicobacter pylori 26695c entire genome, downloaded GENOME SYMMETRIES 243 from (GenBank 2012), with the accession number NC_000915. The length of the DNA sequence is 1,667,867 bp, comprising the whole circular chromosome of H. pylori genome. H. pylori is a helix-shaped bacterium, about 3 m long and a diameter of about 0.5 m, found in 1982 by B. Marshall and R.Warren in patients with chronic gastritis and gastric ulcers, previously not believed to have a microbial cause. It is linked also to duodenal ulcers and stomach cancer. More than 50% of the world's population harbor H. pylori in their upper gastrointestinal tract, but about 80% are asymptomatic and it is considered that the bacterium plays an important role in the natural stomach ecology (Yamaoka, 2008). The study of the genome is focused on understanding pathogenesis, the ability of about 29% of the loci to cause disease, believed to be linked to a 40 kb Cag pathogenicity island (PAI), which contains genes of virulence proteins. The low gc content of the cag PAI relative to the rest of the H. pylori genome suggests it was acquired by horizontal transfer from another bacterial species. Figure 2: Nucleotide imbalance (N) and nucleotide pair imbalance (P) for H. pylori 26695c genome (GenBank 2012, NC_000915). The circular DNA of H. pylori is divided by the variation of the nucleotide imbalance signal N in two segments: one having 3nG + nA in excess to 3nC + nT, resulting in a positive slope (+0.0504) for N, the other having the reversed property (-0.0768). Taking into account equation (2) and Chargaff’s second law, which states that the number of occurrences of complementary nucleotides in a DNA molecule single strand should be the same, one would expect that nG, nC and nA, nT balance each other, and N remains 244 P. D. CRISTEA close to zero. This is the case for most eukaryotes, but obviously not for H. pylori, the variation of N being approximately piece-wise linear. As it can be seen by the direct visual inspection of the curve in Fig.2, the linearity of the two segments of the nucleotide imbalance signal N is not very good. This type of regularity, usually significantly better than for H. pylori, is found in all bacteria and some archaea, with typical features and parameters defining a “physiognomy’ for each genome. The separation points have a biological meaning: the minimum of the nucleotide imbalance signal N in the origin of the sequence corresponds, quite accurately, to the origin of replication, while the maximum (8.14 105 bp) corresponds, with less precision, to the terminus of replication. We have used the following measures of the linearity: - Mean Absolute Error (MAE) average per nucleotide of the absolute differences between the actual values and the best (least mean square error) linear fit values estimates the error of the linear fit; - Linear to Absolute Error Ratio (LAER) ratio between the best (least mean square error) linear fit variation per nucleotide and MAE compares the estimated linear variation per nucleotide with the fluctuations with respect to the linear fit. LAER is a ratio which compares the variation of the best linear fit with the absolute error, on the same length of the nucleotide sequence, for each segment of an (approximately) piece-wise linear signal.. As the error has been expressed by the mean absolute error, computed as an average per nucleotide, the same approach has been used for the linear (regular) variation, to shorten the wording of the definition. Certainly, the division with the number of nucleotides of both the numerator and the nominator does not change the ratio. For the ascending branch (0 8.14 105 bp) of the H. pylori nucleotide imbalance signal N, these measures of linearity are MAE = 1.1 and LAER = 7.65, showing a rather high fluctuation of the N signal with respect to its inferred regular (linear) variation. The behavior is a little better for the descending branch (8.14 16.68 105 bp), where MAE = 0.85 and LAER = 9.13, in accordance to the visual estimation of the linearity. In contrast, a much better linearity is found for the variation of the nucleotide pair imbalance P signal along the entire DNA strand. According to (5), such a feature corresponds to a uniform statistical difference between the n+ pairs and the n pairs. The GENOME SYMMETRIES 245 linearity of the nucleotide pair imbalance P is a general property found in all the investigated genomes, but the slope of P is positive for animal eukaryotes and negative for plant eukaryotes and bacteria. We have shown (Cristea, 2003) that recombination and crossing-over conserve this regularity, while local random mutations, such as uncorrelated SNPs (“snips” - single nucleotide polymorphisms), tend to destroy it. For the nucleotide pair imbalance P signal of H. pylori, the slope is +0.0231 for the entire genome, while the linear parameters are MAE = 0.35 and LAER = 19.45, showing a lower fluctuation and a variation closer to the linear one. Much better linearity of P has been found in other genomes, such as Mycobacterium tuberculosis, for which MAE = 0.12 and LAER = 170.51. The linearity of the nucleotide pair imbalance P is one of the most striking regularities of the nucleotide DNA genomic signals, especially that it occurs even in the points where the N signals change slope (Cristea, 2005), as can be seen in Fig. 2. As mentioned above, recombination and crossing-over – which imply the simultaneous direction reversal and strand switching of a DNA double-helix segment – conserve both n+ and n- in (5), so that the nucleotide pair imbalance P signal remains unchanged, whereas the nucleotide imbalance N signal changes. This property suggests to explore a possible “hidden” symmetry of DNA molecules, by re-orienting all exons in the genome along the same positive direction (Cristea, 2004). The positive or negative orientation of the coding segments (exons) in a nucleotide sequence is known from then tRNAs and the proteins which are synthesized, but there is no such information for the non-coding segments. Consequently, it is possible to re-orient in the same positive direction only the exons. As expected, the nucleotide pair imbalance P signal does not change, but the nucleotide pair imbalance N signal becomes almost linear along the entire strand. This outcome points to a putative less differentiated ancestral genomic structure, from which the current nucleotide structures, revealed by their current specific piece-wise linear N signal, have evolved (Cristea, 2010). Figure 3 gives the nucleotide imbalance N signals for the H. pylori complete genome (marked all, length 1,667,867 bp, similarly to Fig. 2) and for the 1,626 concatenated exons, kept in their initial orientation in the genome (marked NOCS – non-reoriented coding segments, length 1,527,953 bp). The figure also shows the nucleotide imbalance N signals for the complete genome and for the concatenated exons, after the reorientation of all exons in the same positive direction (marked rfr – re-framed, containing both reoriented exons and non-reoriented non-coding segments, and ROCS – 246 P. D. CRISTEA reoriented coding segments, containing reoriented exons, respectively). The most important property of the nucleotide imbalance N signals revealed by the exon reorientation shown in Fig. 3 is the transformation of the piece-wise linear variation of the all and NOCS signals into the quite surprising approximately linear of the rfr and ROCS signals. It is remarkable that the linearity of ROCS signal (MAE = 0.66 and LAER = 42.75), which contains only exons oriented in the same positive direction, is better than the linearity of the rfr signal (MAE = 1.01 and LAER = 30), which also contains some unchanged non-coding segments. Figure 3: Nucleotide imbalance (N) of H. pylori 26695c genome (GenBank 2012, NC_000915, all – complete genome, rfr –reoriented exons and non-reoriented non-coding segments, NOCS – 1,626 non-reoriented concatenated exons, ROCS – reoriented concatenated exons). As mentioned above, the similar four nucleotide pair imbalance P signals for the complete genome (all and rfr), on one hand, and of the non-(re)oriented and reoriented concatenated exons (NOCS and ROCS), on the other, do not change after the reorienting the exons and remain approximately linear. In a graphical representation (not shown here), the curves would appear like two pairs of superposed lines. These results reveal not only the large scale regularities in the genomes of extant species, but also those in putative ancestral genomes, which disappeared in the process of evolution, under the pressure of species separation. GENOME SYMMETRIES 247 4.2 Whole genome di-nucleotide statistical analysis On a DNA strand it is possible to have 4 x 4 = 16 distinct di-nucleotide pairs (aa, ac, ag, at, … , tt), which can be arranged in 8 complementary di-nucleotide couple difference signals in accordance to Watson-Crick rules (tg – ca, gt – ac, gg – cc, aa - tt, ct - ag, ga - tc, at - at, cg - cg). The first six difference signals are non-trivial, the other two are identically zero, as they contain pairs of self-complementary di-nucleotides. For the special case of di-nucleotides, Chargaff's second law states that the numbers of dinucleotides on one strand of a natural DNA molecule (>100 Kb) should be equal to the numbers of their reverse complements on the same strand, meaning that the complementary di-nucleotide couple difference signals, which start from the origin, should become zero again at the end of the sequence (null return value). This statement is only approximately fulfilled, and only for some of the mentioned signals. Figure 4 presents the complementary di-nucleotide difference signals for the H. pylori 26695c whole genome (GenBank 2012, NC_000915), whereas Table 1 gives the range, the return value, as well as the Chargaff’s second law and di-nucleotide distribution percent errors for each of these signals. The error in the Chargaff’s second law can be measured as the ratio of the return value, which measures the global imbalance in the complementary di-nucleotide pair (e.g., tg – ca), and the total number of di-nucleotide pairs in the sequence (tg + ca). The di-nucleotide distribution error is defined as the ratio of the range – the difference between the maximum and minimum values of the complementary di-nucleotide difference signals along the sequence (e.g., max(tg – ca)min(tg – ca)), and the total number of di-nucleotide pairs in the sequence (tg + ca). Chargaff’s second law is statistically well satisfied by the first four pairs of complementary di-nucleotides in Table 1 and the corresponding difference signals (gt – ac, tg – ca, gg – cc, aa - tt), which comprise neutral pairs, and less accurate by the two difference signals (ct – ag, ga - tc), containing pairs of positive and negative dinucleotides. It is known that the aa and tt di-nucleotides satisfy specific distributions, symmetrical relative to each other (Ioshikhes at al., 1992). The di-complementary pairs are also distributed quite uniformly along DNA sequences, but the errors are relatively larger. The uniformity of the di-nucleotide distribution for H. pylori 26695c whole genome is shown in Figs. 5 and 6, for the neutral, positive and negative dinucleotides. The distribution of the eight neutral in Fig. 5 is quite uniform along the sequence, as shown P. D. CRISTEA 248 by the MAE and LAER measures of linearity (best for the tt di-nucleotide), and the complementary di-nucleotide pairs compensate well each other, as can be seen from the low slope errors (best for the cc – gg pair). The linearity for the positive and negative di-nucleotides in Fig. 6 is also very good, or even better (for the self-complementary ta, at, gc, and cg di-nucleotides), but with a less accurate pairing of the remaining nontrivial complementary di-nucleotide pairs (ct – ag, and ct – ga). Complementary di-nucleotide difference signals, H. pylori 26695c whole genome (1667867 bp) 8000 Di-nucleotide differences (number of pairs) 6000 ct-ag 4000 2000 0 gt-ac tg-ca gg-cc -2000 -4000 ga-tc aa-tt -6000 0 2 4 6 8 10 12 14 Nucleotides 16 18 x 10 5 Figure 4: Complementary di-nucleotide difference signals for H. pylori 26695c whole genome (GenBank 2012, NC_000915, 1,667,867 bp). Neutral Positive Negative Table 1: Complementary di-nucleotide difference signals for H. pylori 26695c whole genome Minimum Maximum Return Chargaff’s Divalue value value second law nucleotide error (%) signal tg - ca -877 3027 -719 0.37524 gt - ac -1087 2488 -691 0.52342 gg - cc -921 6067 -838 0.57120 aa - tt -5220 263 -5167 1.2121 ct - ag -3552 4174 4099 2.1325 ga - tc -4176 4246 -4122 2.3924 Di-nucl. distribution error (%) 2.0374 2.7080 4.7632 1.2863 4.0194 4.8881 GENOME SYMMETRIES Figure 5: Neutral di-nucleotide signals for H. pylori 26695c whole genome (GenBank 2012). Figure 7: Neutral di-nucleotide signals for H. pylori 26695c for the 1,626 NOCS. 249 Figure 6: Positive and negative di-nucleotide signals for H. pylori 26695c whole genome (GenBank 2012). Figure 8: Neutral di-nucleotide signals for H. pylori 26695c for the 1,626 ROCS. 250 P. D. CRISTEA It is also interesting to analyze the linearity of the di-nucleotide distribution for the 1,626 concatenated exons (length 1,527,953 bp) of the H. pylori 26695c genome (GenBank 2012), for which N – the nucleotide imbalance, has been given in Fig.3, before (Fig.7, NOCS) and after (Fig.8, ROCS) the re-orientation of exons in the same positive direction. Only the neutral di-nucleotide distribution is considered here, in order to show the contribution of the non-coding DNA segments in satisfying the regularity of the DNA segments. As shown in Table 4, the elimination of the noncoding segments reduces about 3 times the slope of di-nucleotide signals, and similarly increases the slope error, but keeps almost the same linearity measures. In contrast, the re-orientation of all exons in the same positive direction changes little the slope (even if switching the components of each complementary di-nucleotide pair) and the linearity measures, but largely increases the slope error. 4.3 Comparison of mtDNA genes The NuGS methodology can also be used in the local analysis of nucleotide sequences, such as in the study of pathogen variability, primarily for the detection of the development of resistance to drugs and to treatment (Cristea, 2006), the investigation of genetic inserts (Cristea et al., 2008), or the comparison of genes belonging to individuals in the same or related species (Cristea et al., 2011). In Figs. 9 to 11 we consider for illustration the case of the ND6 mitochondrial (mt) gene, one of the Complex I genes in the respiratory chain. Fig. 9 presents the Nucleotide imbalance signals N of the ND6 mt gene for seven species of the Hominidae family: Homo sapiens (shortened to Hs), Hs neanderthalensis (Hsn), Pan troglodytes (Pat, the Chimpanzee), Pan paniscus (Papa, the Bonobo GENOME SYMMETRIES 251 Chimpanzee), Gorilla gorilla (Gg), Pongo pygmaeus abelii (Popya, Sumatran Orangutan) and Pongo pygmaeus (Popy, Bornean Orangutan). The genes have the same length (525 bp) and are aligned. The distance between two homologous genes, from different species or from different individuals in the same species, is defined as the sum of the absolute values of the differences between the NuGSs describing the two genes. The distance between two species, from the point of view of the genes in some specified set of genes, is defined as the Euclidian distance in a space in which each considered gene is an independent coordinate. As mentioned in Section 3, the resolution of the NuGSs in Fig. 9 is not good enough to measure the distances between genes, and the description based on reference and offsets, and offsets’ digital derivatives has to be used (Cristea et al., 2011). Fig. 10 gives the offsets of N signals of the ND6 mt genes in Fig. 9 with respect to the Hs signal, chosen as reference (thus, its offset is zero). The step variations in the other offsets correspond to the points where there are differences (mutations) with respect to the Hs signal. One can already appreciate the distances between the genes (e.g., Hsn is the closest to Hs). The resolution increases by using the digital derivatives of the offsets. In Fig. 11 we have represented the digital derivatives of the offsets of the N signals shown in Fig. 10. The pulses along these lines correspond to the differences in the N signals of the ND6 mt genes for the seven Hominidae species. The distances between each gene and the reference gene for Hs are given by the numbers at the right of the lines. The distances between the genes are now expressed quantitatively (e.g., Hsn is the closest to Hs – distance 5, whereas Popy is the farthest – distance 71, from the ND6 mt gene point of view). Fig. 12 represents graphically the distances between the seven Hominidae species considered above, evaluated on the basis of the distances among their genes in the mtDNA respiratory chain. The succession of the genes along the horizontal axis corresponds to their position in the mtDNA nucleotide sequence. The rhythmicity of the variation of the mt gene distances along the mtDNA molecule can not be fully attributed to the differences in the gene lengths and seems to indicate the existence of hotspots in a mtDNA molecule from the variability point of view. Similar results have been obtained in the study of pathogen variability (Cristea, 2006). 252 P. D. CRISTEA Figure 9: Nucleotide imbalance N signals of the ND6 mt gene for seven species of the Hominidae family (abbreviations: Hs – Homo sapiens, NC001807, Hsn – Hs neanderthalensis, NC011137, Pat – Pan troglodytes, NC001643, Papa – Pan paniscus, NC001644, Popya – Pongo pygmaeus abelii, NC002083, Popy – Pongo pygmaeus, NC001646, Gg – Gorilla gorilla, NC001645). The accession codes of the mtDNA genes in Genbank (2012) are given. Figure 10: Offsets of nucleotide imbalance signals (N) of the ND6 mt gene for the seven species in Fig. 9 with respect to the Hs signal as reference. Figure 11: Digital derivatives of the offsets in Fig. 10. The distances between each gene and the homologous gene for Hs are on the right. GENOME SYMMETRIES 253 Figure 12: Distances between the respiratory chain & ATP synthase genes and the homologous Hs mt genes. 5. CONCLUSIONS The study of nucleotide sequences by using NuGSs reveals regularities in the structure of DNA and RNA molecules. This approach has been applied to describe the structure of nucleotide sequences, both in the current state and in a putative ancestral state, from which they have evolved. The structural restrictions in genomic sequences are reflected in symmetries and regularities observed in the corresponding genomic signals. Results on the Entire genome NuGS analysis (Subsection 4.1), Whole genome di-nucleotide statistical analysis (4.2) and the Comparison of mtDNA genes (4.3) are presented in the paper. REFERENCES Albrecht-Buehler, G. (2006) Asymptotically increasing compliance of genomes with Chargaff's second parity rules through inversions and inverted transpositions, Proc. Natl. Acad. Sci. U.S.A, 103, 17828-17833. Chargaff, E. (1951) Structure and function of nucleic acids as cell constituents, Federal Proceedings, 10, 654– 659. Cristea, P.D. (2002) Conversion of Nitrogenous Base Sequences into Genomic Signals, Journal of Cellular and Molecular Medicine, 6, no. 2, 279–303. Cristea, P.D. (2004) , Genomic Signals of Re-Oriented ORFs, Eurasip – Journal on Applied Signal Processing, [Special Issue on Genomic Signal Processing], 2004, no.1, 132–137. 254 P. D. CRISTEA Cristea, P.D. (2005) Chapter 1: Representation and analysis of DNA sequences, in Genomic Signal Processing and Statistics, Daugherty, E., Shmulevich, I., Chen, J. and Wang, Z.J., eds. Eurasip Book Series on Signal Processing and Communications, Hindawi Publ. Corp., p. 15–65. Cristea, P.D. (2006) Genomic Signal Analysis of Pathogen Variability, Progress in Biomedical Optics and Imaging, Proceedings of SPIE, 6088, P1-P12. Cristea, P.D., Tuduce, R., Cornelis J., Deklerck, R., Nastac, I., Andrei, M. (2007) Signal Representation and Processing of Nucleotide Sequences, Proceeding of the 7th IEEE Intl. Conf. on Bioinformatics and Bioengineering (IEEE BIBE 2007), 1214-1219, Harvard Medical School, Boston, USA. Cristea, P.D., Tuduce, R. (2008) Use of Nucleotide Genomic Signals in the Analysis of Variability and Inserts in Prokaryote Genomes, Proceedings of the 2008 International Conference on Bioinformatics & Computational Biology (BIOCOMP'08), Ed. H.R. Arabnia, Las Vegas, Nevada, USA, pp. 241-247. Cristea, P.D. (2010) Symmetry in Genomics, The Journal of the Symmetrion – Symmetry: Culture and Science – Symmetry in Mathematical Education, ISSN 0865-4824, 21, no. 1-304, 71-86, http://symmetry.hu/aus_journal_thematic_issues.html#SME. Cristea, P.D., Tuduce, R. (2011) Hominidae mtDNA Analysis by Using Nucleotide Genomic Signals, Proceedings of 2nd International Workshop on Genomic Signals Processing (GSP2011), Bucharest, Romania, pp. 61-65. Cristea, P.D. (2012) Building Phylogenetic Trees by Using Gene Nucleotide Genomic Signals, 2012 IEEE Int'l Conf. of the Engineering in Medicine & Biology Soc. (EMBS), San Diego, CA. Darvas, G. (2007) Symmetry, Basel: Birkhauser, 508 pp, 2007. GeneBank (2012) NIH - National Institutes of Health, National Centre for Biotechnology Information, National Library of Medicine, (NCBI/GenBank), http://www.ncbi.nlm.nih.gov/ Ioshikhes I, Bolshoy A, Trifonov E.N. (1992) Preferred positions of AA and TT dinucleotides in aligned nucleosomal DNA sequences, J Biomol Struct Dyn., 9, no. 6, 1111-7. Kunkel, T.A. (2004) DNA Replication Fidelity, The Journal of Biological Chemistry, 279, no. 17, 16895– 16898. Randić M, Zupan J, Balaban AT, Vikić-Topić D, Plavsić D (2011) Graphical representation of proteins, Chem Rev., 111(2), 790-862. Telenti, A., Imboden, P., Marchesia, F., Matter, L., Schopfer, K., Bodmer, T., Lowrie, D., Colston, M.J., Cole,S. (1993) Detection of rifampin-resistance mutations in Micobacterium tuberculosis, Lancet, 341, 647-650. Watson, J.D., Crick, F.H.C. (1953) A structure for deoxyribose nucleic acid, Nature, 171, no. 4356, 737–738. Wigner, E.P. (1964) Symmetry and conservation laws, Proc. Natl. Acad. Sci. U.S.A., 51, 956-965. Yamaoka, Y. (2008) Helicobacter pylori: Molecular Genetics and Cellular Biology, Caister Academic Pr. Symmetry: Culture and Science Vol. 23 , Nos. 3-4, 255-274, 2012 SYMMETRY OF MITOCHONDRIAL DNA. THE CASE OF COXn GENES IN PRIMATES AND CARNIVORES Teodora Popovici1 and Paul Dan Cristea2 Abstract: The Nucleotide Genomic Signals (NuGSs) have been proven to be an effective measure of nucleotide strands characteristic features, and have illustrated the regularities and symmetries within these structures. Together with a proper alignment of two Mitochondrial DNA genes from individuals of different species, they can help compute the distances between the genes. Furthermore, clustering algorithms applied on the data set of distances can illustrate patterns of the relationships among species. This paper presents an approach on the study of these symmetries, using distances among genomic signals of previously aligned genes. For the purpose of this research, the Distance Computer application has been designed and tested on the Primates and Carnivora orders. The results support the idea that using the Nucleotide Genomic Signal as a mathematical abstraction of the nucleotide strands, along with a proper software alignment tool, relevant conclusions can be drawn about the inner symmetries of the mitochondrial DNA. Keywords: mitochondrial DNA, DNA symmetry, inter-gene distances, nucleotide genomic signals, nucleotide imbalance. 1 Teodora Popovici graduated from the Artificial Intelligence Master at the Computer Science Department of the University “Politehnica” of Bucharest (UPB), Romania, in July 2012, and is currently affiliated to the Bio-Medical Engineering Center of UPB (e-mail: [email protected]). 2 Paul Dan Cristea is Professor of Electrical Engineering and Applied Information Sciences at the University “Politehnica” of Bucharest (UPB), Splaiul Independentei no. 313, 060042 Bucharest, Romania, Member of the Romanian Academy, Fellow IEEE, Member of Honor of the Romanian Scientists Academy, director of the Bio-Medical Engineering Center of UPB. 256 T. POPOVICI AND P. D. CRISTEA 1. INTRODUCTION The concept of symmetry is closely related to regularity and harmony. In order to explain the universe, mankind has always searched for rules and patterns. These rules create certain symmetries that one can find in almost any existing system. Sometimes though, these symmetrical patterns are not that clear, they are hidden by the way one can observe the system. However, these symmetries emerge once the observer analyzes the system from another point of view. This innovative point of view could appear by applying some sort of transformation on the system data, for example. Genetics is one of the domains that enclose a very large amount of rules and symmetries, but at the same time it also withholds patterns and regulations from us. New discoveries are constantly made and innovative patterns are revealed through the means of technology and science. Things that were once considered chaotic and meaningless could later gain a well-defined and regulated structure. Researchers thrive to create better methods of analyzing the genome, in order to understand more of its rules and symmetries. One of these innovative methods is the Nucleotide Genomic Signal (NuGS) approach, which will be explained in a later section. This concept creates an interesting approach to genome analysis, which enables one to use powerful signal processing techniques on DNA data. The paper presents the studies conducted on genomic data using techniques like sequence alignment, NuGS and hierarchical clustering. This research aims to prove the effectiveness and applicability of the above mentioned methods in the field of DNA analysis. Studies in this area can help improve the knowledge one has on gene functionality, involvement in diseases, trends of related species in the same family and many other aspects, more broadly discussed in the Motivation chapter of this paper. This study can also reveal the natural symmetry of the Mitochondrial DNA, supported by the accurate species classification that can be created using the previously mentioned techniques. The following sections will give an overview of the terms used across this paper and an introductory part on DNA analysis. Further along, there will be a description of the innovative NuGS technology and its geometrical importance in the field of DNA analysis. In the State of the Art chapter, one may read about related research in the area of nucleotide genomic signals. The final parts of this paper present the developed application and several interesting results that were obtained. GENOME SYMMETRIES 257 1.1. DNA analysis The Deoxyribonucleic acid (DNA) is the hereditary material in almost all living organisms. Only RNA viruses are the exception to this. DNA is a type of nucleic acid (in almost all cases, it resides in the nucleus of the cell) that contains information, the “genetic instructions” that determine the functioning of organisms (Watson and Crick, 1969) and (Pearson, 2006). The genes are the portions of the DNA strands that carry the useful genetic information. The other segments may have structural or regulating functions. The genes are the molecular units that code for a polypeptide or an RNA chain. Basically, genes contain the information relevant for the building and maintaining of cells and can pass this information on to offsprings. For example, genes are responsible for all traits of an organism, either physically visible (like hair or eye color) or more hidden, like predisposition to a certain condition and so on. The mitochondrial DNA (mtDNA) is a special kind of DNA that does not reside in the nucleus of the cell, but in an organelle named Mitochondrion. The Mitochondria are structures that reside in cells that convert the food into energy for the use of the cell. The Mitochondria have the special ability to replicate themselves independently of the gene information in the DNA, unlike all the other cells in the organism. Also, unlike normal DNA, mtDNA is in most species solely inherited from the mother. The mitochondrial DNA is a circular structure of approximately 16,500 base pairs, in humans. It codes for 37 genes and also has a non-coding section named D-loop, in which the two strands of DNA are separated by a third one (hence the name, the displacement loop). The mtDNA is often used in forensic experiments and also lately in phylogeny research. 1.2. Mathematical models of DNA sequences Despite the simple and standard representation, the symbolic form of nucleotide sequences (namely the enumeration of the nucleobases) limits the possibilities of exploitation to pattern matching and statistical analysis. These paths of research are sometimes difficult to use and limiting in nature. Hence, a new approach has been researched by (Cristea, 2005), one that could help interpret genomic information as signals. The resulting numerical values would have an accurate mathematical meaning and can be subject to signal processing procedures. T. POPOVICI AND P. D. CRISTEA 258 After a process of studying different mappings between the symbolic form and the genomic signals, a tetrahedral representation and a 2D model were proposed in (Cristea, 2002, 2005). Using these models, the four bases can be assigned phases. A corresponds to π/4, G to 3 π/4, T to –π/4 and C to -3 π/4. The complex mapping preserves the biochemical characteristics of the bases in corresponding mathematical features. Based on the phase values of the four bases, one can compute the cumulated phase, which is defined as the sum of phases of all the nucleotides in the sequence of interest: (1) c 4 3(n G nC ) (n A nT ) , where nA, nC, nG, nT are the numbers of adenine, cytosine, guanine and thymine bases from the begining of the sequence to the current location. A second measure of the complex representation of DNA strands is the unwrapped phase, which corrects the absolute value of the differences between consecutive elements in the sequence to be lower than π. These two nucleotide genomic signals (NuGS) can be primarily used to find large scale features of DNA molecules. In order to compute the distance between two homologous genes, the general formula is to compute the sum of all the differences between the signals in the sequence, in absolute value: L D(G1 , G2 ) S G1 (k ) S G2 (k ) , (2) k 1 where SG is a genomic signal and in this case it is the nucleotide imbalance signal. 2. MOTIVATION AND PROBLEM DESCRIPTION This research aims at finding an accurate sequence of steps such that DNA strands analysis can be properly made, and structural symmetries can be identified. There have been studies that employ the above mentioned NuGS technique, by (Cristea, 2002), (Teodorescu and Cristea, 2012), and (Cristea and Tuduce, 2003, 2009, 2010). However, a proper alignment has not yet been applied before the genomic signal distance was calculated, but was among the recommendations of the authors for future work. An appropriate alignment of the initial nucleotide sequences could greatly enhance the quality of the distance results. Furthermore, the application behind the current study employs a clustering phase at the end, so that conclusions can be more easily drawn upon the appropriateness of the techniques for the described problem. The clustering results also reveal the patterns and symmetries in the Mitochondrial DNA structure. GENOME SYMMETRIES 259 The main objective of this research is to compute the distance between different species, at molecular level, and to prepare for computing intra-species distances, as well (Cristea and Tuduce, 2003). Such a study enables one to explore the stability of a DNA segment across different species, thus proving its importance. Over the past decade, well-known projects have been concluded, such as the Human Genome Project described at the archive site of the U.S. Department of Energy's Human Genome Project, which was conducted between 1990 and 2003 under the coordination of the U.S. Department of Energy and the National Institute of Health (Human Genome Project). The project’s main goals were to sequence the approximately 3 billion base pairs that compose the human DNA, identify the 20,00025,000 genes, store all the information and improve tools for data analysis. As a corollary, several other organisms were also sequenced. Although the project is officially finished, research on the sequenced data will go on for many years to come. The aim of the continuous research on the data produced by the project is to identify more genes and reveal their functionality, in order to improve research on diseasecausing genes and to identify new treatment solutions. 3. STATE OF THE ART 3.1. Genomic Signals The work in the field of genomic signals is concerned with the comparison between homologous genes in different individuals or across species. In the first case, the aim could be to identify specific variations in genes that might have a great impact in diseases, as researched by (Teodorescu and Cristea, 2012). They conducted a study on the variations of gene TCF7L2, in order to investigate its influence in Diabetes type 2. Numerical experiments have been performed on human individuals, and also on some other species. Because of the numerous inserts in this gene across the different species, the conclusions are to be finalized after a proper alignment of the genes. Several studies have been conducted for the mitochondrial DNA genes by (Cristea and Tuduce, 2003, 2009, 2010). The mitochondrial DNA (mtDNA) is an exception, most of DNA being present only in the nucleus. The mtDNA is present in the mitochondrion, instead. The typical length of the mtDNA is 16,500 base pairs, which actually encode 37 genes, and 2-10 mtDNAs are present in each mitochondrion. The main focus is on the Hominidae family. Apart from the genes, mtDNA also contains a non-coding T. POPOVICI AND P. D. CRISTEA 260 region, the D-loop, which is also important, as it controls initiation and regulation of transcription and replication of mtDNA. A distinct feature of mtDNA is that it is almost always inherited from the mother, and not from both parents, as is the case with nuclear DNA. The research in the mtDNA of six hominidae species conducted by (Cristea and Tuduce, 2003, 2009, 2010) emphasizes the similarities between the species, but also the characteristics that tell them apart. The nucleotide imbalance, the offsets to a reference signal (usually; Homo Sapiens) and the differential signals of the offsets are studied. Also, studies on the nucleotide path have been conducted on several mtDNA genes. All these methods represent an efficient method of comparing and viewing related signals, which give an accurate view on highly related sequences. 3.2. Phylogenetic Trees Phylogenetic trees (phylogenies or evolutionary trees) are arborescent structures which correspond to the inferred evolutionary relationships within a group of species or organisms. These branching diagrams are built using a measure of similarity among physical and/or genetic traits, according to the Phylogenetic Tree (Benton, 2000). The tree often considers restrictions such as time spans. Evolutionary trees can be used to structure classifications, to order the diversity of a system, to infer certain events that took place throughout the evolution of the system, or to guide the scientific evolutionary research, according to (Baum, 2008) and (Baum and Offner, 2008). Phylogenies have been used since the studies by Edward Hitchcock in 1840 and the theories published by Charles Darwin in 1859, but have gained popularity in the last decades. 4. APPLICATION The application employed for the purpose of this research is mainly implemented in Java and uses results from other pieces of software: BLAST (BLAST/NCBI), FeatureExtract (Wernersson, 2005) and MultiDendrograms (Fernández and Gómez, 2008). The application itself is called DistanceComputer and employs several modules. A description of the overall architecture of the DistanceComputer application and its additional software will be given in the following subsections. Given two nucleotide sequences in symbolic format, the program calls BLAST in order to find the most appropriate match sequences. BLAST may introduce gaps in the GENOME SYMMETRIES 261 sequences, even in the sequences of the same length, if there has been an insertion in one of the genes and the matching content is „shifted”. This happens in a batch mode, the program iterates through a directory of gene content files, comparing them all against each other. The application analyzes the comprised output of BLAST, which basically only contains the differences between the two sequences. This format proves to be convenient and efficient to process for computing distances using as reference one of the signals, as well as the cumulated phase. Other features could be added, in order to be able to process other types of BLAST output (for example, the complete alignments, along with statistics reported by the program to what concerns the alignment). 4.1. Overall algorithm The processing is basically made in batch mode. The application can execute a variety of tasks, depending on the processing step: 1. Filtering the input files – this process enables the user to select which inputs are of interest for them. For example, one might be interested in comparing all the Vertebrates among them. 2. Separating the desired genes and grouping the organisms into families – the genes have been separated using a modified version of the FeatureExtract utility, by Rasmus Wernersson. In order to have a broad view on the selected species, organisms have been grouped from level 9 (the Primates/Carnivora level), to level 13 (the smallest family level, in order to correctly classify the organisms in the final phylogenetic trees). 3. Doing the alignment – the programs accepts a list of interesting genes and a certain family of organisms and calls BLAST, for every gene in the list, for all organisms against one another. 4. Computing the distances between organism genes for each pair (every alignment) – the application reads the BLAST output in the format specified above and computes the nucleotide imbalance signals for each organism, only on the portions that disagree. The distance is then computed as the sum of the absolute values of the signal differences. Also, when a deletion or insertion occurs (one of the signals has a gap), a distance of 7 was chosen to represent the gravity of this difference. This value was chosen, because it is one unit bigger that the largest absolute value of the nucleotide imbalance signals difference. T. POPOVICI AND P. D. CRISTEA 262 5. Post-processing phase – the distance files are rewritten, in order to match the input format of the MultiDendrograms application, designed by Alberto Fernández and Sergio Gómez. 6. Running the MultiDendrograms application on the distance set of choice – this step will output the phylogenetic tree of the considered species. 4.2. Auxiliary software The other pieces of software used are BLAST (BLAST/NCBI), FeatureExtract (Wernersson, 2005) and MultiDendrograms (Fernández and Gómez, 2008). BLAST was developed by researchers at the US National Center for Biotechnology Information (NCBI), and is publicly available on the web (BLAST/NCBI), both as a web service and an executable version. BLAST performs queries between a sequence of data (nucleotides or proteins) and another sequence or database of sequences. The algorithm is based on the existence of high-scoring segment pairs (HSPs) which form an alignment. These HSPs are searched through a heuristic of the algorithm SmithWaterman. Because BLAST uses a heuristic of this algorithm, it is much faster than the original version (reportedly 50 times faster), but on the downside it doesn’t guarantee the same accuracy as Smith-Waterman. The compromise between speed and accuracy favors BLAST in the case of gene data alignment. There are several parsing utilities that can assist in extracting relevant information from GenBank files. One of the open and free pieces of software that can extract sequences and annotations from this file format is Feature Extract (Wernersson, 2005), with its command-line Python software gb2tab. The program receives the requirements as arguments in the command line: which types of sequences are searched for, the name of the input GenBank file and other options, which are out of the scope of this research. It then reads the descriptive part of the file, gathering all the information it needs about the features to extract. For the purpose of this paper, the author has modified the software such that it saves the useful information (the nucleotide sequences) in several files, organized in directories based on gene type and family of the organism. The descriptive initial part of the file is also saved, analogous to the content, for further reference. The modified software splits the organisms into smaller families. MultiDendrograms (Fernández and Gómez, 2008) is an open software program designed to create hierarchical clustering on real valued data. It has been implemented in Java and presents a user-friendly graphical interface that enables the user to choose GENOME SYMMETRIES 263 the desired clustering algorithm, along with visual markers for the output. The algorithms that can be used are part of the Agglomerative Hierarchical Clustering genre: Variable-group Single-Linkage, Complete Linkage, Unweighted Average, Weighted Average, Unweighted Centroid, Weighted Centroid, Joint Between-Within. 5. EXPERIMENTS This section aims to present several experiments that have been conducted in order to compute inter-gene distances and construct the associated phylogenetic trees. In order to create the phylogenetic trees depicted in the Appendices section, the Unweighted Average clustering distance has been used. 5.1. The Respiratory Chain of mtDNA As mentioned in the earlier sections of this paper, mtDNA is composed of several segments. Among them, 13 segments code for proteins of the electron transport chain (the respiratory chain), as shown in Table 1 (Cristea and Tuduce, 2009). In this research paper, the focus falls on three COX genes of the Respiratory complex, namely COX1, COX2 and COX3. The tests were done using the Euclidian distance of the component genes. It is worth mentioning that during testing, the „megablast” version of the BLAST software has been used, so that even distant species could be properly aligned. This version of BLAST searches for segments of smaller length to match exactly, such that it can be used for more distant species. Table 1: Products and Genes encoded by mtDNA Description Product Complex I Electron transport chain (Respiratory complex) Ribosomal DNA Transport DNA 5.2. Complex III Complex IV ATP synthase mt rRNA mt tRNA Genes MT-ND1, MT-ND2, MT-ND3, MT-ND4, MTND4L, MT-ND5, MT-ND6 MT-CYB MT-COX1, MT-COX2, MT-COX3 MT-ATP6, MT-ATP8 MT-RNR1 (12S), MT-RNR2 (16S) MT-Ala, ... , MT-Val The Primates Order The Primates order is a part of the Mammalian class, mainly distinguished by the tendency for bipedalism and mostly arboreal life, according to The Primates Order, The T. POPOVICI AND P. D. CRISTEA 264 Haplorrhini Suborder, The Strepsirrhini Suborder, (Goodman et al., 1990), (Rylands, and Mittermeier, 2009), (Saint-Hilaire, 1812), and (Groves, 2005). The Haplorrhini are a suborder of Primates, named after the characteristic of their nose: „dry-nosed” primates. As opposed to the Strepsirrhini, the other suborder of Primates (the „wetnosed” primates), the Haplorrhini have a much more evolved brain. During this research, the authors had access to only a limited amount of complete mtDNA genomes from the Mammalian class. Namely, only 59 species of the Primates order have been sequenced and posted on the NCBI ftp site for to the date (Metazoa GenBank). Among them, 45 are of the Haplorrhini suborder, and 14 of the Strepsirrhini clade. Unfortunately, not all of the families of the order have currently a representative in the GenBank data, the omissions being most significantly from the Platirrhini parvorder. The tests employed on these species and the hierarchies built using the computed distances have shown a close resemblance to the biological hierarchy. 5.3. The Carnivora Order The Carnivora Order is also a part of the Mammalian class. As in the previous case, not all the suborders of this clade have been available for testing purposes. However, the tests have been conducted on 12 species of the Feliformia suborder and 59 species of the Caniformia suborder. 5.4. Test case 1 – Primates For the Primates order, tests have been employed on 59 species, using the three COX genes as reference. The results are similar among each of the three genes. Namely, the species have been correctly classified in almost all cases. The resulting phylogenetic tree shows accurate results in all species, except the Tarsiidae, which are clustered among the Strepsirrhini, although they are officially a part of the Haplorrhini suborder. A curious fact is that throughout the last century, the Tarsiidae were alternately considered as part of the Strepsirrhini and the Haplorrhini suborder. That is because this family has genetic traits that resemble both these suborders. With the exception of this incorrect classification, the resulting phylogenetic trees depicted in the Appendices section of this article show a close resemblance to the actual phylogeny of these species. It is remarkable that tests considering only one gene can give such accurate results. One can predict that combining all the genes of the Mitochondrial DNA, the results will be GENOME SYMMETRIES 265 even more accurate. The phylogenetic trees have the scale attached, so the height of the branches actually represents the computed distances between the clusters. 5.5. Test case 2 – Carnivora In the case of the Carnivora suborder, the results follow a similar pattern. However, all the main families have been successfully identified and separated in the phylogeny. The exception this time is that the algorithm reports a slightly smaller distance between the cheetah and the puma, compared to the distance between the puma and the rest of its family. The results for all the three genes are quite similar to each other. Nonetheless, in the case of COX3 the phylogenetic tree is not entirely accurate, because one family from the Caniformia suborder is grouped along with the Feliformia suborder. The distance to the other members of its right group is however very small. This could be interpreted as another proof that taking into consideration all the Mitochondrial DNA genes, a more accurate result could be obtained and these outliers would be corrected in this case. 5.6. Numerical results The Appendices section contains visual representations of the experimental results. The phylogenetic trees branches have heights that are consistent with the numerical results. For example, one may notice that in Figure 4, the smallest depicted distance is the one between the Gray Wolf and the Eurasian Wolf, namely 2. This result shows that the COX1 gene has slightly mutated between these two closely related species. Similar results show the resemblance between clusters of tiger species, bears or seals. In this particular case of the Carnivora family and the COX1 gene, the farthest subspecies were the Snow Leopard and the Wolverine (of the Mustelidae family), with a reported distance of 884. There have been cases in the research experiments where BLAST could not find a proper alignment, so it reported no resemblance between the two genes. In these cases, the default “infinite” distance is 100,000. This is a path worth studying, as additional alignment techniques should be implemented, when BLAST fails to identify the distance. The simplest of these techniques is the computation of the nucleotide imbalance signal, without prior alignment of the genes. This can be misleading, because it is difficult to compare distances obtained by different techniques. T. POPOVICI AND P. D. CRISTEA 266 6. CONCLUSIONS AND FUTURE WORK The tests in the previous section have shown remarkable results, in the sense that the phylogenies that were obtained resemble closely the scientific accepted phylogeny. This can be an indicator that the employed distance measure is genuinely modeling the differences between the species. This supports the assumption that the nucleotide genomic signal is an appropriate mathematical and signaling model for the DNA strands. In some cases, BLAST may not be able to find a suitable alignment, even though the species are closely related. Taking into account a backup version of alignment in the case of BLAST finding no proper one, can enable the software to give more appropriate distances in this case. Other aspects worth noticing are the accuracy and the reliability of the test data. For example, this study and many others make use of some data sequenced from extinct species, or extant ones, but from very old artifacts. There is an issue of these pieces of data not being as accurate as one may desire, mainly because of the factors that influenced the DNA for centuries and millennia. Nevertheless, the techniques used by specialists for DNA sequencing are of high technology and can be trusted to bring about the best possible results given the factors. The completeness of the results is also affected by the fact that unfortunately there is no complete sequencing of all the species for mtDNA. For example, there are large portions of the Platirrhini family from the Primates order that are not covered in the test set, so the family is only partially represented in the results. A more complete study could be conducted as soon as more data is made available online or in the various university and research facilities. 6.1. Future Work There is a wide area of possible future research in this field. The results presented in this paper can bring confidence to the fact that the Nucleotide Genomic Signal technique, along with a proper alignment, can give good results on comparison of DNA strands. This can be a starting point for implementing other variants for alignment when the main software (BLAST in this case) fails to find one. These can also be done using NuGS techniques. GENOME SYMMETRIES 267 This research can be extended to other fields. There are various possibilities in which this kind of technology can be useful. For example, testing can be done on genes associated with some diseases, from both healthy and afflicted patients. Obtaining distances in these cases and clustering the final results can lead to important discoveries in the impact that the particular genes have on the disease development. Also, treatment plans can be evaluated using this data. One closely related perspective of study could be using other variants of computing distances. For example, the Mode Step reference can be used when aligning the sequences, so that all the individuals or species can be compared against the common trend of the group. This research could offer interesting insight on which individuals are more closely connected to the common trend, and whether this common signal can be interpreted as the best solution in the evolution path or it is merely an average path. Such a technique could depict an even clearer image of the symmetries hidden in the Mitochondrial DNA. To conclude with, using techniques such as sequence alignment and nucleotide genomic signals gives a better perspective on highly related sequences symmetry and enhances the proper estimation of the distances between these sequences. This is a relatively new topic in the field of Bioinformatics, and it promises to offer great opportunities for medical studies as well. REFERENCES Baum, D. A. (2008), Reading a phylogenetic tree: The meaning of monophyletic groups. Nature Education, 1 (1). Baum, D. A., and Offner, S. (2008), Phylogenies and tree thinking. American Biology Teacher, 70, 222–229. Benton, M. J, (2000), Stems, nodes, crown clades, and rank-free lists: is Linnaeus dead?, Biological Reviews, 75, 633-648. BLAST / NCBI, http://blast.ncbi.nlm.nih.gov/Blast.cgi. Cristea, P. D. (2005), Representation and Analysis of DNA sequences, in Genomic Signal Processing and Statistics, Editors Dougherty E. G., Shmulevici I., Chen Jie, Wang Z. J., Book Series on Signal Processing. and Communication, Hindawi, 15-65. Cristea, P. D. (2002), Conversion of nucleotides sequences into genomic signals, International Journal of Cellular and Molecular Medicine, 6, 2, 279–303. Cristea, P. D., and Tuduce, Rodica (2003), Signal processing of genomic information: Mitochondrial genomic signals of hominidae”, 4th EURASIP Conference - Video/Image Processing and Multimedia Communications, 2003. 2-5 July 2003, 209-214. Cristea, P. D., and Tuduce, Rodica (2009), Nucleotide genomic signal analysis of hominidae mitochondrial DNA, DSP2009 - 16th International Conference on Digital Signal Processing, 1-6. 268 T. POPOVICI AND P. D. CRISTEA Cristea, P. D. and Tuduce, Rodica (2009), Nucleotide genomic signal comparative analysis of homo sapiens and other hominidae mtDNA, ISSCS 2009 - International Symposium on Signals, Circuits and Systems, 1-4. Cristea, P. D. and Tuduce, Rodica (2010), Comparative Analysis of Mitochondrial DNA by using Nucleotide Genomic Signals, Materials Science Forum, 670, 507-516. Fernández, A. and Gómez, S. (2008), Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms, Journal of Classification 25, 43-65. Goodman, M., Tagle, D. A., Fitch, D. H., Bailey, W., Czelusniak, J., Koop, B. F., Benson, P., and Slightom, J. L. (1990), Primate evolution at the DNA level and a classification of hominoids, Journal of Molecular Evolution, 30 (3), 260–266. Groves, C. (2005), Strepsirrhini, in Wilson D. E., Reeder D. M., Mammal Species of the World (3rd ed.). Johns Hopkins University Press, Baltimore, 111. Human Genome Project, ttp://www.ornl.gov/sci/techresources/Human_Genome/project/about.shtml. Metazoa GenBank - ftp://ftp.ncbi.nlm.nih.gov/genomes/MITOCHONDRIA/Metazoa. Pearson Helen, (2006), Genetics: What is a gene?, Nature, 441, Volume 441, Issue 7092, 398-401. Rylands A. B. and Mittermeier R. A. (2009). The Diversity of the New World Primates (Platyrrhini), in Garber P.A., Estrada A, Bicca-Marques J. C., Heymann E.W , Strier K.B., South American Primates: Comparative Perspectives in the Study of Behavior, Ecology, and Conservation, Springer. Saint-Hilaire, É. G. (1812), Suite au tableau des quadrumanes. Seconde famille. Lemuriens. Strepsirrhini, Annales du Muséum d'Histoire Naturelle, 19, 156–170. Teodorescu, D., and Cristea, P.D. (2012), Nucleotide Genomic Signal comparative analysis of genes involved in diabetes type 2 for various taxons, 19th International Conference on Systems, Signals and Image Processing (IWSSIP), 518-521. Watson, J. and Crick, F., 1(1969), Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid, Nature, 224, 470 – 471, reprinted from Nature, April 25, 1953. Wernersson, R. (2005), FeatureExtract—extraction of sequence annotation made easy. Oxford University Press, Nucleic Acid Research, 33, 2, w567-w569. GENOME SYMMETRIES APPENDICES Figure 1: Primates Phylogenetic tree for gene mtCOX1. 269 270 T. POPOVICI AND P. D. CRISTEA Figure 2: Primates Phylogenetic tree for gene mtCOX2. GENOME SYMMETRIES Figure 3: Primates Phylogenetic tree for gene mtCOX3. 271 272 T. POPOVICI AND P. D. CRISTEA Figure 4: Carnivora Phylogenetic tree for gene mtCOX1. GENOME SYMMETRIES Figure 5: Carnivora Phylogenetic tree for gene mtCOX2. 273 274 T. POPOVICI AND P. D. CRISTEA Figure 6: Carnivora Phylogenetic tree for gene mtCOX3. Symmetry: Culture and Science Vol. 23, Nos.3-4, 275-301, 2012 SYMMETRIES OF THE GENETIC CODE, HYPERCOMPLEX NUMBERS AND GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES S.V.Petoukhov Biophysicist, bioinformatician (b. Moscow, Russia, 1946). Address: Laboratory of Biomechanical Systems, Mechanical Engineering Institute of Russian Academy of Sciences; Malyi Kharitonievskiy pereulok, 4, Moscow, 101990, Russia. E-mail: [email protected]. Fields of interest: genetics, bioinformatics, biosymmetries, multidimensional numbers, musical harmony, mathematical crystallography (also history of sciences, oriental medicine). Awards: Gold medal of the Exhibition of Economic Achievements of the USSR, 1974; State Prize of the USSR, 1986; Honorary diplomas of a few international conferences and organizations, 2005-2012. Publications: 1) S.V. Petoukhov (1981) Biomechanics, Bionics and Symmetry. Moscow, Nauka, 239 pp. (in Russian); 2) S.V. Petoukhov (1999) Biosolitons. Fundamentals of Soliton Biology. Moscow, GPKT, 288 pp. (in Russian); 3) S.V. Petoukhov (2008) Matrix Genetics, Algebras of the Genetic Code, Noise-immunity. Moscow, RCD, 316 pp. (in Russian); 4) S.V. Petoukhov, M. He (2010) Symmetrical Analysis Techniques for Genetic Systems and Bioinformatics: Advanced Patterns and Applications, Hershey, USA: IGI Global, 271 pp.; 5) He M., Petoukhov S.V. (2011) Mathematics of Bioinformatics: Theory, Practice, and Applications. USA: John Wiley & Sons, Inc., 295 pp. Abstract: The article describes results of study of some symmetries of the genetic coding system by means of matrix representations of its molecular ensembles. This matrix approach is borrowed by the author from the known theory of noise-immunity coding, which is used for a long time in discrete signals processing for communication and computer technology. In the process, important connections between the hierarchy of genetic alphabets and complex numbers, quaternions by Hamilton and some other multi-dimensional numbers are discovered by means of analysis of reasoned numeric representations of genetic (2n*2n)-matrices. It has been shown that these numeric matrices belong to a class of “matrices with internal complementarities” and they allow creation of new mathematical tools to study the molecular-genetic system, including hidden regularities of long nucleotide sequences. The described results give some evidences about the algebraic nature of the molecular-genetic system. S. V. PETOUKHOV 276 Keywords: symmetry, genetic code, matrix, hypercomplex numbers, complementarity, Kronecker multiplication, long nucleotide sequences. 1. ABOUT THE PARTNERSHIP OF THE GENETIC CODE AND MATHEMATICS Science has led to a new understanding of life itself: “Life is a partnership between genes and mathematics” (Stewart, 1999). This article describes a system of multidimensional numeric structures together with some evidences that this mathematical system is the partner of molecular ensembles of the genetic code. The described results are based on symmetric properties of the genetic code system and on a matrix approach which was borrowed by the author from mathematics of noiseimmunity coding to study genetic phenomenology (Petoukhov, 2008a-c, 2011, 2012; Petoukhov, He, 2010). 1 - 1 1 1 1 1 1 1 - 1 1 H4 = R4 = 1 - 1 1 1 1 1 1 - - 1 - 1 1 1 -1 1 -1 1 1 -1 -1 1 -1 1 -1 ; -1 -1 ; 1 1 H8 = 1 1 -1 -1 R8 = 1 1 -1 -1 - 1 - 1 1 1 1 1 1 1 1 1 1 1 1 - 1 - 1 1 1 1 1 1 - 1 - - 1 - - 1 - 1 1 1 1 1 -1 -1 1 1 -1 -1 1 1 1 1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1 1 1 1 1 Figure 1: numeric matrices H4, H8, R4 and R8 which are connected with phenomenology of the genetic coding system (Petoukhov, 2011, 2012) The main mathematical objects of the article are four matrices R4, R8, H4 and H8 shown on Figure 1. Why these numeric matrices are chosen from infinite set of matrices? The reason is that they are connected with phenomenology of the genetic code system in GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 277 matrix forms of its representation as it was shown in works (Petoukhov, 2011, 2012), and as it will be additionally demonstrated in the end of this article, where a conclusion about algebraic essence of the nature of genetic informatics will be made. The matrices H4 and H8 belong to a huge set of famous Hadamard matrices, which are widely used for noise-immunity coding in technologies of signals processing. The matrices R4 and R8 are conditionally termed “Rademacher matrices” because each of their columns represents one of known Rademacher functions. 2. THE HADAMARD MATRICES H4 AND H8 Let us begin with analysis of the (4*4)-matrix H4 (Figure 1). One of variants of decomposition of the matrix H4 gives a set of 4 sparse matrices H40, H41, H42 and H43 (Figure 2). This set is closed in relation to multiplication and it defines their multiplication table (Figure 2, bottom row) that is identical to the famous multiplication table of quaternions by Hamilton. From this point of view, the matrix H4 is the quaternion by Hamilton with unit coordinates. (Such type of decompositions is termed a dyadic-shift decomposition because it corresponds to structures of matrices of dyadic shifts, well known in technology of signals processing (Ahmed, Rao, 1975)). 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 + 0 0 0 H4 = H40 + H41 + H42 + H43 = 1 0 0 0 0 0 0 0 + 0 0 0 0 0 1 1 0 0 0 0 0 0 1 H41 H42 H43 1 1 H41 H42 H43 H41 H41 -1 - H43 H42 H42 H42 H43 -1 - H41 0 1 0 0 + 0 0 0 - 0 0 0 0 1 0 0 1 0 0 0 H43 H43 - H42 H41 -1 Figure 2: the dyadic-shift decomposition of the (4*4)-matrix H4 (from Figure 1) gives the set of 4 sparse matrices H40, H41, H42 and H43, which corresponds to the multiplication table of quatrnions by Hamilton (bottom row). The matrix H40 is identity matrix But the matrix H4 is also the sum of two sparse matrices HL4 and HR4 (Figure 3). One can numerate 4 columns of the matrix H4 from left to right by numbers 0, 1, 2 and 3. In this case two columns with non-zero entries in the matrix HL4 have numerations with even numbers 0 and 2; two columns with non-zero entries in the matrix HR4 have S. V. PETOUKHOV 278 numerations with odd numbers 1 and 3. In view of this, such decomposition H4=HL4 +HR4 can be conditionally termed as “the even-odd decomposition” (such type of decompositions will be used a few times in this article). H4 = HL4 + HR4 = HL4 = HL40 + HL41 = HR4 = HR40 + HR41 = 1 -1 1 -1 0 0 0 0 -1 1 1 -1 1 -1 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + + 0 0 + 1 1 1 1 -1 -1 0 0 1 -1 0 0 0 0 -1 1 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 , , Figure 3: upper row: the representation of the matrix H4 as sum of matrices HL4 and HR4. Other rows: representations of each of matrices HL4 and HR4 as sums of two matrices: HL4=HL40+HL41, HR4 =HR40+HR41 It is unexpected but the set of two (4*4)-matrices HL40 and HL41 is also closed in relation to multiplication and it defines their multiplication table (Figure 43), identical to the multiplication table of complex numbers (http://en.wikipedia.org/wiki/Complex_number). One can note that in the field of matrix analysis, complex numbers are usually represented by means of (2*2)-matrices [a, -b; b, a]. Let us consider now the set of (4*4)-matrices CL = a0*HL40+a2*HL41 which is the unusual representation of complex numbers (here a0, a2 are real numbers) (Figure 4). The classical identity matrix E=[1 0 0 0; 0 1 0 0; 0 0 1 0; 0 0 0 1] is absent in the set of matrices CL, each of which has zero determinant. Consequently the usual notion of the inverse matrix CL-1 (as CL*CL-1=E) can’t be defined in relation to the classical identity matrix E in accordance with the famous theorem about inverse matrices for matrices with zero determinant (Bellman, 1960, Chapter 6, § 4). On the other hand, the set of matrices CL has the matrix HL40, which possesses all properties of identity matrix (or the real unit) for any member of this set (one can check that the matrix HL40 represents the real unit in this set). In the frame of the set of matrices CL, where the matrix HL40 represents the real unity, one can define the special notion of inverse matrix CL-1 for any non-zero matrix CL in relation to the matrix HL40 on the base of equations: CL*CL-1 = CL-1*CL = HL40. From this point of view, the genetic (4*4)-matrix GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 279 HL4 is the complex number with unit coordinates (a0=a2=1). In the case of genetic matrices, we reveal that 4-dimensional spaces can contain 2-parametric subspaces, in which complex numbers exist in the form of (4*4)-matrices CL. HL40 HL41 HL40 HL40 HL41 HL41 HL41 -HL40 ; CL = a0*HL40+a2*HL41 = CL-1 = (a02+a22)-1 * a0 -a0 -a2 a2 0 0 0 0 a2 -a2 a0 -a0 0 0 0 0 a0 -a0 a2 -a2 -a2 a2 a0 -a0 0 0 0 0 0 0 0 0 Figure 4: the multiplication table of two (4*4)-matrices HL40 and HL41 (from Figure 3), which represent a set of two basic elements of complex numbers CL = a0*HL40+a2*HL41, where a0, a2 are real numbers. In the frame of the set of 2-parametric matrices CL, where the matrix HL40 represents the real unit, the matrix CL-1 is the inverse matrix for CL by definition on the base of the equation: CL*CL-1 = HL40 A similar situation holds true for (4*4)-matrices HR4 = HR40 + HR41 (from Figure 3). The set of two matrices HR40 and HR41 is also closed in relation to multiplication; it gives the multiplication table (Figure 5) which is also identical to the multiplication table of complex numbers. The set of (4*4)-matrices CR = a1*HR40+a3*HR41, where a1, a3 are real numbers, represents complex numbers in the (4*4)-matrix form (Figure 5). HR40 HR41 HR40 HR40 HR41 ; CR = a1*HR40+a3*HR41 = HR41 HR41 -HR40 0 0 0 0 CR-1 = (a12+a32)-1 * a1 a1 -a3 -a3 0 0 0 0 0 0 0 0 a3 a3 a1 a1 a1 a1 a3 a3 0 0 0 0 -a3 -a3 a1 a1 Figure 5: the multiplication table of two (4*4)-matrices HR40 and HR41 (from Figure 3), which represent a set of two basic elements of complex numbers CR = a1*HR40+a3*HR41, where a1, a3 are real numbers. In the frame of the set of 2-parametric matrices CR, where the matrix HR40 represents the real unit, the matrix CR-1 is the inverse matrix for any non-zero matrix CR by definition on the base of the equation: CR*CR-1 = HR40. 280 S. V. PETOUKHOV The matrix HR40 plays a role of the real unit in this set of matrices CR. In the frame of matrices CR, where HR40 represents the real unit, the matrix CR-1 (Figure 5) is the inverse matrix for any non-zero matrix CR by definition on the base of equations CR*CR-1 = CR-1*CR = HR40. The genetic matrix HR4 is complex number with unit coordinates (a1=a3=1). Two sets of (4*4)-matrices CL and CR are quite different representations of complex numbers; for example, a sum CL+CR of members of these sets is not complex number. One should note that actions of the (4*4)-matrices HL4 and HR4 on 4-dimensional vectors in their planes R0(x0, 0, x2, 0) and R1(0, x1, 0, x3) rotate the vectors in different directions: clockwise and counterclockwise (Figure 6). The properties of these genetic matrices can be used in studying the famous problem of dissymmetry in biological organisms. Figure 6: The action of the matrix HL4 on a 4-dimensional vector R0(x0, 0, x2, 0) leads to a vector rotation clockwise (on the left). The action of the matrix HR4 on a 4-dimensional vector R1(0, x1, 0, x3) leads to a vector rotation counterclockwise (on the right) As described above, we have received one more interesting result: the sum of two 2dimensional complex numbers HL4 and HR4 with unit coordinates (they belong to two different matrix types of complex numbers) generates the 4-dimensional quaternion by Hamilton with unit coordinates H4=HL4+HR4 (Figure 2). It resembles a situation when a union of Yin and Yang (or a union of female and male beginnings, or a fusion of male and female gametes) generates a new organism. Below we will meet with other similar situations concerning (2n*2n)-matrices which represent (2n)-dimensional numbers with unit coordinates and which consists of two “complementary” halves (like the matrix H4), each of which is 2n-1-dimensional number with unit coordinates. One can name such type of matrices as “matrices with internal complementarities”. They resemble in some extend the complementary structure of double helixes of DNA. Let us return now to the (8*8)-matrix H8 (Figure 1) and demonstrate that it is also the matrix with internal complementarities. Figure 6 shows the matrix H8 as sum of matrices HL8 and HR8. GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 281 H8 = HL8+HR8 = 1 1 -1 -1 1 1 -1 -1 0 0 0 0 0 0 0 0 1 1 1 1 -1 -1 -1 -1 0 0 0 0 0 0 0 0 -1 -1 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 -1 1 1 -1 -1 1 1 -1 0 0 0 0 0 0 0 0 -1 1 -1 1 1 -1 1 -1 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 1 1 -1 0 0 0 0 0 0 0 0 -1 1 -1 1 -1 1 -1 1 Figure 7: The matrix H8 (from Figure 1) is one of matrices with internal complementarities, which are represented by its halves HL8 and HR8 (explanation in text) 1 0 1 0 -1 0 1 0 1 0 -1 0 -1 0 1 0 1 0 -1 0 1 0 1 0 1 0 -1 0 1 0 1 0 -1 0 1 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 10 10 10 10 10 10 10 10 HL8 = HL80 + HL81 + HL82 + HL83 = 10000000 10000000 00100000 00100000 + = 00001000 00001000 00000010 00000010 0010 0 0 00 0010 0 0 00 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 000 0 0 0 10 000 0 0 0 10 0 0 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 00 00 00 10 00 00 00 10 10 00 00 00 10 00 00 00 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 00 00 0010 00 00 0010 00 00 10 00 00 00 10 00 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 + HL80 HL81 HL82 HL83 HL80 HL80 HL81 HL82 HL83 HL81 HL81 - HL80 - HL83 HL82 + HL82 HL82 HL83 - HL80 - HL81 HL83 HL83 - HL82 HL81 - HL80 Figure 8: upper rows: the decomposition of the matrix HL8 (from Figure 7) as sum of 4 matrices: HL8 = HL80 + HL81 + HL82 + HL83. Bottom row: the multiplication table of these 4 matrices HL80, HL81, HL82 and HL83, which is identical to the multiplication table of quaternions by Hamilton. The matrix HL80 represents the real unit for this matrix set 282 S. V. PETOUKHOV The similar situation holds true for the matrix HR8 (from Figure 7). Figure 9 shows a decomposition of the matrix HR8 as a sum of 4 matrices: HR8 = HR80 + HR81 + HR82 + HR83. The set of matrices HR80, HR81, HR82 and HR83 is closed in relation to multiplication and it defines the multiplication table which is identical to the same multiplication table of quaternions by Hamilton. General expression for quaternions in this case can be written as QR = a0*HR80 + a1*HR81 + a2*HR82 + a3*HR83, where a0, a1, a2, a3 are real numbers. From this point of view, the (8*8)-genomatrix HR8 is the quaternion by Hamilton with unit coordinates. HR8 = HR80 + HR81 + HR82 + HR83 = 0 -1 0 0 0 0 0 0 0 -1 0 -1 0 1 0 -1 0 0 0 -1 0 0 0 0 0 1 0 1 0 -1 0 1 0 10 00 0 0 0 0 00 1000 0 0 1 0 -1 0 -1 0 -1 0 0 0 -1 0 0 0 0 0 10 0000 0 0 -1 0 1 0 1 0 1 0 00 10 0 0 0 0 -1 0 0 0 0 0 0 0 -1 0 1 0 -1 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 -1 = + 0 1 0 -1 0 1 0 1 0 00 00 1 0 0 0 0 0 00 00 1 0 1 0 1 0 1 0 -1 0 0 0 0 0 0 0 -1 0 0 0 00 10 0 0 -1 0 -1 0 -1 0 1 0 00 00 0 0 1 0 0 0 0 0 -1 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0 0 00 00 00 1 + 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 00 0 0 1 0 00 00 1 0 0 0 -1 0 0 0 0 0 0 0 00 10 0 0 0 + 0 1 0 00 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 10 0 0 0 0 10 00 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 HR80 HR81 HR82 HR83 HR80 HR80 HR81 HR82 HR83 HR81 HR81 - HR80 HR83 - HR82 HR82 HR82 - HR83 - HR80 HR81 HR83 HR83 HR82 - HR81 - HR80 Figure 9: upper rows: the decomposition of the matrix HR8 (from Figure 7) as sum of 4 matrices: H8R = H08R + H18R + H28R + H38R. Bottom row: the multiplication table of these 4 matrices HR80, HR81, HR82 and HR83, which is identical to the multiplication table of quaternions by Hamilton. HR80 represents the real unit for this matrix set The initial (8*8)-matrix H8 (Figure 1) can be also decomposed in another way on the base of dyadic-shift decomposition. Figure 10 shows such dyadic-shift decomposition H8 = H80+H81+H82+H83+H84+H85+H86+H87, when 8 sparse matrices H80, H81, H82, H83, H84, H85, H86, H87 arise (H80 is identity matrix). The set H80, H81, H82, H83, H84, H85, H86, GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 283 H87 is closed in relation to multiplication and it defines the multiplication table on Figure 10. This multiplication table is identical to the multiplication table of 8-dimensional hypercomplex numbers that are termed as biquaternions by Hamilton (or Hamiltons’ quaternions over the field of complex numbers). General expression for biquaternions in this case can be written as Q8 = a0*H80+a1*H81+a2*H82+a3*H83+ a4*H84 +a5*H85+a6*H86+a7*H87, where a0, a1, a2, a3, a4, a5, a6, a7 are real numbers. From this point of view, the (8*8)-genomatrix H8 is Hamiltons’ biquaternion with unit coordinates. H8 = H80+H81+H82+H83+H84+H85+H86+H87 = 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 -1 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 -1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 H81 H82 H83 H84 H85 H86 H87 + + 1 1 H81 H82 H83 H84 H85 H86 H87 0 -1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 00 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 1 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 1 0 H81 H81 -1 H83 - H82 H85 - H84 H87 - H86 0 -1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -1 1 0 0 0 0 0 0 0 0 0 H82 H82 H83 -1 - H81 H86 H87 - H84 - H85 0 0 0 0 -1 0 0 -1 0 0 0 0 0 0 0 0 + + H83 H83 - H82 - H81 1 H87 - H86 - H85 H84 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 -1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 -1 0 0 0 -1 0 0 0 0 0 0 1 0 0 0 0 -1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 H84 H84 H85 - H86 - H87 -1 - H81 H82 H83 H85 H85 - H84 - H87 H86 - H81 1 H83 - H82 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 + 0 1 0 0 0 0 0 0 + H86 H86 H87 H84 H85 - H82 - H83 -1 - H81 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 -1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 -1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 H87 H87 - H86 H85 - H84 - H83 H82 - H81 1 Figure 10: Upper rows: the decomposition of the matrix H8 (from Figure 1) as sum of 8 matrices: H8 = H80+H81+H82+H83+H84+H85+H86+H87. Bottom row: the multiplication table of these 8 matrices H80, H81, H82, H83, H84, H85, H86, H87, which is identical to the multiplication table of biquaternions by Hamilton (or Hamiltons’ quaternions over the field of complex numbers). H80 is identity matrix Here for the (8*8)-genomatrix H8 we have received the interesting result: the sum of two different 4-dimensional quaternions by Hamilton with unit coordinates (they belong + S. V. PETOUKHOV 284 to two different matrix representations of Hamiltons’ quaternions) generates the 8dimensional biquaternion with unit coordinates. This result resembles the results, regarding genetic matrices with internal complementarities described above; it resembles a situation when a union of Yin and Yang (or a union of male and female beginnings, or a fusion of male and female gametes) generates a new organism. 3. THE RADEMACHER MATRICES R4 AND R8 Now let us pay attention to Rademacher matrices R4 and R8 (Figure1) that belong to the second important type of genetic matrices with internal complementarities. Let us initially analyze the matrix R4, which is the sum of two matrices RL4 and RR4 (Figure 11). 1 0 1 0 0 1 0 -1 R4 = RL4 + RR4 = + -1 0 -1 0 0 1 0 -1 1 0 1 0 0 -1 0 1 -1 0 -1 0 0 -1 0 1 RL4 = RL40 + RL41 = RR4 = RR40 + RR41 = 1 -1 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 + 0 0 1 0 -1 0 0 1 1 + 0 0 0 0 0 0 0 0 0 0 -1 -1 1 -1 0 0 0 0 0 0 0 0 0 0 -1 -1 0 0 Figure 11: upper row: the representation of the matrix R4 as sum of matrices RL4 and RR4. Other rows: representations of matrices RL4 and RR4 as sums of matrices RL40, RL41, RR40 and RR41. The (4*4)-matrix RL4 is the sum of two matrices RL40 and RL41 (Figure 11), the set of which is closed in relation to multiplication and defines the multiplication table of these matrices (Figure 12). This table is identical to the well-known multiplication table of split-complex numbers (their synonyms are Lorentz numbers, hyperbolic numbers, perplex numbers, double numbers, etc. - http://en.wikipedia.org/wiki/Splitcomplex_number). Split-complex numbers are a two-dimensional commutative algebra over the real numbers. GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES RL40 RL41 RL40 RL41 RL40 RL41 RL41 RL40 ; DL = A0* RL40+A2*RL41 = DL-1 = (A02-A22)-1 * 285 A0 0 A2 0 - A0 - A2 A2 0 0 A0 0 0 - A2 0 - A0 0 A0 0 0 0 0 - A2 0 0 0 0 - A0 - A2 A2 A2 A0 - A0 Figure 12: the multiplication table of two (4*4)-matrices RL40 and RL41 (Figure 11), which is a set of basic elements of split-complex numbers DL = A0*RL40+A2*RL41, where A0, A2 are real numbers. The matrix RL40 represents the real unit for this matrix set. If A0 ≠ A2, the matrix DL-1 is the inverse matrix for DL by definition on the base of the equation DL*DL-1= RL40 The set of (4*4)-matrices DL = A0*RL40+A2*RL41, where A0, A2 are real numbers, represents split-complex numbers in the special (4*4)-matrix form (Figure 12). The classical identity matrix E=[1 0 0 0; 0 1 0 0; 0 0 1 0; 0 0 0 1] is absent in the set of matrices DL, each of which has zero determinant. Consequently the usual notion of the inverse matrix DL-1 (as DL*DL-1=E) can’t be defined in relation to the classical identity matrix E in accordance with the famous theorem about inverse matrices for matrices with zero determinant (Bellman, 1960, Chapter 6, § 4). But the set of matrices DL has the matrix RL40 which possesses all properties of identity matrix (or the real unit) for any member of this set. In the frame of the set of matrices DL, where the matrix RL40 represents the real unity, one can define the special notion of inverse matrix DL-1 for any non-zero matrix DL in relation to the matrix RL40 on the base of equations: DL*DL-1 = DL-1*DL = RL40 (Figure 12). From this point of view, the genetic (4*4)-matrix RL4 is the split-complex number with unit coordinates (A0=A2=1). So, we reveal that 4-dimensional spaces can contain 2-parametric subspaces, in which split-complex numbers exist in the form of (4*4)-matrices DL. It is well known that in mathematics split-complex numbers are traditionally represented in the form of (2*2)-matrix [a0 a1; a1 a0], where a0, a1 are real numbers (http://en.wikipedia.org/wiki/Splitcomplex_number). A similar situation holds true for (4*4)-matrices RR4 = RR40 + RR41 (from Figure 11). The set of two matrices RR40 and RR41 is also closed in relation to multiplication; it gives the multiplication table (Figure 13) which is also identical to the multiplication table of split-complex numbers. The set of (4*4)-matrices DR = a1*RR40+a3*RR41, where a1, a3 are real numbers, represents split-complex numbers in the (4*4)-matrix S. V. PETOUKHOV 286 form (Figure 13). The matrix RR40 plays a role of the real unit in this set of matrices DR. In the case a1 ≠ a3, the matrix DR-1 (Figure 13) is the inverse matrix for DR by definition on the base of equations DR*DR-1 = DR-1*DR = RR40. RR40 RR41 RR40 RR40 RR41 RR41 RR41 RR40 ; DR = A1*RR40+A3*RR41 = DR-1 = (A12-A32)-1 * 0 0 0 0 0 0 0 0 A1 A1 - A3 - A3 A1 A1 A3 A3 0 0 0 0 0 0 0 0 - A3 - A3 A1 A1 A3 A3 A1 A1 Figure 13: The multiplication table of two (4*4)-matrices RR40 and RR41, which is a set of basic elements of split-complex numbers DR = A1*RR40+A3*RR41, where A1, A3 are real numbers. The matrix RR40 represents the real unit in this matrix set. If A1 ≠ A3, the matrix DR-1 is the inverse matrix for DR by definition on the base of the equation DL*DL-1 = RR40 The initial matrix R4 can be also decomposed in another way by means of the dyadicshift decomposition as it was done for the matrix H4 on Figure 2. Figure 14 shows such dyadic-shift decomposition R4 = R04+R14+R24+R34 when 4 sparse matrices R04, R14, R24 and R34 arise (R04 is identity matrix). The set of these matrices R04, R14, R24 and R34 is closed in relation to multiplication and it defines the multiplication table on Figure 14. This multiplication table is identical to the multiplication table of 4-dimensional hypercomplex numbers that are termed as split-quaternions by J.Cockle and are well known in mathematics and physics (http://en.wikipedia.org/wiki/Splitquaternion). From this point of view, the matrix R4 is split-quaternion with unit coordinates. 1 0 00 0010 0 0 0 -1 1 1 1 -1 0100 = 0 1 00 + -1 0 0 0 + 0 0 0 -1 + 0 0 -1 0 -1 1 -1 1 1 -1 1 1 0001 0 0 10 1000 0 -1 0 0 0 0 -1 0 -1 -1 -1 1 0001 0 -1 0 0 -1 0 0 0 R04 R14 R24 R34 R04 R04 R14 R24 R34 R14 R14 -R04 - R34 R24 R24 R24 R34 R04 R14 R34 R34 - R24 - R14 R04 Figure 14: upper row: the dyadic-shift decomposition R4 = R04+R14+R24+R34. Bottom row: the multiplication table of the sparse matrices R04, R14, R24 and R34, which is identical to the multiplication table of split-quaternions by J.Cockle (http://en.wikipedia.org/wiki/Split-quaternion). R04 is identity matrix, which plays a role of the real unit in this form of split-quaternions by Cockle. GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 287 So we have received the interesting result: the sum of two 2-dimensional split-complex numbers R4L and R4R with unit coordinates (they belong to two different matrix types of split-complex numbers) generates the 4-dimensional split-quaternion with unit coordinates. It resembles again a situation when a union of Yin and Yang (a union of female and male beginnings, or a fusion of male and female gametes) generates a new organism. In particular, it means that the matrix R4 is one of matrices with internal complementarities. Let us return now to the (8*8)-matrix R8 (Figure 1) and demonstrate that it is also a matrix with internal complementarities. Figure 15 shows the matrix R8 as sum of matrices R8L and R8R. 1 1 -1 -1 1 1 -1 -1 0 0 0 0 0 0 0 0 1 1 1 1 -1 -1 -1 -1 0 0 0 0 0 0 0 0 1 1 -1 -1 1 1 -1 -1 0 0 0 0 0 0 0 0 R8 = RL8 + RR8 = -1 0 0 -1 0 0 -1 0 0 -1 0 + 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1 -1 -1 1 1 -1 -1 0 0 0 0 0 0 0 0 1 1 1 1 -1 -1 -1 -1 0 0 0 0 0 0 0 0 1 1 -1 -1 1 1 -1 -1 0 0 0 0 0 0 0 0 -1 -1 -1 -1 1 1 1 1 Figure 15: the matrix R8 consists of two complementary parts RL8 and RR8 Figure 16 shows a decomposition of the matrix RL8 (from Figure 15) as a sum of 4 matrices: RL8 = RL80 + RL81 + RL82 + RL83. The set of matrices RL80, RL81, RL82 and RL83 is closed in relation to multiplication and defines the multiplication table identical to the same multiplication table of split-quaternions by Cockle. General expression for split-quaternions in this case can be written as SL = a0*RL80 + a1*RL81 + a2*RL82 + a3*RL83, where a0, a1, a2, a3 are real numbers. From this point of view, the (8*8)genomatrix RL8 is split-quaternion by Cockle with unit coordinates. S. V. PETOUKHOV 288 1 0 1 0 1 0 -1 0 1 0 1 0 1 0 -1 0 -1 0 1 0 -1 0 -1 0 -1 0 1 0 -1 0 -1 0 1 0 -1 0 1 0 1 0 1 0 -1 0 1 0 1 0 -1 0 -1 0 -1 0 1 0 -1 0 -1 0 -1 0 1 0 RL80 RL81 RL82 RL83 10000000 10000000 00100000 = 00100000 00001000 00001000 00000010 00000010 0010 0 0 00 0010 0 0 00 -1 0 0 0 0 0 0 0 + -1 0 0 0 0 0 0 0 000 0 0 0 10 000 0 0 0 10 0 0 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 00 0010 00 00 0010 00 0 0 0 0 0 0 -1 0 + 0 0 0 0 0 0 -1 0 10 0000 00 10 0000 00 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0 0 + 0 0 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 RL80 RL80 RL81 RL82 RL83 RL81 RL81 - RL80 - RL83 RL82 RL82 RL83 RL83 RL80 RL81 RL83 RL83 - RL82 - RL81 RL80 Figure 16: Upper rows: the decomposition of the matrix RL8 (from Figure 15) as sum of 4 matrices: RL8 = RL80 + RL81 + RL82 + RL83. Bottom row: the multiplication table of these 4 matrices RL80, RL81, RL82 and RL83, which is identical to the multiplication table of split-quaternions by J.Cockle. RL80 represents the real unit for this matrix set The similar situation holds for the matrix RR8 (from Figure 15). Figure 17 shows a decomposition of the matrix RR8 as a sum of 4 matrices: RR8 = RR80 + RR81 + RR82 + RR83. The set of matrices RR80, RR81, RR82 and RR83 is closed in relation to multiplication and defines the multiplication table that is identical to the same multiplication table of split-quaternions by Cockle. General expression for splitquaternions in this case can be written as SR = a0*RR80 + a1*RR81 + a2*RR82 + a3*RR83, where a0, a1, a2, a3 are real numbers. From this point of view, the (8*8)-matrix RR8 is the split-quaternion with unit coordinates. GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 0 1 0 1 0 1 0 -1 0 1 0 1 0 1 0 -1 0 -1 0 1 0 -1 0 -1 0 -1 0 1 0 -1 0 -1 0 1 0 -1 0 1 0 1 0 1 0 -1 0 1 0 1 0 -1 0 -1 0 -1 0 1 0 -1 0 -1 0 -1 0 1 = + RR80 RR81 RR82 RR83 RR80 RR80 RR81 RR82 RR83 01000000 01000000 00010000 00010000 00000100 00000100 00000001 00000001 0 0010 000 0 0010 000 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 + 0 0000 001 0 0000 001 0 0 0 0 0 -1 0 0 0 0 0 0 0 -1 0 0 000 0010 0 000 0010 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 -1 010 0000 0 010 0000 0 0 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 -1 0 0 + 0 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 RR81 RR81 - RR80 - RR83 RR82 RR82 RR82 RR83 RR80 RR81 289 RR83 RR83 - RR82 - RR81 RR80 Figure 17: upper rows: the decomposition of the matrix RR8 (from Figure 15) as the sum of 4 matrices: RR8 = RR80 + RR81 + RR82 + RR83. Bottom row: the multiplication table of these 4 matrices RR80, RR81, RR82 and RR83, which is identical to the multiplication table of split-quaternions by Cockle. RR80 represents the real unit here. The initial (8*8)-matrix R8 (Figure 1) can be also decomposed in another way by means of the dyadic-shift decomposition as it was done for the matrix H8 on Figure 10. Figure 18 shows the case of such dyadic-shift decomposition R8 = R08+R18+R28+R38+R48 +R58+R68+R78, when 8 sparse matrices R08, R18, R28, R38, R48, R58, R68, R78 arise (R08 is identity matrix). The set R08, R18, R28, R38, R48, R58, R68, R78 is closed in relation to multiplication and defines the multiplication table on Figure 18. This multiplication table is identical to the multiplication table of 8-dimensional hypercomplex numbers that are termed as bi-split-quaternions by Cockle (or splitquaternions over the field of complex numbers). General expression for bi-splitquaternions in this case can be written as S8 = a0*R08+a1*R18+a2*R28 +a3*R38+a4*R48 +a5*R58+a6*R68+a7*R78, where a0, a1, a2, a3, a4, a5, a6, a7 are real numbers. From this point of view, the (8*8)-genomatrix R8 is bi-split-quaternion with unit coordinates. S. V. PETOUKHOV 290 R8 = R08+R18+R28+R38+R48+R58+R68+R78 = 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 01 0 0 00 0 0 00 0 0 00 0 0 00 1 0 00 0 -1 0 0 0 0 -1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 000 100 0 -1 0 0 0 -1 000 000 000 000 R08 R18 R28 R38 R48 R58 R68 R78 01000000 10000000 00010000 00100000 00000100 00001000 00000001 00000010 + + 0 0 0 0 0 1 0 0 R08 R08 R18 R28 R38 R48 R58 R68 R78 0000100 0001000 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 1 0 0 0 0 00 0 0 0 0 0 00 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 R18 R18 R08 R38 R28 R58 R48 R78 R68 R28 R28 R38 - R08 - R18 - R68 - R78 R48 R58 + 0 0 100000 0 00 10000 -1 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 00000 10 0 0000 00 1 0 0 0 0 -1 0 0 0 0 0 0 0 0 -1 0 0 + R38 R38 R28 - R18 - R08 - R78 - R68 R58 R48 + 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 -1 0 0 0 0 -1 0 0 0 0 0 0 0 0 -1 0 0 0 0 -1 0 0 0 0 0 0 0 0 -1 0 0 0 0 -1 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 R48 R48 R58 R68 R78 R08 R18 R28 R38 R58 R58 R48 R78 R68 R18 R08 R38 R28 000 10000 00 100000 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 000 0000 1 000 000 10 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 + R68 R68 R78 - R48 - R58 - R28 - R38 R08 R18 + 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 R78 R78 R68 - R58 - R48 - R38 - R28 R18 R08 Figure 18: Upper rows: the decomposition of the matrix R8 (from Figure 1) as sum of 8 matrices: R8 = R08+R18+R28+R38+R48+R58+R68+R78. Bottom row: the multiplication table of these 8 matrices R08, R18, R28, R38, R48, R58, R68 and R78, which is identical to the multiplication table of bi-split-quaternions by Cockle. R08 is identity matrix and represents the real unit here. Here for the (8*8)-genomatrix R8 we have received the interesting result: the sum of two different 4-dimensional split-quaternions by Cockle with unit coordinates (they belong to two different matrix types of split-quaternion numbers) generates the 8dimensional bi-split-quaternion with unit coordinates. This result resembles the abovedescribed result about the sum of 2-dimensional split-complex numbers with unit coordinates that generates the 4-dimensional split-quaternion with unit coordinates (Figures 12-14). It also resembles a situation when a union of Yin and Yang (a union of male and female beginnings or a fusion of male and female gametes) generates a new organism. GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 291 4. MATRICES OF GENETIC DUPLETS AND TRIPLETS Theory of noise-immunity coding is based on matrix methods. For example, matrix methods allow transferring high-quality photos of Mar’s surface via millions of kilometers of strong interference. In particularly, Kronecker families of Hadamard matrices are used for this aim. Kronecker multiplication of matrices is the well-known operation in fields of signals processing technology, theoretical physics, etc. It is used for transition from spaces with a smaller dimension to associated spaces of higher dimension. By analogy with theory of noise-immunity coding, the 4-letter alphabet of RNA (adenine A, cytosine C, guanine G and uracil U) can be represented in a form of the (2*2)-matrix [C U; A G] (Figure 19) as a kernel of the Kronecker family of matrices [C U; A G](n), where (n) means a Kronecker power (Figure 19). Inside this family, this 4letter alphabet of monoplets is connected with the alphabet of 16 duplets and 64 triplets by means of the second and third Kronecker powers of the kernel matrix: [C U; A G](2) and [C U; A G](3), where all duplets and triplets are disposed in a strict order (Figure 19). We begin with the alphabet A, C, G, U of RNA here because of mRNA-sequences of triplets define protein sequences of amino acids in a course of its reading in ribosomes (below we will separately consider the case of DNA with its own alphabet). Figure 19 contains not only 64 triplets but also amino acids and stop-codons encoded by the triplets in the case of the Vertebrate mitochondrial genetic code that is the most symmetrical among known variants of the genetic code (http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi). One can see on Figure 19 that in the matrix [C U; A G](3) the set of columns with even numeration 0, 2, 4, 6 and the set of columns with odd numeration 1, 3, 5, 7 have the same collection of amino acids and stop-codons. In other words, the nature has constructed the distribution of amino acids and stop-codons in accordance with the principle of the matrix with internal complementarity. This fact is only one of evidences that the described matrices with internal complementarities are the mathematical patterns of the genetic coding system (the mathematical partners of the genetic code). Let us explain black-and-white mosaics of [C U; A G](2) and [C U; A G](3) (Figure 19) which reflect important features of the genetic code. These features are connected with a specificity of reading of mRNA-sequences in ribosomes to define protein sequences of amino acids (this is the reason, why we use the alphabet A, C, G, U of RNA in matrices on Figure 19; below we will consider the case of DNA-sequences separately). S. V. PETOUKHOV 292 C A CCC PRO CCA PRO CAC HIS CAA GLN ACC THR ACA THR AAC ASN AAA LYS CCU PRO CCG PRO CAU HIS CAG GLN ACU THR ACG THR AAU ASN AAG LYS CC CA AC AA U G CUC LEU CUA LEU CGC ARG CGA ARG AUC ILE AUA MET AGC SER AGA STOP CUU LEU CUG LEU CGU ARG CGG ARG AUU ILE AUG MET AGU SER AGG STOP CU CG AU AG UCC SER UCA SER UAC TYR UAA STOP GCC ALA GCA ALA GAC ASP GAA GLU UC UA GC GA UCU SER UCG SER UAU TYR UAG STOP GCU ALA GCG ALA GAU ASP GAG GLU UU UG GU GG UUC PHE UUA LEU UGC CYS UGA TRP GUC VAL GUA VAL GGC GLY GGA GLY UUU PHE UUG LEU UGU CYS UGG TRP GUU VAL GUG VAL GGU GLY GGG GLY Figure 19: the first three representatives of the Kronecker family of RNA-alphabetic matrices [C U; A G](n). Black color marks 8 strong duplets in the matrix [C U; A G](2) (at the top) and 32 triplets with strong roots in the matrix [C U; A G](3) (bottom). 20 amino acids and stop-codons, which correspond to triplets, are also shown in the matrix [C U; A G](3) for the case of the Vertebrate mitochondrial genetic code A combination of letters on the two first positions of each triplet is ususally termed as a “root” of this triplet (Konopelchenko, Rumer, 1975a,b; Rumer, 1968). Modern science recognizes many variants (or dialects) of the genetic code, data about which are shown on the NCBI’s website http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi. 17 variants (or dialects) of the genetic code exist that differ one from another by some details of correspondences between triplets and objects encoded by them. Most of these dialects (including the so called Standard Code and the Vertebrate Mitochondrial Code) have the symmetrologic general scheme of these correspondences, where 32 “black” triplets with “strong roots” and 32 “white” triplets with “weak” roots exist (see details in (Petoukhov, 2008c). In this basic scheme, the set of 64 triplets contains 16 subfamilies of triplets, every one of which contains 4 triplets with the same two letters on the first positions (an example of such subsets is the case of four triplets CAC, CAA, CAT, CAG with the same two letters CA on their first positions). In the described basic scheme, the set of these 16 subfamilies of NN-triplets is divided into two equal subsets. The first subset contains 8 subfamilies of so called “two-position” NN-triplets, a coding GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 293 value of which is independent on a letter on their third position: (CCC, CCT, CCA, CCG), (CTC, CTT, CTA, CTG), (CGC, CGT, CGA, CGG), (TCC, TCT, TCA, TCG), (ACC, ACT, ACA, ACG), (GCC, GCT, GCA, GCG), (GTC, GTT, GTA, GTG), (GGC, GGT, GGA, GGG). An example of such subfamilies is the four triplets CGC, CGA, CGT, CGC, all of which encode the same amino acid Arg, though they have different letters on their third position. The 32 triplets of the first subset are termed as “triplets with strong roots” (Konopelchenko, Rumer, 1975a,b; Rumer, 1968). The following duplets are appropriate 8 strong roots for them: CC, CT, CG, AC, TC, GC, GT, GG (strong duplets). All members of these 32 NN-triplets and 8 strong duplets are marked by black color in the matrices [C U; A G](3) and [C U; A G](2) on Figures 19. The second subset contains 8 subfamilies of “three-position” NN-triplets, the coding value of which depends on a letter on their third position: (CAC, CAT, CAA, CAG), (TTC, TTT, TTA, TTG), (TAC, TAT, TAA, TAG), (TGC, TGT, TGA, TGG), (AAC, AAT, AAA, AAG), (ATC, ATT, ATA, ATG), (AGC, AGT, AGA, AGG), (GAA, GAT, GAA, GAG). An example of such subfamilies is the four triplets CAC, CAA, CAT, CAC, two of which (CAC, CAT) encode the amino acid His and the other two (CAA, CAG) encode another amino acid Gln. The 32 triplets of the second subset are termed as “triplets with weak roots” (Konopelchenko, Rumer, 1975a,b; Rumer, 1968). The following duplets are appropriate 8 weak roots for them: CA, AA, AT, AG, TA, TT, TG, GA (weak duplets). All members of these 32 NN-triplets and 8 weak duplets are marked by white color in the matrices [C U; A G](3) and [C U; A G](2) on Figure 19. From the point of view of its black-and-white mosaic, each of columns of genetic matrices [C U; A G](2) and [C U; A G](3) has a meander-like character and coincides with one of Rademacher functions that form orthogonal systems and well known in discrete signals processing. These functions contain elements “+1” and “-1” only. Due ti this fact, one can construct Rademacher representations of the symbolic genomatrices [C U; A G](2) and [C U; A G](3) (Figure 19) by means of the following operation: each of black duplets and of black triplets is replaced by number “+1” and each of white duplets and white triplets is replaced by number “-1”. This operation leads immediately to the matrices R4 and R8 from Figure 1, that are the Rademacher representations of the phenomenological genomatrices [C U; A G](2) and [C U; A G](3). This fact is one of evidences of algebraic nature of the genetic code. One can note that genomatrices [C U; A G](2) and [C U; A G](3) and their Rademacher representations R4 and R8 (Figure 1) are connected on the base of the equations (1), where means Kronecker multiplication: S. V. PETOUKHOV 294 R4 [1 1; 1 1] = R8, [C U; A G](2) [C U; A G] = [C U; A G](3) (1) Here [1 1; 1 1] is the traditional (2*2)-matrix representation of split-complex number with unit coordinates, that can be considered as the Rademacher representation R2 of the genomatrix [C U; A G]. The equations (1) testify that, in the case of RNA-alphabet, each of its four letters in the matrix [C U; A G] should be taken as equal to number “+1”: A=C=G=U=+1. They also show that Rademacher representations R2 and R4 of matrices [C U; A G] and [C U; A G](2) can be considered as basic due to the fact that the Rademacher representation R8 is deduced from them by means of their Kronecker multiplication. Now let us pay attention to the DNA alphabet (adenine A, cytosine C, guanine G and thymine T) and the appropriate Kronecker family of matrices [C T; A G](n). What kind of black-and-white mosaics (or a disposition of elements “+1” and “-1” in numeric representations of these symbolic matrices) can be appropriate in this case for the basic matrix [C T; A G] and [C T; A G](2)? The important phenomenological fact is that the thymine T is a single nitrogenous base in DNA which is replaced in RNA by another nitrogenous base U (uracil) for unknown reason (this is one of the mysteries of the genetic system). In other words, in this system the letter T is the opposition in relation to the letter U, and so the letter T can be symbolized by number “-1” (instead of number “+1” for U). By this objective reason, one can construct numeric representations H2 and H4 of mentioned matrices [C T; A G] and [C T; A G](2) by means of the following algorithm of transformation of black-and-white mosaics of matrices [C U; A G] and [C U; A G](2) from Figure 19 together with their Rademacher representations R2 and R4: - in matrices [C T; A G] and [C T; A G](2), each of monoplets and duplets that begin with the letter T, should be taken with opposite color in comparison with appropriate entries in matrices [C U; A G] and [C U; A G](2) from Figure 19; correspondingly numeric representations of these DNA-alphabetic matrices [C T; A G] and [C T; A G](2) reflect the new mosaics of these symbolic matrices. The numeric representation H8 of the DNA-alphabetic matrix of triplets [C T; A G](3) is constructed on the base of equations (2) by analogy with equations (1): H4 [1 -1; 1 1] = H8, [C T; A G](2) [C T; A G] = [C T; A G](3) (2) Here [1 -1; 1 1] is the traditional (2*2)-matrix representation of complex number with unit coordinates. The black-and-white mosaic of the matrix [C T; A G](3) is defined by the disposition of numbers “+1” and “-1” in its numeric representation H8. Figure 20 GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 295 shows DNA-alphabetic matrices [C T; A G], [C T; A G](2) and [C T; A G](3) with their mosaics constructed by this way, which is based on the objective properties of the molecular-genetic system and can be used in biological computers of organisms. One can see that mosaics of these symbolic matrices [C T; A G](2) and [C T; A G](3) coincide with the disposition of numbers “+1” and “-1” in numeric matrices H4 and H8 (Figure 1) that can be termed as “Hadamard representations” of these genomatrices because matrices H4 and H8 satisfy the definition of Hadamard matrices (Petoukhov, 2008b, 2011). C A T G CT CG AT AG TC TA GC GA ; CC CA AC AA TT TG GT GG CCC CCA CAC CAA ACC ACA AAC AAA CCT CCG CAT CAG ACT ACG AAT AAG CTC CTA CGC CGA ATC ATA AGC AGA CTT CTG CGT CGG ATT ATG AGT AGG TCC TCA TAC TAA GCC GCA GAC GAA TCT TCG TAT TAG GCT GCG GAT GAG TTC TTA TGC TGA GTC GTA GGC GGA TTT TTG TGT TGG GTT GTG GGT GGG Figure 20: the first three representatives [C T; A G], [C T; A G](2) and [C T; A G](3) of the Kronecker family of DNA-alphabetic matrices [C T; A G](n). Hadamard representations H4 and H8 of the symbolic matrices [C T; A G](2) and [C T; A G](3) with the same mosaics are shown on Figure 1 Genetic matrices with internal complementarities resemble objects with Yin and Yang parts from doctrines of Ancient China. One can add here the following mathematical fact. The famous Yin-Yang symbol has a symmetrical configuration: its 180-degree turn changes only its black-and-white mosaic, but the new configuration of the symbol coincides with the initial. It is interesting that the 180-degree turn of the genetic matrices R4, R8, H4, H8 (Figure 1) leads to a similar result: mosaics of these matrices are essentially changed but the new matrices are again matrices with internal complementarities, algebraic properties of which coincide with the initial (the same multiplication tables as on Figures 9, 10, 12-14, 16-18). So, the mythological object allows revealing new mathematical properties of the genetic matrices in this case. Phenomenology of the genetic system gives additional confirmations of its connection with the mosaic genomatrices [C T; A G](n), numeric representations of which posess internal complementarities. In matrices [C T; A G](n), let us enumerate their 2n columns from left to right by numbers 0, 1, 2, .., 2n-1 and then consider two sets of n-plets (oligonucleotides) in each of matrices [C T; A G](n): 1) the first set contains all n-plets from columns with even numeration 0, 2, 4, … (this set is conditionally termed as the 296 S. V. PETOUKHOV even-set or the Yin-set); 2) the second set contains all n-plets from columns with odd numeration 1, 3, 5, … (this set is conditionally termed as the odd-set or the Yang-set). For example, the genomatrix [C T; A G](3) (Figure 19) contains the even-set of 32 triplets in its columns with even numerations 0, 2, 4, 6 (CCC, CCA, CAC, CAA, ACC, ACA, AAC, AAA, CTC, CTA, CGC, CGA, ATC, ATA, AGC, AGA, TCC, TCA, TAC, TAA, GCC, GCA, GAC, GAA, TTC, TTA, TGC, TGA, GTC, GTA, GGC, GGA) and the odd-set of 32 triplets in its columns with odd numerations 1, 3, 5, 7 (CCT, CCG, CAT, CAG, ACT, ACG, AAT, AAG, CTT, CTG, CGT, CGG, ATT, ATG, AGT, AGG, TCT, TCG, TAT, TAG, GCT, GCG, GAT, GAG, TTT, TTG, TGT, TGG, GTT, GTG, GGT, GGG). One can show, for example, that the structure of the whole human genome is connected with the equal devision of the whole set of 64 triplets into the even-set of 32 triplets and the odd-set of 32 triplets. Really, let us calculate total quantities (frequencies Feven and Fodd) of members of these two sets of triplets in the whole human genome that contains the huge number 2.843.411.612 (about three billion) triplets. The initial data about this genome (Figure 21) are taken by the author from the article (Perez, 2010). Very different frequencies of different triplets are represented in this genome. For example, the frequency of the triplet CGA is equal to 6.251.611 and the frequency of the triplet TTT is equal to 109.591.342; they differ in 18 times approximately. But our result of the calculation shows that the total quantities of members of the even-set (Feven) and of the odd-set (Fodd) in the whole human genome are equal to each other with a precision within 0,12%: Feven = 1.420.853.821 for the even-set of 32 triplets; Fodd = 1.422.557.791 for the odd-set of 32 triplets. One should note that the work (Perez, 2010, Table 10) shows another variant of division of the set of 64 triplets into two other subsets with 32 triplets in each not on the basis of the matrix approach but on the base of using a traditional table of triplets and a principle of “codons and their mirror-codons”. This variant also reveals an approximate equality of quantities of members of these two subsets with a high precision for the case of the whole human genome. GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES TRIPLET TRIPLET FREQUENCY TRIPLET TRIPLET FREQUENCY TRIPLET TRIPLET FREQUENCY AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT 109143641 41380831 56701727 70880610 57234565 33024323 7117535 45731927 62837294 39724813 50430220 45794017 58649060 37952376 52222957 71001746 CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT 53776608 42634617 57544367 52236743 52352507 37290873 7815619 50494519 6251611 6737724 7815677 7137644 36671812 47838959 57598215 56828780 GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT 56018645 26820898 47821818 37990593 40907730 33788267 6744112 39746348 43853584 33774033 37333942 33071650 32292235 26866216 42755364 41557671 297 TRIPLET TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT TRIPLET FREQUENCY 59167883 32272009 36718434 58718182 55697529 43850042 6265386 62964984 55709222 40949883 52453369 57468177 59263408 56120623 54004116 109591342 Figure 25: quantities of repetitions of each triplet in the whole human genome (from [Perez, 2010]). More general confirmation of genetic importance of the structure of genomatrices with internal complementarities for long nucleotide sequences was revealed by the results of the study of the Symmetry Principle № 6 from the work (Petoukhov, 2008c, 6th version, section 11), where a special notion of fractal genetic nets for long nucleotide sequences were used in contrast to this article. Now we propose to use the relevant phenomenologic data for justification and development of the new idea: the described matrices with internal complementarities are important algebraic patterns for structurization of the genetic coding system, the nature of which has algebraic bases. The described connection between the genetic system and matrices with internal complementarities is associated with the Plato’s conception about androgynes. In accordance with this ancient conception, in primal times people had doubled bodies. But at one moment the gods have punished them by splitting them in half. Ever since that time, people run around saying they are looking for their other half because they are really trying to recover their primal nature (http://en.wikipedia.org/wiki/Symposium_(Plato). This conception is frequently used in discussions on important facts of embryology and other modern scientific fields about hermaphroditism including the embryological principle of primordial hermaphroditism, etc. (Dreger, 1998; Money, 1990, etc.). Taking the Plato’s conception into account, genetic matrices with internal complementarities can be also termed as “androgynous matrices”. Results of our researches lead to the idea that phenomena of 298 S. V. PETOUKHOV hermaphroditism have a basic analogue at the molecular-genetic level. These results can be related with biological problems of genetically inherited symmetries and dissymmetry (Darvas, 2007; Gal, 2011; Hellige, 1993). 5. SOME CONCLUDING REMARKS In the beginning of 19-th century, there was a belief was about the existence of one arithmetic that is true for all natural systems. But after the discovery of quaternions by Hamilton, the science has been compelled to refuse the former belief about existence of only one true arithmetic/algebra in the world (see (Kline, 1980)). It has recognized, that various natural systems can have not only their own geometry (Euclidean or nonEuclidean geometries), but also their own algebra (arithmetic of multi-dimensional numbers). If the scientist takes inadequate algebra to model a natural system, he/she can repeat the impressive example by Hamilton, who has wasted 10 years to solve the task of 3D space transformations on the bases of inadequate 3-dimensional algebras (this task needs the 4-dimensional algebra of Hamilton’s quaternions). Modern theoretical physics includes, as one of its main parts, a great number of attempts to reveal what kinds of multi-dimensional numeric systems correspond to ensembles of relations in concrete physical systems. The results of our researches discover that relations in the genetic coding system correspond to the described algebraic system of matrices with internal complementarities. If the researcher does not take into account this fact and this special mathematics, he/she runs the risk of wasting a lot of time and effort because of the application of inadequate approaches to study algebraic properties of the genetic system. In particularly, this article shows the connection of the genetic coding system with quaternions by Hamilton. Hamilton quaternions are closely related to the Pauli matrices, the theory of the electromagnetic field (Maxwell wrote his equation on the language of Hamilton quaternions), the special theory of relativity, the theory of spins, quantum theory of chemical valency, etc. In the twentieth century thousands of works were devotes to quaternions in physics [http://arxiv.org/abs/math-ph/0511092]. Now Hamilton quaternions are manifested in the genetic code system. Our scientific direction - "matrix genetics" - has led to the discovery of an important bridge among physics, biology and computer science for their mutual enrichment. In addition, our study provides a new example of the inconceivable effectiveness of mathematics: GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 299 abstract mathematical structures derived by mathematicians at the tip of the pen 160 years ago, are embodied inside the molecular-genetic system which is the informational basis of living matter. And the fact that mathematics is opened by means of painful reflection (like Hamilton, who has spent 10 years of continuous thought to discover his quaternions) is already represented in the genetic coding system. The described genetic matrices with internal complementarities (or “androgenous matrices”) posess many other interesting mathematical properties related to cyclic and dyadic shifts, multiplications of these matrices, Kronecker families of matrices R4[1 1; 1 1](n) and H4[1 -1; 1 1](n), dichotomous trees of different 2n-dimensional numbers, rotational transformations of these numeric genomatrices into new numeric genomatrices with internal complementarities, etc. A set of (2n*2n)-matrices with internal complementarities contains a huge quantity of different types of matrix representations of complex numbers (or relevant algebraic fields (http://en.wikipedia.org/wiki/Field_(mathematics)) and of split-complex numbers that didn’t specially studied in mathematics previously, as the author can judge. The relevant 2n-dimensional numeric systems, including the said plurality of complex and split-complex numbers and their extensions, have perspectives to be applied in mathematical natural sciences and signals processing. Here one can remember the statement: “Profound study of nature is the most fertile source of mathematical discoveries” (Fourier, 2006). The discovery of genetic importance of matrices with internal complementarities gives us a possibility to divide sets of amino acids and stopsignals in interesting sub-sets in accordance with the structure of the genomatrix [C T; A G](3); it also presents new approaches to study proteins. One should note that phenomena of complementarities play a basic role at different genetic levels. We are hoping to expend this and similar topics in future publications. The notion of number is one of the main notions of mathematics. In a long evolution of this notion, many kinds of multi-dimensional numerical systems have appeared. Complex numbers and split-complex numbers occupy a particularly important place in mathematics and mathematical natural sciences. For example, complex numbers have appeared as magic instruments for development of theories and calculations in the field of problems of heat, light, sounds, vibrations, elasticity, gravitation, magnetism, electricity, liquid streams, and phenomena of a micro-world. These complex numbers are mathematical basis of quantum mechanics and of many other branches of sciences. For example, the Schrödinger equation contains the imaginary unit, and the wave functions of quantum mechanics are complex-valued. This article shows that many 300 S. V. PETOUKHOV kinds of complex numbers and split-complex numbers exist, which are connected with the genetic matrices. One can think that this splitting of numeric basis of mathematical natural sciences lead to a relevant splitting in mathematical natural sciences. For example, one can ask what kinds of complex numbers should be used in the Schrödinger equation? Or can different types of wave functions of quantum mechanics exist, which correspond to different kinds of complex numbers? In our opinion, such questions should be deeply analyzed in future. This article proposes a new mathematical approach to study “a partnership between genes and mathematics” (see Section 1 above). In the author’s opinion, this kind of mathematics is beautiful and it can be used for further developing of algebraic biology and theoretical physics in accordance with the famous statement by P.Dirac, who taught that a creation of a physical theory must begin with the beautiful mathematical theory: “If this theory is really beautiful, then it necessarily will appear as a fine model of important physical phenomena. It is necessary to search for these phenomena to develop applications of the beautiful mathematical theory and to interpret them as predictions of new laws of physics” (Arnold, 2007). According to Dirac, all new physics, including relativistic and quantum, are developing in this way. Results of matrix genetics lead to the idea that the structure of the genetic coding system is dictated by patterns of described numeric genomatrices; here one can remember the famous Pythagorean statement that “numbers rule the world" with the refinement that we should talk now about multi-dimensional numbers. Acknowledgments. The described researches were made by the author in the frame of a long-term cooperation between Russian and Hungarian Academies of Sciences. The author is grateful to Darvas, G., Stepanyan, I.V., Svirin, V.I. for their collaboration. REFERENCES Ahmed, N.U., Rao, K.R. (1975). Orthogonal transforms for digital signal processing. New York: SpringerVerlag, Inc. Arnold, V. (2007) A complexity of the finite sequences of zeros and units and geometry of the finite functional spaces. Lecture at the session of the Moscow Mathematical Society, May 13, http://elementy.ru/lib/430178/430281. Bellman, R. (1960) Introduction to Matrix Analysis. New-York: Mcgraw-Hill Book Company, Inc., 351 pp. Darvas, G. (2007) Symmetry. Basel: Birkhauser Book. Dreger A. (1998) Hermaphrodites and the Medical Invention of Sex. Harward University Press. Fourier, J. (2006) The Analytical Theory of Heat. Cambridge: University Press. GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 301 Gal, J. (2011) Louis Pasteur, language, and molecular chirality. I. Back- ground and dissymmetry, Chirality, 23, 1–16. Hellige, J. B. (1993). Hemispheric Asymmetry: What's Right and What's Left. Cambridge. Massachusetts: Harvard University Press. Kline, M. (1980) Mathematics. The Loss of Certainty. New-York: Random House, 384 p. Konopelchenko, B. G., Rumer, Yu. B. (1975a) Classification of the codons of the genetic code. I & II. Preprints 75-11 and 75-12 of the Institute of Nuclear Physics of the Siberian department of the USSR Academy of Sciences. Novosibirsk: Institute of Nuclear Physics. Konopelchenko, B. G., Rumer, Yu. B. (1975b). Classification of the codons in the genetic code. Doklady Akademii Nauk SSSR, 223(2), 145-153 (in Russian). Money, J. (1990) Androgyne becomes bisexual in sexological theory: Plato to Freud and neuroscience. The Journal of the American Academy of Psychoanalysis. 18(3): 392-413 (http://www.ncbi.nlm.nih.gov/pubmed/2258314) Petoukhov, S.V. (2008a). Matrix genetics, algebras of the genetic code, noise-immunity. Moscow: Regular and Chaotic Dynamics, 316 p. (in Russian; summary in English is on the http://www.geocities.com/symmetrion/Matrix_genetics/matrix_genetics.html) Petoukhov, S.V. (2008b) The degeneracy of the genetic code and Hadamard matrices. http://arXiv:0802.3366, p. 1-26 (The first version is from February 22, 2008; the last revised is from December, 26, 2010). Petoukhov, S.V. (2008c) Matrix genetics, part 1: permutations of positions in triplets and symmetries of genetic matrices. - http://arxiv.org/abs/0803.0888, version 6, p. 1-34. Petoukhov, S.V. (2011) Matrix genetics and algebraic properties of the multi-level system of genetic alphabets. - Neuroquantology, 9, No 4, 60-81, http://www.neuroquantology.com/index.php/journal/article/view/501 Petoukhov, S.V. (2012). The genetic code, 8-dimensional hypercomplex numbers and dyadic shifts. http://arxiv.org/abs/1102.3596, p. 1-80. Petoukhov, S.V. , He M. (2010) Symmetrical Analysis Techniques for Genetic Systems and Bioinformatics: Advanced Patterns and Applications. Hershey, USA: IGI Global. 271 p. Perez, J.-C. (2010) Codon populations in single-stranded whole human genome DNA are fractal and finetuned by the golden ratio 1.618. - Interdiscip Sci Comput Life Sci, 2, 1–13. Rumer, Yu. B. (1968). Systematization of the codons of the genetic code. Doklady Akademii Nauk SSSR, 183(1), p. 225-226 (in Russian). Stewart, I. (1999) Life's Other Secret: The New Mathematics of the Living World. New-York: Wiley, 304 p. Symmetry: Culture and Science Vol. 23, Nos. 3-4, 303-322, 2012 FRACTAL GENETIC NETS AND SYMMETRY PRINCIPLES IN LONG NUCLEOTIDE SEQUENCES S.V. Petoukhov*, V.I. Svirin** * Biophysicist, bioinformatician (b. Moscow, Russia, 1946). Address: Laboratory of Biomechanical Systems, Mechanical Engineering Research Institute of Russian Academy of Sciences; Malyi Kharitonievskiy pereulok, 4, Moscow, 101990, Russia. E-mail: [email protected]. Fields of interest: genetics, bioinformatics, biosymmetries, multidimensional numbers, musical harmony, mathematical crystallography (also history of sciences, oriental medicine). Awards: Gold medal of the Exhibition of Economic Achievements of the USSR, 1974; State Prize of the USSR, 1986; Honorary diplomas of a few international conferences and organizations, 2005-2012. Publications: Biomechanics, Bionics and Symmetry, Moscow, Nauka, (1981), 239 pp. (in Russian); Biosolitons. Fundamentals of Soliton Biology, Moscow, GPKT, (1999), 288 pp. (in Russian); Matrix Genetics, Algebras of the Genetic Code, Noise-immunity, Moscow, RCD, (2008), 316 pp. (in Russian); with M. He: Symmetrical Analysis Techniques for Genetic Systems and Bioinformatics: Advanced Patterns and Applications, Hershey, USA: IGI Global, (2010), 271 pp.; with He M.: Mathematics of Bioinformatics: Theory, Practice, and Applications, USA: John Wiley & Sons, Inc., (2011), 295 pp. ** Biophysicist, bioinformatician (b. Nizhnekamsk, Russia, 1987). Address: Laboratory of Biomechanical Systems, Mechanical Engineering Research Institute of Russian Academy of Sciences; Malyi Kharitonievskiy pereulok, 4, Moscow, 101990, Russia. Email: [email protected]. Fields of interest: genetics, bioinformatics, biosymmetries, multidimensional numbers, musical harmony, descrete mathematics, cybernetics, neural networks. Publications : Stepanyan I.V., Cygankov V.D., Svirin V.I. and Golovanyova G.V. (2012) Neurophysiological approaches to medical cybernetics based on the creative heritage by Academician P.K. In: Anokhin: Biozashchita i Biobezopastnost', IV, №1, 28-42 (in Russian). Abstract: This article is devoted to hidden regularities of long nucleotide sequences. It contains a description and a thematic application of a new research tool that is termed as «fractal genetic nets». Described results testify in favor of existence of new Symmetry Principles of long nucleotide sequence as an addition to the known Symmetry Principle on the base of the generalized Chargaff’s second parity rule. Our results provide new materials to the Chargaff's problem about a grammar of biology and to the idea about an algebraic essence of the genetic coding system. 304 S.V. PETOUKHOV AND V.I. SVIRIN Keywords: genetic code, Chargaff’s rule, long nucleotide sequence, grammar, fractal, genetic nets. 1. ON A GRAMMAR OF BIOLOGY AND THE NOTION OF FRACTAL GENETIC NETS Fantastic successes of molecular genetics were defined in particular by a disclosure of phenomenological facts of symmetry in molecular constructions of genetic code and by a skillful implementation of these facts in theoretical modeling. A bright example is a disclosure of a symmetrological fact, reflected in the famous Chargaff's first parity rule, which says that in any double-stranded DNA segment, the quantities (or frequencies) of adenine and thymine are equal, and so are the frequencies of cytosine and guanine (Chargaff, 1950). This rule was used by Watson and Crick to support their famous DNA double-helix structure model (Watson & Crick, 1953). In his works, Chargaff pursued goal of searching for a grammar of biology that defines hidden regularities of genetic texts to construct living cells with their “confounded multi-dimensionality”, etc. One of his works “Preface to a Grammar of Biology” (Chargaff, 1971) which was devoted to a hundred years of nucleic acid research, reflects his thoughts that all achievements of molecular genetics are only the first steps in to discovering such grammar. Besides his first parity rule for double-stranded DNA, Chargaff also perceived that the parity rule approximately holds in the sufficient long single-stranded DNA segment. This last rule is known as Chargaff’s second parity rule (CSPR), and it has been confirmed in several organisms (Mitchell & Bride, 2006). Originally, CSPR is meant to be valid only to mononucleotide frequencies (that is quantities of monoplets) in the single-stranded DNA. “But, it occurs that oligonucleotide frequencies follow a generalized Chargaff’s second parity rule (GCSPR) where the frequency of an oligonucleotide is approximately equal to its complement reverse oligonucleotide frequency (Prahbu, 1993). This is known in the literature as the Symmetry Principle” (Yamagishi, Herai, 2011, p. 2). The work of Prahbu (1993) shows the implementation of the Symmetry Principle in long DNA-sequences for cases of complementary reverse n-plets with n = 2, 3, 4, 5 at least. In scientific publications, long genetic sequences are those sequences that contain no less that 50.000 nucleotides (see for example (Yamagishi, Herai, 2011)). “As correctly pointed out by Forsdyke, higher order equifrequency does imply lower order, and he therefore conjectured that the original FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 305 CSPR was actually a particular case of a higher order parity rule” (Yamagishi, Herai, 2011). This Symmetry Principle was studied or described in many other publications (Albrecht-Buehler, 2006; Chargaff, 1971, 1975; Dong, Cuticchia, 2001; Forsdyke, 2002; Forsdyke, Bell, 2004; Kong, et al. 2009; Mitchell, Bridge, 2006; Sueoka, 1999). Due to this Symmetry Principle, the work (Yamagishi, Herai, 2011) has uncovered new rules of long nucleotide sequences and emphasized a fractal-like property of such sequences across a large set of genomes “since no matter of scale, the same pattern is observed (self-similarity)”. Our previous research in the field of “matrix genetics” (Petoukhov, 2008a-d, 2011, 2012a,b,c; Petoukhov, He, 2009; Petoukhov, Svirin, 2012) has led to the hypothesis that structures of long nucleotide sequences of different organisms are connected with so called “fractal genetic nets” (FGN). In this work we are proposing a novel approach to discover new Symmetry Principles in such sequences. In general case, each variant of FGN is constructed by means of the author’s “method of a positional convolution (or positional splitting) of long genetic sequences” to get a cluster of long sequences, each of which, respectively, shorter than the original sequence. In the particular case considered in our article, the method lies in the positional convolution (or splitting) of long sequences of triplets through the removal or retention of individual positions (items) in each triplet. 1.1. Methodology Let us explain a construction of FGN of various types on an example of FGN for sequences of triplets (Figure 1). In each triplet, its three positions are numbered by 0, 1 and 2 correspondingly. At the first level of a convolution, an initial long sequence S0 of triplets is transformed by means of a positional convolution into three new sequences of nucleotides S1/0, S1/1, S1/2, each of which is 3 times shorter in comparison with the initial sequence (numerator of the index in this notation of sequences shows the level of the convolution, and the denominator - the position of the triplets, which is used for the convolution): the sequence S1/0 includes one by one all the nucleotides that are in the initial position "0" of triplets of the original sequence S0; the sequence S1/1 includes one by one all the nucleotides that are in the middle position "1" of triplets of the original sequence S0; the sequence S1/2 includes one by one all the nucleotides that are in the last position "2" of triplets of the original sequence S0. At the final stage of the first level of the positional convolution, each of the sequences of nucleotides S1/0, S1/1, S1/2 is represented as a sequence of triplets, where three positions inside each of triplets are 306 S.V. PETOUKHOV AND V.I. SVIRIN numbered again by 0, 1 and 2. To construct the second level of the convolution, each of the sequences S1/0, S1/1, S1/2 is transformed by means of the same positional convolution in three new sequences: S1/0 is convolved in S2/00, S2/01, S2/02; S1/1 – in S2/10,S2/11, S2/12; S1/2 – in S2/20, S2/21, S2/22. Similarly, the third level and subsequent levels of the convolution are constructed to form a multi-level net of sequences of nucleotides called "the fractal genetic net for the triplet convolution" or briefly "FGN-3" (Figure 1). Figure 1: The scheme of the fractal genetic net (FGN-3) for a sequence of triplets This FGN possesses a fractal-like character if the enumeration of positions is only taken into account: each of long sequences of this FGN can be taken as an initial sequence to form a similar genetic net on its basis (Figure 1). In general case, the FGN can be built not only for triplets, but also for other n-plets (n = 2, 4, 5, ...) or oligonucleotides by means of a repeated positional convolution of each of sequences from the previous level into "n" sequences of the next level of the convolution. This way one can built FGN-2, FGN-4, FGN-5, etc. for n=2, 3, 4, 5,… correspondingly. (Each of these FGN-2, FGN-3, FGN-4, FGN-5, etc. is a tree, but all of them form a net of separate trees; in a wide sense, FGN is the complete set of such separate trees). This article, on the other hand, concentrates only on the results related to the FGN-3. FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 307 2. ON SYMMETRY PRINCIPLES IN LONG NUCLEOTIDE SEQUENCES AND THE FGN-3 To test the author's hypothesis that structures of long nucleotide sequences of different organisms are connected with fractal genetic nets (first of all with FGN-3), we analyze an implementation of the known Symmetry Principle for long nucleotide sequences at different levels of a positional convolution in the fractal genetic net for the triplet convolution (FGN-3). In our article we use two different notions of complementary oligonucleotides (or n-plets): 1) complementary oligonucleotides in a traditional sense (for example ACGTG and TGCAC are the pair of complementary oligonucleotides in a traditional sense); 2) complementary reverse oligonucleotides (Prahbu, 1993) briefly called CR-oligonucleotides or reverse complements (for example ACGTG and CACGT are the pair of CR-oligonucleotides). The mentioned Symmetry Principle has been revealed for pairs of CR-oligonucleotides. Taking this into account we began testing the author’s hypothesis by means of analyzing frequencies (or quantities) of all variants of pairs of CR-oligonucleotides in long DNA-sequences of different organisms at different levels of their FGN-3. We test frequencies of n-plets in the FGN-3 with n = 1, 2, 3, 4, 5 only because of our computer limitations, but we assumed that our described results for FGN-3 hold true also for n > 5. Initial nucleotide sequences for testing are taken from (NCBI, 2012a). To test the proposed hypothesis, we use special software written by V.I.Svirin using programming language Python. In our preliminary studies we have revealed the following: 1) the Symmetry Principle for pairs of CR-oligonucleotides is realized in each of long nucleotide sequences at different levels of the convolution in FGN-3 (the length of oligonucleotides or n-plets under consideration is equal to n = 1, 2, 3, 4, 5 at least); 2) a series of new Symmetry Principles exists in those initial long nucleotide sequences where the famous Symmetry Principle for pairs of CR-oligonucleotides is performed; 3) each of these new Symmetry Principles is performed for n-plets in each of long nucleotide sequences at different levels of the convolution in FGN-3 (n = 1, 2, 3, 4, 5 at least). Let us take, for example, the long nucleotide sequence of Mycoplasma crocodyli MP145 chromosome, complete genome (NCBI Reference Sequence: NC_014014.1 (NCBI, 2012b). This sequence contains 934379 nucleotides. Figure 2 shows realisations of the mentioned Symmetry Principle (we'll name it as the Symmetry Principle №1) in the 13 sequences at the first three levels of convolution in the FGN-3 of this sequence. It displays the number of occurrences of 32 triplets (AAA, AAC, AAG, AAT, ACA, ACC, ACG, ACT, AGA, AGC, AGG, ATA, ATC, ATG, CAA, 308 S.V. PETOUKHOV AND V.I. SVIRIN CAC, CAG, CCA, CCC, CCG, CGA, CGC, CTA, CTC, GAA, GAC, GCA, GCC, GGA, GTA, TAA, TCA) and their 32 CR-triplets (TTT, GTT, CTT, ATT, TGT, GGT, CGT, AGT, TCT, GCT, CCT, TAT, GAT, CAT, TTG, GTG, CTG, TGG, GGG, CGG, TCG, GCG, TAG, GAG, TTC, GTC, TGC, GGC, TCC, TAC, TTA, TGA) in the long sequences S0, S1/0, S1/1, S1/2, S2/00, S2/01, S2/02, S2/10, S2/11, S2/12, S2/20, S2/21, S2/22 at the first three levels of the FGN-3 (a limited volume of the article doesn’t allow demonstration of other levels of this FGN). The straight line in each frame is a slope 1 (it is a bisector of the coordinate angle). Each dot in a frame represents one pair “triplet and CR-triplet”; its coordinate X shows number of occurrences (or the frequency) of the triplet, and its coordinate Y shows number the frequency of its CR-triplet on the same strand of the sequence. Each frame contains all 32 pairs «triplet and its CR-triplet». The dots agglutinate at the line of slope 1, demonstrating that amounts of occurrences (or frequencies) of two members of each of 32 pairs «triplet and its CR-triplet» are approximately equal in each of the sequences at each of the levels of convolution in the FGN-3. It means that the Symmetry Principle №1 is performed for each of these sequences. FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 309 Figure 2: Realizations of the Symmetry Principle №1 in the long sequences S0, S1/0, S1/1, S1/2, S2/00, S2/01, S2/02, S2/10, S2/11, S2/12, S2/20, S2/21, S2/22 at the first three levels of the FGN-3 for Mycoplasma crocodyli MP145 chromosome, complete genome (NCBI Reference Sequence: NC_014014.1 (NCBI, 2012b)). The initial sequence S0 contains 934379 nucleotides. These results show the effectiveness of the proposed fractal genetic nets as a research tool and they also testify in favour of existence of the generalized Symmetry Principle № 1: in long nucleotide sequences at different levels of convolution in FGN-3, oligonucleotide frequencies follow a generalized Chargaff’s second parity rule where the frequency of each oligonucleotide is approximately equal to its complement reverse oligonucleotide frequency. Now let us present our research results that testify in favor of the existence of new Symmetry Principles of long nucleotide sequences. Below, we formulate these new Symmetry Principles directly and then provide some data confirming their existence. The Symmetry Principle № 2 (concerning FGN): the frequency of each oligonucleotide is approximately the same in all the long nucleotide sequences of each of levels of FGN-3. Figure 3: Frequencies of the triplet ACG in 40 long nucleotide sequences S0, S1/0, S1/1, S1/2, S2/00, S2/01, S2/02, S2/10, S2/11, S2/12, S2/20, S2/21, S2/22, ….., S3/221, S3/222 at the first four levels of the FGN-3 of Mycoplasma crocodyli MP145 chromosome, complete genome (NCBI Reference Sequence: NC_014014.1 ( NCBI, 2012b)). Coordinate X shows the 40 sequences and coordinate Y shows appropriate frequencies of the triplet ACG in them. S.V. PETOUKHOV AND V.I. SVIRIN 310 Figure 3 demonstrates an example of frequencies of the triplet ACG in 40 long nucleotide sequences S0, S1/0, S1/1, S1/2, S2/00, S2/01, S2/02, S2/10, S2/11, S2/12, S2/20, S2/21, S2/22, ….., S3/221, S3/222 at the first four levels of the FGN-3 of the same initial sequence shown on Figure 2. Figure 4 shows examples of frequencies of all 64 triplets in 12 long nucleotide sequences at the first three levels of FGN-3 of the same initial sequence as on Figure 2. AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT S0 19832 6246 7087 15037 5049 2363 1029 4714 5272 2754 2150 4713 11126 5250 5499 15079 7427 1872 2105 5375 2750 569 681 2181 1171 508 693 989 4326 1786 2115 6917 7190 1404 1820 5225 2974 710 497 2973 S1/0 5786 1709 1859 5320 1527 747 660 1713 1784 590 973 1700 4952 1619 1450 5397 1620 700 553 1473 727 522 336 887 635 321 402 671 1664 832 577 1950 1823 585 932 1664 572 276 377 622 S1/1 5679 1707 1783 5352 1564 755 663 1745 1688 623 880 1704 5051 1570 1488 5390 1615 746 653 1497 712 622 390 945 607 341 365 663 1659 893 647 1913 1801 598 833 1563 580 353 321 582 S1/2 5768 1643 1940 5428 1635 747 684 1702 1737 586 912 1820 4886 1568 1505 5419 1661 733 605 1388 664 513 326 885 600 319 366 684 1562 841 636 1785 1812 611 930 1555 631 288 361 585 S2/00 1975 550 607 1784 542 233 214 526 657 226 293 530 1688 507 494 1797 526 225 212 547 235 181 81 292 189 106 124 214 545 310 215 635 651 208 289 555 228 97 113 227 S2/01 1944 560 619 1685 513 253 205 548 635 216 294 595 1629 537 547 1802 538 236 206 546 248 166 113 277 208 113 93 175 515 306 236 646 689 204 289 530 195 91 97 233 S2/02 1944 567 651 1769 546 266 197 553 640 200 322 543 1637 511 582 1835 568 201 211 519 252 170 84 293 178 89 102 194 520 295 206 641 675 178 274 523 197 110 121 218 S2/10 1986 587 607 1770 492 241 175 536 631 220 292 513 1671 518 510 1816 565 255 205 548 267 188 105 323 196 104 121 199 526 289 207 684 640 195 284 488 241 102 109 181 S2/11 1899 543 630 1758 566 273 210 568 600 199 259 492 1655 527 529 1832 502 262 203 497 252 198 109 302 207 101 137 189 532 306 172 653 610 223 275 507 210 92 108 227 S2/12 1952 557 611 1743 521 253 208 544 598 181 287 523 1659 583 538 1799 508 243 221 539 246 183 118 307 194 107 105 215 542 316 211 577 655 208 295 534 196 118 109 192 S2/20 1954 504 615 1757 537 231 228 552 691 198 314 562 1706 531 551 1857 540 222 197 543 218 164 94 281 204 109 85 164 543 283 196 617 602 189 291 577 196 106 103 207 S2/21 1935 531 679 1775 517 203 203 537 644 223 288 549 1635 522 514 1745 561 238 203 507 255 196 109 292 208 116 97 204 524 312 173 614 654 201 291 557 198 99 104 178 FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT 2330 676 616 2446 3636 1374 2083 6587 13334 3369 4368 11019 6993 2302 1240 5710 6550 2790 2658 5123 13519 7131 7918 20219 958 286 551 733 1680 601 711 1694 5401 1625 1685 4895 1542 872 681 1794 1642 616 720 1727 5470 1839 1700 5886 875 275 546 755 1606 577 720 1736 5418 1685 1596 4950 1679 904 662 1866 1533 610 689 1583 5525 1864 1745 5876 888 283 555 728 1709 578 777 1770 5430 1624 1641 4973 1586 854 669 1798 1602 611 796 1619 5585 1809 1689 5921 283 115 185 246 516 199 265 575 1732 520 498 1635 549 291 183 639 577 190 222 585 1755 644 576 1997 302 110 200 240 537 198 257 541 1708 528 538 1730 514 276 201 645 570 233 285 568 1753 631 579 1929 296 100 193 281 532 218 238 553 1678 509 556 1667 546 293 204 600 560 201 245 547 1760 659 583 2004 297 112 187 232 518 194 219 560 1693 571 475 1648 542 340 193 646 530 219 246 565 1765 669 563 2034 311 272 112 170 249 544 217 226 559 1760 571 505 1672 559 353 204 634 557 219 241 592 1793 660 562 1960 316 108 200 246 501 226 237 530 1717 535 561 1649 575 277 206 658 525 213 240 546 1802 672 559 2010 308 93 215 229 513 202 244 551 1762 492 587 1711 573 295 220 647 568 205 239 531 1733 632 541 1995 318 104 174 243 546 203 239 577 1723 554 544 1695 542 316 212 669 556 191 293 536 1777 680 553 1969 Figure 4: the table of frequencies of 64 triplets in long nucleotide sequences S0, S1/0, S1/1, S1/2, S2/00, S2/01, S2/02, S2/10, S2/11, S2/12, S2/20, S2/21 at the first three levels of FGN-3 of Mycoplasma crocodyli MP145 chromosome, complete genome (NCBI Reference Sequence: NC_014014.1 (NCBI, 2012b)). The Symmetry Principle № 3: for each of long nucleotide sequences at each level of FGN-3 the following rules hold true: sum of the frequencies of all the oligonucleotides, that begin with the letter A, approximately equal to the sum of the frequencies of all the oligonucleotides that begin with the letter T; sum of the frequencies of all the oligonucleotides, that begin with the letter C, approximately equal to the sum of the frequencies of all the oligonucleotides that begin with the letter T. In particularly, these rules hold not only for long sequences at lower levels of FGN-3 but also for an initial long sequence S0. Figure 5 illustrates the Symmetry Principle № 3 using examples of n-plets (n=2, 3, 4, 5) in sequences S0, S1/0,…, S2/22 at the first levels in FGN-3 of the same sequence as on Figure 2. The total frequencies of the sets of duplets in sequences of FGN-3: F(A) F(T) F(C) F(G) S0 169757 171420 62531 63471 S1/0 56674 57022 20763 21265 S1/1 56197 57247 21486 20794 S1/2 56747 57179 20600 21198 S2/00 19065 18870 6929 7044 S2/01 18857 18937 6903 7211 S2/02 18836 19067 6882 7123 S2/10 18764 19102 7113 6929 S2/11 18756 19155 7148 6849 S2/12 18713 19040 7178 6977 S.V. PETOUKHOV AND V.I. SVIRIN 312 The total frequencies of the sets of triplets in sequences of FGN-3: F(A) F(T) F(C) F(G) S0 113200 114243 41465 42541 S1/0 37786 38095 13870 14065 S1/1 37642 38185 14268 13721 S1/2 37980 38207 13568 14061 S2/00 12623 12593 4637 4752 S2/01 12582 12688 4622 4713 S2/02 12763 12612 4523 4707 S2/10 12565 12699 4782 4559 S2/11 12540 12842 4622 4601 S2/12 12557 12745 4632 4671 The total frequencies of the sets of 4-plets in sequences of FGN-3: F(A) F(T) F(C) F(G) S0 84955 85573 31189 31867 S1/0 28475 28439 10286 10662 S1/1 27999 28601 10758 10504 S1/2 28493 28639 10270 10460 S2/00 9522 9531 3409 3492 S2/01 9391 9532 3438 3593 S2/02 9274 9624 3499 3557 S2/10 9342 9552 3573 3487 S2/11 9417 9570 3578 3389 S2/12 9434 9538 3594 3388 S2/11 7417 7668 2907 2771 S2/12 7431 7677 2849 2806 The total frequencies of the sets of 5-plets in sequences of FGN-3: F(A) F(T) F(C) F(G) S0 67729 68688 25144 25304 S1/0 22626 22951 8242 8470 S1/1 22512 22918 8503 8356 S1/2 22788 22764 8217 8520 S2/00 7551 7620 2780 2812 S2/01 7506 7613 2762 2882 S2/02 7542 7684 2701 2836 S2/10 7491 7626 2851 2795 Figure 5: The illustration of the Symmetry Principle № 3 in the case of Mycoplasma crocodyli MP145 chromosome, complete genome (NCBI Reference Sequence: NC_014014.1 (NCBI, 2012b)). Here F(A), F(T), F(C) and F(G) mean sum of the frequencies of oligonucleotides (or n-plets) that begin with the letters A, T, C or G correspondingly. The tables show the F(A) ≈ F(T) and F(C) ≈ F(G) for sets of n-plets (n=2, 3, 4, 5) in each of long nucleotide sequences S0, S1/0, S1/1, S1/2, S2/00, S2/01, S2/02, S2/10, S2/11, S2/12 at the first three levels of FGN-3 of this sequence This result was obtained in connection with studies related to genetic matrices [C T; A G](n) (see (Petoukhov, 2012c) in this issue). Each of 4 quadrants of such genomatrices contains all oligonucleotides that begin with one of 4 letters C, T, A or G. In these genomatrices, each oligonucleotide and its complementary oligonucleotide are disposed inverse-symmetrical relative to the center of the appropriate matrix. In accordance with the Symmetry Principle № 3, the total frequencies of oligonucleotides in both quadrants along the main diagonal of these genomatrices are approximately equal each other (F(C) ≈ F(G)); the total frequencies of oligonucleotides in both quadrants along the second diagonal of these genomatrices are also approximately equal each other (F(A) ≈ F(T)). An additional illustration of the Symmetry Principle № 3 is obtained based on the initial data about frequencies of separate triplets in the whole human genome from the work (Perez, 2010). This genome contains 2843411612 triplets. Figure 6 shows the total frequencies FC, FG, FA and FT of sets of triplets that begin with one of four letters C, G, FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 313 A or T. The percentage difference between the total frequencies FC and FG is equal to 0.05% and between FA and FT is equal to 0.16%. But we don’t have data about the quantities of separate triplets in convoluted sequences S1/0, S1/1,… at the lower levels of FGN-3 for this genome because we have no information about the order of triplets in its huge sequence S0. The total frequencies of the sets of triplets, which begin with C: FC = F(CCC+CCT+CCA+CCG+ CTC+CTT+CTA+CTG+ CAC+CAT+CAA+CAG+ CGC+CGT+CGA+CGG) = 581026275 The total frequencies of the sets of triplets, which begin with G: FG = F(GGG+GGA+GGT+GGC+ GAG+GAA+GAT+GAC+ GTG+GTA+GTT+GTC+ GCG+GCA+GCT+GCC)= 581343106 The total frequencies of the sets of triplets, which begin with A: FA = F(ACC+ACT+ACA+ACG+ ATC+ATT+ATA+ATG+ AAC+AAT+AAA+AAG+ AGC+AGT+AGA+AGG)= 839827642 The total frequencies of the sets of triplets, which begin with T: FT = F(TGG+TGA+TGT+TGC+ TAG+TAA+TAT+TAC+ TTG+TTA+TTT+TTC+ TCG+TCA+TCT+TCC)= 841214589 Figure 6: the approximate equality of the total frequencies of sets of triplets that begin with letters C and G (upper table) and with letters A and T (bottom table) in the case of the sequence S0 of the whole human genome. Initial data about frequencies of separate triplets are taken from the work (Perez, 2010). Now let us introduce the Symmetry Principle № 4, which deals with reading frame shifts, deletion mutations, and also positional permutations in oligonucleotides. Concerning those DNA sequences (including the mentioned sequence on Figures 2-5), that have been tested till today in our laboratory, we have discovered the following phenomenological facts (this study is continued now for a wide list of DNA-sequences of different organisms and organelles): - a transformation of long nucleotide sequences by means of a reading frame shift in them preserves implementations of all described Symmetry Principles inside new long nucleotide sequences (in our tests, a reading frame shift means that the reading of sequence does not begin with its first position, but with one of subsequent positions; the missing fragment of the sequence can be moved into the end of the sequence, and in this case a reading frame shift leads to a simple change of order of all sequences at each of lower levels of FGN); - a transformation of long nucleotide sequences by means of a deletion mutation (when their short parts are missing) preserves implementations of all described Symmetry Principles in new long nucleotide sequences. 314 S.V. PETOUKHOV AND V.I. SVIRIN One should separately consider the question about positional permutations in oligonucleotides. The theory of noise-immunity coding pays a special attention to permutations of elements of transmitted signals. It is obvious that for different n-plets different quantities of variants of permutation of their positions exist: for duplets two variants of positional permutations exist (1-2 and 2-1); for triplets six variants of positional permutations exist (1-2-3, 2-3-1, 3-1-2, 3-2-1, 2-1-3, 1-3-2); for 4-plets 24 variants of positional permutations exist (1-2-3-4, 2-3-4-1, …..); for 5-plets 120 variants of positional permutations exist (1-2-3-4-5, 2-3-4-5-1, …..). It is also evident that if a long nucleotide sequence is interpreted as a sequence of a certain type of oligonucleotides (duplets, or triplets, or 4-plets, or 5-plets, …), and one of possible positional permutations is done simultaneously inside all of its oligonucleotides, then a quite new long nucleotide sequence appears (we named simultaneous positional permutations inside all oligonucleotides of a certain type as “collective positional permutations” inside these oligonucleotides). For example, if we have initially a sequence of triplets CGA-TAA-AGC-GTC-TAG-CGC-ATC -…, then after changing of the positional order from the initial order 1-2-3 to new order 2-3-1 inside each of triplets, we obtain the quite different sequence GAC-AAT-GCA-TCGAGT-GCC-TCA -… . But our studies of a wide set of long nucleotide sequences (including the sequence on Figure 2-5) demonstrated that the FGN-3 for such new long nucleotide sequence has obeyed the same Symmetry Principles №№ 1-3 described above. These results attest to possible existence of the Symmetry Principle № 4, which can be briefly formulated as: The Symmetry Principle № 4: - reading frame shifts and deletion mutations in long nucleotide sequences, and also collective positional permutations inside their oligonucleotides don't essentially violate implementations of all Symmetry Principles for long nucleotide sequences and their fractal genetic net (FGN-3). The final part of this article illustrates an additional application of the FGN-3 approach to study hidden regularities of long nucleotide sequences from the point of view of the FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 315 black-and-white mosaics of the genetic matrix [C T; A G](3) from the article (Petoukhov, 2012c, Figure 20). Figure 7 shows this matrix, which reflects phenomenological properties of the genetic coding system and which contains the complete set of 64 triplets in a strong order. The mosaic of this matrix is identical to the mosaic of one of Hadamard (8*8)-matrices that are widely used in noise-immunity coding (for example, codes based on Hadamard matrices have been used on spacecraft «Mariner» and «Voyadger», which allowed obtaining high-quality photos of Mars, Jupiter, Saturn, Uranus and Neptune in spite of the distortion and weakening of the incoming signals; Hadamard matrices are used to create quantum computers, which are based on Hadamard gates, etc.). In addition, this Hadamard representation of the genetic matrix [C T; A G](3) is the biquaternion by Hamilton with unit coordinates (see details in the article (Petoukhov, 2012c)). A possible connection between the black-and-white mosaic of this genetic matrix and hidden regularities of long nucleotide sequences has a special interest. Below we present our initial results of studying this connection. CCC CCA CAC CAA ACC ACA AAC AAA CCT CCG CAT CAG ACT ACG AAT AAG CTC CTA CGC CGA ATC ATA AGC AGA CTT CTG CGT CGG ATT ATG AGT AGG TCC TCA TAC TAA GCC GCA GAC GAA TCT TCG TAT TAG GCT GCG GAT GAG TTC TTA TGC TGA GTC GTA GGC GGA TTT TTG TGT TGG GTT GTG GGT GGG Figure 7: the genetic matrix [C T; A G](3) of 64 triplets with its black-and-white mosaic, which reflects phenomenological properties of the genetic coding system (from the work (Petoukhov, 2012c)). This matrix [C T; A G](3) in Figure 7 contains two subsets with 28 kinds of white triplets and 36 kinds of black triplets. The authors calculate total quantities (frequencies FWHITE and FBLACK) of members of these two subsets in long nucleotide sequences. For example, we calculated the total frequencies for the whole human genome, which contains the huge number 2843411612 (about three billion) of triplets. The initial data about this genome are shown on Figure 8 from the article (Perez, 2010). Very different frequencies of different triplets are represented in this genome. For example, the frequency of the triplet CGA is equal to 6251611 and the frequency of the triplet TTT is equal to 109951342; they differ in 18 times approximately. But our result of the calculation shows that in this genome the percentage difference between FWHITE and FBLACK is approximately equal to 0.1% because the total quantity FWHITE of white triplets is equal to 1422456641 and the total quantity FBLACK of black triplets is equal to 1420954971. S.V. PETOUKHOV AND V.I. SVIRIN 316 TRIPLET TRIPLET FREQUENCY TRIPLET AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT 109143641 41380831 56701727 70880610 57234565 33024323 7117535 45731927 62837294 39724813 50430220 45794017 58649060 37952376 52222957 71001746 CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT TRIPLET FREQUENCY TRIPLET 53776608 42634617 57544367 52236743 52352507 37290873 7815619 50494519 6251611 6737724 7815677 7137644 36671812 47838959 57598215 56828780 GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TRIPLET FREQUENCY 56018645 26820898 47821818 37990593 40907730 33788267 6744112 39746348 43853584 33774033 37333942 33071650 32292235 26866216 42755364 41557671 TRIPLET TRIPLET FREQUENCY TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT 59167883 32272009 36718434 58718182 55697529 43850042 6265386 62964984 55709222 40949883 52453369 57468177 59263408 56120623 54004116 109591342 Figure 8: quantities of repetitions of each triplet in the whole human genome (from [Perez, 2010]) Similar results about approximate equality of FWHITE and FBLACK were obtained for all 811 long fragments of the human genome studied in his student’s thesis by one of the authors – V.Svirin, who became the pioneer of this comparative analyses of FWHITE and FBLACK in long nucleotide sequences from the point of view of the phenomenological genomatrix shown in Figure 7. What conclusion can be made about an application of the method of the FGN-3 to study the total quantities FWHITE and FBLACK in long nucleotide sequences? Figure 9 shows typical results of the comparison analysis of FWHITE and FBLACK in sequences on different levels of the FGN-3 for the same initial sequence S0 of Mycoplasma crocodyli MP145 chromosome. FWHITE % FBLACK % FWHITE % FBLACK % S0 S1/0 S1/1 S1/2 S2/00 S2/01 S2/02 S2/10 S2/11 S2/12 52 48 49 51 49 51 49 51 49 51 49 51 50 50 50 50 49 51 49 51 S2/20 S2/21 S2/22 S3/000 S3/001 S3/002 S3/010 S3/011 S3/012 S3/020 49 51 49 51 49 51 50 50 49 51 50 50 49 51 50 50 50 50 49 51 FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES FWHITE % FBLACK % FWHITE% FBLACK % 317 S3/021 S3/022 S3/100 S3/101 S3/102 S3/110 S3/111 S3/112 S3/120 S3/121 49 51 50 50 50 50 50 50 50 50 49 51 50 50 49 51 50 50 50 50 S3/122 S3/200 S3/201 S3/202 S3/210 S3/211 S3/212 S3/220 S3/221 S3/222 49 51 50 50 49 51 50 50 50 50 49 51 49 51 49 51 50 50 49 51 Figure 9: percentage of frequencies FWHITE and FBLACK of white and black triplets (from Figure 7) in long sequences S0, S1/0, ..., S3/222 in the first four levels of the FGN-3 for the Mycoplasma crocodyli MP145 chromosome, complete genome (NCBI Reference Sequence: NC_014014.1 (NCBI, 2012b)). The sequence S0 contains contains 934379 nucleotides. From Figure 9, one can observe the fact of approximate equality of total quantities of white and black triplets in all these sequences S0, S1/0, ..., S3/222. It appears that the described FGN-3 and fractal-like properties of long genetic sequences that are related to the invariance of these Symmetry Principles, have a biological value (a biological sense) associated with mutational changes of such sequences and with evolutionary creation of new types of DNA-sequences. The authors presume that mechanisms of biological evolution use these permutational and other described properties of long nucleotide sequences in producing new biological organisms and organelles. For instance, new DNA sequences can be constructed in the course of biological evolution of organisms by means of combinatorics of nucleotide sequences from different levels of FGN (including genetic crossing among long nucleotide sequences from different levels of FGN by analogy with well-known examples of genetic crossing). One should note here that the question about permutation properties of DNA-sequences is very important because some biological organisms differ each from other only by permutations in their DNA sequences (see for example the book (Pevzner, 2000)). The proposed method of the FGN is the new effective and useful approach in the field of bioinformatics, molecular genetics, and evolutionary biology. It generates new data in the field of symmetrology (Darvas, 2007; Cristea, 2005, etc.) In addition, one can mention here about fractal images in genetic systems. A number of publications are devoted to fractal features of genetic texts (Gusev et al, 2009; Jeffry, 1990; Pellionisz et al, 2012a; Petoukhov, 2008b; Petoukhov, He, 2009; Yam, 1995, etc). Interesting data about fractal approaches in genetics, including materials about an important connection of fractal defects with cancer, are presented at the website 318 S.V. PETOUKHOV AND V.I. SVIRIN (Pellionisz, 2012b). Research in this direction continues all over the world. In this article, the authors propose Fractal Genetics Nets (FGN) as a new tool to study fractallike properties of long DNA sequences that also describes new fractal-like properties of such nucleotide sequences. We believe that these FGN and fractal-like properties of long nucleotide sequences can lead to new principles and systems in the field of signal processing, recognition of images and artificial intellect. The list of these scientific tools includes also genetic algorithms developed intensively in scientific world during last decades (for example see (Goldberg, Korb, Deb, 1989; Forrest, Mitchell, 1991)). Our findings described here contribute to the evidences of the idea about algebraic essence of the genetic coding system (Petoukhov, 2008a-d, 2011, 2012a,b,c; Petoukhov, He, 2009). We plan to publish in the nearest future other results of our studies toward FGN and the Symmetry Principles related to a wide list of long DNA sequences of different organells and organisms from different taxonomical classes. These results would require a large volume for their publication and, therefore, are not included in the limited volume of this article. 3. DISCUSSION The genetic coding system possesses impressive noise-immunity properties. Modern technology of noise-immunity coding is based on matrix presentations of discrete signals. This technology allows noise-immunity transferring, for example, photos of a surface of Mars through millions kilometers of spaces with noises to provide a receiving the high-quality photos on Earth. The authors are studying hidden regularities of the genetic coding system by means of known matrix methods from this communication technology. In the result, a special scientific direction called “matrix genetics” is developing during last year (Petoukhov, 2008a-d, 2011, 2012a,b,c; Petoukhov, He, 2009; Petoukhov, Svirin, 2012). The results described in our article are closely connected with many other results of this effective direction of researches where many connections have been revealed between the genetic coding system and mathematics of discrete signals processing including noise-immunity coding. In particular, the list of relevant mathematical formalisms includes Hadamard matrices, orthogonal systems of Walsh functions and Rademacher functions, Kronecker families of matrices, dyadic-shift matrices and dyadic-shift decompositions of matrices, hypercomplex numbers (including Hamilton quaternions and bi-quaternions), new matrix presentations of complex numbers and split-complex numbers, algebras of projective FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 319 operators, etc. Without these algebraic results, we couldn’t offer fractal genetic nets as a new tool for genetic analyses and we couldn’t receive the described phenomenological data about the proposed symmetry principles in long nucleotide sequences. Our matrix approach to the genetic system gives opportunity to receive data in favor of existence of not only symmetry principles proposed above but also some other symmetry principles that will be published in the nearest future. Modern science knows that deep knowledge about phenomenological relations of symmetry among separate parts of a complex natural system can tell many important things about the evolution and mechanisms of these systems. It should be noted that fantastic successes of molecular genetics were defined in particular by a disclosure of phenomenological facts of symmetry in molecular constructions of genetic code and by skilful using of these facts in theoretical modeling. A bright example is a disclosure of a symmetrological fact, reflected in the first rule by E. Chargaff, of an equality of quantities of nitrogenous bases in their appropriate pairs (adenine-thymine and cytosine-guanine) in molecules of DNA in different organisms. This phenomenological rule was used skilfully in a theoretic modeling of a double helix of DNA by F. Crick and J. Watson with using of additional symmetrological principles. Biological organisms belong to a category of very complex natural systems, which correspond to a huge number of biological species with inherited properties. But surprisingly, molecular genetics has discovered that all organisms are identical to each other by their basic molecular-genetic structures. Due to this revolutionary discovery, a great unification of all biological organisms has happened in the science. The information-genetic line of investigations has become one of the most prospective lines not only in biology, but also in science as a whole. The more science studies living matter, the more facts of unification in other physiological systems (metabolic biosystems, energy biosystems, etc.) are discovered. The searching of unification principles in living matter is an important direction of developing modern science. Materials of our article belong to this direction. Modern science recognizes a key meaning of information principles for inherited selforganization of living matter. Modern informatics is an independent branch of science, which possesses its own language and mathematical formalisms and exists together with physics, chemistry and other scientific branches. A problem of information evolution of living matter has been investigated intensively in the last decades in addition to studies of the classical problem of biochemical evolution. Not only physics and chemistry deal with principles and methods of symmetry, informatics and digital 320 S.V. PETOUKHOV AND V.I. SVIRIN signal processing also pay great attention to them. How is theory of signal processing connected to geometry and geometrical symmetries? Signals are represented there in a form of a sequence of the numeric values of their amplitude in reference points. The theory of signal processing is based on the interpretation of discrete signals as a form of vector in multi-dimensional spaces. In every tact of time, a signal value is interpreted as the corresponding value of a coordinate in a multi-dimensional vector space of signals. In this way, the theory of discrete signals turns out to be the science of geometries of multi-dimensional spaces where different multidimensional numeric systems can be useful. The number of dimensions of such a space is equal to the quantity of reference points for the signal. Metric notions and all other necessary things are introduced in these multi-dimensional vector spaces for those or other problems of maintenance of reliability, speed and economy of the signal information. On this geometrical basis, many methods and algorithms of recognition of signals and images, coding information, detection and correction of information mistakes, artificial intellect and training of robots are constructed. One can add here the importance of symmetries in permutations of components for coding signals, in spectral analysis of signals, in orthogonal and other transformations of signals, and so on. Investigation of symmetrical and structural analogies between computer informatics and genetic informatics is also needed for the creation of DNA-computers, DNA-robotics, for so called “genetic algorithms” that is widely used in modern engineering, etc. The authors of the article hope that the proposed symmetry principles described in the article will be useful not only for fundamental knowledge but also for technologic applications. Thoughts and dreams of Chargaff about a disclosure of a grammar of biology on the basis of symmetrologic analysis of hidden regularities of DNA are still valid and they determine the important area of researches that are additionally supported by this article. Acknowledgments: The described research was conducted in a framework of a longterm cooperation between Russian and Hungarian Academies of Sciences. The authors are grateful to G. Darvas, M. He, A. Pellionisz and I. Stepanyan for their support. Some results of this paper have been possible due to the Russian State scientific contract P377 from July 30, 2009. FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 321 REFERENCES Albrecht-Buehler, G. (2006) Asymptotically increasing compliance of genomes with Chargaff's second parity rules through inversions and inverted transpositions. Proceedings of the National Academy of Sciences, November 21, 103(47), 17828–17833. Bell, S. J., Forsdyke, D. R. (1999) Deviations from Chargaff's Second Parity Rule Correlate with Direction of Transcription, Journal of Theoretical Biology, 197, 63-76 Chargaff, E. (1950) Chemical specificity of nucleic acids and mechanism of their enzymatic degradation. Experimentia, 6, 201 Chargaff, E. (1971) Preface to a Grammar of Biology: A hundred years of nucleic acid research, Science, 172, 637-642, http://www.sciencemag.org/content/172/3984/637.full.pdf?ijkey=99298aa2ffc516d64de947c301cfa5f6 a56d3c08&keytype2=tf_ipsecsha Chargaff, E. (1975) A fever of reason, Annual Review of Biochemistry, 44, 1-20 Cristea, P.D. (2005) Representation and analysis of DNA sequences. Genomic Signal Processing and Statistics, Chapter 1, E. Daugherty et al. Eds., Hindawi Publishing Corp., pp. 15–65. Darvas, G. (2007) Symmetry. Basel: Birkhauser, xi + 508 pp. Dong, Q., Cuticchia, A.J. (2001) Compositional symmetries in complete genomes. Bioinformatics, 17, 557559. Forrest, S., Mitchell, M. (1991) The performance of genetic algorithms on Walsh polynomials: Some anomalous results and their explanation. – In R.K.Belew and L.B.Booker, editors, Proceedings of the Fourth International Conference on Genetic Algorithms, pp.182-189. Morgan Kaufmann, San Mateo, CA. Forsdyke, D. R. (2002) Symmetry observations in long nucleotide sequences: a commentary on the Discovery Note of Qi and Citicchia. Bioinformatics letter, v. 18, 1, 215-217. Forsdyke, D. R., Bell, S. J. (2004) A discussion of the application of elementary principles to early chemical observations. Applied Bioinformatics, 3, 3-8. Goldberg, D.E., Korb B., Deb K. (1989) Messy genetic algorithms: Motivation, analysis, and first results. Complex systems, 1989, 3(5), 493-530. Gusev, V., Miroshnichenko, L., Chuzhanova, N. (2009). Detection of fractal structures in the DNA sequences. - International Book Series “Information Science and Omputing”, Book 8, Classification, Forecasting, Data Mining, p.117-124, in Russian (Supplement to International Journal "Information Technologies and Knowledge", v. 3) http://www.foibg.com/ibs_isc/ibs-08/ibs-08-p17.pdf Jeffrey, H.J. (1990) Chaos game representation of gene structure. Nucleic Acids Research, v.18, 8, 2163-2170 Kong, S-G, Fan W-L, Chen, H-D, Hsu, Z-T, Zhou, N, et al. (2009) Inverse Symmetry in Complete Genomes and Whole-Genome Inverse Duplication, PLoS ONE 4(11): e7553. doi:10.1371/journal.pone.0007553 Mitchell, D., Bridge, R. (2006) A test of Chargaff's second rule. Biochemical and Biophysical Research Communications, 340(1): 90-94, http://www.ncbi.nlm.nih.gov/pubmed/16364245 . NCBI. (2012a). http://www.ncbi.nlm.nih.gov/. NCBI. (2012b). http://www.ncbi.nlm.nih.gov/nuccore/294155300. Pellionisz, A.J, Graham, R., Pellionisz, P.A., Perez, J.C. (2012a) Recursive Genome Function of the Cerebellum: Geometric Unification of Neuroscience and Genomics. In: Springer Handbook "The Cerebellum" pp. 1381-1423 M. Manto, D.L. Gruol, J.D. Schmahmann, N. Koibuchi, F. Rossi (eds.), Handbook of the Cerebellum and Cerebellar Disorders, Submitted October 20, Accepted November 1, 2011.DOI 10.1007/978-94-007-1333-8_61, #Springer Science+Business Media Dordrecht 2012 (full text in http://fr.scribd.com/doc/111439455/BOOK-Unification- of-Neuroscience-and-GenomicsPellionisz-Et-Al-in-Section-4-Springer-the-Cerebellum-Handbook-2012). 322 S.V. PETOUKHOV AND V.I. SVIRIN Pellionisz, A.J. (2012b). http://www.junkdna.com/the_genome_is_fractal.html. Perez, J.-C. (2010). Codon populations in single-stranded whole human genome DNA are fractal and finetuned by the golden ratio 1.618. Interdisciplinary Sciences Computational Life Sciences, 2, 1–13. http://www.ncbi.nlm.nih.gov/pubmed/20658335 , full text in: (http://fr.scribd.com/doc/95641538/Codon-Populations-in-Single-stranded-Whole-Human-GenomeDNA-Are-Fractal-and-Fine-tuned-by-the- Golden-Ratio-1-618 ). Petoukhov, S.V. (2008a) The degeneracy of the genetic code and Hadamard matrices. arXiv:0802.3366 [qbio.QM]. Petoukhov, S.V. (2008b) Matrix genetics, algebras of the genetic code, noise immunity. Moscow: RCD, 316 p. (in Russian). Petoukhov, S.V. (2008c) Matrix genetics, part 1: Permutations of positions in triplets and symmetries of genetic matrices, http://arxiv.org/abs/0803.0888, 6th version, 1-34. Petoukhov, S.V. (2008d) Matrix genetics, part 3: the evolution of the genetic code from the viewpoint of the genetic octave Yin-Yang-algebra. arXiv:0805.4692[q-bio.QM]. Petoukhov, S.V. (2011) Hypercomplex numbers and the algebraic system of genetic alphabets. Elements of algebraic biology. Hypercomplex numbers in geometry and physics, v. 8, 2(16), 118-139 (Gipercompleksnyie chisla v geometrii i fizike, in Russian) Petoukhov, S.V. (2012a) The genetic code, 8-dimensional hypercomplex numbers and dyadic shifts. (7th version from January, 30, 2012), http://arxiv.org/abs/1102.3596 Petoukhov, S.V. (2012b) On fractal structure of long nucleotide sequences. Joint scientific journal (Ob’edinennyi nauchnyi journal), # 6-7, 50 (in Russian) Petoukhov, S.V. (2012c) Symmetries of the genetic code, hypercomplex numbers and genetic matrices with internal complementarities. Symmetry: Culture and Science, in this issue Petoukhov, S.V., He, M. (2009) Symmetrical Analysis Techniques for Genetic Systems and Bioinformatics: Advanced Patterns and Applications. Hershey, USA: IGI Global. 271 p. Petoukhov, S.V., Svirin, V.I. (2012) Fractal genetic nets and the rules of long genetic sequences. Joint scientific journal (Ob’edinennyi nauchnyi journal), # 8-9, 50-52 (in Russian) Pevzner, P.A. (2000) Computational molecular biology. An algorithmic approach. – Cambridge, Massachusetts: MIT Press. Prabhu, V. V. (1993) Symmetry observation in long nucleotide sequences. Nucleic Acids Research, 21, 27972800. Sueoka, N. (1999) Two aspects of DNA base composition: G + C content and translation-coupled deviation from intra-strand rule of A = T and G = C. – Journal of Molecular Evolution, 49, 49–62 Yam, Ph. (1995). Talking trash (Linguistic patterns show up in junk DNA). – Scientific America, 272(3), 1215. Yamagishi, M.E.B., Herai, R.H. (2011) Chargaff’s “Grammar of Biology”: New Fractal-like Rules. arXiv:1112.1528v1 from 07.12.2011 Watson, J. D., Crick, F. H. C. (1953) Molecular Structure of Nucleic Acids. Nature, 4356, 737. Symmetry: Culture and Science Vol. 23, Nos. 3-4, 323-342, 2012 A MARKOV INFORMATION SOURCE FOR THE SYNTACTIC CHARACTERIZATION OF AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION Miguel A. Jiménez-Montaño BioPhysicist, (b. México, D.F., MEXICO, 1941). Address: Faculty of Physics and Artificial Intelligence, University of Veracruz, Sebastián Camacho # 5, Col. Centro, C.P. 91000, Xalapa, Ver., México. E-mail: [email protected]. Fields of interest: The structure of the genetic code, technological evolution and informational measures and algorithmic complexity of sequences of symbols and nerve signals. Awards: Research Award, 1989; Dean Award, 2004; both from Universidad Veracruzana. Fulbright Fellow, 1982. First Prize National Contest on Scientific Non-technical Essay, 1990. Publications: Ebeling W., Jiménez-Montaño M. A. (1980)*. On Grammars, Complexity, and Information Measures of Biological Macromolecules. Mathematical Biosciences Vol. 52:53-71. Jiménez-Montaño M. A. (1984). On the Syntactic Structure of Protein Sequences, and Concept of Grammar Complexity. Bulletin of Mathematical Biology, Vol.46:641-660. Jiménez-Montaño M.A., de la Mora-Basáñez R., Pöschel T. (1996)*. The Hypercube Structure of the Genetic Code Explains Conservative and Non-Conservartive Aminoacid Substitutions in Vivo and in Vitro. BioSystems Vol. 39: 117-125. Jiménez-Montaño M.A (1999)* Protein Evolution Drives the Evolution of the Genetic Code and Vice Versa. BioSystems Vol. 54: 47-64. Weiss O., Jiménez-Montaño M.A, Herzel H. (2000) Information content of protein sequences. J. Theor. Biol., Vol 206: 379-386. . Jiménez-Montaño M. A. (2004). Applications of Hyper Genetic Code to Bioinformatics. Journal of Biological Systems. Vol. 12: 5-20. .Jiménez-Montaño M. A. (2009)*. The fourfold way of the genetic code. BioSystems, Vol. 98 (2), 105-114. Abstract: We introduce a theoretical model, which consists of a Markov Information Source that generates codon sequences, and from them amino acid sequences, that maintain the same or very similar functions and structures, as a direct consequence of the structure of the genetic code, and general physical chemical constraints. With the help of the model, we propose a codon dendrogram to describe a hierarchy of codon categorizations, which explain the pattern of frequent amino acid substitutions in shortterm evolution. Keywords: Markov source, genetic code, codon, amino acid, protein evolution. 324 M. A. JIMÉNEZ-MONTAÑO 1. INTRODUCTION Understanding protein evolution remains today a major challenge in molecular biology as it was a decade ago (Dokholyan and Shakhnovich, 2001 and references therein), despite the huge amount of data gathered from genes, protein sequences and structures presently available. Our knowledge of the relation between the genotype (DNA coding for a protein) and the phenotype (a protein’s structure and its pattern of specific traits related to its biological function), which is central to the Theory of Evolution and all biology, is still at a very primitive stage (Thorne and Goldman, 2001; Wagner, 2012). While the mechanisms of mutations in DNA sequences that code for proteins are known (Parkhomchuk et al., 2009 ; Skipper,et al., 2012) the contribution of the genetic code in creating new information, against the part played by natural selection in its fixation in the population, is not completely appreciated. According to Abel and Trevors, (2006), “Genetic prescription of computation precedes and produces phenotypic realization. And this prescription is “written in stone”. Only recently, it has been recognized in the literature the full complexity of the genotype/phenotype map (Crutchfield and Schuster, 2003). According to DePristo et al., (2005), “Taken as whole, recent findings from biochemistry and evolutionary biology indicate that our understanding of protein evolution is incomplete, if not fundamentally flawed”. They suggest joining the fields of protein biophysics and molecular evolution by highlighting the shared questions. In the same line of thought, Pàl et al., (2006) argue that an integrated view of this field should embrace genomic, structural and population levels of description. In Fig.1 these different levels are graphically displayed. However, the problem to achieve this aim is, on the one hand, that these levels belong to fields of knowledge with radically different conceptual frameworks; and, on the other, the degenerate relationship between physics and biology. It is well known that many proteins with no apparent sequence similarity display the same folds (Kleiger et al., 2000). Thus, the many-to-one map M (Ai / S), depicted in Fig.1, gives the amino acid at position i of any of the sequences corresponding to the given structure (fold) S. In the concise statement by Hietpas et al., (2011): “Biology is governed by physical interactions, but biological requirements can have multiple physical solutions”. AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 325 Environment Natural Selection Neutral Evolution S = 3‐d structure of the protein Ai = aa at site i of protein sequence Ci = codon at site i of protein fi= frequency of species i ψi= phenotype of species i M (S’/Aj’) M (f’/S’) f X Virus Quasi‐Species f1, f2…fn; ψ1, ψ2… ψn (Concentration space) Population genetics (PHENOTYPE) Aj’ M(S/ψi) Protein structure (form Space) Biophysics & Biochemistry (Thermodynamics) M(Ai/S) Protein sequence space (Coding of structure) M(Cj /Ai) DNA sequence space (Genotypes) Cm M(Ai’/C’m) Translation apparatus (Codon usage) Genetic Code Space (Hypercube) MUTATIONS M(Cm’/Cm) Molecular biology & Bioinformatics (GENOTYPE) Figure 1: The conceptual scheme for protein evolution. The connection between the domain of population genetics (where Darwinian selection operates and true evolution ocurrs) and the genome of an organism (where mutations, an essential ‘raw material’ of evolution, occur), is mediated by the physics and chemistry of proteins. The crucial point to appreciate the complexity implicit in this diagram comes from the degenerate relationship between physics and biology: Given a population, e.g. of virus quasi-species with frequencies {fi, i = 1, 2, …,n}and with corresponding phenotypes {ψi, i = 1, 2, …,n}, M (S/ ψi ) specifies the common structure (S) of a protein family (the set π of orthologous proteins, associated phenotype ψi ). Then, M (Ai/S) is the amino acid at position i of any of the sequences corresponding to the structure S. In the same way, M(Ci/Ai) gives one of the codons codifying the amino acid, according to the genetic code. Let Cm be the chosen codon, then M (C’m / Cm) represents the single-nucleotide mutation from codon Cm to codon C’m , obeying the structure of the genetic code. In turn, M (A’j / C’m) gives the mutated amino acid A’j , coded by C’m, and M (S’/A’j) maps A’j to the corresponding new structure S’ . This modified protein structure is mapped to a new phenotype ψ+ though M (ψ+ / S’), which through Darwinian selection starts the evolutionary cycle again. Notice that M (S’/A’j) maps A’j to the corresponding new structure S’, and M (ψ+ / S’), maps the new protein structure (fold) to the new phenotype ψ+. Certainly, there is no direct feedback from the old to the new structure. Proteins are concrete molecules that do not evolve. The main purpose of the present work is to provide a theoretical model, built upon an empirical codon substitution matrix (Schneider et al., 2005), which explains the pattern of amino acid substitutions in proteins that maintain the same or very similar functions and structures, as a direct consequence of the structure of the genetic code (JiménezMontaño, 1994), which controls the possible amino acid changes from single 326 M. A. JIMÉNEZ-MONTAÑO nucleotide mutations, and general physical chemical constraints which are responsible for the stability of the protein. 2. MODELS OF PROTEIN EVOLUTION 2.1 Amino acid models of protein evolution Nonetheless the complexity of protein evolution, for a wide range of applications such as database search, sequence alignment, protein family classification and phylogenetic inference, among many others, the phenomenological approach to amino acid substitutions in protein families, started with the empirical work of Margaret Dayhoff and her colleagues (1978), is still widely used. Following Dayhoff’s footsteps, with the help of large data bases available in subsequent years, various authors built several amino acid substitution matrices based on observed mutation counts in protein alignments (e.g. the updated Dayhoff matrices by Gonnet et al., 1992 or Jones et al., 1992). This formalism operates in protein space (see below), thus completely ignores the underlying mutational process that occurs at the DNA level. Dayhoff’s PAM matrices describe the probabilities of amino acid substitutions, for a given period of evolution. They are derived from a model in which amino acids mutate randomly and independent of one another. Each substitution probability during some time interval depends only on the identities of the initial and replacement residues. Mathematically speaking, the dynamics of amino acid substitution resembles a time-homogenous first order reversible Markov chain (Dayhoff et al., 1972, 1978; Gonnet et al., 1992; Jones et al., 1992; Müller and Vingron, 2000).Of course, the above assumptions are not strictly true, and various authors have pointed out that the dynamics of amino acid substitutions is not Markovian, stationary, nor homogeneous (Crooks and Brenner, 2005). Sequence space, the abstract space of all sequences drawn from an alphabet of k letters and of length n, was first introduced in coding theory by Hamming (1950). It is a metric space with respect to the Hamming distance, dH, (Hamming, 1950), which represents the minimum number of changes that are required to convert one sequence into another. Maynard Smith (1970) applied this concept to amino acid sequences defining the concept of protein space (see also Kauffman, 1989 and references therein). As recently described in a delightful paper by Frances Arnold (2011), in protein space each sequence is surrounded by its one-mutant neighbors, that is, by all the proteins that differ from it by a change in a single amino acid letter. As described in (Kauffman 1989), “The concept of protein space is a high-dimensional space in which each point AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 327 represents one protein, and is next to 19 N points representing all the 1-mutant neighbors of that protein. The protein space therefore simultaneously represents the entire ensemble of 20N proteins and keeps track of which proteins are 1-mutant neighbors of each other”. However, the autonomous description of protein evolution in protein (sequence) space is misleading, because it violates the Central Dogma of molecular biology. In nature amino acids do not interchange among themselves during evolution. The space of possibilities is at the DNA level, where the meaningful unit is a base triplet or codon. Therefore, the relevant space to describe codon substitutions is a genetic code space (Fig. 1), where codon mutations occur (Swanson, 1984; Jiménez-Montaño et al., 1996; Petoukhov, 1999; Stambuk, 2000; Jiménez-Montaño, 2004; Jiménez-Montaño and He, 2009). Thus, by single-nucleotide mutations many of the 19 amino acids are out of reach from the original amino acid, and thus they have null probability of appearance. This is the first place where the symmetry associated to the concept of random mutations is broken. The single-nucleotide mutations among the four bases are indeed random (although not necessarily equally probable), but the corresponding amino acid substitution probabilities cannot be equal, due to the structure of the genetic code. The dynamics of amino acid substitutions refers to an aggregate level (see below). 2.2 Codon models of protein evolution Goldman and Yang (1994) and, independently, Muse and Gaut (1994) introduced the first models of a Markovian dynamics at the DNA (codon) level. In these models all substitution rates are derived from parameters. We will not discuss parametric codon models here; instead, we are going to employ an adaptation for short-term evolution of the empirical codon substitution model proposed by Gaston Gonnet and his group (Schneider et al., 2005). In this case, all substitution rates were estimated from a large data set of aligned vertebrate coding sequences and then fixed. Assuming a Markovian dynamics at the DNA (codon) level, the dynamics of amino acid substitutions is defined by an aggregation (grouping) of codon states. However, Görnerup and Jacobi (2010) pointed out that in general the dynamics on the aggregated level is not closed, since the partition of the original space introduces memory on the aggregated level. Only in the special case when the aggregated dynamics indeed is closed, the stochastic process over the partitions constitutes a Markov chain with the same order as the original process. Employing the same empirical codon substitution matrix (Schneider et al., 2005) as we do, they showed that the substitution process 328 M. A. JIMÉNEZ-MONTAÑO hierarchically operates on multiple levels, from nucleotides to codons, to groups of codons, associated with amino acids, and to amino acid groups which form “reduced alphabets”. Since each level approximately has its own closed dynamics, the original dynamics and the partition of the state space then define a new stochastic process on the coarser level. These theoretical aspects of molecular evolution were corroborated by our computer simulations. Recently, Kosiol and Goldman (2011) proposed a closely related approach in terms of aggregated Markov processes (AMPs), to model protein evolution as timehomogeneous Markovian at the DNA (codon) level but observed (via the genetic code) only at the amino acid level. They showed that this approach leads to time-dependent and non-Markovian observations of amino acid sequence evolution. The main difference between their work and the paper by Görnerup and Jacobi (2010) and our model is that Kosiol and Goldman employed a parametric codon substitution matrix. Nonetheless, our model is consistent with their assertion that the genetic code and amino acids' physiochemical properties “influence the average substitution patterns observed over collections of proteins at all evolutionary distances in the same way”. That is, we assert that is not exact that the influence of the genetic dominates in the short-term, and physiochemical properties in the long-term, as supposed by Benner et al. (1994). 3. THE MARKOVIAN CODON-SUBSTITUTION MODEL Markov processes/ chains/ models were first developed by Andrei A. Markov. Their first use was for a linguistic purpose, modeling the letter sequences in works of Russian literature (Markov, 1913). Later on, Markov models were developed as a general statistical tool and applied to problems in the study of natural language processing (Christopher and Schutze, 2003) and in computational biology (Nielsen, 2005; Ewens et al., 2001; Yang, 2006), among many other applications. 3.1 The Markov Information Source Probabilistic finite state automata, PFA, as hidden Markov models, HMM, are widely used in computational linguistics, machine learning, time series analysis, computational biology, and speech recognition among other fields of research. Their definition, given in (Vidal et al., 2005), is equivalent to the definition of a stochastic regular grammar. PFA are built to deal with the problem of probabilizing a structured space by adding AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 329 probabilities to structure. This is precisely what we want to do: To bring in codon transition probabilities into the structure of the genetic code. This is necessary because in our model, as in the parametric models by Goldman and Yang (1994), and by Halpern and Bruno (1998), the state-space for the Markov process corresponds to the standard genetic code (or its variants). In the model of Goldman and Yang, “The states of the Markov process are the 61 sense codons. The three nonsense (stop) codons are not considered in the model, as mutations to or from stop codons can be assumed to affect drastically the structure and function of the protein and therefore will rarely survive”. But, except for sharing the same abstract space, our approach is not related with the mentioned models. Rather, its formulation and interpretation is closer to that of informational and linguistic models. Therefore, we interpret the genetic code as a Markov information source, exactly as this expression is understood in information theory (Ash, 1965, p 172). That is, a finite Markov chain, together with a function f whose domain is the set of states S and whose range is a finite set Γ called the alphabet of the source. In our case, Γ = {A, G, S, T,…, Y, W} is the amino acid alphabet. The PFA can be displayed graphically as a six-dimensional Boolean hypercube (JiménezMontaño et al., 1996; Petoukhov, 1999; Stambuk, 2000; Jiménez-Montaño, 2004; Sánchez et al., 2004; Karasev and Soronkin, 1997). In a forthcoming paper (JiménezMontaño and Ramos-Fernández, 2013) we describe an implementation of the PFA with the help of software tool GSEQUENCE that we developed specially to simulate the generation of codon sequences, and from them amino acid sequences, in protein evolution. As Shannon (1948) did not mean that his statistical description of human language, with a Markov information source, is the actual manner in which human discourse is generated, it is clear that we are not suggesting that Nature really produces proteins with the help of a Markov information source at the codon level. This is only a mathematical device to describe the correlations among the amino acid substitutions along the evolutionary process. As mentioned above, for our model we have adapted for short term evolution (i.e., for single-nucleotide changes) the 61 x 61 codon matrix introduced in (Schneider et al., 2005), for which all substitution rates have been estimated from a set of 17,502 alignments of orthologous genetic sequences from five vertebrate genomes. The codon transition probabilities are fixed, and correspond to protein divergence between 25 and 60 accepted point mutations per 100 amino acids (PAMs). Besides the influence of the genetic code, this as any other empirical codon substitution matrix includes variable 330 M. A. JIMÉNEZ-MONTAÑO factors such as codon usage, transition/transversion bias and selective pressures. Inversions and duplications are not considered in this paper. Out of the 190 possible interchanges among the 20 amino acids, we consider only 75 that can be obtained by single-base substitutions. Therefore, we employ a reduced empirical matrix (REM), making zero all entries corresponding to more than one nucleotide change in the original matrix and normalizing the resulting matrix; see (Jiménez-Montaño and He, 2009) for more details. In this way, we take into account the local structure of the genetic code around each codon. In one step, a codon can change in nine different ways and generally can have from zero to three synonymous changes and from six to nine non-synonymous changes, except in the cases of six-fold degeneracy such as serine, leucine and arginine. We consider all possible one-step changes for all 61 codons disregarding the three stop codons. The important contribution of the genetic code to protein evolution has recently been underlined by Hietpas et al., (2011), who found that the genetic code is highly optimized (+2.4σ) to favor single-base substitutions between codons with WT-like fitness compared with randomly generated codes. Thus, the genetic code generally permits single-base substitution pathways between codons with WT-like fitness. 4. SELECTIVE CONSTRAINTS The functions and structure of individual proteins impose different constraints on their evolution. Irrespective of their dispensability, most proteins require a suitable three dimensional structure to function. Therefore, any polypeptide having a well defined globular structure must be the subject of a strong selection and its sequence is, from this point of view, nearly optimal in terms of stability (Sánchez et al., 2006). Therefore, a majority of positions in a protein globular domain are selected for stability. The fundamental role of selection for thermodynamic stability in shaping molecular evolution has been demonstrated by studies that simulated sequence evolution under structural constraints (Parisi and Echave, 2001). The amino acid substitution probabilities derived from the REM matrix are highly anti-correlated with the values taken from the amino acid substitution matrix suggested long ago by Miyata et al., (1979) (Table 2). Therefore, the selective constraints are approximately captured by the amino acid properties of hydrophobicity and volume. AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 331 More than thirty years ago, while analyzing the globin fold, Lesk and Chothia (1980) reached already a similar conclusion. So long as some basic physical-chemical constraints are satisfied, there is considerable latitude in primary structure (Axe, 2004). About the same time, a related conclusion was reached by Sander and Schulz (1979) in their study of the degeneracy of information contents of amino acid sequences from overlaid genes. From the families of homologous protein know at that time, they concluded that the information contained in a sequence is degenerate with respect to function. They quantified this degeneracy on the basis of viral overlays and found that five amino acid groups is the largest number of groups for which the assumption is tenable that there exists one group sequence per protein function. Ever since, a great number of reduced alphabets have been proposed, based on different amino acid properties or observed substitutions (Miyata et al., 1979; Jiménez-Montaño, 1984; Murphy et al., 2000; Solis and Rackovsky, 2000; Cannata et al., 2002; Li et al., 2003), among many others. Here, following (Görnerup and Jacobi, 2010), we interpret reduced amino acid alphabets simply as a result of the various codon sub dynamics, among different groups of codons, which are neighbors according to the topology of the genetic code space. Recently, Chothia (Sasidharan and Chothia, 2007) returned to the problem from a different perspective. For the divergence process in proteins that maintain the same or very similar functions and structures, Sasidharan and Chothia reported very similar overall patterns of divergence by counting observed amino acid substitutions in three very different groups of orthologs. They interpret this result to mean that individual responses of most proteins are variations on a common set of selective constraints which govern the types of frequent mutations that are acceptable. In RESULTS we show that the frequencies of amino acid pair substitutions deduced from our computer simulations are in very good agreement with the mutation profile obtained in their paper. 5. PROTEIN SYNTAX Paraphrasing Prince and Smolensky (1997) in their attempt to relate the sciences of the brain with the sciences of the mind, we can say in the present context that: “It is evident that statistical thermodynamics and molecular biology are separated by many gulfs, not the least of which lies between the formal methods appropriate for continuous dynamical systems and those for discrete symbol structures”. 332 M. A. JIMÉNEZ-MONTAÑO In order that an amino acid substitution is acceptable is necessary, first of all, that the alteration it produces in the protein structure be as small as possible. Therefore, the general Darwinian principle of “gradual change” is interpreted in the sense that the destabilization of the structure should be as small as possible. Thus, thermodynamics requires minimization of the Gibbs free energy change. However, this continuous optimization is hampered by the discrete nature of the amino acid change. A rough estimation of the effect produced by the substitution consists, for example, in calculating the Miyata et al., distance (1979) between the original and the new amino acid. However, this distance should not calculated between the original and any of the other 19 amino acids; only between the original amino acid and the accessible amino acids after a single-nucleotide mutation (that is, at most nine amino acids). In this way the genetic code modulates acceptable mutations. Following the parallelism between linguistics and a formal protein language (JiménezMontaño, 1984), we recall that an important challenge of the first discipline is to discover an architecture for grammars that both allows variation and limits its range to what is actually possible in human language (Prince and Smolensky, 1997). Furthermore, these authors remark that “… a central element in the architecture of grammar is a formal means for managing the pervasive conflict between grammatical constraints”. “The key observation is this: In a variety of clear cases where there is a strength asymmetry between two conflicting constraints, no amount of success on the weaker constraint can compensate for failure on the stronger one”. Finally, “….a grammar consists entirely of constraints arranged in a strict domination hierarchy, in which each constraint is strictly more important than-takes absolute priority over- all constrains lower-ranked in the hierarchy. With this type of constraint interaction, it is only the ranking of constraints in the hierarchy that matters for the determination of optimally; no particular numerical strengths, for example, are necessary”. Below we are going to show in which sense these concepts can be applied to the characterization of amino acid substitutions. First, we need to say some words about amino acid categorizations and the syntactic structure of proteins at the letter-unit level which we discussed in detail in (Jiménez-Montaño, 1984). We will call any classification of the 20 amino acid types in r groups (under different criteria), an amino acid categorization. The set of symbols denoting the group names will be called a reduced alphabet. When the reduced alphabet corresponds to a pattern of substitutions according to an empirical matrix (Dayhoff et al., 1972, 1978; Gonnet et al., 1992; Jones et al., 1992; etc.), the pattern is called a pattern of substitution classes. AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 333 If the categorization is based on physical-chemical properties (Grantham; 1964; Miyata et al., 1979; etc.), the resulting patterns are called amino acid property sequences. In the paper mentioned above, we approached the question of how we can select a fixed number of categories which best mirror amino acid replacements. The mutational categories should reflect the most frequent amino acid substitutions observed. In the same way, if the selected physical chemical properties truly determine the protein´s architecture, both patterns will be consistent. In other words, the reduced alphabets obtained under different criteria will be almost equivalent. It is on this basis that is reasonable to assume that the constraints responsible for a given fold are somehow encoded in the pattern of substitution classes. A hierarchy (inverted tree) of amino acid categorizations represents a syntactic structure of proteins at the letter-unit level. We shall employ for our discussion the hierarchy introduced in Fig. 1 of (Jiménez-Montaño, 1984). In the same paper we discussed two more hierarchies, one from Sneath (1966) and the other from Lim (1974). Twelve more hierarchies (also called dendrograms), associated with the same number of popular amino acid substitution matrices are displayed in Fig. 4 of (Johnson and Overington, 1993). See also Fig. 1 in (Fan and Wang, 2003), and (Venkatarajan and Braun, 2001), among many other proposals in the literature. With this background, the interpretation of the above quotations from Optimality Theory (Prince and Smolensky (1997) in the context of the present article is straightforward. A hierarchy of amino acid categorizations encodes physical chemical constraints arranged in a strict domination hierarchy. Thus, the dominant partition in the dendrogram in Fig. 1 of (Jiménez Montaño, 1984) separates amino acids into nonhydrophobic, represented by the group symbol a, and hydrophobic, represented by the group symbol b. This constraint dominates over the lower constrains (for example size). Therefore, we expect that an amino acid of a given class will be substituted with another amino acid of the same class. In this case, we say that the substitution is syntactically correct, and that the new sequence belongs to the language generated by the grammar. Let us illustrate with an example how the grammar generates average amino acid substitutions obeying general physical chemical constraints that preserve the stability of the protein. If, in a given site of a protein sequence, we have aspartic acid (D) we expect that it will be replaced by glutamic acid (E) because the node n in Fig. 1 of (JiménezMontaño, 1984) is the smallest class that includes both amino acids. Next, we have node e which includes two more amino acids, Q and N, that is, e = {D, E, N, Q}. Thus, the category represented by the symbol n corresponds to the most conservative substitution, 334 M. A. JIMÉNEZ-MONTAÑO then follows the wider category e, and so on up to the category represented by the symbol a, which embraces the non-hydrophobic amino acids. The amino acid dendrograms we are considering were derived from a number of amino acid substitution matrices, by several authors that employed different clustering procedures which have a significant influence in the result. Besides, this approach in protein space disregards the fact that two amino acids in the same group may be separated by two or three nucleotide substitutions, thus unlikely to substitute one another. As pointed out long ago by Miyata et al., (1979): “Amino acids separated by two or three codon position differences are unlikely to interchange even if they are chemically similar”. Recently, we discussed this problem (Jiménez-Montaño and He, 2009). For example, the category i = {F, Y, W} in Fig. 1 of (Jiménez-Montaño, 1984) , which includes the three large hydrophobic amino acids should be refined into two groups:{F, Y} and {W}. This is so because to go from the codon of W to any of the codons of the other two amino acids we need two nucleotide changes; thus, W constitutes a separate group by itself. This splitting of W was already proposed, for example, in (Murphy et al., 2000) but for a different reason (which is a consequence of the above reason): The small number of substitutions observed between W and the other two amino acids, as reflected in the empirical BLOSUM 50 matrix. Therefore, it is clear that to improve over previous approaches it is necessary to have a syntactic structure at the codon level. 6. RESULTS The first result of this paper is the proposal of the codon dendrogram shown in Fig. 2. It was obtained by applying the clustering algorithm UPGMA (unweighted pair-group method using arithmetic averages) to the full codon substitution matrix introduced in (Schneider et al., 2005). This classification of codons, inferred from an empirical matrix, induces a corresponding arrangement for amino acids. We observe that the codons for D and E share the same group, therefore, we expect these two amino acids exchange frequently both because they are very similar and because they are neighbors in codon space. They are ranked one in our simulations (Table 1) and in HumanChicken orthologs, and ranked two in Escherichia coli and Salmonella orthologs; both from the observed data (Table 6 in supp. material from Sasidharan and Chothia, 2007). However, the group e = {D, E, N, Q} does not occur in Fig.2; N and S2 (S with AGY codons) form one category, and Q and H another. Therefore, the letters in category e are not completely equivalent from the point of view of their substitutability. From the AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 335 results in Table 1, the substitutions DN, and EQ are in ranks between 12 and 14, while NQ and EN, are ranked 44 and 45, respectively, in (Table 6 in supp. Material from Sasidharan and Chothia, 2007). In the first case the amino acids have neighboring codons; in the second case the corresponding codons differ by two bases. Figure 2: Dendrogram (Codon rooted tree) obtained from the full empirical codon-substitution matrix (Schneider et al., 2005), employing the UPGMA method. M. A. JIMÉNEZ-MONTAÑO 336 E. COLI-S. ENTERICA RANK ORDER MUTATION 2 DE 1 IV 4 ST 5 AT 10 NS 3 AS 15 KR 18 GS 8 AV 6 IL 13 7 19 16 21 17 23 LV LM AP AG HQ FY NT 29 IM 25 FL 24 14 QR DN 11 KQ 20 9 12 22 26 27 28 30 HN AE EQ KN AD TV HR LQ MARKOV SOURCE RANK ORDER MUTATION 1 DE/ED 3 IV/VI 4 ST/TS 8 AT/TA 5 SN/NS 9 AS/SA 2 RK/KR 15 SG/GS 13 AV/VA 14 IL/LI 12 SP/PS 6 VL/LV 7 LM/ML 20 AP/PA 17 AG/GA 3 QH/HQ 10 FY/YF 19 TN/NT 16 VM/MV 10 IM/MI 23 PQ/QP 10 LF/FL 22 TP/PT 19 TI/IT 18 RQ/QR 12 ND/DN 19 SC/CS 9 KQ/QK 28 TM/MT 17 NH/HN 24 AE/EA 8 EQ/QE 11 KN/NK 25 AD/DA 20 22 20 24 RH/HR QL/LQ GE/EG PL/LP RANDOM GENERATOR RANK ORDER MUTATION 10 DE/ED 6 IV/VI 3 ST/TS 5 AT/TA 6 SN/NS 2 AS/SA 6 RK/KR 2 SG/GS 5 AV/VA 4 IL/LI 3 SP/PS 3 VL/LV 9 LM/ML 5 AP/PA 5 AG/GA 10 QH/HQ 10 FY/YF 8 TN/NT 10 VM/MV 11 IM/MI 8 PQ/QP 6 LF/FL 5 TP/PT 6 TI/IT 6 RQ/QR 10 ND/DN 7 SC/CS 10 KQ/QK 10 TM/MT 10 NH/HN 8 AE/EA 10 EQ/QE 10 KN/NK 8 AD/DA 5 TV/VT 6 RH/HR 6 QL/LQ 8 GE/EG 2 PL/LP Table 1: Comparison of rank positions of the most frequent mutations types found in pairs of orthologs from Escherichia coli and Salmonella enteric, for < 10 % divergence (Sasidharan and Chothia, 2007), with the ones obtained from simulations generated with our Markov information source, implemented with the help of software tool GSEQUENCE (Jiménez-Montaño and Ramos-Fernández, 2013). In the third column we display results obtained with a random source Despite the just explained discrepancies, there is a very good agreement between most of the categories in Fig. 1 from (Jiménez-Montaño, 1984) and those in the codon AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 337 dendrogram (Fig. 2). For example, the important category h = {L, I, V, M} of aliphatic amino acids in our hierarchy of amino acid categorizations, coincides with the category d in Fig.2. The same is true for the small neutral amino acids in group c = {P, A, G, S, T}, except for glycine (G), which in Fig. 2 belongs class b = {N, S2, G, D, E}. There are other coincidences and differences that the reader can easily find. An important difference worth mention is the following: while in the grouping based on amino acid properties I and L are in the same category q, in the codon dendrogram I joins V to form a group; even though there are proofreading process in protein synthesis to correct translation errors, the substitutions VI are ranked three and the substitutions LI are ranked fourteen in our simulations; and are ranked one and six, respectively, in pairs of orthologs from Escherichia coli and Salmonella enterica (See our Table 1 and Table 6 in supp. Material from Sasidharan and Chothia, 2007). The amino acid substitution pairs from our codon dendrogram, (D,E),(K,R),(I,V), (Y,F),(M,L),(N, S2 ) (Q, H) and (A,S) are consistent to the ones reported in the dendrogram displayed in Fig. 3 of (Görnerup and Jacobi, 2010), except for minor differences. These are in the two last groups; in their paper (which employs a completely different agglomeration procedure), Q and H form separate groups, and A pairs with T instead of S. However, the higher categorizations are completely different in both dendrograms. The separation of hydrophobic and hydrophilic amino acids in our dendrogram (Fig. 2) is consistent with that in Fig. 1 of (Jiménez-Montaño, 1984). The only difference comes from the small neutral amino acids, which are in the nonhydrophobic group in our former publication, and are grouped with the hydrophobic amino acids in the codon dendrogram. The second but not less important result is that six of the ten more frequent amino acid substitutions pairs, obtained from simulations generated with our Markov information source, implemented with the help of software tool GSEQUENCE (Jiménez-Montaño and Ramos-Fernández, 2013), agree with the pairs in the three sets of orthologs from E. coli – S. enterica, Human-Mouse and Human-Chicken, respectively, from Table 6 in supp. material from (Sasidharan and Chothia, 2007). These pairs are: DE, IV, ST, NS, AT, AS. The pair KR agrees with two of the three sets of orthologs. Additionally, we have in descending order the pairs VL, LM and FY which are ranked twelve, thirteen and seventeen, respectively, in the same source. Seven of these amino acid substitution pairs agree with the codon pairs in the Codon Dendrogram (Fig. 2), they are: DE, IV, NS, AS, KR, LM, FY. These results are consistent with the most frequent amino acid exchanges found in (Schmitt et al., 2007). In our simulations, the corresponding codons M. A. JIMÉNEZ-MONTAÑO 338 outline approximately closed dynamics, as discussed in (Görnerup and Jacobi, 2010). Therefore, these cycles of the codon dynamics produce amino acid substitutions which are fixed in the population because of the similarity of the corresponding amino acids. These outcomes take us to the third and last result of this paper. AA/CODON K AAA K AAG N AAC N AAT T ACA T ACC T ACG T ACT R AGA R AGG R CGA R CGC R CGG R CGT S2 AGC S2 AGT S TCA S TCC S TCG S TCT CORR COEF -0.8791 -0.9266 -0.6477 -0.7568 -0.6894 -0.4522 0.4707 -0.4727 -0.8543 -0.8397 -0.8787 -0.9202 -0.9238 -0.9403 -0.7744 -0.6739 -0.9765 -0.8915 -0.7629 -0.9292 AA/CODON I ATA I ATC I ATT M ATG Q CAA Q CAG H CAC H CAT P CCA P CCC P CCG P CCT L CTA L CTC L CTG L CTT L TTA L TTG E GAA E GAG CORR COEF -0.5617 -0.5668 -0.5791 -0.6726 -0.7166 -0.5851 -0.6559 -0.6578 -0.8373 -0.8056 0.3259 -0.7635 -0.6903 -0.964 -0.9467 -0.9273 -0.5547 -0.7221 -0.6831 -0.8145 AA/CODON D GAC D GAT A GCA A GCC A GCG A GCT G GGA G GGC G GGG G GGT V GTA V GTC V GTG V GTT Y TAC Y TAT C TGC C TGT W TGG F TTC F TTT CORR COEF -0.7193 -0.7201 -0.3349 -0.3815 0.7656 -0.4195 -0.4947 -0.8047 -0.2542 -0.7988 -0.4626 -0.6288 -0.8257 -0.6141 -0.9487 -0.9621 -0.788 -0.7812 -0.7817 -0.4676 -0.4737 Table 2: Anti-correlation between the substitution probabilities from the reduced empirical matrix, REM (Jiménez-Montaño and He, 2009), and the physical-chemical dissimilarity index (distance) from (Miyata et al., 1979). For a given codon, e.g. AAA (K), I calculated the correlation between the list of values of substitution probabilities with its neighbors (AGA (R), GAA (E), etc, and the list of values of the index of the associated amino acids (in parenthesis) In Table 2 we display the anti-correlation between the substitution probabilities from the reduced empirical matrix, REM (Jiménez-Montaño and He, 2009), and the dissimilarity physical-chemical index (distance) from (Miyata et al., 1979), which is based on hydrophobicity and volume of amino acids. As expected, amino acid pairs which have codons that substitute frequently have small values of the dissimilarity index and vice versa. So, this well-known result from comparisons of amino acid substitution matrices, is corroborated at the codon level. 7. CONCLUSIONS After presenting a general conceptual framework for the analysis of protein evolution, we introduced a theoretical model, which consists of a Markov Information Source that generates codon sequences, and from them amino acid sequences, that maintain the same or very similar functions and structures. This invariance is a consequence not only AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 339 of natural selection (that preserves the sequences that obey general physical chemical constraints, which are responsible of the stability of the protein), but also of the structure of the genetic code, which controls the possible amino acid changes, from single nucleotide mutations. With the help of the model, we introduced a syntactic formulation (codon dendrogram) to describe a hierarchy of codon categorizations which explain the pattern of frequent amino acid substitutions in short-term evolution. From our computer simulations (Jiménez-Montaño and Ramos-Fernández, 2013) we interpreted the reduced amino acid alphabets simply as a result of the various codon sub dynamics, among different clusters of codons, which are neighbors according to the topology of the genetic code space. Acknowledgements: I wrote this paper while commissioned at Dirección General de Investigaciones de la Universidad Veracruzana. I want to thank director César I. Beristain-Guevara for his support. I also thank Q.F.B. Antero Ramos-Fernández for his help in doing some calculations and preparing the tables and figures. I thank David Abel for suggestions to make clearer the manuscript and some references. I express thanks to Sistema Nacional de Investigadores, México, for partial support. Finally, I thank my wife, Ma. Eta. Castellanos G. for her patience and understanding. REFERENCES Abel, D.L. and Trevors, J.T. (2006) More than Metaphor: Genomes are Objective Sign Systems, Journal of BioSemiotics, 1 253-267. Arnold, F.H. (2011) The Library of Maynard-Smith: My Search for Meaning in the Protein Universe. Microbe, ASM News 6(7) 316-318. Ash, R. (1965) Information Theory, New York: Interscience Publishers, 339pp. Axe D.D. (2004) Estimating the Prevalence of Protein Sequences Adopting Functional Enzyme Folds. Journal of Molecular Biology, 341 1295-1315. Benner S.A. Cohen M.A. Gonnet G.H. (1994). Amino acid substitution during functionally constrained divergent evolution of protein sequences. Protein Engineering, 7 1323–1332. Cannata,N., Toppo, S., Romualdi, C. and Valle, G. (2002) Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices. Bioinformatics 18 1102-1108. Crooks, G.E. and Brenner, S.E. (2005) An alternative model of amino acid replacement. Bioinformatics 21 975–980. Crutchfield, J.P. and Schuster, P. (2003) Evolutionary Dynamics–Exploring the Interplay of Accident, Selection, Neutrality, and Function, Oxford University Press, New York, 452pp. Dayhoff, M.O., Eck, R.V. and Park, C.M. (1972) A model of evolutionary change in proteins. In: Dayhoff M, ed. Atlas of Protein Sequence and Structure, National Biomedical Research Foundation, Washington, D.C., 5 89–99pp. Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978) A model of evolutionary change in proteins. In: Dayhoff M, ed. Atlas of Protein Sequence and Structure, National Biomedical Research Foun- dation, Washington, D.C. 5(3) 345–352pp. 340 M. A. JIMÉNEZ-MONTAÑO DePristo, M.A., Weinreich, D.M. and Hartl, D.L. (2005) Missense meanderings in sequence space: a biophysical view of protein evolution. Nature Reviews Genetics 6 678-687. Dokholyan, N.V. and Shakhnovich, E.I. (2001) Understanding hierarchical protein evolution from first principles. Journal of Molecular Biology 312 289–307. Ewens, W.J. and Grant, G.R. (2001) Statistical Methods in Bioinformatics: An Introduction, Springer- Verlag, New York, 476pp. Fan, K. and Wang, W. (2003) What is the Minimum Number of Letters Required to Fold a Protein? Journal of Molecular Biology 328 921–926. Goldman, N. and Yang, Z. (1994) A Codon-based Model of Nucleotide Substitution for Protein-coding DNA Sequences. Molecular Biology and Evolution 11 725-736. Gonnet, G.H., Cohen, M.A. and Benner, S.A. (1992) Exhaustive matching of the entire protein sequence database. Science 256 1443-1445. Görnerup, O. and Jacobi, M.N. (2010) A model-independent approach to infer hierarchical codon substitution dynamics. BMC Bioinformatics 11 201 Grantham, R. (1974) Amino acid difference formula to help explain protein evolution. Science 185 862-864. Halpern, A.L. and Bruno, W.J. (1998) Evolutionary Distances for Protein-Coding Sequences: Modeling SiteSpecific Residue Frequencies. Molecular Biology and Evolution 15 910–917. Hamming, R.W. (1950) Error detecting and error correcting codes. Bell System Technical Journal 29 147-160. Hietpas, R.T., Jensen, J.D. and Bolon, D.N.A. (2011) Experimental illumination of a fitness landscape. Proceedings of the National Academy of Sciences 108 7896–7901. Jiménez-Montaño, M.A. (1984) On the syntactic structure of protein sequences and the concept of grammar complexity. Bulletin of Mathematical Biology 46 641-659. Jiménez Montaño M. A. (1994) On the Syntactic Structure and Redundancy Distribution of the Genetic Code. BioSystems, 32 11-23. Jiménez-Montaño, M.A. (2004) Applications of Hyper Genetic Code to Bioinformatics. Journal of Biological Systems 12 5-20. Jiménez-Montaño, M.A. and He, M. (2009) Irreplaceable Amino Acids and Reduced Alphabets in Short-term and Directed Protein Evolution. In Bioinformatics Research and Applications. Mandoiu, Ion; Narasimhan, Giri; Zhang, Yanquing (Eds.). Springer-Verlag Berlin Heidelberg, 297–309pp. Jiménez-Montaño, M.A. and Ramos-Fernández, A. (2013) Simulation of protein evolution with a Markovian empirical codon-substitution model. Manuscript in preparation. Jiménez-Montaño, M.A., de la Mora-Basáñez, R. and Pöschel, T. (1996) The Hypercube Structure of the Genetic Code Explains Conservative and Non-Conservartive Aminoacid Substitutions in Vivo and in Vitro. BioSystems 39 117-125. Johnson, M.S. and Overington, J.P. (1993) A structural basis for sequence comparisons—an evaluation of scoring methodologies. Journal of Molecular Biology 233 716–738 Jones, D.T., Taylor, W.R. and Thornton, J.M. (1992) The rapid generation of mutation data matrices from protein sequences. Computer Applications in the Biosciences 8 275–282. Karasev, V.A. and Soronkin, S.G. (1997) Topological structure of the genetic code, Russian Journal of Genetics 33 622–628. Kauffman, S. (1989) Adaptation on Rugged Fitness Landscapes. In Lectures in the Sciences of Complexity. Stein, D.L., Editor.Addison-Wesley Publishing Company, Redwood City, California, 527- 618pp. Kleiger, G., Beamer, L.J., Grothe, R., Mallick, P., and Eisenberg, D. (2000) The 1.7 Å Crystal Structure of BPI: A Study of How Two Dissimilar Amino Acid Sequences can Adopt the Same Fold, Journal of Molecular Biology 299 1019-1034. Kosiol, C. and Goldman, N. (2011) Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models, Journal of Molecular Biology 411 910–923. AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 341 Lesk, A.M. and Chothia, C. (1980) How different amino acid sequences determine similar protein structures: The structure and evolutionary dynamics of the globins, Journal of Molecular Biology 136 225-270. Li, T., Fan, K., Wang, J. and Wang, W. (2003) Reduction of protein sequence complexity by residue grouping, Protein Engineering 16 323-330. Lim, V.I. (1974) Algorithms for prediction of alpha-helical and beta-structural regions in globular proteins, Journal of Molecular Biology 88 873-94. Markov, A.A. (1913) Primer statisticheskogo issledovanija and tekstom `Evgenija Onegina' illjustrirujuschij svjaz' ispytanij v tsep (An example of statistical study on the text of `Eugene Onegin' illustrating the linking of events to a chain). Izvestija Imp, Akademii nauk, serija VI, 3 153-162. Miyata,.T., Miyazawa, S. and Yasunaga,.T. (1979) Two types of amino acid substitutions in protein evolution, Journal of Molecular Evolution 12 219-236. Müller, T. and Vingron, M. (2000) Modeling amino acid replacement, Journal of Computational Biology 7 761–776. Murphy,L.R., Wallqvist, A. and Levy, R.M. (2000) Simplified amino acid alphabets for protein fold recognition and implications for folding, Protein Engineering 13 149-152. Muse, S.V. and Gaut, B.S. (1994) A Likelihood Approach for Comparing Synonymous and Nonsynonymous Nucleotide Substitution Rates, with Application to the Chloroplast Genome, Molecular Biology and Evolution 11 715-724. Manning, C.D. and Schutze, H. (1999) Foundations of Statistical Natural Language Processing, Cambridge, Massachusetts, MIT Press, 680pp. Reprint: Cambridge, Massachusetts, MIT Press 2003. Nielsen, R. (2005) Statistical Methods in Molecular Evolution, Springer Verlag, New York, 508pp. Pál, C., Papp, B. and Lercher, M.J. (2006) An integrated view of protein evolution, Nature Reviews Genetics 7 337-348. Parisi, G. and Echave, J. (2001) Structural constraints and emergence of sequence patterns in protein evolution, Molecular Biology and Evolution 18 750–756. Parkhomchuk D., Amstislavskiy,V. , Soldatov A. and Ogryzko V. (2009) Use of high throughput sequencing to observe genome dynamics at a single cell level, Proceedings of the National Academy of Sciences 106 20830-20835. Petoukhov, S.V. (1999) Genetic code and the ancient Chinese book of changes, Symmetry: Culture and Science 10 211-226. Prince, A. and Smolensky, P. (1997) Optimality: From Neural Networks to Universal Grammar, Science 275 1604-1610. Sanchez, I.E., Tejero, J., Gomez-Moreno, C., Medina, M. and Serrano, L. (2006) Point Mutations in Protein Globular Domains: Contributions from Function, Stability and Misfolding, Journal of Molecular Biology 363 422–432. Sánchez, R., Morgado, E. and Grau, R. (2004) The Genetic Code Boolean Lattice, Communications in Mathematical and in Computer Chemistry 52 29-46. Sander, C. and Schulz, G.E. (1979) Degeneracy of the information contained in amino acid sequences: Evidence from overlaid genes, Journal of Molecular Evolution 13 245-252. Sasidharan, R. and Chothia, C. (2007) The selection of acceptable protein mutations, Proceedings of the National Academy of Sciences 104 10080–10085. Schmitt A. O., Schuchhardt, J., Ludwig A., Brockmann G. A. (2007) Protein evolution within and between species, Journal of Theoretical Biology 249 376–383. Schneider, A., Cannarozzi, G.M. and Gonnet, G.H. (2005) Empirical codon substitution matrix, BMC Bioinformatics 6 134. Shannon C. (1948). A Mathematical Theory of Communication, The Bell System Technical Journal 27 379– 423, 623–656, July, October. 342 M. A. JIMÉNEZ-MONTAÑO Skipper M., Dhand R., Campbell P. (2012) Nature/Encode. 2001 Will always be remembered as the year of the human genome, Nature 489: 45. The ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57–74. http://www.nature.com/encode/ Smith, J.M. (1970) Natural Selection and the Concept of a Protein Space, Nature 225 563–564. Sneath, P.H.A. (1966) Relations between chemical structure and biological activity in peptides, Journal of Theoretical Biology 12 157-195. Solis, A.D. and Rackovsky, S. (2000) Optimized representations and maximal information in proteins, Proteins: Structure, Function, and Bioinformatics 38 149-164. Stambuk, N. (2000) Universal metric properties of the genetic code, Croatica Chemica Acta 73 1123-1139. Swanson, R. (1984) A unifying concept for the amino acid code, Bulletin of Mathematical Biology 46 187203. Thorne, J.L. and Goldman, N. (2001) Probabilistic models for the study of protein evolution. Balding, D.J., Bishop, M., Cannings, C. (Eds.), Handbook of Statistical Genetics. John Wiley, Chichester, UK, 6782pp. Venkatarajan, M.S. and Braun, W. (2001) New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical–chemical properties, Journal of Molecular Modeling 7 445–453. Vidal, E., Thollard, F., de la Higuera, C., Casacuberta, F. and Carrasco, R.C. (2005) Probabilistic finite-state machines-Part I, IEEE Trans, Pattern Analysis and Machine Intelligence 27 1013-1025. Yang, Z. (2006) Computational molecular evolution, Oxford: Oxford University Press 374 pp. Wagner A. (2012). The Role of Randomness in Darwinian Evolution, Philosophy of Science 79 95-119. Symmetry: Culture and Science Vol. 23, No. 3-4, 343-375, 2012 SYMMETRIES IN MOLECULAR-GENETIC SYSTEMS AND MUSICAL HARMONY G. Darvas*, A.A. Koblyakov**, S.V.Petoukhov***, I.V.Stepanian**** * Physicist, philosopher (b. Budapest, Hungary, 1948). Address: Symmetrion, 29 Eötvös St. Budapest, H-1067 Hungary; [email protected]. Fields of interest: symmetry in arts and sciences, especially physics; interrelations of sciences and arts. Publication: Symmetry, Basel: Birkhauser, (2008), xi+508 p. ** Composer, musicologist (b. Kuibyshev, Russia, 1951). Address: dean of Composer Faculty, Moscow State Conservatory by P.I. Tchaikovsky, Bolshaya Nikitskaya street 13/6, 125009 Moscow Russian Federation. E-mail: [email protected] Fields of interest: Music, interdisciplinary research, logic, mathematics, biology, physics Awards: Laureate of the International Composers' Competition Publications: 1) Synergetics and creativity // Synergetic paradigm. Moscow, 2000 (in Russian); 2) From disjunction to conjunction (the contours of the general theory of creation) // The language of science - the languages of art. Moscow, 2000 (in Russian); 3) Semantic aspects of self-similarity in music // Symmetry: culture and science, v.6, number 2, 1995; 4) About one model defining art in its broadest sense // Sustainable development, science and practice. M., 2003, № 2 (in Russian); 5) Discrete and continuous in the field of music from the viewpoint of problem-sense approach // Proceedings of the International Conference "Mathematics and Art", Moscow, 1997 (in Russian). *** Biophysicist, bioinformatist (b. Moscow, Russia, 1946). Address: Laboratory of Biomechanical Systems, Mechanical Engineering Research Institute of Russian Academy of Sciences; Malyi Kharitonievskiy pereulok, 4, Moscow, 101990, Russia. E-mail: [email protected]. Fields of interest: genetics, bioinformatics, biosymmetries, multidimensional numbers, musical harmony, mathematical crystallography (also history of sciences, oriental medicine). Awards: Gold medal of the Exhibition of Economic Achievements of the USSR, 1974; State Prize of the USSR, 1986; Honorary diplomas of a few international conferences and organizations, 2005-2012. Publications: 1) S.V. Petoukhov (1981) Biomechanics, Bionics and Symmetry. Moscow, Nauka, 239 pp. (in Russian); 2) S.V. Petoukhov (1999) Biosolitons. Fundamentals of Soliton Biology. Moscow, GPKT, 288 pp. (in Russian); 3) S.V. Petoukhov (2008) Matrix Genetics, Algebras of the Genetic Code, Noise-immunity. Moscow, RCD, 316 pp. (in Russian); 4) S.V. Petoukhov, M. He (2010) Symmetrical Analysis Techniques for Genetic Systems and Bioinformatics: Advanced Patterns and Applications, Hershey, USA: IGI Global, 271 pp.; 5) He M., Petoukhov S.V. (2011) Mathematics of Bioinformatics: Theory, Practice, and Applications. USA: John Wiley & Sons, Inc., 295 pp. 344 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN **** Mathematician, biologist (b. Moscow, Russia, 1980) Fields of interest: algebraic biology, quantum neural computing, DNA music therapy, bioinformatics and bionics Address: Laboratory of Biomechanical Systems, Mechanical Engineering Research Institute od Russian Academy of Sciences; Malyi Kharitonievskiy pereulok, 4, Moscow, 101990, Russia. E-mail: [email protected] Publications: 1) I.V. Stepanian (2011) Neural network algorithms for acoustic spirometry data recognition (method of diagnosis of pulmonary occupational diseases). LAP LAMBERT Academic Publishing GmbH & Co.: Saarbrücken, Germany, 200 pp. (ISBN: 978-3-8473-2767-7); 2) I. V. Stepanian, A. L. Krugly (2011) An example of the stochastic dynamics of a causal set, in Foundations of Probability and Physics – 6, Växjö-Kalmar, Sweden, 14-16 June 2011, AIP Conference Proceedings, V. 1424, edited by Mauro D’Ariano, Shao-Ming Fei, Emmanuel Haven, Beatrix Hiesmayr, Gregg Jaeger, Andrei Khrennikov, and Jan-Åke Larsson, (2012), pp. 206 -210 (arXiv: 1111.5474 [gr-qc]); 3) S.V.Petoukhov, V.I. Svirin, I.V. Stepanian (2012) Matrix genetics, hypercomplex numbers and the rules of long genetic sequences.Proceedings of the VIII International conference «Finsler extensions of relativity theory», Moscow-Fryazino, 25 June-1 July, 2012, p. 71-72. Abstract: The Moscow State Conservatory by P.I. Tchaikovsky has recently created a special “Center for interdisciplinary researches of musical creativity”. One of the main tasks of this center is to study genetic musical scales from different viewpoints including new opportunities for composers and for musical therapy. This article is devoted to scientific aspects of the genetic musical scales, which are based on symmetric features of molecular ensembles of genetic systems. These musical scales were revealed in a course of symmetrologic study of representations of molecular-genetic ensembles in a united form of mathematical matrices (Kronecker families of genetic matrices). This study has discovered a relation of genetic systems with the golden section and Fibonacci numbers, which play role in a hierarchical system of these musical scales and which are well known in biological phyllotaxis laws and in aesthetics of proportions. Some historical and biological aspects of musical harmony are also considered. Keywords: symmetry, musical harmony, genetic code, golden section, Fibonacci numbers. 1. ABOUT THE GENETIC CODING SYSTEM AND GENETICALLY INHERITED PERCEPTION OF MUSIC From ancient times, understanding the phenomenon of music and building musical structures were associated with mathematics. The creator of the first computer G.Leibniz wrote: “Music is a secret arithmetical exercise and the person who indulges in it does not realize that he is manipulating numbers” and “music is the pleasure the SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 345 human mind experiences from counting without being aware that it is counting” (http://thinkexist.com/quotes/g._wilhelm_leibniz/ ). The range of human sound perception contains an infinite set of sound frequencies. Pythagoras has discovered that certain mathematical rules, based on integers, allow separating - from this infinite set of frequencies - a discrete set of frequencies, which determine the harmonious sound set. In other words, certain combinations of sounds from this set are perceived by living organisms as pleasant for hearing (consonances). In addition, Pythagoras has linked the phenomenon of the harmonic sounds with the parameters of a physical object: oscillation frequencies of stretched string, the length of which is varied in accordance with appropriate numerical rules. But these discoveries by Pythagoras say nothing about the fact that other discrete sets of sound frequencies may exist, which will also form harmonious sets of sounds. This article describes some results of researches of molecular ensembles of the genetic coding system. The results reveal that sets of parameters of this molecular genetic system are related with the well-known Pythagorean musical scale and also with a hierarchy of special mathematical sets. This hierarchy can be interpreted and used as the base of a new system of musical scales, because appropriate sets of sound frequencies may possess harmonic properties for human hearing. According to our assumption, it seems to be essential that these musical systems be connected with the moleculargenetic system because the phenomenon of musical perception is inherited. The scientific studies of physiological mechanisms of musical perception took place long ago. One can find the review on this topic in the article (Weinberger, 2004). Beginning with 4-months old infants turn to a source of pleasant sounds (consonances) and turn aside a source of unpleasant sounds (dissonances). The human brain does not possess a special center of music. The feeling of love to music seems to be dispersed in the whole organism. The musical sound addresses to all in the person, or to person’s archetypes. There are known data that the first shout of the baby, who has been born, corresponds to sounds on frequency of the music note “la” (440 Hz) irrespective of its timbre and of loudness, as a rule. (http://www.rods.ru/Html/Russian/MoreResonance.html). This frequency is used traditionally for tuning musical instruments by means of a tuning fork. This speaks about certain biological unification of musical sounds. According to statistics, physical reactions to music (in the form of skin reactions, tears, laugh, etc.) arise in 80 % of adult people. Animals also are not indifferent to human music. All such data show that the 346 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN perception of music has biological essence and that the feeling of musical harmony is based on inborn mechanisms. Therefore it is necessary to search for connections of the genetic system with musical harmony. This article presents such a search. It can be mentioned that thoughts about the key significance of musical harmony in the organization of the world exist from ancient time. For example, Pythagoreans thought about musical intervals in the planetary system and in all around. J. Kepler wrote the famous book Harmonices Mundi, etc. Modern atomic physics found the harmonic ratios in spectral series by T. Lyman in the atom of hydrogen, which has been named “music of atomic spheres” by A. Einstein and A. Sommerfeld (Voloshinov, 2000). The importance of Pythagorean ideas about a role of musical harmony was emphasized also by the Nobel prize winner in physics R. Feynman (1963, v. 4, Chapter 50). The living substance is compared with crystals frequently. For example, E. Schrödinger (1955) named it “aperiodic crystal”. Whether annals of modern science contain any data about a connection of musical harmony with crystals? Yes, such data exist (see, for example, the book (Berger, 1997, p. 270-281). In 1818, C.S. Weiss, who discovered crystallographic systems and who was one of founders of crystallography, emphasized a musical analogy in crystallographic systems. He investigated ratios among segments, which are formed by faces of crystals of the cubic system. Weiss has shown that these ratios are identical absolutely to ratios between musical tones. In 1829, J. Grassman, who wrote a well-known book “Zur Physischen Kristallonomie und Geometrishen Combinationslehre” and developed many mathematic methods in crystallography, noted impressive musical analogies in the field of crystallography. The statement is about many analogies described by him between ratios of musical tones and segments, formed by faces of the same zone of crystals. According to his figurative expression, “crystal polyhedron is a fallen asleep chord - a chord of the molecular fluctuations made in time of its formation” (from (Berger, 1997, p. 270)). At the end of 1890’s the outstanding crystallographer V. Goldschmidt returned to the same ideas. The prominent Russian mineralogist and geochemist A.E. Fersman wrote about his thematic publications: “These works represent the historical page in crystallography, which has lead Goldschmidt to revealing by him laws of harmonic ratios. Goldschmidt has extended these laws logically from the world of crystals into the SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 347 world of other correlations in the regions of paints, colors, sounds and even biological correlations. It has become one of the most favourite themes of philosophical researches by Goldschmidt” (from (Berger, 1997, p. 270)). This list of such historical examples can be continued. Taking into account, that Shrödinger named a living substance as aperiodic crystal and that the classicists of crystallography emphasized a connection between crystal structures and musical harmony, it seems natural to try to find traces of musical harmony in living substance as well. This idea about a possible participation of musical harmony in the organization of biological organisms is not new for modern biophysics. For example, the famous Russian biophysicist, S. Shnoll (1989) wrote: “From possible consequences of interaction of macromolecules of enzymes, which are carrying out conformational (cyclic) fluctuations, we shall consider pulsations of pressure - sound waves. The range of numbers of turns of the majority of enzymes corresponds to acoustic sound frequencies. We shall consider … a fantastic picture of "musical interactions" among biochemical systems, cells, bodies, and a possible physiological role of these interactions. …… It leads to pleasant thoughts about nature of hearing, about an origin of musical perception and about many other things, which already belong to the area of biochemical aesthetics”. This term “biochemical aesthetics”, proposed by Schnoll, reflects materials of our article. Let us recall some fundamental notions of the theory of musical harmony. Each musical note is characterized by its certain frequency of sounding. For musical melody, a ratio between frequencies of neighboring notes is important, but not the absolute values of frequencies of separate notes. For this reason the melody is easily distinguished irrespective of what acoustic range of frequencies it is produced in, for example, by child, woman or adult man with quite different voices. An aggregate of frequency values between sounds in musical system is named a musical scale. The same note, for example, the note “do” is distinguished by the person as the same if its frequency is increased or reduced twice i.e., if it belongs to another octave. The interval of frequencies from some note frequency f0 up to frequency 2*f0 is named an octave. Each note “do” is considered usually as the beginning of the appropriate octave. For example, the first octave reaches from frequency 260 Hz approximately (the note “do” of the first octave) up to the double frequency 520 Hz (the note “do” of the second octave). Small quantity of discrete frequencies of the octave diapason is traditionally used for musical notes only. The notes, which correspond to these frequencies, form a certain sequence in ascending order of frequencies. A musical scale represents a sequence of 348 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN numerical values (“interval values”) between frequencies of the adjacent notes (musical tones). For Europeans the idea of musical harmony of a universe is connected basically with the name Pythagoras and his school. After ancient thinkers (first of all, ancient Chinese thinkers) Pythagoreans considered that the world is arranged by principles of musical harmony. The Pythagorean musical scale, which is based on the quint ratio 3:2, played the main role in these views. One should note that this musical scale was known in Ancient China long before Pythagoras, who has presumably got acquainted with it in his life in Egypt and Babylon (the analysis of these questions is presented in detail in the book (Needham, v.4, 1962)). In Ancient China this quint music scale had a cosmic meaning connected with “The Book of Changes” (“I Ching”): numbers 2 and 3 were named “numbers of Earth and Heaven” there. After Ancient China, Pythagoreans considered numbers 2 and 3 as the female and male numbers, which can give birth to new musical tones in their interconnection. According to some data, the quint system of the musical scale is the most ancient among known systems in the history of musical scales (http://www.arbuz.uz/t_octava.html). Ancient Greeks attached an extraordinary significance to the search of the quint 3:2 in natural systems because of their thoughts about musical harmony in the organization of the world. For example, the great mathematician and mechanician Archimedes considered the detection of the quint 3:2 between volumes and areas of a cylinder and a sphere entered in it (Voloshinov, 2000) as the best result of his life. Just these geometrical figures with the quint ratio were pictured on his gravestone according to Archimedes testament. And due to these figures Cicero has found Archimedes’ grave later, 200 years after his death. This article demonstrates, in particular, the connection of the Kronecker family of the genomatrices of hydrogen bonds with the Pythagorean musical scale based on the quint ratio 3:2. 2. NUMERIC GENOMATRICES OF HYDROGEN BONDS One of the effective methods of cognition of a complex natural system, including the genetic coding system, is the investigation of symmetries. Modern science knows that deep knowledge about phenomenological relations of symmetry among separate parts of a complex natural system can tell many important things about the evolution and mechanisms of these systems. This article studies some symmetry properties of the genetic coding system by means of matrix representation and analysis of molecular SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 349 ensembles of the genetic system. An initial choice of such a form of presentation of molecular ensembles of the genetic code is explained by the following main reasons. Information is usually stored in computers in the form of matrices. The genetic coding system provides noise-immunity properties; noiseimmunity codes are constructed on the basis of matrices. Genetic molecules obey principles of quantum mechanics, which utilizes matrix operators. A connection between genetic matrices and these matrix operators can be revealed. The significance of matrix approach is emphasized by the fact that quantum mechanics has arisen in a form of matrix mechanics by W. Heisenberg. Complex and hypercomplex numbers, which are utilized in physics and mathematics, possess matrix forms of their presentation. The notion of number is the main notion of mathematics and mathematical natural sciences. In view of this, investigation of a possible connection of the genetic code to multidimensional numbers in their matrix presentations can lead to very significant results. Matrix analysis is one of the main investigation tools in mathematical natural sciences. The study of possible analogies between matrices, which are specific for the genetic code, and famous matrices from other branches of sciences can be heuristic and useful. Matrices, which are a kind of union of many components in a single whole, are subordinated to certain mathematical operations, which determine substantial connections between collectives of many components. Such connections can be essential for collectives of genetic elements of different levels as well. In history of science, the first publication about matrix representation of molecular ensembles of the genetic coding system was the work (Konopel’chenko, Rumer, 1975), which studied symmetries in the genetic system. It represented the genetic alphabet A (adenine), C (cytosine), G (guanine), T (thymine) and the set of 16 duplets in a form of the two square matrices [C G; T A] and [C G; T A](2) respectively (here 2 in brackets means the Kronecker power). 350 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN In our article we continue studying the molecular genetic system in its matrix forms of representation, which has given some interesting results in the last 10 years (Petoukhov, 2005, 2008; Petoukhov, He, 2010). In these works we studied the Kronecker family of genetic matrices [C T; A G](n) (here “n” in brackets means the Kronecker power), the first representatives of which are shown in Fig. 1. CC CT TC TT C T CA CG TA TG (2) [C T; A G]= A G ; [C T; A G] = AC AT GC GT AA AG GA GG [C T; A G](3) = CCC CCA CAC CAA ACC ACA AAC AAA CCT CCG CAT CAG ACT ACG AAT AAG CTC CTA CGC CGA ATC ATA AGC AGA CTT CTG CGT CGG ATT ATG AGT AGG TCC TCA TAC TAA GCC GCA GAC GAA TCT TCG TAT TAG GCT GCG GAT GAG TTC TTA TGC TGA GTC GTA GGC GGA TTT TTG TGT TGG GTT GTG GGT GGG Figure 1: The first members of the Kronecker family of genetic symbolic matrices [C T; A G](n). Here A, C, G and T are adenine, cytosine, guanine and thymine correspondingly. Numeric genomatrices can be derived from the replacement of each symbol A, C, G, T of the nitrogenous bases in the symbolic genomatrices [C T; A G](n) (Figure 1) by quantitative parameters of these bases. For example, let us consider the genomatrices of hydrogen bonds of these nitrogenous bases. The hydrogen bonds 2 and 3 of complementary letters of the genetic alphabet are suspected for their important information meaning by different authors for a long time. In addition, hydrogen plays the main role in the composition of our Universe, where 93% hydrogen atoms exist among all kinds of atoms and where “chemical influence of omnipresent hydrogen is the defining factor” (Ponnamperuma, 1972). Thus the investigation of a possible meaning of hydrogen bonds in genetic information deserves special interest. The complementary letters C and G have 3 hydrogen bonds (C = G = 3) and the complementary letters A and T have 2 hydrogen bonds (A = T= 2). Let us replace each multiplet in the Kronecker family of the genomatrices [C T; A G](n) by the product of these numbers of its hydrogen bonds. In this case, we get the Kronecker family of numeric matrices [3 2; 2 3](n). For example, the triplet CAT will be replaced by number 12 (=3*2*2) in the genomatrix [3 2; 2 3](3). Figure 2 demonstrates the three SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 351 initial genomatrices from this Kronecker family of genomatrices [3 2; 2 3](n) constructed in this way. Numeric characteristics of each genomatrix [3 2; 2 3](n) are connected with the quint ratio 3:2; for this reason we name such genomatrices as quint genomatrices conditionally. Q = 3 2 2 3 ; Q(2) = 9 6 6 4 6 9 4 6 6 4 9 6 4 6 6 9 ; Q(3) = 27 18 18 12 18 12 12 8 18 27 12 18 12 18 8 12 18 12 27 18 12 8 1812 12 18 18 27 8 12 1218 18 12 12 8 27 18 1812 12 18 8 12 18 27 1218 12 8 18 12 18 12 2718 8 12 12 18 12 18 18 27 Figure 2: The beginning of the family of the quint genomatrices [3 2; 2 3](n), which are based on the product of numbers of hydrogen bonds (C=G=3, A=T=2) 3 THE NUMERIC GENOMATRICES AND THE GOLDEN SECTION In biology, a genetic system provides the self-reproduction of biological organisms in their generations. In mathematics, the “golden section” (or the “divine proportion”) and its properties were a mathematical symbol of self-reproduction from the Renaissance, and they were studied by Leonardo da Vinci, J. Kepler and many other prominent thinkers (see details in (Darvas, 2007; Shubnikov, Koptsik, 2005) and in the website “Museum of Harmony and Golden Section” by A. Stakhov, www.goldenmuseum.com). Is there any connection between these two systems? Yes, and this article demonstrates such unexpected connection. The golden section is the value φ = (1+50.5)/2 = 1.618… (Sometimes the inverse of this value is called the golden section in literature). If the simplest quint genomatrix [3 2; 2 3] is raised to the power 1/2 in the ordinary sense (that is, if we take the square root), the result is the bi-symmetric matrix [φ φ-1; φ-1 φ] = [3 2; 2 3]1/2, the matrix elements of which are equal to the golden section and to its inverse value. And if any other quint genomatrix [3 2; 2 3](n) is raised to the power ½ in the ordinary sense, the result is the bi-symmetric matrix [φ φ-1; φ-1 φ](n) = ([3 2; 2 3](n))1/2, the matrix elements of which are equal to the golden section in various integer powers with elements of symmetry among these powers (Figure 3). Here one can remind what does it mean: square root of a nonsingular square matrix M? It means such square matrix M1/2, the second power of which is equal to the initial matrix M: (M1/2)2 = M. Many known kinds of software (for example, MathLab) allow receiving square roots from nonsingular square matrices. Let us demonstrate here that 352 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN for example the golden genomatrix [φ φ-1; φ-1 φ] is the square root from the quint genomatrix [3 2; 2 3]. Really, using ordinary rules of matrix multiplication, we get: [φ, φ-1; φ-1, φ]*[φ, φ-1; φ-1, φ] = [φ*φ+φ-1*φ-1, φ*φ-1 + φ-1*φ; φ-1*φ+φ*φ-1, φ-1*φ-1 +φ*φ] = [ 3, 2; 2, 3]. Similar results can be checked for other corresponding pairs of the genomatrices: ([φ φ-1; φ-1 φ](n))2 = [3 2; 2 3](n). Matrices with matrix elements, all of which are equal to the golden section φ in different integer powers only, can be referred to as “golden matrices”. Figuratively speaking, the quint genomatrices [3 2; 2 3](n) have the secret substrate from the golden matrices [φ φ-1; φ-1 φ](n) (below we will explain a deep geometrical relationship between the quint matrices and the golden matrices, which represent square roots from them). φ φ‐1 [3 2; 2 3]1/2 = φ‐1 φ ([3 2; 2 3](3))1/2 = ; ([3 2; 2 3](2))1/2 = φ3 φ1 φ1 φ-1 φ1 φ-1 φ-1 φ-3 φ1 φ3 φ-1 φ1 φ-1 φ1 φ-3 φ-1 φ1 φ-1 φ3 φ1 φ-1 φ-3 φ1 φ-1 φ -1 φ1 φ1 φ3 φ -3 φ-1 φ-1 φ1 φ2 φ0 φ0 φ2 φ0 φ2 φ2 φ0 φ0 φ2 φ2 φ0 φ2 φ0 φ0 φ2 φ1 φ-1 φ-1 φ-3 φ3 φ1 φ1 φ-1 φ-1 φ1 φ-3 φ-1 φ1 φ3 φ-1 φ1 φ-1 φ-3 φ1 φ-1 φ1 φ-1 φ3 φ1 φ-3 φ-1 φ-1 φ1 φ-1 φ1 φ1 φ3 Figure 3: The beginning of the Kronecker family of the golden matrices [φ φ-1; φ-1 φ](n) = ([3 2; 2 3](n))1/2, where φ = (1+50.5)/2 = 1, 618… is the golden section The mentioned matrix elements of the matrix [φ φ-1; φ-1 φ](n) = ([3 2; 2 3](n))1/2 can be constructed from a combination of φ and φ-1 directly by the following algorithm. We take a corresponding multiplet of the genomatrix [C T; A G](n) and change its letters C and G to φ. Then we take letters A and T in this multiplet and change each of them to φ-1. As a result, we obtain a chain with “n” links, where each link is φ or φ-1. The product of all such links gives the value of corresponding matrix elements in the matrix [φ φ-1; φ-1 φ](n). For example, in the case of the matrix [φ φ-1; φ-1 φ](n), let us calculate a matrix element, which is disposed at the same place as the triplet CAT in the matrix [C T; A G](3). According to the described algorithm, one should change the letter C to φ and the letters A and T to φ-1. In the considered example, we obtain the following product: (φ * φ-1 * φ-1) = φ-1. This is the desired value of the considered matrix element for the matrix [φ φ-1; φ-1 φ](3) on Figure 3. SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 353 A ratio between adjacent numbers in numerical sequences inside each of such matrices [φ φ-1; φ-1 φ](n) (for example, …φ3, φ1, φ-1, φ-3 …) is equal to φ2 (or φ-2) always. The same ratio φ2 exists in regular 5-stars (Figure 4) as a ratio between sides of the adjacent stars entered in each other. Below we will use the name “pentagram musical scales” for new musical scales connected with the golden genomatrices. In view of this, let us remind that the pentagram and its metaphysical associations were explored by the Pythagoreans who considered it an emblem of perfection and health. Pythagoreans swore by it and used the pentagram as a distinctive sign of belonging to their community. But the pentagram has been known long before Pythagoras since ancient times as a sign that protects from all evil, so in Ancient Babylon it depicted on the doors of stores and warehouses to protect goods from damage and theft. It was also a sign of power and was used on the royal seals. The first known images of pentagrams date back to around 3500 BC, they were found at the territory of Ancient Mesopotamia. For early Christians, the pentagram was a reminder about the five wounds of Christ, from the crown of thorns on his forehead, and from the nails in the hands and feet. Figure 4: Sizes of pentagrams, which are entered in each other, differ by scale factor φ2 (or φ-2) Let us remind that the value φ2 (or φ-2) is also well known in another genetically inherited phenomena of biological organisms, which are united under the term “phyllotaxis laws” (authors don’t know how these two cases of realization of φ2 are interconnected by biological mechanisms). Hundreds of books and articles around the world are devoted to these genetically inherited laws, which are connected with Fibonacci numbers and the golden section and which describe genetically inherited configurations of a huge number of living bodies at different levels and branches of biological evolution (see the review in the book (by Jean, 2010)). For example, leave 354 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN arrangements in the cases of a lime tree, an elm tree, a beech are characterized by the ratio 2/1; in cases of an alder tree, a nut-tree, a vine, a sedge – by the ratio 3/1; in the cases of a raspberry, a pear tree, a poplar, a barberry – by the ratio 8/3; in the cases of an almond tree, a sea-buckthorn – by the ratio 13/5; cones of coniferous trees correspond to ratios 21/8, 34/13, 55/21 in various cases. All of these integer numbers are Fibonacci numbers, and the sequence of these ratios tends to the value φ2 = 2,618… . The ideal angle in phyllotaxis laws, which is termed as “the Fibonacci angle”, is equal to φ-2 (Jean, 2010, section 2.2.1). The golden section is presented in 5fold-symmetrical objects of biological bodies (flowers, etc.), which are presented widely in the living nature but which are forbidden in classical crystallography. It exists as well in many figures of modern generalized crystallography: quasi-crystals by D. Shechtman, R. Penrose’s mosaics, dodecahedra of ensembles of water molecules, icosahedral figures of viruses, biological phyllotaxis laws, etc. (Darvas, 2007). The article (Carrasco et al, 2009) shows that about 1-nm-wide ice chains that nucleate on metal surfaces Cu(110) are built from a face-sharing arrangement of water pentagons. The pentagon structure is favored over others because it maximizes the water–metal bonding while maintaining a strong hydrogen-bonding network. It reveals an unanticipated structural adaptability of water–ice films. In recent years, unexpected connections are discovered between the golden section and micro-world of quantum mechanics, which includes genetic molecules. The article (Coldea et al., 2010) describes that the chain of atoms in certain circumstances acts like a nanoscale guitar string. The journal “Science Daily” gives a special title in its information about this discovery: “Golden Ratio Discovered in Quantum World: Hidden Symmetry Observed for the First Time in Solid State Matter”. The principal author of this paper R.Coldea speaks: “Here the tension comes from the interaction between spins causing them to magnetically resonate. For these interactions we found a series (scale) of resonant notes: the first two notes show a perfect relationship with each other. Their frequencies (pitch) are in the ratio of 1.618…, which is the golden ratio famous from art and architecture. … It reflects a beautiful property of the quantum system - a hidden symmetry. Actually quite a special one called E8 by mathematicians, and this is its first observation in a material" (http://www.sciencedaily.com/releases/2010/01/100107143909.htm ). The new theme of the golden section in genetic matrices seems to be important because many physiological systems and processes are connected with it. It is known that proportions of a golden section characterize many physiological processes: cardiovascular processes, respiratory processes, electric activities of brain, locomotion SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 355 activity, etc. The golden section is described and is investigated for a long time in phenomena of aesthetic perception as well. Taking into account these facts, the golden section should be considered as the candidate for the role of one of base elements in an inherited interlinking of the physiological subsystems, which provides unity of an organism. The matrix relation between the golden section φ and significant parameters of genetic codes testifies in a favor of a molecular-genetic clue providing such physiological phenomena. One can hope that the algebra of bi-symmetric genetic matrices, which are connected with the theme of the golden section, will be useful for explanation and the numeric forecast of separate parameters in different physiological sub-systems of biological organisms with their cooperative essence and golden section phenomena. One should emphasize the deep geometrical sense of the connection between the quint genomatrices [3 2; 2 3](n) and golden genomatrices [φ φ-1; φ-1 φ](n). This connection deals with the notion of “metric tensor”, which is the main notion of Riemannian geometry (all other notions of Riemannian geometry - curvature tensor, geodesic lines, etc. – can be deduced from this main notion) (Rashevsky, 1964; http://en.wikipedia.org/wiki/Metric_tensor). The statement is that quint genomatrices [3 2; 2 3](n) are metric tensors, and golden genomatrices [φ φ-1; φ-1 φ](n) are matrices of basic vectors of the frame of reference, on which this tensor is built. Let us explain it in more details. By definition, a metric tensor in n-dimensional affine space, where the operation of scalar product exists, is determined by means of a nonsingular matrix ||gij|| with the condition of symmetry gij = gji (Rashevsky, 1964, p. 157). Coordinates of the metric tensor gij are equal to the scalar products of pairs of the basic vectors ei, ej of the frame of reference, on which this tensor is built. The square root of the metric tensor ||gij|| gives a square matrix, columns of which are basic vectors ei of the frame of reference. But the quint matrices [3 2; 2 3](n) satisfy the definition of metric tensors. Above we took the square root from this quint metric tensor [3 2; 2 3](n), and as a -1 -1 (n) result we received golden genomatrices [φ φ ; φ φ] . It means that the metric tensors [3 2; 2 3](n) are built on the corresponding bunches of the “golden” vectors (as their basic vectors of the frames of reference), all components of which are equal to the golden section φ in integer power. For example, the genomatrix [3 2; 2 3] can be interpreted as a metric tensor, which is built on a special affine frame of reference. This frame consists of two basic vectors: the golden vector e1 with coordinates (, -1) and the golden vector е2 with coordinates (-1, ). These two golden vectors coincide with the columns in the golden genomatrix [φ φ-1; φ-1 φ]. Scalar products of pairs of these vectors are equal to the components of the quint genomatrix [3 2; 2 3]: <e1 , e1> = * + <e2 , e2> = -1*-1 = 3; <e1 , e2> = *-1 + -1* = 2; <e2 , e1> = -1* + *-1 = 2, -1*-1 + * = 3. 356 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN One additional remark is the following. To interpret correctly the matrix [3 2; 2 3] as a metric tensor of a 2-dimensional plane, one should show a group of transformations, in relation to which this matrix plays a role of a tensor. In the considered case, for example, group of rotations can play can be taken for a tensor because their transformations conserve values of scalar products of the frame golden vectors (though the coordinates of these vectors are changed under such transformations). A similar situation holds true for other corresponding pairs of the quint genomatrices [3 2; 2 3](n) and golden genomatrices [φ φ-1; φ-1 φ](n). In result, one can say that the considered Kronecker families of the quint genomatrices [3 2; 2 3](n) and golden genomatrices [φ φ-1; φ-1 φ](n) are closely connected from a geometrical point of view or, in other words, they form a geometric organic whole. It should be added that the Riemannian geometry is very essential to study genetically inherited curved surfaces and lines of biological bodies: these curvilinear configuration endowed internal metric that is described by means of the Riemannian geometry (in view of this, some mathematical models of biological morphogenesis can be developed on the base of this geometry and its metric tensors). The molecular system of the genetic alphabet is constructed by nature in such manner that not only numeric parameters of hydrogen bonds lead to the quint and golden genomatrices but some other significant parameters of genetic molecules lead also to quint and golden matrices by analogy. For example, the quantities of atoms in molecular rings of pyrimidines and purines are such parameters: the ring of purine contains 6 atoms and the ring of pyrimidine contains 9 atoms (Figure 7). From the viewpoint of this kind of parameters, C = T = 6, A = G = 9. The ratio 9:6 is equal to the quint 3:2. Thus the symbolic matrices [A C; T G](n), [G C; T A](n), [A T; C G](n), [G T; A C](n) become the threefold quint matrices in the Kronecker power “n” in the case of replacement of their symbolic elements by these numbers 9 and 6. The square root of such numeric matrices is connected with the golden matrices obviously. A biological organism is the master on the use of a set of parallel information channels. It is enough to remind about many sensory channels by means of which we obtain sensory information simultaneously: visual, acoustical, tactile, etc. It is probable, that many kinds of genetic matrices are used by organisms in parallel information channels as well. SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 357 4 THE GENOMATRICES, MUSICAL HARMONY AND PYTHAGOREAN MUSICAL SCALE The theme of harmony of living nature is discussed frequently by many authors. The word “harmony” has arisen in Ancient Greece in relation to the Pythagorean musical scale. In the antique theory of music the word "harmony" has found the modern value the consent of discordant. Seven musical notes carry names familiar to all: do (C), re (D), mi (E), fa (F), sol (G), la (A), si (B). These seven notes are interrelated among themselves by their frequencies not in an accidental manner, but they form the regular uniform ensemble. Really, it is well known that the seven notes of the Pythagorean musical scale from appropriate octaves form the regular sequence of the geometric progression on the base of the quint ratio 3:2 between frequencies of the adjacent members of this sequence (Figures 5). The quint 3:2, which is the ratio between frequencies of the third and the second harmonics of an oscillated string, plays the role of the factor of this geometrical progression. The frequency 293 Hz of the note re (D1) of the first octave stays in the middle of this frequency series. The ratios of the frequencies of all notes to this frequency of the note re (D1) form the symmetrical series by signs and sizes of their powers of the quint: from the power "-3" up to the power "+3". fa (F) 87 (3/2)-3 do (C) 130 (3/2)-2 sol (G) 196 (3/2)-1 re (D1) 293 (3/2)0 la (A1) 440 (3/2)1 mi (E2) 660 (3/2)2 si (B2) 990 (3/2)3 Figure 5 The quint (or the perfect fifth) sequence of the 7 notes of the Pythagorean musical scale. The upper row shows the notes. The second row shows their frequencies. The third row shows the ratios between the frequencies of these notes to the frequency 293 Hz of the note re (D1). The designation of notes is given on Helmholtz system. Values of frequencies are approximated to integers. The Kronecker family of the genomatrices [3 2; 2 3](n) is connected with the Pythagorean musical scale. Let us consider it more attentively. Each genomatrix of the family [3 2; 2 3](n) demonstrates the quint (or the perfect fifth) principle of its structure because they have the quint ratio 3:2 at different levels: between numerical sums in top and bottom quadrants, sub-quadrants, sub-sub-quadrants, etc. including quint ratios between neighbor numbers in them. For example, [3 2; 2 3](3) contains 4 numbers – 27, 18, 12, 8 - with the quint ratio between them: 27/18=18/12=12/8=3/2. Each quint genomatrix [3 2; 2 3](n) contains (n+1) kinds of numbers from a geometrical progression, factor of which is equal to the quint 3/2: 358 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN [3 2; 2 3](1) 3, 2 9, 6, 4 [3 2; 2 3](2) (3) [3 2; 2 3] 27, 18, 12, 8 …………………………………………. [3 2; 2 3](6) 729, 486, 324, 216, 144, 96, 64 …………………………………………………….. Let us write out these kinds of numbers in columns for each genomatrix [3 2; 2 3](n) to arrive at the “genetic” triangle, which is shown on the left part of the expression (1): 3 2 9 27 6 18 4 12 8 81 243 …. 54 162 …. 36 108 …. 24 72 …. 16 48 …. 32 …. 1 3 9 27 2 4 8 (1) On the right side in the expression (1) the historically famous numeric triangle by Plato is demonstrated. This triangle was utilized by Ancient Greeks to create the Pythagorean musical scale on the basis of its main proportions. One can see the analogy between the “genetic” triangle and the Plato’s triangle. Moreover, as Jay Kappraff (USA) has informed one of the authors of this article in his private letter, this genetic triangle, which was obtained from the matrices of the genetic code, was known many centuries ago: it is identical to the famous triangle, which was published 2000 years ago by Nichomachus of Gerasa in his famous book “Introduction into arithmetic”. Nichomachus belonged to the Pythagorean society, and this triangle was famous for centuries as the basis of the Pythagorean theory of musical harmony and aesthetics. In accordance with this triangle, the Parthenon (Kappraff, 2006) and other great architectural objects were created because architecture was interpreted as the nonmovement music, and the music was interpreted as the dynamic architecture. Nichomachus of Gerasa was one of the great persons in the theory of musical harmony and aesthetics. The Cambridge library has the ancient picture, where Nichomachus is shown together with other great persons in this field: Pythagoras, Plato and Boeticus (http://www.jcsparks.com/painted/boethius.html ). One can find more details about the triangle by Nichomachus of Gerasa in the publications (Kappraff, 2000, 2002). This unexpected connection of times makes additionally probable the adequacy of the presented way of the matrix research of genetic systems and the assumed connection of SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 359 genetic systems with the Pythagorean musical scale, reflected unconsciously in Nichomachus’ triangle. As we mentioned above, a set of certain kinds of numbers in each genomatrix [3 2; 2 3](n) reproduces fragments of the geometrical progressions with the quint factor. Thus sequences of such kinds of numbers can be compared to quint sequences of musical notes from Figure 5. If one confronts the least number from a quint genomatrix with the frequency 87 Hz of the musical note “fa” (F), which possesses the least frequency on Figure 5, then all sequences of such kinds of numbers automatically corresponds to the series of the frequencies of the musical notes: for example, the sequence of numbers 8, 12, 18, 27 of [3 2; 2 3](3) is assumed to correspond to the frequency sequence 87, 130, 196, 293 Hz of the notes fa(F) - do(C) - sol(G) - re(D1). Genomatrix [3 2; 2 3](6) contains the sequence of 7 numbers (64, 96, 144, 216, 324, 486, 729), which is assumed to correspond to the whole quint sequence of the frequencies 87, 130, 196, 293, 440, 660, 990 Hz of the 7 notes of Figure 5: fa(F) - do(C) - sol(G) - re(D1) - la (A1) - mi (E2) - si (B2). For this reason, we assume that each genomatrix [3 2; 2 3](n) can be presented in the form of a matrix PMUSIC(n) of frequencies of notes (or a “music-matrix”). For instance, Figure 6 demonstrates the genomatrix [3 2; 2 3](3) of the 64 triplets as a music-matrix PMUSIC(3) of frequencies of appropriate four notes (the general factor 293/27 arises for concordance of numeric values of the note frequencies with numbers 8, 12, 18, 27 of the genomatrix [3 2; 2 3](3)). re (D1) sol (G) sol (G) do (C) sol (G) do (C) do (C) fa (F) sol (G) re (D1) do (C) sol (G) do (C) sol (g) fa (F) do (C) sol (G) do (C) re (D1) sol (G) do (C) fa (F) sol (G) do (C) do (C) sol (G) sol (G) re (D1) fa (F) do (C) do (C) sol (G) sol (G) do (C) do (C) fa (F) re (D1) sol (G) sol (G) do (C) do (C) sol (G) fa (F) do (C) sol (G) re (D1) do (C) sol (G) do (C) fa (F) sol (G) do (C) sol (G) do (C) re (D1) sol (G) fa (F) do (C) do (C) sol (G) do (C) sol (G) sol (G) re (D1) Figure 6: A presentation of the genomatrix [3 2; 2 3](3)*(293/27) in the form of the music-matrix PMUSIC(3) of the frequencies of the musical notes (see Figure 5) The four numbers 8=2*2*2, 12=2*2*3, 18=2*3*3, 27=3*3*3, which are presented in the genomatrix [3 2; 2 3](3) on Figure 2, characterize those four kinds of triplets, which differ by their numbers of hydrogen bonds of nitrogenous bases. For instance, number 360 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN 18=2*3*3 belongs to those triplets, which have one nitrogenous base with 2 hydrogen bond and two bases with 3 hydrogen bonds (the mathematics of genomatrices testifies that products of numbers of hydrogen bonds should be taken into account here but not their sums; it has precedents and the justification in information theories, in particular, in the theory of parallel channels of coding and processing the information). Different sequences of these four numbers, for example 12-8-27-12-8-18-18-…, determine appropriate successions of the musical ratios (3:2)0, (3:2)1, (3:2)2, (3:2)3 (in this example, 3:2 - (3:2)3 – (2:3)2 – (2:3) – (3:2)2 - (3:2)0-…). It is obvious that such succession can be interpreted as a kind of an analogous genetic music for triplets, which is connected with their hydrogen bonds. Each gene and each part of a DNA and RNA have their own genetic “melody of hydrogen bonds” which can be played by means of musical tools. But the described musical sequence is not the single one in the molecule DNA at all. DNA can be considered as a set of joint sequences, which are very different in their physical-chemical sense: a sequence of nitrogenous bases; a sequence of hydrogen bonds of complementary pairs of these bases; a sequence of triplets; a sequence of rings of nitrogenous bases; a sequence of ensembles of protons in rings of nitrogenous bases, etc. One can note the phenomenological fact that many of these sequences are constructed on quint ratios between quantitative characteristics of their neighboring members, which are typical for the Pythagorean musical scale (it was mentioned above). Correspondingly each of these sequences of ratios can be interpreted as a special kind of genetic musical melody. The whole set of such sequences in DNA can be considered as a polyphonic (coordinated) music ensemble. An investigation of this music ensemble seems to be an important scientific task. Let us demonstrate a few additional examples of sequences with the musical ratios in DNA. A sequence of triplets in DNA has another kind of genetic music also which is connected with the quantity of protons in molecular rings of nitrogenous bases (Figure 7). The pyrimidines C and T have 40 protons in their rings; the purines A and G have 60 protons in their rings. (Each complementary pair has 100 protons in their rings precisely). The ratio 60:40 is equal to the quint 3:2. Let us present each triplet by the product of the proton numbers 40 and 60 in its rings (as we did above for numbers 2 and 3 of the hydrogen bonds of triplets). Then any triplet has one of four proton numbers: 64000=40*40*40; 96000=40*40*60; 144000=40*60*60; 216000=60*60*60. This proton set of the four numbers differs from the considered set of four numbers 8, 12, 18, 27 of hydrogen bonds in the triplets by the factor 8000 only. In other words, a ratio between any two numbers from this proton set has a quint character again and is equal to one of the values (3:2)k, where k = 1, 2, 3. One can note that a sequence of triplets of SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 361 one DNA-filament has two different sequences with the same typical ratios: one sequence for triplet characteristics of its hydrogen bonds and another sequence for triplet characteristic of protons in triplet rings. These two sequences differ each from other by dispositions of these ratios along DNA-filament, generally speaking (Figure 7). So, any triplet sequence bears on itself two different genetic melodies on these two parameters. Figure 7: On top: Complementary pairs of four nitrogenous bases in DNA: А - Т and C - G. By a dotted line are specified hydrogen bonds in these pairs. Black circles are atoms of carbon, small white circles - hydrogen, circles with the letter N - nitrogen, and circles with the letter O – oxygen. At bottom: the numerical representations of a sequence of complementary pairs of the bases in DNA as a sequence of numbers of hydrogen bonds in the given pairs (the average row made up on basis of numbers 2 and 3) and as a numerical sequence of protons of molecules rings of these nitrogenous bases Sequential dispositions of musical ratios for these two parameters of triplets (and of nitrogenous bases also) are different on two filaments of DNA, but they are connected in regular manner due to a fact of complementary pairs of bases. Figuratively speaking, two filaments of DNA bear complementary kinds of genetic music on these parameters. It should be added about an atomic parameter of nitrogenous bases: the quantity of nonhydrogen atoms in molecular rings of the pyrimidines C and T is equal to 6 and the 362 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN quantity of non-hydrogen atoms in molecular rings of the purines A and G is equal to 9. Their quint ratio 9:6=3:2 can be considered as a basis for “atomic” genetic music of the nitrogenous bases and triplets along DNA. But these kinds of sequences of ratios are identical to sequences of ratios in the case considered above about 40 and 60 protons in rings of the pyrimidines and the purines. For this reason these sequences have nothing new from musical viewpoint though they can have an important meaning in the ensemble of genetic music because they are organized on the higher – atomic - level. A sequence of numbers of 2 and 3 of hydrogen bonds between complementary nitrogenous bases along DNA (for instance, 3-2-2-3-2-3-…) determines a sequence of ratios between its neighboring - subsequent and previous - members (in the considered example, 2:3 - 2:2 - 3:2 - 2:3 -….). This simple sequence contains ratios (3/2)-1, (3/2)0 and (3/2)1 only. From a viewpoint of musical analogy, this sequence determines a special kind of very simple genetic music. Quantities of molecular rings in the pyrimidines and the purines are characterized by the octave ratio 2:1. This fact gives an additional possibility to consider sequences of nitrogenous bases and triplets in DNA as genetic melodies. But sequences of ratios in these cases contain the octave ratios only and are not so interesting from musical viewpoint though they can play an important role in the whole ensemble of genetic music. Total quantities of protons in both pairs of nitrogenous bases A-T and C-G are the same and are equal to 136. On this numeric parameter, a sequence of nitrogenous bases has constant ratios 1:1 along DNA. The full list of different kinds of such genetic music at different parameters and levels of genetic system permits one to reproduce a musical polyphonic party for each gene and for other parts of the genetic system. These musical sequences were created by nature itself. Each gene and each protein have their own genetic music composition (or briefly “genomusic”). The natural music of genes can be reproduced in acoustical diapason not only for aesthetic pleasure but, perhaps, also for medical therapy, for theoretical needs, etc. (applications of genomusic in the field of musical therapy have not been tested by authors). This natural genomusic and its compositions can be connected to deep physiological archetypes, which were introduced into science by the creator of analytic psychology Carl Jung. From the viewpoint of musical harmony in structures of molecular-genetic system, outstanding composers are researchers of harmony in the organization of living substance. According to the famous expression by G. Leibnitz, music is the mysterious arithmetic of the soul, which calculates itself SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 363 without understanding this action (here one should note the difference in the order of magnitude of the wave-lengths of musical tones and the size of the molecular nitrogenous bases; this fact testifies in favor of informational nature of musical action). It is well known, that some kinds of music stimulate growth of plants, cure people, etc. “American Music Therapy Association” unites a few thousands of members, many of them are professional therapists there. One should emphasize that “melodies” of the mentioned genetic music are not formed by any person in a forcible way, but they are defined by natural sequences of parameters in chain genetic molecules (although applications of the genetic music for musical therapy have not been tested else, we may suppose that this kind of music is closer to biological organisms than the former ones). Such genetic melodies are named conditionally as "natural genetic music" to distinguish them from variants of "genetic music", sometimes offered by other authors on the basis of obviously forcible approaches without a sufficient support on molecular features of genetic sequences. The claim is that some authors (see for example http://www.youtube.com/watch?v=tQv5Ho8zsKI) propose their own “genetic music” on the basis of an arbitrary correspondence of the genetic letters or triplets to musical notes without sufficient attention to the musical correspondence of ratios of natural numeric parameters of adjacent genetic elements. Such attempts to create arbitrary "genetic music" are related with the long-standing hypothesis that just the genetic system is the carrier of genetically inherited connection of biological organisms with the phenomenon of music (see for example http://discovermagazine.com/2001/aug/featmusic#.UMyvN0I3tXU). All physiological systems should be coordinated structurally with the genetic code for their genetic transfer to next generations and for a survival in a course of biological evolution. For this reason we collect examples of harmonious ratios (first of all, the quint 3:2) in structures and functions on different levels of biological systems including the supra-molecular level. For example, the quint ratio 3:2 exists between: durations of phases of the activity and the rest in human cardio-cycles (0.6 sec and 0.4 sec correspondingly); plasmatic and globular volumes of blood (60% and 40%); albumens and globulins of blood (60% and 40%); 60S and 40S sub-particles in the composition of ribosomes (from http://vivovoco.rsl.ru/VV/JOURNAL/NATURE/08_03/KISSELEV.HTM). Now let us consider a well-known algorithm of the construction of the Pythagorean musical scale from a geometrical progression, which factor is equal to the quint. This algorithm, which is useful for the theme of the next paragraph, creates the sequence of the notes do-re-mi-fa-sol-la-si-do on the interval of frequencies {1, 2} of one 364 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN octave, in which the lowermost note “do” has the conditional frequency of power 1 and the lowermost note of the next octave has the conditional frequency of power 2. This algorithm contains the following steps: 1. Taking the first seven members of such geometrical progression with the quint factor 3/2, which begins from the inverse value of the quint: (3/2)-1, (3/2)0, (3/2)1, (3/2)2, (3/2)3, (3/2)4, (3/2)5; 2. Returning into the octave power interval {1, 2} for those members of this sequence, values of which overstep the limits of this interval; this returning is made for these values by means of their multiplication or division with the number 2. As a result of this operation, the new sequence appears (this sequence can be named “the geometrical progression with the returning into the octave ”): 2*(3/2)-1, (3/2)0, (3/2)1, (3/2)2/2, (3/2)3/2, (3/2)4/4, (3/2)5/4; 3. The permutation of these seven members in accordance with their increasing values from 1 up 2 (the number 2 is included in this sequence as the end of the octave): (3/2)0, (3/2)2/2, (3/2)4/4, 2*(3/2)-1, (3/2)1, (3/2)3/2, (3/2)5/4, 2. In this last sequence, a ratio of the greater number to the adjacent smaller number refers to as the interval factor. Two kinds of interval factors exist in this sequence only: 9/8, which is named the tone-interval T, and 256/243, which is named the semitone-interval S. One can check that the sequence of interval factors in this case is T-T-S-T-T-T-S. These five tone-intervals and two semitone-intervals cover the octave precisely: (9/8)5 * (256/243)2 = 2. It is known that the name “semitone-interval” in the Pythagorean musical scale is utilized by convention only because the semitone-interval 256/243= 1.0545… is not equal to the half of the tone-interval, that is the square root from the tone-interval: (9/8)0.5 =1.0607… . If one takes not 7, but 6 or 8 members in the initial quint geometrical progression (see the first step of the algorithm), then the same Pythagorean algorithm does not give a binary sequence of interval factors T and S because three kinds of interval factor arise. The similar algorithm will be used in the next paragraph to construct new mathematical scale on the base of described data about the genetic code and its genomatrices. 5 PENTAGRAM MUSICAL SCALES AND FIBONACCI NUMBERS Many theorists of music paid attention to the connection of the structure of many musical compositions of prominent composers with the golden section φ = (1+50.5)/2 = 1.618… (see, for example, (Lendvai, 1993) and the web-site about the Hungarian composer Bela Bartok and musicologist Erno Lendvai SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 365 http://mathcs.holycross.edu/~groberts/Courses/Mont2/Handouts/Lectures/Bartokweb.pdf). The results of matrix genetics reveal a new direction of thoughts about a relation between the golden section, Fibonacci numbers and music because structures of a genetic code are also (although in another way) connected with the golden section. Similarly to a quint genomatrix [3 2; 2 3](n), which contains a sequence of (n+1)-kinds of numbers from a geometrical progression with the quint factor 3/2, a corresponding golden genomatrix Φ(n) contains a sequence of (n+1)-kinds of numbers from a geometric progression, the factor of which is equal to φ2 = 2.618….: Φ(1) φ1, φ-1 Φ(2) φ2, φ0, φ-2 Φ(3) φ3, φ1, φ-1, φ-3 ……………………………. (2) The previous section demonstrated that the Kronecker family of the quint genomatrices is connected with the Pythagorean musical scale. Now we turn to the Kronecker family of the golden genomatrices and to the geometrical progressions with the factor φ2. Is it possible to apply the described Pythagorean algorithm to such geometrical progressions with factor φ2 to arrive at a new musical (or mathematical) scale, where only two interval factors exist by analogy with the Pythagorean musical scale? Investigation of this question seems to be important because such a new scale or scales can be essential for a theory of musical harmony and for the creation of musical compositions with increased physiological activity. After research of this question the beautiful positive result is obtained: yes, it is possible every time, when we take one of Fibonacci numbers 2, 3, 5, 8, 13 (see the Figure 8) as the first member of such a geometrical progression (the situation becomes more difficult for the higher Fibonacci numbers 21, 34, …). Mathematical scales, which are formed in these cases, possess such quantity of each of their two interval factors, which is equal to Fibonacci numbers as well. Moreover a value of each of these two interval factors is expressed by means of Fibonacci numbers, too. n Fn 0 0 1 1 2 1 3 2 4 3 5 5 6 8 7 13 8 21 9 34 10 55 Figure 8: The Fibonacci series where Fn+1 = Fn + Fn-1 11 89 … … 366 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN Such interrelated Fibonacci-stage scales, each of which has interval factors of two kinds only and which is based on the geometric progression with the coefficient φ2, are named “the Fibonacci-stage scales” or “the pentagram scales”. Let us consider the example of the 8-stage pentagram scale. We should construct a new mathematical scale of frequencies, which fills up the octave {1, 2}, by means of the Pythagorean algorithm on the base of a geometrical progression with the irrational coefficient φ2 (instead of the coefficient of the quint 3/2). As a result we should arrive at such a scale, which possesses two kinds of interval factors only by analogy with the Pythagorean musical scale. One can note that the factor φ2 = 2.618… exceeds the considered interval of the octave {1, 2}. Therefore it is comfortable to use from the very beginning the twice smaller factor φ2/2 = р = 1.309…, the value of which belongs to this octave interval. It is easy to check that the final sequence (3) of the 8-stage pentagram scale does not depend on whether we use the factor φ2 or the factor φ2/2, which are equivalent to each other in the given problem. This factor р = φ2/2 has been known in the field of investigations of biological symmetries and morphological invariants for a long time under the name of the golden wurf (Petoukhov, 1981; Petoukhov, He, 2010). Now let us construct the 8-stage pentagram scale by means of the analogue of the described Pythagorean algorithm, using the factor p = φ2/2 in the initial geometric progression (instead of the quint factor 3/2). All three steps of the Pythagorean algorithm are reproduced: 1. 2. 3. Taking the first eight (!) members of such a geometrical progression with the factor p = φ2/2, which begins from the inverse value of this factor: p-1, p0, p1, p2, p3, p4, p5 , p6; Returning into the octave interval {1, 2} for those members of this sequence, values of which overstep the limits of this interval; this returning is made for these values by means of their multiplication or division with the number 2. As a result of this operation, a new sequence is obtained (this sequence can be named "the geometrical progression with return to the octave "): 2* p-1, p0, p1, p2, p3/2, p4/2, p5/2, p6/4; The permutation of these seven members in accordance with their increasing values from 1 up to 2 (the number 2 is included in this sequence as the end of the octave): 1, p3/2, p6/4, р1, p4/2, 2*p-1, p2, p5/2, 2 (3) This final sequence (3) satisfies the initial condition concerning the existence of two kinds of interval factors only. Really, it is easy to check directly that all ratios of SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 367 adjacent members of this sequence are equal to two values only, which play the role of the interval factors. For this sequence (3), the first kind of intervals is T = p3/2 = 1.1215… and the second kind of intervals is S = 4*р-5 = 1.0407… . The sequence of these interval factors is T-T-S-T-S-T-T-S. This sequence fills all the octave in accuracy: (p3/2)5 * (4*р-5)3 = 2. The quantities of various interval factors are equal to Fibonacci numbers here. Really, the 3 intervals S, 5 intervals T and in total 8 interval factors exist here. It is interesting, that if we take a non-Fibonacci number (for example, 4, 6 or 9) for the first member of the initial geometric progression in the first step of the Pythagorean algorithm, there arise such final sequences, which have more than two kinds of interval factors. Let us compare the classical 7-stage Pythagorean musical scale with the obtained 8-stage pentagram scale. Figure 9 shows the minimal difference between the sequences (musical scales) of two kinds of intervals inside the octave interval 2 for both scales. The initial and final parts of both sequences coincide completely, and only one additional semitone-interval arises in the middle part of the octave. This additional interval of the second kind S exists because the factor “р” is less than the quint factor. T T S T T T S T T S T S T T S Figure 9: Sequences of interval factors in the 7-stage Pythagorean scale of C major (the upper row) and in the 8-stage pentagram scale. In each row, the intervals of the first kind are marked by T, and the intervals of the second kind are marked by S (though values of T and S in the upper row differ from values of T and S in the bottom row). Using the sequence (3) of the intervals, one can construct the sequence of tones (musical notes), which is named the “8-stage pentagram scale of C major” by analogy with Pythagorean scale of C major (Figure 10). A choice of frequencies for these tones of the first octave is made in such way that this scale contains the frequency 440 Hz, which corresponds to note “la” in the Pythagorean scale and in equal temperament scale and which is used traditionally for tuning in musical instruments. Figure 10 compares the Pythagorean 7-steps scale C major and 8-stage pentagram scale for the first octave. Taking into account a minimal difference between the two scales, the majority of the notes of the pentagram scale are named by analogy with the appropriate notes of the Pythagorean scale but with the letter “m” in the end (for instance, "rem" instead "re"). The additional fifth note is named “pim”. 368 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN 260.7 D O1 256.8 DOM1 293.3 RE 288.0 REM 330.0 MI 323.0 MIM 347.6 FA 336.1 FAM 376.8 PIM 391.1 SOL 392.3 SOLM 440 LA 440 LAM 495.0 SI 493.5 SIM 521.5 D O2 513.6 DOM2 Figure 10: The upper row demonstrates the frequencies of the tones in the 7-stage Pythagorean scale of C major in the first octave. The bottom row demonstrates the frequencies of the tones in the 8-stage pentagram scale of C major in the similar octave. Numbers mean frequencies in Hz. The names of the notes are given. № K=0 K = 1 (n=3) K = 2 (n=4) K = 3 (n=5) К = 4 (n=6) K = 5 (n=7) K = 6 (n=8) Scales 1 2 3 5 8 13 21 Value of ТК 2 21*p-1 = 1,5279… 20*p1 = 1,3090… 2*p-2 = 1,1672… 2-1*p3 = 1,1215… 22*p-5 = 1,0407… 2-3*p8 = 1,0776… Value of SК 20*p1 = 1,3090… 21*p-2 = 1,1672… 2-1*p3 = 1,1215… 22*p-5 = 1,0407… 2-3*p8 = 1,0776… 25*p-13 = 0,9657… K = 7 (n=9) 34 25*p-13 = 0,9658… 2-8*p21 = 1,1159… K = 8 (n=10) 55 2-8*p21 = 1,1159… 213*р-34 = 0,8655… Sequence of ТК и SК in the scale T0 T1 –S1 T2– S2 – T2 S3–T3–T3 –S3 –T3 T4–T4–S4–T4–S4– T4–T4–S4 S5-T5-S5-T5-T5-S5-T5-T5-S5-T5S5-T5-T5 T6-T6-S6-T6-T6-S6-T6-S6-T6-T6S6-T6-S6-T6-T6-S6-T6-T6-S6-T6S6 S7-T7-S7-T7-T7-S7-T7-S7-T7-T7S7-T7-T7-S7-T7-S7-T7-T7-S7-T7T7-S7-T7-S7-T7-T7-S7-T7-S7-T7T7-S7-T7-T7 T8-T8-S8-T8-T8-S8-T8-S8-T8-T8S8-T8-T8-S8-T8-S8-T8-T8-S8-T8S8-T8-T8-S8-T8-T8-S8-T8-S8-T8T8-S8-T8-S8-T8-T8-S8-T8-T8-S8T8-S8-T8-T8-S8-T8-T8-S8-T8-S8T8-T8-S8-T8-S8 Figure 11: The values and the order of both kinds of intervals ТК and SK in the first Fibonacci-stage pentagram scales. “K” means a serial number of pentagram scales; “n” is a serial number of Fibonacci values from Figure 8. SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 369 This pentagram scale (Fig. 10), which was constructed in connection with parameters of the genetic code, possesses many analogies with the Pythagorean musical code by their internal symmetries and proportions. Its main difference from the Pythagorean scale is connected with irrational values of its interval factors. Irrational factors are used also in the modern equal-temperament scale. According to some data, Ancient Chinese knew about the equal-temperament scale, but neglected it preferring the Pythagorean scale, in which they saw cosmic and biological importance. The history of attempts of creation of new musical scales includes names of many prominent scientists: J. Kepler, R. Descartes, G. Leibnitz, L. Euler, etc. But these authors had no possibility to use the data about the genetic code in their attempts. The data about the genetic code allow one to create new musical scales. By analogy with the 8-stage pentagram scales, other Fibonacci-stage scales can be constructed. Figure 11 shows both kinds of interval factors T and S in the pentagram scales with different Fibonacci stages (see more details in (Petoukhov, 2008)). Each of pentagram scales (Figure 11) contains a Fibonacci quantity of each of interval factors ТК and SK: the interval TK is repeated Fn-1 times and the interval SK is repeated Fn-2 times. In total, the number of repetitions of these intervals TK+SK is equal to Fn, and they always exhaust the octave interval 2 exactly: (TК^Fn-1)*(SК^Fn-2) = 2, (4) where the symbol «^» means exponentiation. In addition, values of each of ТК and SK are also expressed via Fibonacci numbers simply. One can see from the table on Figure 11 that the recurrent relations exist for the system of the pentagram scales: TК+2 = TК/TК+1; SК = TК+1 (K = 0, 1, 2,..; Т0 = 2, Т1 = 2*p-1) (5) These relations lead to a recurrent algorithm (6) for calculating interval factors in the pentagram scales. This new algorithm can be considered as an alternative variant in relation to the Pythagorean algorithm described above. On the base of values T1 and T2, this new algorithm allow calculating the values of the interval factors TK for K = 3, 4, 5, 6,…, which correspond to 3-, 5-, 8-, 13-, 21- and higher order of the Fibonacci-stage 370 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN scales without using the Pythagorean algorithm. Really the recurrent relations (5) generate relations (6) for values TK as functions T1 and T2: ТК=[( Т1^FК-2)/(T2^FК-1)]^(-1)K+1 (6) where the symbol «^» means exponentiation; FK-1 and FK are Fibonacci numbers. Due to the expression Tn+2=Tn/Tn+1 (5) this family of Fibonacci-stage scales is connected with Pascal's triangle and the coefficients of the binomial expansion by Newton because Т01 = Т11*Т21 = Т21*Т32*Т41 = Т31*Т43*Т53*Т61 = Т41*Т54*Т66*Т74*Т81 = … . The exponents in these products coincide with the binomial coefficients. Another algorithm exists to determine the order of ТК and SK in each of the pentagram scales on the base of knowledge about their order in the first pentagram scales: T0 and Т1-S1. This algorithm is connected with the classical task by Fibonacci about rabbits’ reproduction. The algorithm is based on the fact that under transition from the pentagram scale K to the next pentagram scale K+1, each interval TK is replaced by two intervals TK+1 and SK+1, and each interval SK is replaced by interval TK +1. It should be noted that under transition from the scale with odd numeration K to the scale with even numeration (for example, from K=3 to K=4) the interval TK is replaced by TK+1 and SK+1 (the order TK+1 and SK+1 is essential here). In contrary, under transition from the scale with even numeration K to the scale with odd numeration (for example, from K=4 to K=5) the interval TK is replaced by SK+1 and TK+1 (the reverse order). Let us explain this with an example of the sequence S3-T3-T3-S3-T3, pointing in brackets for each of S3 and T3 their algorithmic transformation into T4 and S4 under transition from the pentagram scale with K=3 to the next scale with K = 4: S3(Т4)–T3(T4-S4)– T3(T4-S4)–S3(Т4)–T3(T4-S4). Paying attention only to the sequence of T4 and S4 inside brackets, we get the familiar sequence T4–T4–S4–T4–S4–T4–T4–S4 for the pentagram scale with К=4 on Figure 11. Figure 12 shows the tree of reproduction of interval factors TK and SK, which is constructed on the base of this algorithm and which corresponds to sequences of TK and SK in the table of pentagram scales on Figure 11. But the similar tree in which each of two elements is a repeated Fibonacci number at each level, is known since before the Renaissance. It appeared in the "biological" task by Fibonacci. This task speaks about breeding rabbits, a couple of which give birth every month a new pair, but give birth to rabbits only from the second month of its SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 371 birth. The history of the task and applications of Fibonacci numbers in many fields of contemporary science are described in the book (Vorobyev, 2003). Figure 12: The tree of reproduction of intervals TK (black circles) and SK (white circles) for the ensemble of the Fibonacci-stage pentagram scales from the table on Figure 11. Along with the similarities between trees in the Fibonacci's classical task and in our musical "task of the octave", which is associated with expression (4), the following mathematical differences exist between these tasks and between their trees: Our task of the octave analyzes not only the number of TK and SK, but also the values of each of the TK and SK, which is expressed through the Fibonacci numbers. Fibonacci's classical task doesn't consider parameters of each rabbit (e.g., weight or size), and all rabbits are different from each other only on the basis of sexual maturity. The octave task determines the order of TK and SK for each level of the tree of the pentagram scales. Fibonacci's task doesn't consider the order of two kinds of rabbits (which reached or not reached their sexual maturity) at each level of the Fibonacci tree. Octaves are not considered at all in the frame of the Fibonacci task. Taking these facts into account, our "task of the octave" can be represented as complication or generalization of the classical Fibonacci task. On the base of the table in Figure 11, one could construct systems of sound frequencies for the pentagram scales corresponding to different Fibonacci stages. In this case a new interesting result appears: all the musical frequencies of any pentagram scale are repeated in all higher pentagram scales, which have more numbers of steps. In other words, a fractal-like principle exists which provides incorporations of the set of sound frequencies of lower pentagram scale into the set of frequencies of higher pentagram scales. Each subsequent pentagram scale contains information about musical frequencies of all previous "generations" of pentagram scales. 372 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN The Moscow State Conservatory by P.I. Tchaikovsky has recently created a special “Center for interdisciplinary research of musical creativity” headed by one of the authors of the article. One of the main tasks of this center is to study genetic musical scales described in this article. This study is being conducted now from different viewpoints including new opportunities for composers. The study revealed that the pentagram musical scales possess beautiful and rich harmony for sound perception and they can be used to compose music based on them. The system of the musical pentagram scales has much more possibilities to produce harmonic sounds in comparison with the equal tempered scale, which is widely used now. The authors of the article have created a few musical instruments and special software on the computer language Python to produce appropriate musical products, which are used in this study. A group of specialists from different fields of science, medicine and culture participate also in these works. In addition, theoretical researches in this field are conducted in the international institute “Symmetrion” (Budapest, Hungary). Initial results of the wide study testify into a favor of great perspectives of this direction for science and culture. In our opinion, the aesthetic aspects of genetic music are connected not with a mechanical resonance of molecular structures under influence of sound waves but with informational aspects, which provide an effect of (not yet identified way of) recognition of a kindred language under during listening genetic music. This effect of recognition can be provided by biological algorithms of signal processing inside organisms. For example, in the case of pentagram music from the outside world, our organism can recognize those ratios, on which our genetic system and the whole inherited physiology are built, and the organism responds positively to this manifestation of a structural kinship of the outside world with its own genetic physiology. This positive reaction can be compared with mutual understanding between two persons when they begin to talk in the same language (if they talk in different languages, mutual understanding and interactivity don't arise though these persons can speak more and more loudly and energetically). Music is not limited to the relationship of sounds emitted by a system of stretched strings, which were studied by Pythagoras. The purpose of music is to call the emotion, associations and living pictures from bio-informational memory. This may be made effectively not only by sounds from classical musical systems of stretched strings but also (and more effectively?) by other sets of sounds, which are structured and toned on the base of inherited algorithms of biological processing of genetic information. No wonder the sense of musical harmony is innate. SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 373 6 SOME CONCLUDING REMARKS The facts described in this article about relations of the genetic systems with musical harmony are essential for the problem of genetic bases of aesthetics and inborn feeling of harmony. According to the words of the famous physicist Richard Feynman about feeling of musical harmony, "we may question whether we are any better off than Pythagoras in understanding why [stressed] only certain sounds are pleasant to our ear. The general theory of aesthetics is probably no further advanced now than in the time of Pythagoras" (Feynman, Leighton, & Sands, 1963, Chapter 50). A cultural direction of “genetic art” (or briefly “genoart”) can be developed additionally due to these data of matrix genetics. Genoart has many patterns, which are revealed by matrix genetics, and can be used to create new works of art, of designs and architectural and musical compositions. For example, the quint genomatrices can be presented in a form of color mosaics if matrix numbers are replaced by colors. It is possible to see regular complication of color mosaics along the family of the genomatrices with an increase of their Kronecker powers. The discovery of the connection of the genetic code with the golden section shows the molecular-genetic base of many known facts about aesthetic meanings of the golden section. Specifically the described facts give new materials for the question about architectural canons, where the golden section is used for a long time; for example, the famous modulor by Le Corbusier (1948, 1953, http://en.wikipedia.org/wiki/Le_Corbusier) is based on the golden section. The pentagram Fibonacci-stage scales can be additionally utilized for architectural proportions (in the role of “pentagram modulor”). There is no doubt that applications of numeric genetic matrices for investigations of the various ensembles of parameters of the genetic system can give many unexpected and useful results in the future as well. This direction of theoretical researches will be developed in parallel with developing matrix application in many other branches of science. The matrix-genetic approach to phenomena of the golden section in genetic systems and aesthetics can be developed in many theoretical ways and can give new interesting mathematical models. According to the described materials, each gene, each DNA, each protein can be characterized by its own “musical ensemble”. Sequences of appropriate musical intervals from such genetic melodies can be reproduced in a form of sequences of sounds, colors (“color music”), electrical stimulus, and impulses of laser beams, etc. for different needs (though own frequencies of these physical matters are very different). Whether such "natural genetic music" (or compositions on its basis) possesses a special physiological effectiveness for the treatment of people and animals, stimulation of 374 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN growth of plants and microorganisms, and so forth? Only future experiments can give the answer more precisely. It seems that a creation of a computer bank of genetic music is useful for theoretical and practical needs. One can add here that the creator of analytic psychology Carl Jung, studying archetypes of human consciousness, has created the medical method of amplification. This method is based on an active intercourse of his patients with these archetypes including famous tables of Ancient Chinese “I Ching”, which are connected with the genetic matrices (Petoukhov, 2005, 2008; Petoukhov, He, 2010; Tusa, 1994). Many composers declared a mysterious connection of music with the golden section and Fibonacci numbers early. In our opinion, this connection has based on the musical scale tuned on the described scale, which was constructed on the analogy of the mathematical sequences discovered in the algebraic structure of the genetic coding. The described facts are related to a problem of genetic bases of aesthetics and an inborn feeling of harmony. Investigations of numeric genetic matrices are an effective scientific instrument to analyze multi-component and multi-parametric ensembles of the molecular-genetic systems. The obtained results give a new vision of connections of genetic systems with well-known mathematical objects and theories from other branches of science and culture. Owing to the results of matrix genetics new opportunities arise to demonstrate the close connection between science and culture. One of them is a problem of multidimensional spaces including multi-dimensional musical spaces which need appropriate algebraic formalisms for their analysis (Kappraff, Petoukhov, 2009; Koblyakov, 1995, 2000a,b). One should note that our attempt to create the mathematical scale of the golden section, where the factor of the geometrical progression is equal to the golden section φ (but not to the φ2), has led to a scale, which differs from the Pythagorean musical scale cardinally and which has been considered not so interesting from the musical viewpoint. Furthermore such scales of the golden section had no evident connection with Fibonacci numbers in its interval factors. REFERENCES Berger L. G. (2001). Epistemology of art. Moscow: Isskusstvo (in Russian). Carrasco, J., Michaelides, A., Forster M., Haq S., Raval R., Hodgson, A. (2009). A one-dimensional ice structure built from pentagons. Nature Materials, v. 8, 427 - 431 Coldea R., Tennant D. A., E. M. Wheeler, E. Wawrzynska, D. Prabhakaran, M. Telling, K. Habicht, P. Smeibidl, K. Kiefer (2010). Quantum Criticality in an Ising Chain: Experimental Evidence for Emergent E8 Symmetry. Science, Jan. 8, Vol. 327, no. 5962, 177-180 SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 375 Darvas, G. (2007). Symmetry. Basel: Birkhäuser book. Feynman, R., Leighton, R., Sands, M. (1963) The Feynman lectures. New-York: Pergamon Press. Jean, R.V. (2006). Phyllotaxis. A systemic study in plant morphogenesis. Cambridge: Cambridge University Press. Kappraff, J. (2000). The arithmetic of Nichomachus of Gerasa and its applications to systems of proportions. Nexus Network Journal, 2(4). Retrieved October 3, 2000, from http://www.nexusjournal.com/Kappraff.html Kappraff, J. (2002). Beyond measure: essays in nature, myth, and number. Singapore: World Scientific. Kappraff, J., Petoukhov S.V. (2009) Symmetries, generalized numbers and harmonic laws in matrix genetics. Symmetry: Culture and Science, v. 20, 1-4, 23-50 Koblyakov, A.A. (1995) Semantic aspects of self-similarity in music. Symmetry: Culture and Science, v.6, 2, 63-74 Koblyakov, A.A. (2000a) Synergetics and creativity. Synergetic paradigm. Moscow (in Russian) Koblyakov, A.A. (2000b) From disjunction to conjunction (the contours of the general theory of creation). The language of science - the languages of art. Moscow, 2000, 75-86 (in Russian) Konopelchenko, B. G., & Rumer, Yu. B. (1975). Classification of the codons in the genetic code. Doklady Akademii Nauk SSSR, 223(2), 145-153 (in Russian). Le Corbusier, Sh. (1948). Modulor. Boulogne: Collection Ascoral. Le Corbusier, Sh. (1953). Der Modulor. Stuttgart: DVA. Lendvai, E. (1993). Symmetries of music, an introduction to semantics of music. Kecskemét: Kodály Institute. Needham, J. (1962). Science and civilization in China. Cambridge: Cambridge University Press. Petoukhov, S. V. (1981). Biomechanics, bionics and symmetry. Moscow: Nauka. Petoukhov, S.V. (2005). The rules of degeneracy and segregations in genetic codes. The chronocyclic conception and parallels with Mendel’s laws. In: He M., Narasimhan G., Petoukhov S. eds. Advances in Bioinformatics and its Applications, Proceedings of the International Conference (Florida, USA, 1619 December 2004), Series in Mathematical Biology and Medicine, v.8, 2005, New Jersey-LondonSingapore-Beijing, World Scientific, ISBN 981-256-148-X Petoukhov, S.V. (2008). Matrix genetics, algebras of the genetic code, noise-immunity. Moscow: RCD (in Russian). Petoukhov, S.V., He, M., (2010) Symmetrical Analysis Techniques for Genetic Systems and Bioinformatics: Advanced Patterns and Applications. Hershey, USA: IGI Global. 271 p. Ponnamperuma, C. (1972). The origin of life. New York: E.P.Dutton. Rashevsky P.K. (1964). Riemannian geometry and tensor analysis. Moscow, Nauka (in Russian) Schrodinger, E. (1955). What is life? The physical aspect of the living cell. Cambridge: University Press. Shnoll, S.E. (1989). Physical-chemical factors of biological evolution. Moscow: Nauka (in Russian). Shubnikov, A. V., & Koptsik, V. A. (1974). Symmetry in science and art. New-York: Plenum Press. Shults, G.E., & Schirmer, R.H. (1979). Principles of protein structure. Berlin: Springer-Verlag. Tusa, E. (1994). Lambdoma - “I Ging” - Genetic code. Symmetry: Culture and Science, 5(3), 305-310. Voloshinov, A.V. (2000). Mathematics and arts. Moscow; Prosveschenie (in Russian). Vorobiev, N.N., (2003). Fibonacci numbers. Birkhäuser Basel Weinberger, N.M. (2004). Music and brain. Sci. Amer., 291(5), 88-95. Symmetry: Culture and Science Vol. 23, Nos. 3-4,377-402, 2012 MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS Yuri V. Andreyev*, Alexander S. Dmitriev** * Physicist (b. Ufa, Russia, 1960). Address: Laboratory of Information and Communication Technologies based on Dynamic Chaos (InformChaos Lab.), Institute of Radio Engineering and Electronics of Russian Academy of Sciences; Mokhovaya st., 11, building 7, Moscow, 125009, Russia. E-mail: [email protected]. Fields of interest: dynamic chaos, chaos for communications, symmetry of chaos, information theory, chaotic cryptography. Publications: Radiotekhnika, 2008, №8, с. 83 (in Russian);Int. J. Bifurcation and Chaos (2005) vol. 15, No. 11, pp. 3639; IEEE Trans. Circuits and Systems-I, 2003, vol. 50, No. 5, pp. 613; Chaos, Solitons and Fractals, 2003, vol. 17, No. 2-3, pp. 531; Int. Journal Bifurcation and Chaos. 1999, vol. 9, no. 12, pp. 2165; Nonlinear Phenomena in Complex Systems, 1999, vol. 2, no. 4, pp. 48. ** Physicist, informatician (b. Kuibyshev, Russia, 1948). Address: Laboratory of Information and Communication Technologies based on Dynamic Chaos (InformChaos Lab.), Institute of Radio Engineering and Electronics of Russian Academy of Sciences; Mokhovaya st., 11, building 7, Moscow, 125009, Russia. E-mail: [email protected]. Fields of interest: dynamic chaos and bifurcation phenomena; generation of dynamic chaos; information processes in complex dynamics systems; information technologies based on dynamic chaos and nonlinear phenomena; application of dynamic chaos in information networks and communications. Awards: State Awards of the USSR Council of Ministries (1984 and 1989); 2 Gold Medals of the 4th Int. Invention Fair in the Middle East (Kuwait, 2011); IEEE Circuits and Systems Society Chapter-of-the-Year Award (2001); Medaille d'Honneur of Int. Exhibition of Inventions (2000) Paris, France; Grand Prize of the Contest of the works on Image Recognition (HP Labs – Bristol, 1992). Publications: Dmitriev A.S., Efremova E.V., Kuzmin L.V., Miliou A.N., Panas A.I., Starkov S.O.: Chapter 15: “Secure Transmission of Analog Information using Chaos”, in: Chaos Synchronization and Cryptography for Secure Communications: Applications for Encryption, ed. Santo Banerjee, IGI Global (2010) pp. 337-360; “Generation of chaos", Tekhnosfera (2012) 424 p. (in Russian); "Dynamic chaos: novel information carriers for communication systems", Fiz.-Mat. Lit (2002) 252 p. (in Russian); "Dynamic chaos as information carrier", New in synergetics: Glimpse in the third millenium. Nauka (2002) pp. 82–122. (in Russian); Andreyev, Yu.V., Dmitriev, A.S., and Kuminov, D.A. Chaotic processors, Advances in Modern Radioelectronics. (Foreign Radioelectronics), 1997, No. 10, pp.50-79 (in Russian). Abstract: In this paper we consider realization of information processing and recognition with dynamic systems. A method for storing and processing information is de- 378 Y.V. ANDREYEV, A.S. DMITRIEV scribed, which is connected with symmetries and in which digital information blocks are related to dynamic attractors (periodic orbits or chaotic attractors) of a specially designed nonlinear dynamic system. Such a system has interesting features concerning information processing, in particular, associative access (retrieval of the whole stored image when a small part of it is given as request), search by content, novelty filter, etc. The designed dynamic systems can be used as storage for texts, pictures, digital sequences, etc. Practical examples of using this approach to create information search engines, information archives with associative access to the stored data, and other data management solutions are given. Keywords: dynamic chaos, symmetries, attractors, information processing, recognition. 1. INTERRELATION OF DYNAMICS AND INFORMATION The role that dynamic chaos plays in processing information by human and animal brains is extensively investigated in the last decades. The very existence of chaotic modes in the brain is considered doubtless, and the efforts of the researchers are concentrated now on the study of those special functions of brain, for which chaos is either necessary, or has some advantages compared to simple dynamics. The last few decades have been witnessing a sharp growth of interest towards processing, memorizing and storing information in live systems. Unlike the addressed memory now used in computers, the memory of humans and animals is associative, i.e., both storing and retrieval of information are based not on the index of a memory cell but on the content (Kohonen, 1980). There exist quite a number of concepts of realizing the association principle to one extent or another. One of the most popular among them is that of using neuron network models (Grossberg, 1988; Hopfield, 1982; Carpenter, 1989). Such models are described as dynamic systems and objects being memorized or recognized are related to basic attractors, viz. stable modes. The attraction basin of each of the attractors defines the limits of recognition of one image or another. The functional role of cortical chaos appearing due to thalamo-cortical interaction is discussed in (Nicolis, 1982; Nicolis & Tsuda, 1985). Chaos is considered as a possible mechanism of self-referential logic, and as a machine for a short-term memory based on that logic. MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 379 W.J. Freeman observed chaotic activity in the learning process in the rabbit’s olfactory system (Freeman, 1987; Freeman, Yao & Burke, 1988; Skarda & Freeman, 1987; Eisenberg, Freeman & Burke, 1989; Yao & Freeman, 1990). He has found, that the rabbit remembers a known smell by coding it in spatially coherent and temporally almost periodic activity of the olfactory potential. In the case when the animal feels a new smell the coding mechanism doesn't work and the activity of the olfactory bulb becomes a low-dimensional chaos, as if it were a filter of "novelty", forming the state "I don't know". Based on the analysis of human electro-encephalograms a hypothesis was suggested (Babloyantz, 1986; Babloyantz & Destexhe, 1986; Destexhe, Sepulchre & Babloyantz, 1988), that the functional role of chaos is determined by the property of chaotic dynamics to increase the resonance capacity of the brain, giving a chance for extremely rich responses to an external stimulus. Among other hypotheses about the functional role of chaos we want to note: a nonlinear pattern classifier (Freeman, Yao & Burke, 1988; Yao & Freeman, 1990), a catalyst of learning (Skarda & Freeman, 1987), a stimulus interpreter (Tsuda, 1984), a memory searcher (Tsuda, Koerner & Shimizu, 1987), etc. A more thorough list of possible roles of cortical chaos in information processing can be found in (Tsuda, 1992), along with a rich reference base for the studies of 1970s to 1990s. Procaccia expressed a few ideas (Procaccia, 1988), that indicate at connection between chaos, unstable periodic orbits and information properties of dynamic systems. First, chaotic orbits can be organized around the skeleton of unstable periodic orbits. Each periodic orbit (or a point) can be universally encoded, using symbolic dynamics. As for the symbolic dynamics, there is a “grammar” than defines permitted “words”, or periodic orbits. Such a grammar can also be universal, which means that different dynamic systems, belonging to the same universal class, have the same distribution of periodic orbits in corresponding space points. Finally, periodic orbits and their eigenvalues can be derived directly from experimental data. A corresponding algorithm is given in (Auerbach, Cvitanovic & Eckmann, 1987). Some details of the above ideas are also discussed in (Gunaratne & Procaccia, 1987; Cvitanovic, 1988). The problem of information streams in 1-D maps is analyzed in (Wiegrinch & Tennekes, 1990). The authors refer to the studies (Shaw, 1981; Schuster, 2004; Farmer, 1982), in which information is argued to be a fundamental concept in the theory of dynamic systems and chaos. In particular, sensitivity to initial conditions rigorously 380 Y.V. ANDREYEV, A.S. DMITRIEV refers to information production. Further, he considers a dynamic system described by mapping f of a segment into itself and studies how map f iteration produces a special process which the authors call information stream. In (Matsumoto & Tsuda, 1988) the rates of information streams are estimated, and restrictions imposed by computers are discussed. The rate of information stream is interpreted by the volume of new data per unit time. For coupled 1-D maps derived from experimental data on Belousov-Zhabotinsky reaction, the authors showed that the data rate is equivalent to Kolmogorov entropy KS. Another reason to consider dynamic chaos from information viewpoint is existence of natural objects with deterministic chaotic dynamics (Voges, Atmanspacher & Scheingraber, 1987; Atmanspacher, Scheingraber & Voges, 1988) or with mixed dynamics, containing both deterministic chaos and a random process. As a rule, there is a 1-D signal, which is to be processed on order to obtain more or less detailed information about the object dynamics. Such a processing is a method of getting information on the object by a chaotic process that takes place in it. One can note the known fact about a connection of the genetic coding system with dynamic chaos and symmetric patterns of fractals (see for example (Almeida et al., 2001; Basu, et al., 1997; Deschavanne et al., 1999; Dutta & Das, 1992; Fiser, Tusnady & Simon, 1994; Goldman, 1993; Gutierrez, Rodriguez & Abramson, 2001; Joseph & Sasikumar, 2006; Oliver al., 1993; Petoukhov & He, 2010; Wang et al., 2005; Yu, Anh & Lau, 2004)). The pioneer work in this field was (Jeffrey, 1990) where an application of the known symmetrologic method “Chaos Game Representation” (CGR) has allowed visualization of long DNA sequences and has discovered different fractals patterns in them. Identifying chaos in experimental data such as biological sequences, a researcher can include searching for a strange attractor in the special dynamics, identified by its fractal structure. Having found such an attractor, one can try to estimate its dimension, which is a measure of the number of active variables and hence the complexity of the equations required to model the dynamics. Fractals are to chaos what geometry is to algebra. They are the useful geometric manifestation of the chaotic dynamics. They are called "the fingerprints of chaos" sometimes. The revealing the fractal structure of such CGR patterns of DNA sequences and protein sequences shows hidden connections of such genetic structures with non-linear dynamics or chaotic dynamical systems. A book of Jeff Hawkins “On intelligence” is of a special interest (Hawkins & Blakeslee, 2004). The author analyzed contemporary knowledge on human brain from MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 381 the viewpoint of technical implementation of the main principles of brain functioning. He claims that intelligence must be treated as memory, rather than behavior, and points that the brain is purely dynamic system with nothing static in it. On his opinion, the brain operates as predictive hierarchical associative memory, in which patterns are stored as dynamic structures (e.g., cycles), and the input signal to brain is time-varying. Associative memory is activated (accessed) by means of applying time-varying patterns at the input (or parts of patterns or distorted patterns). Here, we would like to stress the idea that the carriers of information in such a model of the brain are dynamic objects, not the fixed points of any kind. Thus, experimental investigations of electric activity of the brain and its certain neural subsystems, simulation of various neural networks and qualitative analysis of information processes in the brain allowed to suggest and to prove, to some extent, several hypotheses about the role of chaos in the brain activity. The use of different approaches, models and methods in the study of the functional role of chaos leads to an idea of the existence of general principles of information processing in chaotic systems, independent of the concrete nature and realization of the systems. This allows hoping to investigate main relations of information processing using simple models. And here the problem of a proper choice of the dynamical system arises, which must be convenient, i.e., be simple enough and allow thorough description, and at the same time exhibit complex and chaotic behavior. The approach that we follow in this paper implies that information processing in a dynamical system is associated with a notion of an attractor in the system phase space carrying information. Information processing, e.g., recognition, is associated with structural transformations of the attractors (bifurcations) and essential change in the system’s behavior. The first step of information processing, the storing, is coupled with a synthesis of a nonlinear dynamical system with the phase space of a special structure, i.e., with attractors corresponding to stored information. This approach is used, for example, in neural networks where for a given set of images a neural network is synthesized (trained) such that these images correspond to equilibrium states of this dynamical system. The most simple type of attractors, a stable point in the system phase space, is used as the carrier of information. Efforts are also known (e.g., (Tsuda, 1992, 1994; Baird & Eeckman, 1992)) of using more complicated attractors, such as cycles (periodic orbits) and strange attractors, for 382 Y.V. ANDREYEV, A.S. DMITRIEV carrying information in neural networks. But the enormous complexity of cooperative motion of the neurons in conventional neural networks makes direct synthesis (calculation) of these networks very difficult or even practically impossible. Instead, to design such a dynamical system, one has to use time-consuming procedures of training, which obscures investigations of the general principles of information processing. Issuing from the above concept of the existence of general principles of information processing independent of the concrete dynamical system, we proposed to use a class of discrete-time one-dimensional systems, namely, piecewise-linear maps of a segment (an interval) into itself xn+1=f(xn). The efforts were concentrated on the synthesis of dynamical systems with prescribed cycles in the system phase space. As a result, a method of storing information using stable limit cycles of 1-D maps as information carriers was proposed (Dmitriev, 1991; Dmitriev, Panas & Starkov, 1991). The use of more complicated attractors, i.e., cycles rather than equilibrium points, offers new capabilities of information processing, for example, associative memory (Dmitriev, 1991; Dmitriev, Panas & Starkov, 1991). Further investigations have shown that this method of storing information can be applied in practice to storing pictures, texts, signals, etc., (Andreyev, Belsky & Dmitriev, 1992; Andreyev et al., 1992; Dmitriev, 1993; Dmitriev et al., 1993). The method was extended also to storing information in 2-D and multi-dimensional maps and to storing multi-dimensional information sequences (cycles of vectors) (Andreyev, Belsky & Dmitriev, 1994). In this Paper we describe the original method and its developments and generalizations, including the use of chaotic systems and unstable limit cycles for storing information, and discuss opportunities of information processing in such systems. In section 2 we present the original method (Dmitriev, 1991; Dmitriev, Panas & Starkov, 1991). In section 3 we briefly describe dynamics of the map with stored information. In section 4 we discuss memory scanning using unstable cycles. In section 5 we demonstrate that information can be stored as chaotic attractors. In section 6 we investigate 1-D maps as recognition machines. “Long-term” and “short-term” memory model is shown. Finally, we summarize information processing functions realized in the proposed dynamical systems, and draw some conclusions on the role of chaos in the discussed models. MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 383 2. STORING INFORMATION AS DYNAMIC ATTRACTORS Processing information with dynamic systems implies: choice of attractor types, suitable for processing; choice of dynamic phenomena, necessary to implement basic operations of information processing; development of principles of unambiguous relation of information with trajectories of the dynamic system; development of concrete mathematical models allowing to process information as map trajectories and to control dynamic phenomena, in order to implement basic processing operations; development of software to simulate dynamic processors on PC; investigation of dynamic processor models; solution of complex problems with the dynamic processor, which are hard to solve with traditional approaches. Each trajectory of a dynamic system can be treated as an information signal, so, the set of map trajectories is a certain information “depository”. This “depository” has a number of interesting features, depending on the type of the dynamic system attractors. 2.1. Storing as synthesis of 1-D map In (Dmitriev, 1991; Dmitriev, Panas & Starkov, 1991) we proposed to store information in nonlinear dynamic systems as dynamic attractors (cycles, periodic orbits or even chaotic attractors). A method of storing information as stable cycles of a map of a segment into itself was proposed. Since an arbitrary map can have no necessary cycles, this is actually a method for synthesis of dynamic system (1-D map), in phase space of which there exist cycles of prescribed structure. This method is based, in part, on the ideas of symbolic dynamics (Lind & Marcus, 1995; Kitchens, 1998). We partition the phase space of the dynamic system into adjacent regions and assign each region a symbol. So, when the phase trajectory visits some region, we treat this as “appearance” or “production” of the corresponding symbol by this dynamic system. In the theory of symbolic dynamics this partitioning (generatrix) must be precise, in order to provide unambiguous relation between the system variables and the produced symbols. Here, we design an artificial dynamic system, so we may partition the phase space at will. Then we construct the cycles in the system phase space that run through necessary 384 Y.V. ANDREYEV, A.S. DMITRIEV space regions, which would mean production of the required symbols in the prescribed order, and finally, design a dynamic system which has these cycles in its phase space. The procedure is as follows. Since the cycle is finite and can carry a limited portion of information, we store finite blocks of information. Let us introduce the main notions and terms on an example of storing two information blocks, e.g., 1-D strings “babe” and “add”. For simplicity, we use a subset of the Latin characters A = {a, b, c, d, e} as the alphabet. The length of the alphabet is NA = 5. Our aim is to design the function of a 1-D map xn+1=(xn), such that in the phase space of this dynamic system stable cycles exist, and each information block of length n stored in the map is unambiguously related to a n-period cycle n. The symbols of the strings are coded by the amplitude of the mapping variable xn. We will store the words in a 1-D map using second-level storing, which means that each point of the cycles is determined by a pair of successive symbols. We divide the phase space of the dynamical system (the unit interval I = [0, 1]) into NA subintervals of the first level (each with the length 1/NA = 0.2) and relate them to the elements of the alphabet. Then we repeat the procedure and divide each of the subintervals of the first level into subintervals of the second level (with the length 1/NA2 = 0.04) and also relate them to the alphabet elements, as shown in Fig. 1. Now we design two cycles n = {x1, x2, ... xn} unambiguously related to the stored information blocks. Three cycle points for the word add are related with the block fragments (pairs) ad, dd, and da (the information block is mentally closed in a loop). The cycle point corresponding to the fragment ad is the center of the second-level subinterval d located within the first-level subinterval corresponding to the symbol a. Other cycle points are created similarly. The cycle points corresponding to block babe are determined by the pairs ba, ab, be, ea. Having created the cycles n in 1-D phase space, we construct a dynamical system possessing the phase space with such a structure. To obtain a map with a cycle of period n passing through the points x1, …, xn, it is necessary to plot the points (x1, x2), (x2, x3), …, (xn, x1) on the plane (Xm, Xm+1) and draw a curve y = f(x) passing through these points. In the plane (Xm, Xm+1) we plot the pairs of successive points (xi, xi+1) for all the cycles. These points form the “skeleton” of the map function (x). Through these points, we then draw short straight-line segments (called information regions), all with the same fixed slope s. We will control the stability of the cycles by changing the slope of these MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 385 segments (turning them around the central point which coincides with the cycle point lying on this segment). Figure 1: Storing two information blocks “babe” and “add” in 1-D map. Cycle points are designated with squares and diamonds, respectively. Storage level q = 2, s = 0.5. (a) The map function. (b) Information carrying cycles. As is known, the stability of a cycle is determined by its multiplier . In the case of a 1D map xn = (xn+1), the eigenvalue for the cycle n = {x1, x2, ... xn} is equal to = (n)(x1) = (x1)(x2) ... (xn). (1) Here, = sn. If <1 (s<1) the cycle is stable, otherwise it is unstable. To complete the synthesis of the piecewise-linear map function (x), we connect the information regions and the unit interval endpoints in series with straight-line segments, which we will further call non-information segments. The plot of the map with the information cycles is shown in Fig. 1. Iterates of the designed map produce the output information stream: an occurrence of the system variable xi in a first-level subinterval is treated as “generation” of the corresponding alphabet element. Mathematically, it is mi = int(NAxi), where mi is the order 386 Y.V. ANDREYEV, A.S. DMITRIEV number of this element in the alphabet, and int() denotes integer part of number. Thus, the motion of the phase trajectory over a cycle in the system phase space is accompanied by continuous reproduction of the corresponding information block. Storing information as stable limit cycles allows easy associative access to the stored information. If an equilibrium point is used as an information carrier, all information or its most part is necessary to access the point and to retrieve information. If an image is stored as a stable cycle, as in our case, each point of the cycle is related to only a part of the image, and only a piece of the original information is necessary to get a point near the cycle and to retrieve the whole image by iterating the dynamical system. Thus, associative access to the stored information becomes possible, yet by expense of iteration time. Indeed, if we take an excerpt aiai+1...aj of an information block with the length equal to or greater than the storage level q, then we can apply the same procedure as in creating the cycles points and get a point lying exactly on the corresponding cycle. Note that this is direct access to information, because the offered excerpt is not compared to all the images, instead, an initial point lying at the required cycle is directly calculated, so the access to information is very fast. 2.2. Storing texts, signals, images in 1-D map Not only simple letter sequences can be stored in 1-D maps. Any kind of information can be stored, that permits representation in the form of symbolic sequence of a certain alphabet. Texts, DNA sequences, vocal sheet music, etc., can be represented by certain symbols from a finite set, i.e., they can be stored as cycles of 1-D map. Moreover, the method can also be applied to storing more complex data, e.g., 2-D digital images, because they can also be transformed into 1-D symbol sequences, as is shown in Fig. 2. As the cycle of period p repeats itself after p iterations (here, p = mn), all cyclic permutations are equivalent and generate the same trajectory. We can say that the system loses its initial phase when converging towards the cycle. This situation is undesirable when working with pictures, since a picture has a definite beginning and it is difficult to recognize a cyclic permutation of the picture. Therefore, we mark the beginning of the information block by putting a label (a special symbol of the alphabet). Correspondingly, when restoring the pattern from the values of the mapping variable on the limit cycle, we consider the first element after the label as the beginning of the picture (the label itself is not displayed). Thus the alphabet is augmented MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 387 with a special symbol for the label. The label element is used only once in each information block. Figure 2: representation of the letter E. Label symbol is added to the string of black and white elements As an example, let us make an information block from a picture of letter E (Fig. 2). The picture is represented here by 8×8 array. Assume that the alphabet consists of three elements 0, 1 and 2, where 0 corresponds to black, 1 corresponds to white, and 2 corresponds to the label. Then the information block is written as 21111111001100010011010000111100001101000011000101111111000000000. (2) Since the above string contains 2 identical substrings (001101000011) of length 12, the minimum number of levels necessary to store the image with the original storing method cannot be less than 13. Let us estimate, how much information can be stored in 1D map at level q. Since the size of information region is N–q, then no more than NAq symbols can be stored (in that case, all function segments are information regions). Proceeding to more usual bits, we obtain the utmost capacity limit of Emax = NAqlog2(NA) bits. If we consider, for certainty, storing of blocks of equal length l, the information capacity E of the method can be estimated as (see reasoning in Andreyev et al., 1992): N l / l, E q N / l, l q; l q. (3) 2.4. Storing and coding arbitrary information blocks As can be seen from example in Fig. 2, it is difficult to store information blocks having large identical fragments. The number of the storage levels must be more than the length of identical fragments, but an increase of the number of storage levels leads to 388 Y.V. ANDREYEV, A.S. DMITRIEV exponential decrease of the size of information region. Finite accuracy of computer calculations puts a limit on the potential number of storage levels. To overcome this difficulty, a development of the original method is proposed in (Andreyev, Belsky & Dmitriev, 1992; Andreyev et al., 1992). Analysis of information sequences (pictures, texts, etc.) shows that the main storing difficulties are associated with repeating pieces of data. This leads us to an idea of compressing information (eliminating redundancy) before storing it. A compression method, matching the storing requirements (e.g., at level q), using alphabet of repeating fragments, was proposed (Andreyev, et al., 1992; Andreyev, Belsky & Dmitriev, 1994; Andreyev et al., 1996a). The idea is to substitute repeating fragments of length q with new symbols of the alphabet. Thus, shorter information blocks are obtained, while the alphabet is extended. The procedure is repeated until all the set of blocks becomes "q-storable", i.e., contains no identical fragments of length q. Thus, the encoding procedure means elimination of information redundancy by means of encoding the repeating fragments with symbols. Using this coding method, any set of information blocks can be stored at any level, beginning from the second. The coding method is reversible, i.e., lossless. To decode information, symbols of the addition alphabet are substituted by length-q fragments; possibly, this is repeated several times. From the viewpoint of compressing information, the described method is very close to known lossless Lempel-Ziv (LZ) compression methods. So, the compression ratio of the described method is very close to that of these methods. The difference here is that this compression method is matched with method of storing at level q. 2.4.1 Associative access to encoded information To implement associative access, i.e., recovery of a stored image by its arbitrary fragment, it is necessary to set the initial conditions on the related attractor. For this, a fragment (ajaj+1...aj+q–1) of length q of information block is required, which is used to calculate the initial point x0 q a j k 1 j 1 Nk x0 . (4) Associative access in the case of encoded information is organized as follows. If a fragment of the encoded information block of length q is given as request, then according to expression (4) initial point x0 can be calculated, encoded information block MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 389 recovered, and then the original information block decoded back. However, much more interesting is to recover information by fragments of the original (not coded) blocks. To implement associative access in this case, we should be able to transform a given request in initial alphabet into the corresponding encoded fragment. The first step is encoding this request using the present addition alphabet, i.e., the table of fragments obtained by orthogonalization of information blocks. The problem is that the request fragment can begin with arbitrary symbol of the original information block, preceded by those that don’t exist in the encoded block, because they were incorporated into new symbols. For example, if a block (abcdefghijk) after coding at the third level became (xdyhz), where x = (abc), y = (efg), z = (ijk), and the a fragment of initial block (cdefghi) is presented as request, then after encoding this fragment with the addition alphabet, sequence (cdyhi) is obtained, which contains “correct” piece of encoded block (dyh) in the center and "garbage" elements in the beginning and in the end. To get rid of them, we move q-element window along the encoded request until we find an initial point x0 that hits an information interval. We can do this, because we know the map function. Then we iterate the map beginning from this initial point, and compare the generated information stream with the symols of the encoded request. If there is a match, at least for a few iterates, then the matching symbols are “correct” and other symbols are “garbage”. If the encoded request fragment contains a correct piece of the length q (at least), it will be found. When initial point x0 is obtained, the entire information sequence can easily be restored and, consequently, the entire image. Thus, the system of associative memory based on the described principles, in response to the offered information block or a part of it, practically immediately forms one of the two answers: it either returns initial point x0 on the corresponding attractor (limit cycle) with which the entire block can be restored; or it gives the answer that the offered information is insufficient to unambiguously recover information perhaps, because there is no such on the map). Note that the forming initial point on attractor and, consequently, recovery of the initial image by its fragment take place without comparison of the request fragment with all stored images. After encoding the request and calculation of the initial point x0, just a few iterates are necessary to determine the fact of the presence on map attractor, and the time of each iteration of the map is proportional to logarithm of the volume of stored information. Thus, the described associative access operates as a very fast correlator. 390 Y.V. ANDREYEV, A.S. DMITRIEV 2.5. Storing information in 2-D and multi-dimensional maps To store information at qth level with the original method, we used nested sub-divisions of the system phase space. Instead, other dimensions of Rq space can be used. Method for storing information as cycles of dynamic systems is generalized to 2D and higher dimensions (Andreyev, Belsky & Dmitriev, 1994). Regular procedures of designing multidimensional maps are developed. Even multi-dimensional signals can be stored in such maps. Storing multi-dimensional signals with capabilities of quick associative search can find practical application in geophysical studies, tomography, by construction of hierarchical systems, etc. 3. DYNAMICS OF MAPS WITH STORED INFORMATION By construction, procedure cycles with stored information are stable, if the slope s of information regions of the map is less than 1. Here, we discuss the phenomena that occur when these limit cycles are made unstable, i.e., in case s > 1 (Andreyev, 1995). To investigate these phenomena, bifurcation diagrams are built for parameter s. As is shown in (Maistrenko, Maistrenko, Sushko, 1994a, 1994b), in piecewise-linear maps beside the ordinary cycles of points m = {x1, x2, ..., xm} there might be cycles of intervals m = {I1, I2, ... Im}, i.e., chaotic attracting sets composed of a finite number of intervals Ik. Each interval is mapped exactly into the next one, i.e., In+1 = f(In). When moving on this attractor, the trajectory goes through points x1, x2, …, xm, xn+1 = (xn), …, located sequentially on the intervals I1, I2, ..., Im. In (Maistrenko, Maistrenko, Sushko, 1994a, 1994b) the theory of piecewise-linear maps with a single extremum was developed and birth mechanisms of cycles of chaotic intervals and dynamic properties of these attractors were investigated. For example, consider bifurcation diagrams of the map with two information blocks 12345 and 97583 stored at the second level. The size of information regions here is 0.01. When the cycle stability is lost at s = 1, for each cycle of points two interval cycles are born, one at each end of the information regions. In Fig. 3(a) diagram is depicted for the case, when by stability loss of the limit cycle carrying information block 97583, the phase trajectory is attracted to the right interval cycle in the vicinity of this information cycle. At s 1.03 interval cycle at the right border loses stability and the trajectory leaves it and goes towards the attractor on the left border. If it were stable at MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 391 that moment, the trajectory remained there. But due to similar instability, the trajectory remains there for some time, then returns to the first attractor, and so on. Thus, the birth of united interval cycle, that embraces entire information segments of cycle 97583, is observed. ( A) (B) Figure 3: bifurcation diagram of the maps with two information blocks The slopes of non-information segments, adjacent to different information segments, are different, so, the “lifetimes” and stability conditions by parameter s are different for each of these interval cycles. In Fig. 3(b) the case is depicted, in which during the bifurcation of limit cycle stability loss, corresponding to information block 12345, the trajectory is attracted to the newly-born stable interval cycle that goes through the right border of information segments. But it loses stability practically immediately, and the trajectory goes to the interval cycle at the other border of information segments. At s 1.015 united interval cycle 5, is born. At the moment, when it also loses stability (at s 1.025), there still exists stable interval cycle connected with the other information block 97583, and the phase trajectory is attracted to it. At s = s2 1.07, the united interval cycle for block 97583 loses stability. Since no stable structures remain in the phase space of the dynamic system, the phase trajectory starts wandering over the system phase space. At the moment of appearance, the holes on interval cycles are small, and the trajectory spends most of its time in the vicinity of the interval cycles, but with increasing slopes the holes extend, and the map variable distribution becomes more uniform. Analysis of bifurcation diagrams of the maps with stored information shows that at s < 1 the only stable attractors in this oscillation system are information limit cycles, and at 1 < s < s2 – chaotic interval cycles. At s > s2 transition to global chaos through intermittency takes place in nonlinear dynamic systems. 392 Y.V. ANDREYEV, A.S. DMITRIEV 4. RANDOM ACCESS TO STORED IMAGES In section 2 we described the method for storing information on 1-D maps using stable limit cycles. We will show in this section the possibility of realizing random access, in the sense that the trajectory visits all regions which correspond to information stored in the map. It is clear that we cannot use stable limit cycles for this purpose. We want the trajectory to visit each region (or just the regions where information is stored) of the phase space from time to time. For this purpose, we can use a strange attractor with a nonzero distribution of the system variable in every point on the interval [0, 1]. However, in general, a strange attractor visits each region only briefly. For the random access memory, we exploit the phenomenon of intermittency. The idea of memory scanning using intermittency was discussed qualitatively by Nicolis (1986, 1991). In intermittency mode, the phase trajectory "lingers" in some definite regions of the phase space. So, we want the trajectory to be in the vicinities of the limit cycles corresponding to information blocks. "Random access" can therefore be realized using unstable cycles to store information blocks. Each cycle can easily be made unstable by letting the absolute value of the cycle multiplier (1) (i.e., the product of the slopes in the cycle points) be slightly larger than 1. A trajectory starting from the vicinity of a cycle will leave the cycle after some time and begin to wander in phase space until it gets into the neighborhood of the same limit cycle or another limit cycle with the same stability property. Then the trajectory will linger near that limit cycle for some time before leaving it again. If, independent of the initial condition, the trajectory visits the neighborhoods of all points, intermittency with respect to all cycles corresponding to stored information will then be observed over short but random time intervals. It must be noted that intermittency takes place in a rather narrow region of the parameter space only. Let us demonstrate random-access memory on example of 2D data, i.e., graphic images. Ten information blocks, corresponding to ten 8×8 patterns (depicting letters A, b, C, d, E, f, G, h, I, j), are stored in 1-D map at the second level (Fig. 4a). Information blocks were preliminary compressed. The map is designed in the same way as was described earlier, with the only exception that the slopes s of information segments are slightly more than 1. Actually, s = 1.02, hence, the cycles are unstable. The sequence of snapshots in Fig. 4b demonstrates intermittent appearance of all ten images. MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 393 As can be seen in Fig. 4b, the trajectory visits all districts with the stored information and lingers in the vicinity of unstable information cycles long enough. By means of switching the slopes of information segments and making them less than 1, an unstable cycle can be made stable. ( A) (B) Figure 4: (a) set of letters stored on 1-D map, and (b) intermittency between cycles representing letters. Snapshots are taken at arbitrary moments 5. STORING INFORMATION AS CHAOTIC ATTRACTORS Motion over interval cycle is chaotic, yet the cycle itself is stable and confined in space. Interval cycle consists of a finite number of continuous intervals, that embrace information intervals of corresponding information cycle (or parts of them) and small parts of adjacent non-information intervals. The order of going round the intervals of this chaotic attractor coincides with the round order of the information limit cycle points, located on these intervals. Therefore, if we consider the information stream aiai+1ai+2… (where ai = int(Nxi), int(x) is integer part of x) produced by map xn+1 = (xn) during the motion on this chaotic attractor, it appears to be reproduction of the stored information block. ( A) (B) (C) Figure 5: Interval cycles of the map with information block 375 stored at the first level: (a) time series, (b) phase portrait, (c) probability distribution 394 Y.V. ANDREYEV, A.S. DMITRIEV In Fig. 5, the use of chaotic attractors (interval cycles) for storing information is shown for the case of information block 375 stored at the first level. Solution for the slope s = 1.125 is depicted. 6. UNSTABLE CYCLES AND RECOGNITION Storing information as periodic motions in 1-D maps is an easy and efficient method of organizing associative access. However, an analysis of the papers devoted to investigation of information processing in natural brains indicates essentially complicated behavior in natural neural networks, (e.g., see a review in (Tsuda, 1992)). In particular, periodic motion often points at some “degenerate” states in the brain, the “usual” state being chaotic (Skarda & Freeman, 1987; Yao & Freeman, 1990; Babloyantz & Destexhe, 1986). Besides, the existence of stable limit cycles in the 1-D map phase space leads to competition of cycles, which means that iterates from an arbitrary initial point can result in convergence to any cycle, because the attraction basins of the cycles have fractal structure (Dmitriev, 1991; 1993). In the above described procedure of storing information, the cycles may be easily made unstable. All that is necessary is to change the slope of the information regions of the piecewise-linear map function s > 1. 6.1. Direct map function control A number of methods can be used to retrieve information from the map. Here we demonstrate the method of direct control of the map function (slope switching) to make the desired cycle stable while retaining others unstable, and apply this method to image recognition (Andreyev, Belsky & Dmitriev, 1992). According to the map design procedure, the phase space of the map contains a skeleton of unstable periodic orbits coupled with the stored information blocks and passing through information regions of the map function. Unstable periodic orbits coupled with only non-information segments of the map are ignored in the further discussion. We want to derive a regular procedure of the map function ƒ deformation, such that, if a presented image coincides with one of the stored images (or is close enough), the corresponding cycle becomes stable and attracting. The phase trajectory converges to this stable periodic orbit then, and the stored image is reproduced by the system, which can MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 395 be treated as recognition. Otherwise, no stable cycles appear, and the motion in the dynamical system remains chaotic. Thus, the character of the motion in the modified dynamical system, regular or chaotic, indicates of the result of recognition. Let M information blocks be stored in the map at q-th level, and an image be presented for recognition in the form of a string of L symbols. The question that the system is to answer is, does this image correspond to any of the stored information blocks or not? The procedure is as follows. We look through L fragments of this string, each q symbols long, and change the slopes of those information regions of the map function that correspond to the fragments, so that they become less than one in magnitude, as in Fig. 6. If the presented image coincides with an image stored in the system, all information regions of the map coupled with this image become switched, and the cycle becomes stable. If we iterate the modified map now beginning from an arbitrary initial point, we find that in some time the phase trajectory converges to this single stable limit cycle, and the system’s behavior becomes regular. If the presented image has nothing in common with the stored images, the map function ƒ is not distorted, and the motion in the dynamical system remains chaotic. Figure 6: Switching the slope of an information region. It is important, that only one attracting cycle appears in the system phase space. Therefore, if the global chaotic attractor comprises information regions of the map, then in the modified map the trajectory will inevitably “fall” in some time onto an information region of the stable cycle and converge to this cycle. Pictorially, this can be described as an appearance of a "hole" in the chaotic set, through which the trajectory “leaks” out from the chaotic to the regular mode. The stable limit cycle appearing because of the crisis is unique, so the recognition process is practically independent of the initial conditions for the phase trajectory. The choice of an initial point determines only the duration of the transient process from the metastable chaotic set to the stable limit cycle, i.e., the time of recognition. 396 Y.V. ANDREYEV, A.S. DMITRIEV The method of direct map function control that was designed to retrieve information from the map is based on the knowledge of the concrete map construction, but some general methods, such as cycle stabilization after OGY procedure (Ott, Grebogi & Yorke, 1990), or chaotic synchronization (Afraimovich, Verichev & Rabinovich, 1986; Pecora & Carrol, 1990), also seem applicable for this purpose. 6.2. Adaptive Model and Recognition The above discussed possibility of retrieving information stored as unstable cycles of a 1-D map is an intermediate step in creating a model of an “living” system processing continuous information stream and capable of selection (“recalling”) of information images stored in the system. If there are excerpts of “known” information objects in the input information stream, the system “recalls” them and reproduces them thoroughly (because of the associative property) in the output information stream; if there are no “known” objects in the input stream, the system returns to the initial chaotic state. These different states of the system can be related to “short-term” memory (“inspiration”) and “long-term” memory (“storage”) inherent of brain systems. Realization of these properties is possible with an adaptive model, a generalization of the above model, in which information is stored as unstable cycles, and the form of the map function controlled by external signal (Fig. 7). External signal here is an endless sequence of symbols fed to the system input. The elements of the external signal all belong to the same alphabet as the stored information blocks. They are fed to the system input synchronously, one per iteration. The input signal influences not the system variable xn, but directly controls the system function f by changing the slopes of information regions. The slopes in this model are not switched, but “slowly” oscillate between two boundaries. Now, when an image at the input is not known immediately as a whole, the map function is modified permanently at each step according to the piece of information available at the moment. If the input contains pieces of information related to some information regions, their slopes are rapidly decreased with iterates, if not, they slowly return to the initial value. Figure 7: block diagram of the adaptive recognition model. If the storage was made at q-th level, the input symbols are accumulated to form a fragment of the length q and, consequently, a point X on the map. If point X hits infor- MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 397 mation region j at i-th time moment, the slope of this region sij is decreased according to some rule and can become less than one in magnitude. Besides, backward relaxation is introduced: if at a moment i the input signal doesn't correspond to this information region j, then the slope sij begins the return to the initial state, i.e., to the value of the slope in the absence of the external signal. Thus, at each moment only one information region of the map function can be turned downward, all the others are turned upward. If the input sequence consists of successive repetitions of a stored image, the related information regions will eventually become modified (their slopes set less than one in magnitude), and a stable cycle will appear in the system. Thus, the system permanently adapts itself to the input signal. Let us designate the upper boundary for the information regions as Su (>1), and the lower boundary as Ss (<1). General equation describing the dynamics of sij is then sji+1 = [sij + (1 – )Ss] ij + [sij + Su(1 – )](1 – ij), (5) where ij = 1 if i = j, otherwise ij = 0 (this means that at each time moment each information region is turned only in one direction); determines the rate of relaxation, and the rate of convergence. In numerical experiments we used the values = 0.1 and = 0.9, which provided fast convergence to Ss and slow return to Su. As follows from (5), the convergence to Ss in the presence of the corresponding external signal may take place only if the condition 0 < |n| << || < 1 is satisfied, where n is the length of the corresponding cycle. The simplest case = 0 corresponds to one-step convergence. 6.3. Models of "long-term memory" and "short-term memory" We will show now how the notions of "long-term memory" and "short-term memory", widely used in the study of the principles of memory functioning in living systems (e.g., (Klatzky, 1975)), are applicable to the behavior of the adaptive system. Let us begin with the long-term memory. After information blocks are stored, they are present in the system all the time, and the carriers of information are the unstable cy- 398 Y.V. ANDREYEV, A.S. DMITRIEV cles. So, such a system may be interpreted as a long-term memory. Information is present in the system, but an external stimulus is necessary to retrieve it. The external stimulus is a signal containing information blocks stored in the system, precise or with some errors. In general, the system doesn't respond to other information, remaining in chaotic state. If the external signal with a stored information is fed to the system input, a stable limit cycle appears in the place of one of the unstable cycles. When the external stimulation is ceased, this cycle remains stable until the slopes of the corresponding information regions return to initial values, i.e., while the condition for the cycle stability < 1 is fulfilled. Figure 8: “Online” recognition of the input stream representing repetitions of the stored information blocks. An example of an “online” system is given in Fig. 8, where the dynamical system with three information blocks 123, 14568, 97583, stored at the second level, demonstrates transitions between three stored information blocks. The input signal, the output information stream and the map trajectory are presented. The input signal (upper plot) is pieces of oscillations representing the stored images, 200 points each. As is seen from the figure, the phase trajectory xn follows (though with a time delay) the input images, so the dynamical system successfully recognizes the images in the input signal. A certain delay in switching between the cycles in the output stream is associated with the discussed effect of the short-time memory. MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 399 Competition of stable information cycles caused by strong dependence on the initial conditions disappears, because a single attracting cycle exists in recognition process. Chaos can be treated here as a reservoir containing “useful” trajectories (along with many other ones). The main role of the global chaos in these systems seems to be the global mixing, providing guaranteed though random access to all the stored images: independently of the initial point the system phase trajectory will sooner or later occur in the vicinity of any required cycle (in the properly designed map). 7. CONCLUSIONS Models of human cognition using nonlinear dynamic systems with information stored as dynamic attractors are described. They implement a wide range of information processing functions, such as storing and retrieval, associative memory, memory scanning based on intermittency, image recognition based on storing with unstable cycles and direct map function control, “novelty filter”, “long-term” and “short-term” memories, etc. Associative memory is realized due to storing information as dynamic attractor (stable cycle), which allows access to a stored image by its fragment, i.e., by a small part of the stored image the system “recalls” the place where the image is stored, finds a point on the corresponding cycle and retrieves the whole image by means of iteration (going along the cycle). Memory scanning or random access is meant here as a mode, in which the phase trajectory wanders chaotically over the phase space and comes “at random” in the vicinity of this or that information cycle, lingers there for some time, sufficient to recover the stored image, and then leaves this vicinity and goes wandering further. This is a “slightly chaotic” mode, with information cycles made unstable but with the cycle multipliers (1) very close to unity. The designed systems can operate as “novelty” filter, i.e., when a piece of data is presented to the system with information blocks stored as stable cycles, the system immediately answers the question, whether this data piece is known to the system (i.e., it is stored) or it is “new”. For the presented data piece the system calculates a point in the phase space (or a set of points) and checks, whether they fit any cycle point. Image recognition by means of direct map function control, “long-term” and “shortterm” memory described in Section 6 become possible, when information is stored as 400 Y.V. ANDREYEV, A.S. DMITRIEV unstable cycles and the cycle stability is varied depending on the request presented. Note also surprisingly high performance of the chaotic processing models, which results in capability of solving rather complex and large problems by ordinary computer simulation. It seems that this is due to not only good model but also to the properties of the chaotic systems themselves, e.g., flexibility and quick reaction to external influence. The discussed models are very simple and allow complete description, yet they possess considerable information capacity, and may be of practical interest from the viewpoint of information processing technologies using chaos. In addition one can see that the described materials are closely connected with methods of symmetrology to study many characteristics of cyclic processes, periodic orbits, fractal structures and other typical objects in the field of dynamic chaos. REFERENCES Afraimovich V.S., Verichev N.I., & Rabinovich M.I. (1986) Chaotic synchronization of oscillations in dissipative systems, Izvestiya VUZov. Radiofizika, vol. 29, no. 9, 1050 (in Russian). Almeida, J. S., Carrico, J. A., Maretzek, A. M., Noble, P. A., & Fletcher, M. (2001). Analysis of genomic sequences by chaos game representation. Bioinformatics, 17, 429-437. Andreyev Yu.V. (1995) Attractors and bifurcation phenomena in 1-D dynamical systems with stored information. Izvestiya VUZov. Prikladnaya nelineinaya dinamika, vol. 3, no. 5, 3-15 (in Russian). Andreyev Yu.V., Belsky Yu.L., & Dmitriev A.S. (1992) Information processing in nonlinear systems with dynamic chaos. Proceedings of Int. Seminar Nonlinear Circuits and Systems, Moscow, vol. 1, 51-60. Andreyev Yu.V., Dmitriev A.S., Chua L.O., & Wu C.W. (1992) Associative and random access memory using one-dimensional maps. International Journal of Bifurcation and Chaos, vol. 2, no. 3, 483-504. Andreyev Yu.V., Belsky Yu.L., & Dmitriev A.S. (1994) Storing and recognition of information using stable cycles of 2-D and multi-dimensional maps. Radiotekhnika i elektronika, vol. 39, no. 1, 114-123 (in Russian). Andreyev Yu.V, Belsky Yu.L., Dmitriev A.S., & Kuminov D.A. (1996a) Information processing using dynamical chaos. IEEE Transactions on Neural Networks, vol. 7, 290–291. Atmanspacher H., Scheingraber H., & Voges W. (1988) “Global Scaling Properties of the Chaotic Attractor Reconstructed from Experimental Data. Physical Review A, vol. 37, 1314–1322. Auerbach D., Cvitanovic P., & Eckmann J.-P. (1987) Exploring chaotic motion through periodic orbits. Physical Review Letters, vol. 58, no. 23, 2387. Babloyantz A. [(986) Evidence of chaotic dynamics of brain activity during the sleep cycle. In Dimension and Entropies in Chaotic Systems (G. Mayer-Kress, Ed. ), Berlin: Springer-Verlag, 252-259. Babloyantz A. & Destexhe A. (1986) Low-dimensional chaos in an instance of epilepsy. Proceedings of National Academy of Sciences USA, vol. 83, 3513-3517. Baird B. & Eeckman F. (1992) A normal form projection algorithm for associative memory. In: Associative Neural Memories: Theory and Implementation (Ed. M. H. Hassoun), New York: Oxford University Press, 33-51. Basu, S., Pan, A., Dutta, C., & Das, J. (1997). Chaos game representation of proteins. Journal of molecular graphics & modeling, October, 279-289. MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 401 Carpenter G.A. (1989) Neural network models for pattern recognition and associative memory. Neural Networks, vol. 2, 243. Cvitanovic P. (1988) Invariant Measurement of Strange Sets in Terms of Cycles. Physical Review Letters, vol. 61, no. 24, 2729–2732. Deschavanne, P., Giron, A., Vilain, J., Fagot, G., & Fertil, B. (1999). Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol. 16(October), 1391-1399. Destexhe A., Sepulchre J.A., & Babloyantz A. (1988) A comparative study of the experimental quantification of deterministic chaos. Physical Letters A, vol. 132, 101-106. Dmitriev A.S. (1991) Storing and recognition information in one-dimensional dynamical systems. Radiotekhnika i Elektronika, vol. 36, no. 1, 101-108 (in Russian). Dmitriev A.S. (1993) Chaos and information processing in dynamical systems. Radiotekhnika i elektronika, vol. 38, no. 1, 1-24 (in Russian). Dmitriev A.S., Panas A.I., & Starkov S.O. (1991) Storing and recognition information based on stable cycles of one-dimensional maps. Physical Letters A, vol. 155, no. 8/9, 494-499. Dmitriev A.S., Kuminov D.A., Pavlov V.V., & Panas A.I. (1993) Storing and processing texts in 1-D dynamical systems. Preprint no. 3 (585), Institute of Radioengineering and Electronics RAS, Moscow, (in Russian). Dutta, C., & Das, J. (1992). Mathematical characterization of Chaos Game Representation: New algorithms for nucleotide sequence analysis. J. Mol. Biol., 228, 715–729. Eisenberg J., Freeman W.J., & Burke B. (1989) Hardware architecture of a neural network model simulating pattern recognition by the olfactory bulb. Neural Networks, vol. 2, 315-325. Farmer, J.D. (1982) Information Dimension and the Probabilistic Structure of Chaos. Zeitschrift fur Naturforschung, vol. 37A, 1304-1325. Fiser, A., Tusnady, G.E., & Simon, I. (1994). Chaos game representation of protein structures. Journal of molecular graphics, 12, 302-304. Freeman W.J. (1987) Simulation of chaotic EEG patterns with a dynamic model of the olfactory system. Biological Cybernetics, vol. 56, 139-150. Freeman W.J., Yao Y., & Burke B. (1988) Central pattern generating and recognizing in olfactory bulb. Neural Networks, vol. 1, 277-278. Hawkins J., Blakeslee S. (2004) On intelligence, New York, NY: Times Books. Hopfield J.J. (1982) Neural networks and physical systems with emergent collective computational abilities. Proceedings of National Academy of Sciences, USA. Apr; vol. 79, n 8, 2554–2558. Goldman, N. (1993). Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acid Research, May, 2487-2491. Grossberg S. (1988) Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks, vol. 1, 17-61. Gunaratne G.N. & Procaccia I. (1987) Organization of Chaos. Physical Review Letters, vol. 59, no. 13, 1377– 1380. Gutierrez, J.M., Rodriguez, M.A., & Abramson, G. (2001). Multifractal analysis of DNA sequences using novel chaos-game representation. Physica A, 300, 271–284. Joseph, J., & Sasikumar , R. (2006). Chaos game representation for comparison of whole genomes. BMC Bioinformatics, 7(May), 243-246. Kitchens B. (1998) Symbolic Dynamics: One-sided, Two-sided and Countable State Markov Chains. Springer. Klatzky R.J. (1975) Human Memory. Structures and Processes. W. H. Freeman and Co., San Francisco. Kohonen T. (1980) Content-addressable memories (Springer, Berlin). 402 Y.V. ANDREYEV, A.S. DMITRIEV Lind D. & Marcus B. (1995) An Introduction to Symbolic Dynamics and Coding. Cambridge University Press, Cambridge. Maistrenko Yu.L., Maistrenko V.L., Sushko I.M. (1994a) Order for the appearance of attractors in piecewise linear systems. In: Chaos and Nonlinear Mechanics, Series B, vol. 7, World Scientific. Maistrenko Yu.L., Maistrenko V.L., Sushko I.M. (1994b) Bifurcation phenomena in generators with delay lines. Radiotekhnika i elektronika, Moscow, No. 8-9, 1367-1380. Matsumoto K. & Tsuda I. (1988) Calculation of information flow rate from mutual information. Journal of Physics A: Mathematical and General, vol. 21, 1405–1414. Nicolis J.S. (1982) Should a reliable information processor be chaotic? Kybernetes, vol. 11, 269-274. Nicolis, J.S. (1986) Dynamics of Hierarchical Systems: An Evolutionary Approach (Springer-Verlag). Nicolis, J.S. (1991) Chaos and Information Processing: A Heuristic Outline (World Scientific). Nicolis J.S. & Tsuda I. (1985) Chaotic dynamics of information processing – The “magic number seven plusminus two” revisited. Bulletin of Mathematical Biology, vol. 47, 343-365. Oliver, J.L., Bernaola-Galvan, P., Guerrero-Garcia, J., & R. Roman-Roldan (1993). Entropic profiles of DNA sequences through chaos-game-derived images. Journal of theoretical biology, February 21, 457-70. Ott E., Grebogi C., & Yorke J.A. (1990) Controlling chaos. Physical Review Letters, vol. 57, 1196-1199. Pecora L.M. & Carrol T.L. (1990) Synchronization in chaotic systems, Physical Review Letters, vol. 64, 821824. Petoukhov S.V. & He M. (2010) Symmetrical Analysis Techniques for Genetic Systems and Bioinformatics: Advanced Patterns and Applications. Hershey, USA: IGI Global. Procaccia I. (1988) The organization of chaos by periodic orbits: Topological universality of complex systems. In: Universalities in Condensed Matter. Ed. Jullien R., Springer, 213 Skarda C.A. & Freeman W.J. (1987) How brains make chaos in order to make sense of the world. Behavioral and Brain Sciences, vol. 10, 161-165. Shaw, R. (1981) Strange attractors, chaotic behavior and information flow. Zeitschrift fur Naturforschung, vol. 36a, 80-112. Schuster H.G. (2004) Deterministic Chaos. An Introduction (Wiley-VCH). Tsuda I. (1984) A hermeneutic process of the brain. Progr. Theor. Phys., Supplement, vol. 79, 241-259. Tsuda I. (1992) Dynamic link of memory - chaotic memory map in nonequilibrium neural networks.” Neural Networks, vol. 5, 313-326. Tsuda I. (1994) Can stochastic renewal of maps be a model for cerebral cortex? Physica D, vol. 75, 165-178. Tsuda I., Koerner E., & Shimizu H. (1987) Memory dynamics in asynchronous neural networks. Progress of Theoretical Physics, vol. 78, 51-71. Voges W., Atmanspacher H., & Scheingraber H. (1987) Deterministic Chaos in Accrecting Systems: Analysis of the X-Ray Variability of Hercules X. The Astrophysical Journal, vol. 320, 794–802. Wang, Y., Hill, K., Singh, S., & Kari, L. (2005). The spectrum of genomic signatures: from dinucleotides to chaos game representation. GENE, February, 173-185. Wiegrinch W. & Tennekes H. (1990) On the Information Flow for One-Dimensional Maps. Physical Letters A, vol. 144, no. 3, 145–152. Yao Y. & Freeman W.J. (1990) Model of biological pattern recognition with spatially chaotic dynamics. Neural Networks, vol. 3, no. 2, 153-170. Yu, Z. G., Anh, V., & Lau, K. S. (2004). Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. Journal of theoretical biology, 226(February), 341-348. Symmetry: Culture and Science Vol. 23, Nos. 3-4, 403-426, 2012 THE IRREGULAR (INTEGER) TETRAHEDRON AS A WAREHOUSE OF BIOLOGICAL INFORMATION Tidjani Négadi Address: Physics Department, Faculty of Science, University of Oran, 31100, Oran, Algeria; E-mail: [email protected] Abstract: A “variable geometry” classification model of the 20 L-amino acids and the 20 D-amino acids, based on twenty, physically and mathematically, labeled positions on tetrahedrons, and extending Filatov’s recent model, is presented. We also establish several physical and mathematical identities (or constraints), very useful in applications. The passage from a tetrahedron with (possibly) maximum symmetry to a tetrahedron with no symmetry at all, here a distinguished integer Heronian tetrahedron, which could “describe” some kind of symmetry breaking process, reveals a lot of meaningful biological numerical information. Before symmetry breaking, and as a first supporting result, we discover that the L- and D-tetrahedrons together encode the nucleon-content in the 61 amino acids of the genetic code table and the atom-content in the 64 DNAcodons. After a (geometric) symmetry breaking, and also an accompanying (physical) “quantitative symmetry” restoration concerning atom numbers, more results appear, as for example the atom-content in this time 64 RNA-codons (61 amino acids and three stops), the remarkable Downes-Richardson-shCherbak nucleon-number balance and, most importantly, the structure of the famous protonated serine octamer Ser8+H+ (L- and D- versions), thought by many people to be a “key player” in the origin of homochirality in living organisms because of its unique property to form exceptionally stable clusters and also its strong preference for homochirality. Using all the labeling possibilities, we find the more fundamental neutral serine octamer Ser8 (L- and D-versions). We also revisit, in this paper, the number 23! which is at the basis of our recent arithmetic approach to the structure of the genetic code. New consequences, not yet published, and also new results, specially in connection with the serine octamer, 404 T. NÉGADI are given. Finally, a remark on the inclusion the “non-standard” versions of the genetic code, in the present formalism, is made. Keywords: integer tetrahedron; amino acids; homochirality; serine octamer 1. INTRODUCTION This paper is devoted, first, to a new classification of the twenty amino acids based on the Heronian (integer) tetrahedron, including and extending the recent one by Filatov (Filatov, 2009) based on the usual tetrahedron and includes also the twenty mirrorimage D-amino acids. We start with the regular, achiral, tetrahedron, the most basic of all polyhedra (the first of the five platonic solids) with maximum symmetry and end up with the irregular, chiral, integer-Heronian tetrahedron, noted symbolically “117”, with no symmetry at all. In this way we have, besides a “symmetry-breaking” process “begeting chirality”, a coherent chiral-framework classification of the amino acids, distributed in the faces, vertices and edges of the chiral integer Heronian tetrahedron (and also its mirror-image for the 20 D-amino acids). Recall that in the case of the tetrahedron the framework could not always be chiral and the amino acids which are disposed on it are chiral (except of course glycine). The whole object that is the chiral Heronian tetrahedron-and-chiral amino acids on it is, in this way, chiral. At the same time, a slight “quantitative symmetry-breaking”, at the level of atom number and inherent to the Filatov’s ordinary tetrahedron (extended) model, with only physical labeling (nucleon, carbon, hydrogen and atom numbers), is cured and the latter restored. Another nice virtue of the above “117” Heronian tetrahedron is that it incorporates (encodes) naturally and strikingly, throught its geometric characteristics the correct numeric structure of the famous protonated serine octamer Ser8+H+, thought to be a promising actor in the origin of homochirality (Cooks et al., 2001; Hodyss et al., 2001). It is precisely the passage from a regular tetrahedron, with possible maximum symmetry, to the Heronian (irregular) tetrahedron, with no symmetry at all (no two edges equal), that is a symmetry breaking, that leads to the serine octamer. By introducing a supplementary and concomitant mathematical labeling, for the 20 positions (amino acids) in the vertices, edges and faces of the tetrahedron, the so-called “generalized Plato’s Lambda” numbers or Tetraktys, its more fundamental neutral form Ser8 is revealed. This is examined in section 2. Our second aim, in this paper, and in section 3, is to revisit the number 23! which was at the basis of our arithmetic model of the genetic code (Négadi, 2007, 2008, 2009) but in the context of the present work. We show, in particular, that it incorporates, too, the above serine octamer structure and encode also several other mathematical characteristics of the genetic code. As this paper is intended first for the readers of Symmetry: Culture and Science, and maybe other IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 405 people at large, we include two Appendices to (i) define the physical (chemical) and mathematical entities used, (ii) ease reading and (iii) render possible skill/verifications by the reader. In the first, we collect detailed chemical data concerning the 20 amino acids as well as the DNA (RNA) units Thymine (Uracil), Cytosine, Adenine and Guanine. In the second, we give some mathematical definitions concerning certain (elementary) arithmetic functions used in this paper (and in others), that help to unveil the hidden “biological information”. In particular, famous Euler’s totient function is briefly presented, with here interesting applications. 2.AN EXTENDED CLASSIFICATION MODEL OF THE AMINO ACIDS AND THE SERINE OCTAMER SER8__ From about seven hundred known amino acids only exactly twenty are coded by the genetic code. It is also a recognized fact that life uses exclusively the L-form amino acids to make proteins and exclusively the D-form sugars in the backbones of DNA and RNA. However, and this is a guenine paradigm shift (or turning point) in biology, the D-amino acids are also used in many living beings, from bacteria to mammals (see for example Yang et al. 2003), in various biological strategies. In fact, L or D is a matter of convention, the two forms (called also enantiomers) exist and some people say that life could even have begun “blind” to the D- or L-forms of the amino acids, maybe used both, and latter “chosed” one of them, the L-form, by inventing “control quality”, proofreading mechanisms and “checkpoints”. In the contemporary ribosome, it is mainly the Aminoacyl-t-RNA synthetases that play a central role for the correct handedness (L-amino acids) because the other component, the Peptidyl Transferase Center and the codon-anticodon Decoding Center, are “blind” to the handedness of the amino acids. The Decoding Center guarantees only the identity of an amino acids but not its handedness. Enantiomers have very similar physical properties, identical with respect to ordinary chemical reactions, and the difference arise only when they interact. In the recent years, the field of D-amino acids chemistry and biochemistry has grown and these have now the same status as their L-forms. In the following, therefore, we shall consider an extended classification comprising the 20 L-amino acids as well as the 20 D-amino acids. We start from Filatov’s classification model of the twenty amino acids, which is a consequence of his symmetric table of the genetic code, and based on the tetrahedron. The author does not precise the geometric nature of the tetrahedron (regular or irregular) but any tetrahedron has 4 faces, 4 vertices and 6 edges. At one side, we have the regular tetrahedron with 24 isometries and, at the other we have the irregular T. NÉGADI 406 tetrahedron with 7 possible isometries. The extreme case where all the edges of the tetrahedron are different corresponds to a complete loss of symmetry and we are precisely interested in this case, in this work. The 20 amino acids are disposed on the tetrahedron as shown in the figure 1 below (see Filatov, 2009). Four amino acids A, N, L and F are on the center of the four faces (blue), four amino acids G, P, K and Y (green) are on the four vertices and the remaining twelve amino acids (in red) are disposed on the 6 edges (2 amino acids per edge). Filatov has discovered a “quantitative” symmetry at the level of the nucleon number (the Hasegawa-Miyata Parameter (HMP), Hasegawa, and Miyata, 1980). The nucleon (proton or neutron) is the basic building-block of (ordinary) matter. mirror P M T N S KW G C V A F F E Q M S L R K I P D H Y Y N L E H I T G Q D YA R C W V K Figure 1: The L-amino acids (ex. left) and the D-amino acids (right); the vertical line: the mirror. The sum of the nucleon numbers of the amino acids (side-chains only) in the two pairs of faces is (for one tetrahedron, the L-tetrahedron say): 628+627=626+629=1255 (n) The nucleon numbers of the 20 amino acids are given in Appendix 1. Here, the two pairs of faces are defined in the figure by (PKY/YKG) and ((PGY/PGK), respectively. We have used Filatov’s tetrahedron model to deduce the existence of the equivalents of equation (n) for carbon and for hydrogen. For atom number, there is, a-priori, no (exact) quantitative symmetry, as for carbon and hydrogen, but we shall see below that this “quantitative symmetry-breaking” could be cured, thanks to the Heronian tetrahedron mathematical properties, and the (quantitative) symmetry restored, with interesting consequences. For carbon, it writes (for the respective faces, as in (n), K IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 33+34=32+35=67 407 (c) Using an elementary (funny) manipulation of numbers, by adding the “digits” of the numbers in Eq.(n) in base-100 gives 34+33=32+35, which is Eq.(c) up to a trivial permutation. The nucleon and carbon atom numbers appear therefore, somehow, "linked" as we have already experienced earlier (see Négadi, 2009) for the total number of nucleons in the 20 amino acids (12+55=67). Now, for hydrogen, we have 64+53=62+55=117 (h) That each pair of faces gives the (correct) nucleon, carbon and hydrogen numbers for the 20 amino acids is not at all trivial. As a matter of fact, in each pair of faces, there is only 16 different amino acids, with four of them situated on an edge, contributing two times, and these 16 amino acids give the correct number of nucleons, carbon and hydrogen numbers for the 20 (different) amino acids. Thus, there is something, not trivial, at work. If now we consider atom numbers, we have for the two pairs 108+97=205 and 104+99=203, which is not 204, the number of atoms in the 20 amino acids and the “quantitative” symmetry is broken. However, their mean gives the correct value: (205+203)/2=204. We shall see below how this “quantitative symmetrybreaking” for atom numbers could be “restored”. There exist also a secondary “quantitative” symmetry at the level of carbon (C), hydrogen (H), atom and nucleon numbers but now between the four vertices (G, P, Y, K) and the four face centers (L, A, F, N): #C-atoms(v)=#C-atoms(f.c.) =14 (v) #H-atoms =#H-atoms (v) #atoms =#atoms (f.c.) (f.c.) =23 =39 #nucleons(v)=#nucleons(f.c.) =221 (1) (2) (3) (4) We could also take Eqs.(1)-(4), collectively, and represent this “secondary” quantitative symmetry by the sum of the four numbers in the following interesting identity: 297=297 (5) The identities (or constraints) such as (n), (c) and (h) will be called in the following primary and those in (1)-(5) secondary. We shall show below the importance of the relation in Eq.(5) concerning serine and for other interesting consequences. As T. NÉGADI 408 mentioned in the introduction, the above tetrahedron(s) could also be labeled but the numbers of the 3-D generalization of famous Plato’s Lamda, or 3-D Tetraktys1 The numbers on the four faces, when these latter are taken separately, are given as shown below 1 2 1 4 1 3 4 9 12 16 27 36 48 64 4 8 16 8 16 32 64 2 4 8 8 3 12 16 18 24 32 27 36 48 64 6 9 12 18 27 Let us note, first, that the sums of the numbers in each of the above four numberpatterns are as follows: 155, 220; 90, 285, respectively. Now, when the tetrahedron is reconstructed, that is when the four “faces” are assembled, some numbers become redundant so that the sum of all theses numbers is only 350 (see Phillips, ref. in footnote 1, and figure 2 below). In this way, we could establish, also for the Tetraktys, the equivalent of Filatov’s nucleon number “quantitative” symmetry (see Eq.(n), same order): G(1,1) N(58,8) Q(72,3) 51 A(15,6) 52 F(91,12) 53 P(41,64) T(45,32) 117 R(100,16) H(81,9) K(72,8) V(43,12) M(75,16) 84 C(47,2) W(130,4) I(57,4) L(57,24) 80 S(31,48) E(73,36) D(59,18) Y(107,27) Figure 2: The Heronian tetrahedron “117” hosting the 20 amino acids 285+90=220+155=375 (T) This Tetraktys-number identity could be shown to be also derivable from the above secondary identities relations (1)-(5). As a matter of fact, by adding (5) to the sum of the prime factors of (1) through (4), or their a0-functions (see Appendix 2), we have (two times) 1 Stephen M. Phillips Plato’s Lambda-its meaning, generalization and connection to the Tree of Life (Article 11, http://smphillips.8m.com).E IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 297+a0(14)+a0(23)+a0(39)+a0(221)=375 409 (6) Let us now return to the Heronian tetrahedron which was mentioned several times above. This interesting mathematical object (shown in figure 2 above), also called “perfect pyramid” (Buchholz, 1992), has integer edges, faces areas and volume. We have represented in the figure only one Heronian tetrahedron, say the L-Heronian tetrahedron, but its mirror-image D-Heronian tetrahedron is also considered, but here not represented. Also, for each amino acid, the first number in the parenthesis is the number of nucleons and the second the Tetraktys number. Now from a large family of Heronian tetrahedrons, known in mathematics, the integer Heronian tehtrahedron having the smallest maximum edge-length is the one with edge-lengths 51, 51, 53, 80, 84, 117. As mentioned in the introduction, this tetrahedron is often noted “117”, in reference to its special property (smallest maximum edge-length). As mentioned in the introduction, the above (irregular) tetrahedron has all its edges different so that it is chiral. The faces corresponding to the order in equation (n) have as (integer) areas 1800, 1170, 1890, 2016. The volume V is equal to 18144 (see Buchholz, 1992). As a first application, consider the integer volume. Mathematically the volume of the mirror image of V is –V, via the Cayley-Menger determinant but this would have no impact because the arithmetic functions we use here are insensible to the sign of an integer, as they are only the sum of the prime factors. We have a0(V)=a0(-V)=29 so that a0(a0(V)+a0(-V))=31, SOD(V)=18 and B0(V)=56, where SOD is the Sum Of Digits and the functions a0 and B0 are defined in Appendix 2. This is serine; it has 31 nucleons in its side-chain, 56 in the block of its “residue” and 18 in the water molecule to complete the block. (Note that 56+18=74 is the number of nucleons in a block, the same for 19 amino acids and, as it is well known, proline is an exception.) Serine is also “present” when considering the Tetraktys numbers with sum 350. Looking at the numbers in the detail, we see that the numbers 4, 8, 12 and 16, and only these, appear two times or with multiplicity 2. This gives for the total sum 80+270=350 where 80 is for the above four numbers, having multiplicity 2, and 270 corresponds to the rest of the numbers, with no multiplicity. We have B0(270)=31 and B0(270)+B0(80)=56 so that 2B0(270)+B0(80)=31+56=87 is equal to the nucleon number in the “residue” of serine, 105-18, where 18 is for the nucleons in a water molecule as mentioned above. This latter could be found several ways and two of them are the following B0(375-350)=18 or, simply a0(375)=18, where 375 is the sum of the Tetraktys numbers in the two pairs of faces (see Eq.(T)). These three numbers 31, 56 and 18, characterizing serine, are identical to the above three ones, obtained from the heronian tetrahedron. 410 T. NÉGADI The Heronian tetrahedron has as the sum of its six integer edges 437 but when the two pairs of faces are considered each edge contributes twice so that the total sum is 2437=874. This number identifies nicely with the number of nucleons in the protonated serine octamer Ser8+H+, in a state called “maximum exchange” where it “catches” exactly 33 hydrogen nuclei or nucleons. As a matter of fact we could write for the total number of nucleons for eight serines, the octamer, and its “charging” proton, on the one hand, and the 33 “exchangeable” hydrogens, on the other, 8(31+74)+1+33=841+33=874. (Each serine has 4 exchangeable hydrogens H2N(amine group), HO- (hydroxy group) and –COOH (acid group) and a charging-carrying proton, that is 48+1=33, hydrogen atoms.)2 In experiments serine clusters can appear between the mass ratio m/z=841 (no exchange) and m/z=874 (complete exchange). It is interesting, and remarkable, that the sum of all the edges of the two pairs of faces of the above Heronian (L-)tetrahedron, 874, could be made to reproduce numerically the details of the protonated octamer and its 33 exchangeable hydrogens. It suffices to introduce the primary identity (c), for carbon, in the form 33+34-(32+35)=33-33=0, that selects the number 33: 874-33+33=841+33=874. Selecting instead the number 34 (33+charging proton), we obtain the neutral serine octamer Ser80 and the rest in protons (hydrogen atoms) 840+1+33=874. Also, selecting from the six edges the two equal maximum lengths, we get 2117+640=874 and, introducing this time the secondary identity for carbon, in Eq.(1) and assuming that the first primay identity is already applied, we end up with 248+593+33=74. The first two numbers are respectively the number of nucleons in the 8 side-chains of the 8 serines forming the octamer (831) and the number of nucleons in their 8 blocks as well as the charging proton (874+1=592+1). The serine tetramer (4 serines) is the smallest serine cluster known to exhibit homochiral preference (Cooks et al., 2001). It is also thought that the octamer could be made of two tetramers (Nemes, 2005, Costa and Cooks, 2001). As only a speculational exercise, and following the reasoning about the serine octamer, the tetramer (Ser4+H+,) would have 44=16 “exchangeable hydrogens” and, adding a “charging” proton, we would end with a tetramer in “maximum exchange” having a number of nucleons equal to 4105+1+16=421+16=437. It is remarkable that this number half 874 or the sum of the six edges of the Heronian tetrahedron (51+52+53+80+84+117). The introduction of the identity (5) in 437 gives 140+297=437 which has a simple meaning: the first term, 140, is the number of nucleons in the four side-chains and the sixteen “exchangeable hydrogens” 431+16 and the second, 297, is the number of nucleons in the four blocks and the “charging” proton 474+1. This is in agreement with the fact that the octamer is thought to be made 2 I would like to thank Lars Konermann (Department of Chemistry, The University of Western Ontario, Canada) for very kind explanations concerning the exchangeable hydrogens. IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 411 of two tetramers. Finally, the D-octamer is reproduced the same way, using this time the mirror image, or D-tetrahedron. Now, we come back to the “quantitative symmetry-breaking”, mentioned above. We had for the two pairs 108+97=205 and 104+99=203. In fact, the discrepancy comes from the NOS-atoms (nitrogen, oxygen and sulfur), as for hydrogen and carbon the symmetry is obeyed. In the detail, and for the four faces, we have 108=64+44, 97=53+44, 104=62+42 and 99=55+44, where in each case the first number refers to hydrogen number and the second to the CNOS-atoms. There exist numerical patterns that suggest that hydrogen is in some way separated from the other four atoms (see the next section) and we shall keep this separation working in the above four relations where the numbers of CNOS-atoms are, respectively 44, 44, 42 and 44. These latter correspond to the faces “628”, “627”, “626” and “629”, respectively, and we shall show below that their atom numbers are rather given by the B0-functions of the areas of theses faces. The three edges of these faces are (53, 80, 117), (51, 52, 53), (51, 84, 117) and (52, 80, 84), respectively. The (integer) areas of these faces have, as mentioned above, the values 1800, 1170, 1890 and 2016. The formula for the area (Heron of Alexandria, 10-75 A.D.) is simply the square root of s(s-a)(s-b)(s-c) where s=(a+b+c)/2 is the semiperimeter. Now, the B0-functions of the areas are calculated to be B0(1800)=42, B0(1170)=45, B0(1890)=43 and B0(2016)=44, respectively. Adding the respective hydrogen atom numbers gives 64+42=106, 53+45=98, 62+43=105 and 55+44=99, respectively. Finally, grouping the two pairs of faces apart gives 106+98=105+99=204. (a) Thus, by replacing the CNOS-number of each face of the tetrahedron by the B0-function of the area of that face, the “quantitative symmetry” at the level of atom-number is restored, as it was mentioned before (remember, the quantitative symmetry for hydrogen is not broken 64+53=62+55=117). Now, we come to some interesting applications. Consider, first both (ordinary) tetrahedrons, the L-tetrahedron and the D-tetrahedron (see the figure). The 20 positions on each one of them are, first, labeled by the nucleon numbers of the amino acids. We know that there is a “balance” between two pairs of faces (primary identity (n)) and four “balances” between the four vertices and the four centers of the faces (secondary identities (1)-(4) or (5)) involving the carbon, hydrogen, atom and nucleon numbers. Finally, each tetrahedron involves 4 vertices, 4 centers and 6 edges or 14 “invariant” objects. Writing down the sum for the two tetrahedrons (L- and D-) gives T. NÉGADI 412 2(21255+2297+14)=22(1255+297+7) (7) The first expression, at the left of the equal sign, means 2 tetrahedra (L- and D-) and the second, at the right, means twice two pairs of faces and, in each pair, 7 (concomitantly) means 2 faces sharing 5 edges or 2+5=7. Writing, first, 1255 as 627+628, 297 as 221+76 and 7 as 3+4 (or 2+5) and, second, noting that the different parts are independent, we could for example group in the parenthesis the odd numbers together and the even ones together, to obtain the following interesting partition 22(627+221+3)=3404 (8) 22(628+76+4)=2832 (9) The first equation gives the total number of nucleons in the 61 amino acids and the second the total number of atoms in 64 DNA-codons. As a matter of fact, there are 145 nucleons in the 5 quartets, 188 in the 3 sextets, 660 in the 9 doublets, 57 in the triplet and 75+130=205 in the two singlets (see Appendix 1) and therefore 1454+1886+6602+573+2051=3404 nucleons, in 61 amino acids. For 64 DNAcodons, each one of the four nucleobases T, C, A and G appears 48 times and there are therefore 48(15+13+15+16)=2832 atoms (see Appendix 1). Now, taking instead 1255=626+629 in (8) and (9), the result would be 3404+8=3412, for the first, and 2824, for the second, so that it is essentially the same result as above. Note also that 3412 is the number of nucleons in 61 amino acids where these latter are in their “physiological” state, that is when some few amino acids are charged (Downes and Richardson (2002), shCherbak, 2008). Concerning the number 3412, mentioned above, Downes and Richardson (2002) have shown the existence of an exact nucleon-number balance between the 61 amino acids residue molecules (in their “physiological” state) sidechains and the 61 blocks (5756+455) where the unique amino acid proline has 55 nucleons instead of 56 which is the number of nucleons in the remaining 19 amino acids. (Note that in this case the number of nucleons in the 20 amino acids is 1256 or 1255+1.) To see the balance, we consider the three identities (n), (c) and (a), where in this latter we add in both members 90+90=90+90 (corresponding to the blocks). We have L-tetrahedron → 2(1255+67+384)=3412 (10) D-tetrahedron → 2(1255+67+384)=3412 (11) and the equality of the two tetrahedra could “fit” (or encodes) the remarkable DownesRichardson nucleon number balance. Downes and Richardson (Downes and IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 413 Richardson, 2002, see also shCherbak, 2008) considered that some few amino acids (in their “physiological” state) are charged and also the case where proline has 42 nucleons in its side chain and 73 nucleons in its block and showed that there are 3412 nucleons in the 61 side-chains and 3412 (=5774+473) nucleons in the corresponding 61 blocks. We have also, recently, established this very numerical balance, by using other mathematical technics (Négadi, 2011). Now, we have already established the restoration of the “quantitative” symmetry of atom number. Consider therefore the sum =2(1255+67+117+204) including the nucleon, carbon, hydrogen and atom numbers in the two pairs of faces of the tetrahedron. We have (=23153) B0()=86+31=117 (12) Where 86 corresponds to the sum of the prime factors, a0(), and the rest, 31, to the sum of the prime indices and the big Omega function (or the number of prime factors, here 3; see Appendix 2). The result is the total number of hydrogen atoms in the 20 amino acids in agreement with the pattern “16+7=23” (see below): 86 hydrogen atoms in 5 quartets, 9 doublets and 2 triplets (21+50+7+8) and 31 hydrogen atoms in 3 sextets and 1 triplet (22+9). Now, we include the blocks, in the number of atoms and collect everything for one pair of faces: Eqs.(n), (5), (c), (h) and (a) to which we add 290=180, for the blocks, and also the number of edges (5) and centers of faces(2), that is 7. We have for the sum of the prime factors and their indices for the sum, 2127 (=3709; 709 is the 127th prime), of all the mentioned quantities A0(1255+297+67+117+384+7)=841 (13) This number, again, corresponds to the protonated serine octamer. The selection of the sub-set comprising the four last numbers above is also interesting. Computing the quantity 22A0(67+117+384+7)=448=192 (14) for both tetrahedrons, L- and D-, that is for 22 pairs of faces, we find this time the total number of nucleobases in 64 codons (each of the 4 nucleobase appears 48 times). Next, we introduce the quantity Q0=21255+2375+(1170+1800+1890+2016) comprising the nucleon numbers, the Tetraktys numbers and also the (integer) areas, the two pairs of faces of the Heronian tetrahedron, the L-tetrahedron, say. We have B0(Q0)=248 (15) T. NÉGADI 414 The nucleon numbers and the Tetraktys numbers for the four vertices and the four face centers gives (v, f.c.)=2221+150=592. Adding the two quantities above gives B0(Q0)+ (v, f.c.)=248+592=840 (16) The relation (16) corresponds exactly to the neutral serine octamer Ser80: there are 248 nucleons in its eight side-chains (831) and 592 nucleons in its eight blocks (874). B0(Q0) and (v, f.c.) give therefore these two parts. Also, taking the A0-function of these two numbers, we obtain A0(248)+A0(592)=112 (17) Amazingly it appears that 112 is equal to the number of atoms in the neutral serine octamer, as serine has 14 atoms (side-chain and block) and 814=112. Finally, considering the two tetrahedra (L- and D-), we have B0(2112)=32=31+1 and this corresponds to the protonated monomer Ser+ (side-chain only). In the equations (16) and (17) nothing prevents from adding the number of tetrahedron(s) involved, here 1, to get 840+1=841 (nucleons) and 112+1=113 (atoms) for the protonated serine octamer. For the two tetrahedra we could form the product [B0(112)+1][B0(112)+1] and this latter is equal to 292=841, again to protonated serine octamer. Now, we form the two following quantities Q1=21255+(1170+1800+1890+2016) (18) Q2=2375+(1170+1800+1890+2016) (19) where, in Q1, the nucleon and the Heronian tetrahedron numbers are considered and, in Q2 the Tetraktys numbers replace those of the nucleons. First we have A0(Q1)+A0(Q2)=76+104=180 which coud be written as 114+66. This is the carboncontent in the 61 amino acids: 114 in 5 quartets, 9 doublets and 2 singlets (49+233+3+9), on the one hand, and 66 in the 3 sextets and the triplet (69+34), on the other. This is also in agreement with the pattern “16+7=23” (see above and below). Now, we use the B0, instead. We have in this case B0(Q1)+B0(Q2)=80+108=188. This is the number of nucleons in the three sextets serine, leucine and arginine. As Q1=213192 and Q2=233141, we get immediately 31+157 which selects serine’s number of nucleons, 31. Some little additional manipulation gives the final result 31+57+100, respectively. Taking the total sum, 180+188=2423, we have a0(368)=31. The sum of the prime indices and the number of factors give 18, the water molecule. Equivalently B0(368)=31+18=49. Finally, crossing the four values, we introduce the IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 415 numbers 156=76+80 and 212=104+108 from which we have A0(156)+A0(212)=105. This is the nucleon-number of serine (side-chain and block). As we have already shown, here, in this paper, these same numbers refer to serine; in particular the difference A0(156)+A0(212) - B0(368) gives the right number of nucleons in the block of the residue of serine (31+74=49+56=87+18=105). To end this section, let us return to the two important, and interesting, numbers 248 and 840. Both correspond to the serine octamer, in the first the block is not included. We show below that, in fact, the number 248 begets the number 840 and shares with the Heronian tetrahedron rare mathematical properties. As a matter of fact, the number 248 is the smallest number 1 for which the three Pythagorean means, the arithmetic, geometric and harmonic means of Euler’s totient-function and the divisor-function are all integers. We have (248)=120 and (248)=480. As for the means we have the a0 (integer!, see the equation mean=(120+480)/2=300, the g-mean= below) and finally the h-mean=(2120480)/(120+480)=192. From all these relations we have (248)+(248)+g-mean[(248),(248)]= (20) 120+480+240=248+592=840 In the last step, we isolated the proper divisors of 248 and wrote (248)=248+232. This result is the same as in Eq.(16), the neutral serine octamer. We have thus seen that the complete serine octamer is derivable from the side-chain part and the number 248 seems therefore more fundamental than 840. It is even “hidden” in the L/D-classification based on the regular tetrahedron. In Eq.(7), the right-hand side is equal to 221559 which is the factorization of the sum 6236 (3404+2832). It appears that the sum of the primeindices is equal to 248 (1559 is the 246th prime). Now, A0(840)=33 and find that 840, itself, begets the 33 “exchangeable” hydrogens (see above). The sum of the three numbers in the chain 248→840→33 (for one tetrahedron) leads to B0(1121)=105 and, considering the two tetrahedra, L- and D-, we have A0(21121)=106=105+1. These are respectively the neutral monomer of serine and the protonated monomer, experimentally observed. The number 192 which is the total number of nucleobases in 61 codons appears, here, two times: (i) (840)=192 and (ii) g-mean[(248), (248)]=192, see above). This could describe 61 RNA-codons and 61 DNA-codons, inasmuch as 840 could also reveal the four DNA units {T(126), C(111), A(135), G(151)} and the four RNA units {U(112), C(111), A( 135), G(151)} where, in parenthesis, the number of nucleons in each nucleobase is given. As a matter of fact, using Eq.(16), we get by adding 840 and its -function (see section 3) T. NÉGADI 416 840+(840)=1032 (21) This last number is identical with the total nucleon-sum of the eight units: 523 for the DNA units and 509 for the RNA units. The number 1032 could be written 442+590, by isolationg the nucleon identity (4), that is the physical term 2221, and next by applying the carbon identity (c), 67-67=0, we find (442+67)+(590-67)=509+523 which is the result. Considering also the two tetrahedra and using the results following Eqs.(18)(19), we have 21255+ 2(B0(Q1)+B0(Q2))=2(1255+188)=21443=2886 (22) This is also an interesting result, as it gives the total number of atoms in 61 RNAcodons, 2560, and in 3 stops, 326 (see Rakočević, 2009). Rakočević used the nucleotides for the 61 RNA-codons and the ribonucleosides for the 3 stop-codons (UAA, UAG and UGA): 45U+48C+44A+46G giving 2560 atoms and 3UMP+4AMP+2GMP giving 326 atoms. Let us note that 326 could be written as 128 for the nucleotide-part, or the “side-chain”, and 198 for the ribose/phosphate-part, or the “block”. Let us also write one copy of 1255 as 245+1010, according to the pattern “16+7” (see above and below) where 245=188+57 (3 sextets S, L, R, and the triplet I). Knowing that 188 is also equal to B0(Q1)+B0(Q2)=108+80, we could therefore rewrite (22) as (108+57)+(2188+80+1010+1255)=2886. It is now sufficient to apply the primary identity (c) for carbon to get (108+57+33)+(2721-33)=198+2688. We have therefore found the “block”-part, 198, see above. Using now, in 2721, the decomposition of 188 as 57+131 and using the secondary identity for carbon (1), we have (198+128)+2560=326+2560, that is the correct partition into codons and stops (see above and Rakočević, 2009). Is is quite astonishing that we could also reveal the structure of serine using nucleon- and atom-numbers in all the chemical engredients amino acids, DNA and RNA. Take for example the two numbers 2886 and 2560. Their difference, 326, was associated above to the 3 stops and they are both relevant: A0(2886)=67+9=76, the number of carbon atoms in 23 Amino Acids Signals and a0(2560)=18+5=23, precisely these 23 AASs, (). Also, their difference (the three stops) gives A0(326)=204 which is the number of atoms in the 20 amino acids (see also Rakočević, 2009). Consider therefore the quantity Q3=3404+2886+2560=235259. We have B0(Q3)=74+31=105, precisely the detailed structure of serine (see above). Add now to Q3 the number of atoms in 64 DNA codons Q4=Q3+2832 (see Eq.(9)) to get A0(Q4)=105. Also we have that B0(½Q4)=106, the number of nucleons in the protonated monomer of serine, 105+1. Finally Q5=3404+2832+2560=223733 (733: 130th prime) leads to A0(Q5)=874 which corresponds to the protonated serine octamer in IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 417 its state of maximum exchange (see above in this section). Finally, the two main numbers of the tetrahedron 437 and 874 are also informative, by themselves. First, B0(437) written as A0(437)+(437)=59+2=61 has a nice interpretation with respect to the genetic code mathematical structure: 59 codons for degenerate amino acids and one codon for each one of the two singlets (M and W). Interestingly, the sum of the a0functions of the four semiperimeters 125, 78, 126 and 108 of the Heronian tetrahedron gives also 61. As for 874, closely related to the serine octamer, and homochirality, as we have seen in this paper, we have B0(874)=65 and this last number is conspicuously identical with the number of what are known as the “biological” space groups which survive, out of a total of 230 crystal groups when considering chiral molecules (see for example Mainzer, 1996 ). 3. REVISITING 23! In 2007, (Négadi, 2007), we have designed an arithmetic model of the genetic code based on the number 23!. From its twofold, decimal and prime-factorization representations in equations (II) and (III) 23!=1234…212223 23!=25852016738884976640000 23!=21939547311213171923 (I)-(II)=0, (I)-(III)=0 (I) (II) (III) (IV) (V) we have deduced the multiplet structure of the (standard) genetic code and computed the right degeneracies as well as many other interesting results (Négadi, 2008, 2009). Every digit, from 1 to 9 in Eq.(II), was associated to an amino acid coded by more than one codon. Two zeros were associated to the two singlets methionine and tryptophane and three zeros to the three stop codons. One of them, for example, gives the total number of hydrogen atoms in the 61 amino acids as a0(23!)+(23!)+117=200+41+117=358, where 117 is the sum of all the digits in (II) as well their number (zero excluded). This last number is nothing but the number of hydrogen atoms in the 20 amino acids and, the remaining part, 200+41=241, corresponds to the 41 degenerate codons (amino acids). It appears that we did not fully exploit all the numerical facets hidden in 23!. For example, an unnoticed nice result comes from the above two representations, (II) and (III), and consists in adding all the (47) digits, as “individual”, even ignoring the place value (for example count 11 as 1+1) and also the factorial-sign. In this way we obtain 99+74+2(2+3)=183, which identifies nicely with the total number of nucleobases in the 61 codons inasmuch as the prime T. NÉGADI 418 factorization of 183, 361, is “taylor-made”: 61 codons and 3stops, because a0(183)=61+3=64, and also 18 amino acids with degeneracy and 2 non-degenerate singlets, because the sum of the prime-indices SPI(183)=18+2=20. This is not the end, because if we consider the first “representation”, (I), which is in fact the definition of the factorial, and proceed as above for its (individual) digits, we get 114 which identifies with the number of nucleobases in 38 codons (3 nucleobases per codon). Substracting 114 from 183 gives 69 the number of nucleobases in 23 codons. We obtain therefore the pattern 23+38 for the 61 codons. Note that the sum 183+114 gives 297. (114: 73 from 1 till 16 and 41 from 17 till 23). Another interesting result comes from the three representations of 23! written above as (I), (II) and (III). In the table below, we compute the sum of all the digits, and their number, needed to write each representation (excluding zeros and ignoring place-values, exponent positions, etc., just the digits) I II III #digits (zero excluded) 35 18 (11+7) 20 sum 114 (73+41) 99 74 The total sum is 360 and we have, adding its number of divisors 360+(360)=360+24=384. This number is equal to the number of atoms in the 20 amino acids (side-chain and block). In the table the number 114 is also written according to the partition 73+41, already mentioned. Moreover, the number of digits in the decimal place-value representation is also partitioned according to the parity of the digits (even/odd): 18=11+7. Using these two partitions, we have (i) (73+11)+300=84+300=384 and (ii) (99+74+7)+(73+41+11+20+35)+σ(360)=180+204=384. The case (ii) corresponds to the blocks and the side-chains, respectively, and case (i) describes the partition into the set comprising the 5 quartets, the 9 doublets and the 2 singlets, 16 amino acids having 76+177+20+27=300 atoms, on the one hand, and the 3 sextets and the triplet, 7 amino acids having 62+22=84 atoms, on the other. Let us denote this pattern by “16+7=23”; we shall meet it again later in the next section (see Appendix 1). The number of atoms, 384, could be derived otherwise. As a matter of fact, including the sum of the three 23s in the first members of Eqs.(I)-(III) and the nine “missing” zeros, including those of Eqs.(IV)-(V) which are “necessary” to express the IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 419 equivalence of the three relations (I), (II) and (III), we get 360+(5+5+5)+9=384. Pushing the computation to its extreme, by including the number of digits in the three 23s, 6, we end up with 390, a very useful number, see below. Above, the total number of hydrogen atoms in the 61 amino acids was computed as a0(23!)+(23!)+117=200+41+117=358, where 117 is the sum of all the digits in (II) as well their number (zero excluded). Now, if we include the five zeros, we have 99+23=122 and, adding the sum of all the digits in (I), 114, we get 236. This is the number of CNOS-atoms (carbon, nitrogen, oxygen and sulfur) in the 61 amino acids. In this way we find the total number of atoms in the 61 amino acids 236+358=594, 236 for CNOS and 358 for hydrogen H. Returning to the above number 390, it appears that it is just equal to the number of atoms in the 41 degenerate codons (amino acids) and, substracting it from 594, gives 594-390=204, which is the number of atoms in 20 amino acids. Now, we turn to another type of consequences from the number 23!, that were not published before. We introduced (Négadi, 2009) the use of certain (useful and generating information) simple algorithms known as mathematical “black-holes” to “produce” biological information. One of them, call it the “black-hole(123)”-algorithm, where the “black-hole” is the number 123, works as follows: (i) start with any number and count the number of even digits and the number of odd digits, (ii) write them down next to each other (by concatenation) following by their sum. Treat the result as a new number and continue the process. This latter is very quick even for big numbers, as 23!. The first iteration, using (II), with 16 even numbers and 7 odd numbers gives 16723 (16+7=23) and this first iteration agrees with the pattern “16+7”, mentioned several times in the first section. The second iteration gives 235, the third 123 and finally the fourth (check) also 123. 23! → 16723 → csod(16723)=1 ↓ ↓ SPI(16723)=359 →235→123→123 → 840+1=841 → 841+A0(840)=874 ↓ a0(a0(16723))=603 →235→123→123 → 1443 (+1) (3)(16723)=1440 → 1443 As the number 16723 is a five-digits number, but with special consideration (see above), we shall treat it differently from the rest of iterations which are all three-digits numbers. It is capable to give, as its consequences, two numbers (i) 359 as the sum of the prime-indices of its prime-factors and (ii) 603 as the sum of the prime factors of it’s a0-function (see the schematic diagram above). We see that the sum of the four numbers in the second line give 840, the neutral serine octamer number. Adding the complete T. NÉGADI 420 sum of the digits, 840+csod(16723), gives the protonated form 840+1=841. Finally, adding the A0-function of 840 leads ot 840+csod(16723)+A0(840)=841+33=874, the serine octamer in the state of “maximum exchange, see above. Now, we use the function and consider exactly some iterations applied to 16723. We get, at the third iteration (3)(16723)=1440. If we just add to this last number the number of iterations, here, 3 we have 1440+3=1443. This number is not unknown; it is the number of nucleons in the 23 AASs 1255+188 (see above). The addition of csod(16723)=1, or 1443+1, could even fit the usual case where proline has 41+1 nucleons. The sum a0(a0(16723))+SPI(16723)+[(3)(16723)+3]+235+2123=21443=2886 gives the same result as in Eq.(22), that is the total number of atoms in 61 RNA-codons (2560) and in 3 stops (326), at the sole condition to introduce the carbon identity, used several times above, into SPI(16723) and write it 359-33+33=326+33. Rearranging the terms we end up with 2560+326=2886. We have shown in the previous section how the numeric structure of the serine octamer arises from the mathematical characteristics of the Heronian tetrahedron. It could shown that the Tetraktys, too, is able to reveal this strange structure. First, we have to take the two tetrahedrons in order to include L- as well as D-amino acids. The number 350 is the sum of all the numbers situated at the 20 places, as explained above, and 375 corresponds to the (same) sum on the two pairs of faces (see above). These four numbers are 155, 220, 90 and 285. The following relation (2350+A0(155)+A0(220)+A0(90)+A0(285))+B0(375) =(700+140+1)+33 =(840+1)+33 =874 (23) which is constructed from the Tetraktys numbers only, gives again the same result for the serine octamer as in the first section, from the Heronian tetrahedron, or as above from the “black-hole(123)” algorithm. We speculate that there could maybe exist a “link” between these different approaches. For example we have that the number of partitions of the number 23 is equal to 1255 (procedure numbpart(.) in softwares). It appears also that the number of partitions of the number 20 is equal to 627. The difference leads to 1255-627=628 or 627+628=1255 and this relation is precisely one of the two filatov’s identities mentioned at the beginning of section 2 (see Eq.(n)). Another interesting result comes when we take the ratio between the number of divisors of 23! and the -function of the number of partitions of 23: =192 (24) IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 421 The result, 192, is equal to the total number of nucleotides in 64 codons and, as (840) is also equal to 192, we get 2192=384, a nice starting point for establishing the multiplet structure of the genetic code, another way, that is from one and small number (Négadi, 2011). As a last word, let us remark that a close (and strange) relation exists between serine, which was in the focus, in this paper, and proline, also known to be a quite singular amino acid (ex. “entangled” side-chain and block). Notably, it has also been shown that chirality has significant impact on the assembly of proline clusters (Myung et al., 2006). Looking at that acute apex of the Heronian tetrahedron in Figure 2, where proline resides, we have that the sum of the nucleon number of proline’s sidechain, 41, and its associated Tetraktys number, 64, is equal to 105=41+64, which is precisely the number of nucleons in serine (side-chain and block). Also, by calling, again, the carbon identity (c), we get (41+33)+(64-33)=74+31 which is serine, in the detail, 31 for the side-chain and 74 for the block. Inversely, and first, the positive difference between the numbers for serine, 48 and 31 is equal to 17 which fits the atomnumber of proline. Second, take now the total sum of the nucleon numbers on the tetrahedron, 1255, and the Tetraktys numbers, 350, that is 1605 (35107), and compute the sum of the prime factors of this sum, we find 3+5+107=8+107=115 which is precisely the total number of nucleon in proline. Introducing the carbon identity (3333=0) gives 41+74; this is the correct partition into side-chain and block. Moreover, adding the sum of the prime-indices of the prime factors and their number (-function) to the numbers for serine, 31 and 48, gives 31+48+36=115. Finally, writing 36 as 8+28, we have (48+8+31)+28 and by introducing the atom number (secondary) identity (1), we get 73+42=115. This is the other form of proline’s nucleon number: 41+1 in the side-chain and 74-1 in its block. The “manifestation” of serine and its clusters seems not to be something linked to the tetrahedron classification of the 20 amino acids considered in this paper, alone. We have studied recently the small set of the amino acid precursors and found a similar “manifestation” of serine, its clusters, in particular the octamer (Négadi, 2011). We end this paper by considering the possibility to include the “nonstandard” genetic codes, mentioned at the end of the introduction. We have seen in this paper, that a prominent role is given to the number of amino acids 61 and also to the 3 stops, i.e., the standard genetic code. A raised question by a reviewer was whether something could be said, in the present model formalism, about the experimentally known “nonstandard” genetic codes which are however also known to concern only very few living organisms, compared to the standard, or quasi-universal genetic code. This is an interesting question and we shall show that indeed something could be said. As a matter of fact, we start from the number of nucleotides in the 61 coding (or sense) codons 422 T. NÉGADI 183=361, i.e., three nucleotides per codon. This number (derived by us several times, as the one in section 3 also mentioned in Appendix 2) has as the prime factorization 361 so that its a0-function (sum of the prime factors) is 61+3=64 with, here, an immediate new “interpretation”: 61 amino acids sense codons and 3 stop-codons. Please note in this latter case the “new” function of the number 3, as the number of “stops” for the “standard” genetic code which we recall concerns the great majority of the living organisms. Now, some 18 “nonstandard” genetic codes have been discovered these last years (see Elzanowski and Ostell, 2010 for a recent updated compilation). Looking at these genetic codes tables, we have that the sense codon number oscillates between 60 and 63 with only four possible (observed) cases 60, 61, 62 and 63. Equivalently, the possible number of stop-codons oscillates between 1 to 4, with also four possible cases 4, 3, 2 and 1. In general, in these “re-assignments” and without intering into the details of the biochemical machinery, a sense codon could become a stop codon and, conversely, a stop codon could become a sense codon, coding for some “new” amino acid (in the same canonical set of 20 amino acids or even comprising the 21th or the 22th amino acids Selenocysteine or Pyrrolysine). It is precisely at this point that, one more time, Euler’s totient function , and also , the sum of divisors function, could help to “describe” these features. Take, as the starting point, the numbers 61 and 3 for the “standard” genetic code case where the former refers to the sense codons and the latter to the stops. They are both prime (see above) and we have (61)=60=61-1 (61)=62=61+1 (3)=2=3-1 (3)=4=3+1 For a prime p, recall that (p)=p-1 and (p)=p+1. It is not difficult to see that by introducing the above relations, in the “standard form” 63+3=64, three and only three other emerging cases, (ii)-(iv), are possible. In Summary, we have (i) 64=61+3 (ii) 64=60+4 (iii) 64=62+2 (iv) 64=63+1 These four relations seemingly describe all the following 18 observed cases (and also the case of Selenocysteine and Pyrroloysine too) where the number of stop-codons is indicated in the parenthesis (see Elzanowski and Ostell, 2010): The Standard Code (3), The Vertebrate Mitochondrial Code (4), The Yeast Mitochondrial Code (2), The IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 423 Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (2), The Invertebrate Mitochondrial Code (2), The Ciliate, Dasycladacean and Hexamita Nuclear Code (1), The Echinoderm and Flatworm Mitochondrial Code (2), The Euplotid Nuclear Code (2), The Bacterial, Archaeal and Plant Plastid Code (3), The Alternative Yeast Nuclear Code (3), The Ascidian Mitochondrial Code (2), The Alternative Flatworm Mitochondrial Code (1), Blepharisma Nuclear Code (2), Chlorophycean Mitochondrial Code (2), Trematode Mitochondrial Code (2), Scenedesmus Obliquus Mitochondrial Code (3), Thraustochytrium Mitochondrial Code ( 4) and Pterobranchia Mitochondrial Code (2). It appears that the inclusion of all these genetic code variants could be extended also at the level of the number of nucleotides, itself, beyond the number of codons, thanks again to the -function. We shall develop these new results in a forthcoming paper. Appendix 1 In this appendix we give some numeric data concerning the 20 amino acids and the DNA and RNA units, used in the text. M 4 6 2 3 1 Amino acid Proline (P) Alanine (A) Threonine (T) Valine (V) Glycine (G) Serine (S) Leucine (L) Arginine (R) Phenylalanine (F) Tyrosine (Y) Cysteine (C) Histidine (H) Glutamine (Q) Asparagine (N) Lysine (K) Aspartic acid (D) Glutamic Acid (E) Isoleucine (I) Methionine (M) Tryptophane (W) Total H C N/O/S #Atom #Nucleon 5 3 0 8 41 3 1 0 4 15 5 2 0/1/0 8 45 7 3 0 10 43 1 0 0 1 1 3 1 0/1/0 5 31 9 4 0 13 57 10 4 3/0/0 17 100 7 7 0 14 91 7 7 0/1/0 15 107 3 1 0/0/1 5 47 5 4 2/0/0 11 81 6 3 1/1/0 11 72 4 2 1/1/0 8 58 10 4 1/0/0 15 72 3 2 0/2/0 7 59 5 3 0/2/0 10 73 9 4 0 13 57 7 3 0/0/1 11 75 8 9 1/0/0 18 130 117 67 9/9/2=20 204 1255 The 20 amino acids atomic composition 424 T. NÉGADI In the above Table, the detailed atomic composition of the amino acids side-chains is given: H for hydrogen, C for carbon c, N for nitrogen, O for oxygen and S for sulfur. Also, the atom and nucleon numbers atre given. The 20 amino acids are organized into the five known multiplets of the standard genetic code: 5 quartets (M=4), 3 sextets (M=6), 9 doublets (M=2), 1 triplet (M=3) and 2 singlets (M=1); the multiplicity M gives the number of codons. The amino acids are given in the one-letter usual code (in parenthesis) and only the numbers for the side chains are given. When one considers also the blocks, then the corresponding number for the (common) block must be added; for example for the number of atoms one must add 9, for the nucleons 74, etc.. Second, the number of atoms in the five nucleobases (or nucleotides) is as follows Uracil (U, 12)/Thymine (T, 15), Cytosine (C, 13), Adenine (A, 15) and Guanine (G, 16), see the picture below (courtesy from Dr. Gary E. Kaiser). When the “block”, made of the ribose sugar and phosphate, is added to the nucleotides the ribonucleosides have the following content in atoms UMP(C9H13N2O9P, 34), CMP(C9H14N3O8P, 35), AMP(C10H14N5O7P, 37), GMP(C10H14N5O8P, 38), (see Rakočević, 1997). The nitrogenous nucleobases of RNA and DNA Appendix 2 In this appendix, we give the definition of some elementary mathematical tools, used in this paper. First, we use the Fundamental Theorem of Arithmetic which states that every natural number n could be written, uniquely, as a product of primes each raised to a given exponent n=p1a1p2a2p3a3p4a4…. For a given number n the arithmetic function a0(n) gives the sum of the prime factors of n, including multiplicity. When the multiplicities are discarded, the corresponding function is called a1(n). We also define the function, A0(n) to be the sum of a0(n) and the sum of the prime-indices of the prime factors. The big-Omega function (n) counts the number of the prime factors. We also define the function B0(n) as A0(n)+(n). Let us give an example. Take the number n=183, mentioned in section 3. Its prime decomposition is 361, the prime-indices of the two prime factors 3 and 61 are respectively 2 and 18 and (183)=2. We have a0(183)=3+61=64, A0(183)=a0(183)+2+18=84 and B0(183)=A0(183)+(183)=86. We IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 425 also use famous Euler’s totient or -function (a fundamental mathematical object in Cryptography) which gives the total number of numbers that are co-prime to n and, specially for a prime p, it is simply p-1). In other words, it gives a count of how many numbers in the set {1, 2, 3, …, n} share no common factors with n that are greater to one. For any number n, the general formula φ(n) = p1a1-1(p1-1) p2 a2-1( p2-1)...pn ak-1( pk-1) could be used to compute , by hand, but one could also use more quickly computer software as phi(n) and sigma(n) in Maple6, used here. Aknowledgments: I express my aknowledgement to the reviewers for their very constructive comments. REFERENCES Abramowitz, M.; Stegun, I. A. (1964), Handbook of Mathematical Functions, New York: Dover Publications, ISBN 0-486-61272-4. See paragraph 24.3.2. (see also http://en.wikipedia.org/wiki/Euler's_totient_function) Buchholz, R. H. (1992) Perfect Pyramids; Bull. Austral. Math. Soc. 45, 3, 353-368 (See also http://mathworld.wolfram.com/HeronianTetrahedron.html) Cooks, R. G., Zhang, D., Koch, K. J., Gozzo, F. C., Eberlin, M. N. (2001) Chiroselective self-directed octamerization of serine: implications for homochirogenesis Anal. Chem. 73, 3646-3655. Costa, A.A., Cooks, R.G. (2001) Origin of chiral selectivity in gas-phase serine tetramers. Phys Chem Chem Phys., 877-85. Downes, A.M., Richardson, B.J. (2002) Relationships between genomic base content distribution of mass in coded proteins J Mol Evol , 55, 476-490. Elzanowski, A and Ostell, J (2010) http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes Filatov, F. (2009) A molecular mass gradient is the key parameter of the genetic code organization, http://arxiv.org: q-bio. OT/0907.3537. Hasegawa, M. and Miyata, T. (1980) On the antisymmetry of the amino acid code table. Origins of Life, 10, 265-270. Hodyss, R., Julian, R.R., Beauchamp, J.L. (2001) Spontaneous chiral separation in noncovalent molecular clusters, Chirality, 13, 703. Mainzer, K. (1996) Symmetries of Nature, de Gruyter, Berlin (German Edition: 1988). Myung, S., Fioroni, M., Julian, R. R., Koeniger, S. L., Baik, M.-H. Clemmer, D. E. (2006) Chirally Directed Formation of Nanometer-Scale Proline Clusters. J. Am. Chem. Soc., 128, 10833–10839. Négadi, T. (2007) The genetic code multiplet structure, in one number. Symmetry: Culture and Science, 18, 23, 149-160. (Available also at http://arxiv.org: q-bio. OT/0707.2011). T. NÉGADI 426 Négadi, T. (2008) The genetic code via Gödel encoding. The Open Physical Chemistry Journal, 2, 1-5. (Available also at http://arxiv.org: q-bio. OT/0805.0695). Négadi, T. (2009) The genetic code degeneracy and the amino acids chemical composition are connected. NeuroQuantology, 7, 1, 181-187 (Available also at http://arxiv.org: q-bio. OT/0903.4131). Négadi, T. (2009) A taylor-made arithmetic model of the genetic code and applications. Symmetry: Culture and Science, 20, 1-4, 51-76. Négadi, T. (2011) The multiplet structure of the genetic code, from one and small number. NeuroQuantology, 9, 4, 767-771. Négadi, T. (2011) A “quantum-like” approach to the genetic code. NeuroQuantology, 9, 4, 2011, 785-798. Nemes, P., Schlosser, G., Vékey, K. (2005) Amino acid cluster formation studied by electrospray mass spectrometry, Journal of mass spectrometry, 40, 43-49. Rakočević, M. M. (1997) Genetic Code as a unique system, SKC Niš, Appendix 3. Rakočević, M. M. (2009) Genetic Code Table: A note on the three splittings into amino acid classes, http://arxiv.org: q-bio. OT/0903.4110. Scherbak, V. (2008) The Arithmetical Origin of the genetic code in The Codes of Life: The Rules of Macroevolution, M. Barbieri (ed.), Springer 2008. Yang, H., Sheng, G., Peng, X., Qiang, B. Yuan, J., (2003) D-Amino acids and D-Tyr-tRNATyr deacylase: stereospecificity of the translation machine revisited, FEBS Letters, 552, 95-98. Symmetry: Culture and Science Vol. 23, Nos.3-4, 427-447, 2012 THEORY OF TOPOLOGICAL CODING OF PROTEINS AND NATURE OF ANTISYMMETRY OF THE AMINO ACIDS CANONICAL SET Vladimir A. Karasev Biochemist, (b. Saint-Petersburg (Leningrad), Russian Federation (USSR), 1947). Address: Microtechnology Center, St. Petersburg State Electrotechnical University, Prof. Popov str. 5, 197376 St. Petersburg, Russia. E-mail: [email protected]. Fields of interest: Biochemistry, bioinformatics, theoretical biology, mathematical biology. Publications: Karasev, V.A. and Sorokin, S.G. (1997) Topological structure of the genetic code, Russ. J. Genetics, B33, 622-628; Karasev, V.A. and Stefanov, V.E. (2001) Topological nature of the genetic code, J. Theor. Biol. B209, 303-317; Karasev V.A., Luchinin V.V. and Stefanov V.E. (2005) Series in Mathematical Biology and Medicine, B.8. Proceedings of the International Conference. "Advances in Вioinformatics and Its Applications”, New Jersey: World Scientific Publishing Co., pp. 482-493; Karasev V.A., Luchinin V.V. and Stefanov V.E. (2007) A model of the “molecular vector machine” for protein folding, Proceedings of the 3-rd Moscow conference on computational molecular biology, Moscow, Russia, July 27-31, 2007, pp.134-136; Karasev V.A. and Luchinin V.V. (2009) Vvedenie v konstruirovanie boinicheskich nanosistem [Introduction to the design of bionical nanosystems, in Russian], Moskow: "Fizmatlit", 464 pp. Abstract: The present review is dedicated to the theory of topological encoding of proteins. Consistent development of this theory has led to a model of molecular vector machines (MVM) of proteins, which includes: protein fragment containing five amino acids (pentafragment), the dodecahedron containing a group of 20 vectors, canonical set of changeable physical operators (side chains of amino acids) and tetrahedral Ci th atom. The action of vectors is aimed at the formation of hydrogen bond NiH … Oi4=C in the protein pentafragment, the side chains of amino acids realizing this action as physical operators. It is shown that the group of vectors possesses symmetry, while the side chains of amino acids manifest antisymmetry. A dodecahedron-based model for the structure of the canonical set of amino acids is proposed. The model makes part of MVM. Antisymmetry of the side chains of amino acids is considered in connection with their involvement in MVM structure of protein. V. A. KARASEV 428 Keywords: Genetic code, amino acids canonical set, antisymmetry, topological coding, proteins. theory of 1. INTRODUCTION The nature of the canonical set of 20 amino acids which are contained in molecules of proteins and coded by a genetic code is still unclear (Schulz and Schirmer,1979, Higgs and Pudritz, 2009). One of the approaches to this problem suggests development of classifications for amino acids. There are several ways to classify the amino acids’ side chains. The most common is classification of the amino acids’ side chains according to physicochemical properties of their radicals (Campbell and Smith, 1994) (Table 1). Chemically, side chains are rather heterogeneous. For example, Asp and Glu are carbonic acids, Lys is a primary amine with five side chains containing cycles, Pro is a saturated imino acid, whereas Phe, His, Tyr and Trp are aromatic compounds. The physicochemical approach cannot disclose grouping principles for the side chains. The genetic code is a natural basis for classification of amino acids. One of the approaches considers the complementarity of amino acids encoded by complementary triplets (Mekler and Idlis, 1993; Siemion et al., 2004). A version of the periodic table of amino acids based on properties of the genetic code was proposed in the paper (Biro et al., 2003). Table 1: Structure of side chains of canonical amino acids Nonpolar and weakly polar CH2 N CH2 Glycine (Gly) C H2 CH3 Alanine (Ala) Proline (Pro) CH2 HC H 3C HC H3C CH3 Leucine (Leu) CH2 H3C Isoleucine (Ile) H 2C HC H 3C CH 3 Valine (Val) HC OH Serine (Ser)(-) H 3C OH Threonine (Thr)(-) ANTISYMMETRY OF THE AMINO ACIDS CANONICAL SET 429 Polar H 2C CH2 O C CH2 O Asparagic acid (Asp) (-) C O O Glutamic acid (Glu)(-) H2C H2C CH2 H+ N H H H2C C H2 Lysine (Lys)(+) CH2 NH 2 H2C C N H N+H Arginine (Arg)(+) Neutral and Sulfur-containing H2C CH2 H 2N C O Asparagine (Asn) C O H2N CH2 H2 C CH2 H2C SH Cysteine (Cys)(-) S CH3 Methionine (Met) Glutamine (Gln) Cyclic HC HC C CH HC CH Phenylalanine (Phe) CH2 CH2 CH 2 HC HC C CH C CH HO Tyrosine (Tyr) N HC C CH N H Histidine (His) HC H C CH2 C C CH N C H H Tryptophane (Trp) HC C * The side chains are attached to the -carbon atom, marked with a circle. Methods of the theory of groups were applied to genetic code triplets (Balakrishnan, 2002). Four groups of multiplets of triplets coding amino acids with different properties were attributed to SU(4) group symmetry. There are also spatial models of the genetic code (Jimenez-Montaño et al. , 1996, Karasev and Sorokin, 1997), while amino acids play a subordinate role. There are a number of approaches in classification of amino acids which appearance was stimulated by the need of prediction of the protein structure (Taylor, 1986; Kosiol et al., 2004; Esteve and Falceto, 2005). Amino acids can also be classified on the basis of their role in the protein structure: passive chains and active functional modules (Karasev et al., 1994). Another classification considers amino acid chains as physical operators reconstituting the encoded structure (Karasev and Stefanov, 2001). V. A. KARASEV 430 The present work purpose is spatial representation of the structure of the amino acids’ canonical set based on theory of topological encoding of proteins. Earlier the problem was addressed in a preliminary study (Karasev et al. 2005; 2007). 2. THE BASIC PRINCIPLES OF THE THEORY FOR TOPOLOGICAL CODING OF PROTEINS 2.1 Topological code 2.1.1 Definitions The main elements of the theory for topological coding of proteins are topological code, the system of physical operators, recreating encoded structure and model of molecular vector machines (Karasev et al., 2000; Karasev and Stefanov, 2001; Karasev and Luchinin, 2009,a, b, c). The main elements of the theory for topological coding of proteins are topological code, the system of physical operators, recreating encoded structure and model of molecular vector. As an elementary unit of the protein we isolated a fragment of five amino acids (pentafragment). Our choice is motivated by the fact that it is the minimal fragment of a protein capable of forming the cycle with one H-bond (between NiH and Oi-4 = C), which has loose conformation of -helix (Schulz and Schirmer, 1979). Link in the protein, in the framework of our theory, is a fragment of the two amino acids linked by peptide bonds. Accordingly, in structure of pentafragments it is possible to allocate four links, therefore we name them also 4-unit fragments of the protein. Mathematical analogue of protein pentafragments are 4-arc chain graphs (Fig. 1a, 1,b) (Karasev and Stefanov, 2001). c a b d Figure 1: Cyclic 4-unit fragments of the protein which have H-bond between NiH…Oi-4=C (a), its 4-arc chain graph with connectivity edge between i-th – i-4-th vertices (b)$; its matrix description (c); d – matrix of 6 variables. ANTISYMMETRY OF THE AMINO ACIDS CANONICAL SET 431 We have used 4-arc chain graphs to construct a model of topological code. In the structure of the graph two types of edges have been allocated: the structural edges connecting adjacent vertices in a linear chain and the connectivity edge connecting nonadjacent vertices. These edges define a variety of conformations of the graph (Karasev and Stephanov, 2001). Description of the location of the edges can be performed using the upper triangular matrices of six variables, taking the value 0 (no connectivity edge) and 1 (presence of the connectivity edge), as shown in Fig. 1,c and Fig. 1,d. We distinguish two types of protein pentafragments conformations (connectivity states) and their graphs: Acyclic – conformations, which lack H-bond NiH…Oi-4=C in the protein’s fragment and contain no connection between the i-th - i-4 fourth vertices in the graph (x3 = 0); Cyclic – conformations, in which hydrogen bond NiH…Oi-4=C in the protein’s fragment is formed and connection between i-th and i-4 fourth vertices exists (x3=1). 2.1.2 Supermatrix conformations of 4-arc graph All possible 64 conformations of the 4-arc graph were considered (Karasev et al., 2000, Karasev and Stefanov, 2001). As can be seen in Figure 2, in supermatrix there are 4 blocks. The main property of each block is the occurrence of common second pairs of variables – x3x4 (in the headlines of the blocks). Blocks 00 and 01 contain acyclic graph conformations (x3 = 0), and the blocks 10 and 11 – cyclic conformations (x3 = 1). The blocks are constructed according to the following rules: rows are generated by the first pair of variables (x1x2) in the sequence 00, 10, 01, 11 and columns – by the third pair of variables (x5x6) in the sequence 00, 01, 10, 11. 2.1.3 Symmetry in the supermatrix One can find two types of symmetry in the supermatrix. One exists within the blocks. Thus, in the first block (x3x4 = 00) the conformations of the first and third pairs are symmetrical, i.e. 00 00, 10 01, etc. The corresponding matrices and graphs are arranged symmetrically with respect to the main diagonal. A particular case of this symmetry is the intrinsic symmetry of matrices lying on the main diagonals, e.g. 000000, 100001, 010010, 110011. 432 V. A. KARASEV Figure 2: Supermatrix conformations of 4-arc graph and their matrix description The second type of symmetry is related to the structure of the supermatrix as a whole. Two groups of matrices in which 0-elements of the matrix belonging to one group correspond to 1-elements of the matrix of the other group and vice versa, e.g. 000000 111111, 100000 011111, 010000 101111, occupy positions which are related by C2 symmetry (separated by a solid line in Fig. 2). This type of symmetry was called antisymmetry (Karasev et al., 2000), and the transformation itself 0 1 conversion of antisymmetry. Acyclic and cyclic conformations associated symmetry group C2, are antisymmetric. So, completely disconnected graph, described by the matrix of the six "0" (located in the upper left corner of the supermatrix) is antisymmetric graph with a cyclic conformation, described by the matrix of the six "1" (located in the lower right corner). ANTISYMMETRY OF THE AMINO ACIDS CANONICAL SET 433 Spatial representation of the resulting supermatrix composed of “6 variables“ elements is Boolean hypercube B6, which reflects all of the possible single-bit transitions between elements (Karasev et al., 2000; Karasev and Stefanov, 2001; Karasev and Luchinin, 2009, a). 2.1.4 Transformation of supermatrix into the triplet genetic code Information about the graph structure presented in the matrix form cannot be used for transmission, reproduction and copying. It should be transformed into a suitable form of unbranched chain (Karasev and Stefanov, 2001). The number of variables in the matrices, describing conformation of the 4-arc graph, is equal to 6, i.e. 3 pairs. For pairs of variables xixi+1 we introduce the notation XYZ (Scheme 1): x 1x 2 - X x3x4 - Y x5x6 – Z (1) The values 00, 10, 01, 11, assumed by xixi+1, can be denoted as symbols C, U, G, A of the genetic code (Scheme 2): C - 00 G - 10 U - 01 A - 11 (2) , Using this correspondence, we transform the supermatrix into the code (Fig.3). Triplets appear together with the amino acids they code for. Information on the structure of the 4-arc graph in terms of the 4-letter code assumes the form of a linear chain. The second letter of the triplet (Y), the same for the whole block, codes for variables x3 and x4. It contains the main information on the graph structure. As can be seen in the Figure 3, the second pairs of variables were transformed into the second letter of the triplets in the sequence: C, U, G, A. The third letters of triplets have the same order, while the first letters are located in other order - C, G, U, A , which is conditioned by the rules of the conversion. 2.1.5 Symmetry in the table of genetic code Symmetric matrices describing symmetric conformations of the 4-arc graph are encoded by triplets arranged symmetrically with respects to the main diagonals of the blocks, for example CCU – GCC, UCC – CCG. (Fig. 3). Anti-symmetric matrices, related by symmetry C2 (separated by a thick line), are encoded by triplets, which transform into each other according to the Rumer’s rule (Rumer, 1968): C A, G U, for example, ССС AAA, CCU AAG, CCG AAU and so on. 434 V. A. KARASEV Figure 3: Transformation of the supermatrix describing the conformation of the 4-arc graph, in the triplet genetic code (according to Karasev and Stefanov, 2001). Since in the basis of the code, as we have shown, is a description of the conformations of the 4-arc graph (or, equivalently, protein pentafragments), it is clear that the Rumer’s rule connects antisymmetric conformations of the protein encoded by triplets. Spatial structure of the supermatrix is a Boolean hypercube B6 (see Section 2.1.3). It is clear that the triplet genetic code, received on the basis of correspondences of schemes 1 and 2, must have a structure that is isomorphic to the hypercube B6 (Jimenez-Montaño et al., 1996; Karasev and Sorokin, 1997). Thus, we have shown that the genetic code has a topological nature and is associated with encoding of conformations of protein pentafragments. The nature of “triplet – amino acid assignment” from the same position can be explained (Karasev and Stefanov, 2001). ANTISYMMETRY OF THE AMINO ACIDS CANONICAL SET 435 2.2 Physical operators and their assignment to genetic code tripets 2.2.1 The definition of "physical operator" In order recreate encoded graph structure (acyclic or cyclyc conformation) by protein, it is necessary that between coding triplets and the side chain of the protein a definite assignment existed (Karasev et al., 2000, Karasev and Stephanov, 2001). The only site that can be affected by the side chain R of the just bound i-th amino acid is the area where the hydrogen bond between groups NiH…Oi-4 is formed (Fig. 4,a, link, shown by arrow). This is represented by the variable x3 in the matrix. Connectivity and anticonnectivity operators can be distinguished by their mode of action (Fig. 4,b и 4,c). For realization of their functions, they must meet a number of requirements: • have a group capable of acting on the bond NiH...Oi-4; • the size should be of the same order with the scope; • their spatial position should always be the same. The latter fact can be realized only in the event of chirality of the protein links (D-or Ltype). Thus, in our approach the chirality of amino acids exists due to their participation, as the physical operators in the reconstruction of the encoded protein conformations. Connectivity operators - amino acid side chains which provide additional fixation of pentafragments, e.g. due to hydrogen bonds, in accordance with the encoded cyclic fragment of the 4-arc-graph. For this, connectivity operators should have the end groups, capable to form hydrogen bonds. Connectivity operator type is shown in Fig. 4, b. In the matrix x3 = 1. Anti-connectivity operators – amino acid side chains which obstruct formation of a cyclic protein pentafragment in accordance with the encoded fragment of the 4-arc-graph, providing recreation of acyclic conformation. The side chains of the anti-connectivity operators should be introduced into the region of NiH...Oi-4 bond and to prevent the formation of H-bond (Fig. 4, c). In the matrix x3 = 0. Side chain of anti-connectivity operators as a rule, should not have groups, capable to form hydrogen bonds. In the works (Karasev and Stefanov, 2001; Karasev and Luchinin, 2009,a) it is shown for this purpose, that the side chains of amino acids quite satisfy the above requirements. 436 VLADIMIR A. KARASEV a b c Figure 4: The definition of "physical operator". a – the scope of the physical operator; b – connectivity operator; c – anti-connectivity operator 2.2.2 Assignment of physical operators to blocks of triplets of the genetic code In the supermatrix of the genetic code, as follows from Figure 3, there are two types of blocks: two blocks of triplets coding for acyclic conformation of the protein (C = 00 and U = 01) in a matrix which includes the variable x3 = 0, and two blocks with cyclic conformations (G = 10 and A = 11), for which x3 = 1. The property of anti-connectivity operators is a recreation of the conformations of acyclic graph (x3 = 0), so they should be assigned to the blocks, С = 00 and U = 01. The group property of the connectivity operators is a reconstruction of cyclic conformations (x3 = 1) and they should correspond to blocks of G = 10 and A = 11. As can be seen from Figure 3, mainly non-polar side chains (Pro, Ala,, Val, Leu Ile, Phe and Met) correspond to blocks C = 00 and U = 01, which meets to the requirements to the anti-connectivity operators. At the same time, blocks G = 10 and A = 11 correspond to the amino acid side chains capable of forming hydrogen bonds (Arg, Ser, Cus, Trp, His, Gln, Asp, Glu, Tyr, Asn, Lys), which is also consistent with the above requirements. THEORY OF TOPOLOGICAL CODING OF PROTEINS 437 It should be noted that the side chain of Ser is present both in block C = 00 containing acyclic conformations (triplets UCC, UCU, UCG, UCA), and in block G = 10 which includes cyclic conformations (triplets AGC, AGU). Within of our approach, it may be due to the fact that C–OH-group of Ser can form H-bond both with Оi-3=C-group, and with Оi-4=C-group. In the first case, C–OH-group contributes to the formation of Hbond NiH….Оi-3=C in block C = 00 (pentafragment here is acyclic, since cyclic one must include bond NiH….Oi-4=C by definition, see Section 2.1.1). In the second case, C–OH-group promotes formation of the cyclic pentafragment in block G = 10 due to connection NiH ….Oi-4=C. In general, the problem of triplet-amino acid assignment in the genetic code (Crick, 1968, Knight et al., 1999) can be solved in the framework of the concepts of connectivity and anti-connectivity physical operators. 2.2.3 Recreating symmetric conformation of the protein We have considered the action of the operators encoded by different triplets, recreating the symmetric conformation (Karasev and Stefanov, 2001; Karasev and Luchinin, 2009,a). Fig. 5 shows the proposed mechanism of action of these operators. Suppose that these are connectivity operators with similar properties but of different size (Fig. 5). Let us denote functional groups situated at the end of the chains as O=CNH2. As seen from fig. 5, hydrogen bonds of two side chains of different length have different slope and, hence, differently directed field lines. Connectivity of i-th – (i4)-th -carbon atoms in the two cycles is the same (dotted line), whereas connectivity of other atoms is different. a b Figure 5: Assignment of physical operators to triplets, encoding symmetric conformation of the graph. 438 VLADIMIR A. KARASEV The longer side chain forms the lines of force directed to the left and there is a connectivity edge between i-2 - i-4 (variable x6 = 1, Fig. 5, a), while the shorter chain forms the lines of force directed to the right - there is a connectivity edge between i - i-2 (variable x1 = 1, Fig. 5, b). The reasoning is, apparently, applicable to the anticonnectivity operators as well. The consecutive analysis of the possible actions of all side chains on the bond area NiH...Oi-4=C from a position of the theory of topological coding has led to the concept of «molecular vector machine» of proteins. 3. MOLECULAR VECTOR MACHINE AND STRUCTURE OF THE AMINO ACIDS CANONICAL SET 3.1 Model of molecular vector machine 3.1.1 The allocation of planes of symmetry and setting the vectors Earlier (Karasev and Luchinin, 2009, a, с) the molecular vector machine model was described for chain polymers. Let's consider the area of NiH … Oi-4=C bond of the main protein chain in detail (Fig. 6). In addition, let us take into account that the HNC=O groups in proteins have partially delocalized double bond, which makes the considered group flat (Schulz and Schirmer, 1979). Besides that all six atoms surrounding the i-3 - i-4-th group (Ci-3, H, Ni-3, Ci-4, Oi-4, Ci-4) are located in one plane. Due to the partial delocalization, electron environment of HNC=O-group can be described by three sp2-hybridized clouds (in Oi-4, Ci-4 and Ni-3). One of them (that for the atom of oxygen) is shown in Figure 6. One can to draw three mutually perpendicular planes (I - III, 5, a) through Ci-4=Oi-4- and NiH-groups, dividing sp2- hybridized clouds into parts (plane I: - right and left parts, plane II: - front and rear parts, plane III: - upper and lower parts). Possible directions of the action of the side chains on NiH … Oi-4=Ci-4 bond (Fig. 6, b-d) were considered, which revealed at least 20 vectors of action. Within plane I two pairs of vectors are allocated: along NiH … Oi-4=Ci-4 bond and across this bond (Fig. 6, b). On the basis of reflection transformation ( - at transition through a plane I, - at transition through plane II) and rotation transformation ( - at transition through plane III) two subgroups of eight vectors directed, respectively, along the plane II (Fig. 6, c) and along the plane III (Fig. 6, d) are allocated. THEORY OF TOPOLOGICAL CODING OF PROTEINS 439 3.1.2 Setting the directions of vectors. Model of the molecular vector machine The action of vectors is realized with the aid of the side chains of amino acids which have the real physical dimensions. Therefore, they must build a kind of spatial figure. Most suitable for these purposes is the dodecahedron. It has 20 vertices that correspond to the number of amino acids in the canonical set. a b c d Figure 6: The introduction of planes of symmetry (a) and the possible arrangement of vectors in the area of NiH...Oi-4=Ci-4-bonds (b-d). Atom Oi-4 is placed in the center of the dodecahedron, group Ni in one of vertices, vectors directed to the vertices of the dodecahedron (Fig. 7). Dodecahedron vertices and vectors directed to its vertices are divided into four groups and designated according to their symmetry. In plane I, vertex, corresponding to atom Ni, is marked with the letter A, and the vertex connected with the vertex A via operation 440 VLADIMIR A. KARASEV of rotation () is marked by adding sign minus to A (A). Together they form a subgroup 1. Two other vertices located in the plane I and designated as B and (B) are also interconnected by operation and form a subgroup 2. The vertices, symmetric with respect to plane I and located above the plane III on the left and on the right, are designated by the letter A either with low right or upper left subindix, respectively. The vertices located below plane III, have the same notation with the minus sign in front. Together, they form the third subgroup consisting of 8 vertices connected by symmetric transformations (Table 2). Similar notation is used for the fourth subgroup, consisting of eight elements, designated by the letter B. Figure 7: Model of molecular vector machines THEORY OF TOPOLOGICAL CODING OF PROTEINS Subgroups 1 2 3 4 A B A1 B1 1 A2 B2 1 A B -А -B -A1 -B1 441 2 - 1A - 1B -A2 -B2 - 2A - 2B 2 A B Table 2: The subgroups of vectors connected by transformations of symmetry Thus, 20 vertices of the dodecahedron and 20 vectors to which they are directed, form four subgroups according to operations of transformation about the planes which are drawn in the dodecahedron. To develop further the concept of the MVM let us add an arrow, marked with the letter Si, to introduce a changeable physical operator (side-chain amino acids), and a fragment of i +1-th link with an arrow showing the direction to the i +1-th C atoms (Fig. 7). In the structure of MVM four components can be identified (Karasev et al., 2007; Karasev and Luchinin, 2009, a,b): protein fragment, consisting of five amino acids (pentafragment), dodecahedron containing a group of 20 vectors (radii of the dodecahedron); canonical set of exchangeable physical operators and tetrahedral Ri -th atom. The operation principle of MVM consists in consecutive action on the bond NiH...Oi4=Ci-4 of protein pentafragment during the synthesis of the protein side chains of amino acids implying attachment to the i-th -carbon atom. Each amino acid, in accordance with its length, is represented by the vertex of the dodecahedron, to which the vector, corresponding to the appropriate amino acid, is directed. Connectivity operators realize their effect through the hydrogen bonding of terminal groups with the group Oi-4 = Ci-4 (their vectors are directed mainly upwards), whereas the anti-connectivity operators –act via collisions of this group with the electron shells of the side chains’ terminal groups (their vectors are directed downward). In this case, the side chains associated with the ith tetrahedral -atom, by changing the direction Ci Ci+1, (see Figure 7), determine the direction of growth of the polypeptide chain. 3.2 Dodecahedron model of the canonical set of 20 amino acids This model was proposed in earlier work (Karasev et al., 2005). In its present form (Figure 8), it is modified to meet the requirements of MVM. 442 VLADIMIR A. KARASEV The side chains of amino acids were arranged in circles at the vertices of the dodecahedron. Alpha-carbon atom, to which side chains are attached are situated at the top, and the side chains oriented downward. Side chains were arranged from the top to the bottom in the increasing order of their size. According to model MVM, the dodecahedron structure is divided by three planes - I, II and III. Shorter side chains are situated on the right whereas their heavier analogs on the left from the plane I. In this model four groups of antisymmetrical chains can be distinguished: 1) chains antisymmetrical about plane I (e.g. Ser : Thr, etc.), 2) chains antisymmetrical about plane II (e.g. Ser : Cys, etc.), 3) chains antisymmetrical about plane III (e.g. Ser : His) and 4) chains antisymmetrical about the center of the dodecahedron (e.g. Ser : Trp). If this structure to place in a generalized structure of MVM, leaving only the names of amino acids, it will turn into a MVM model of proteins (Karasev and Luchinin 2009 a). Figure 8: Dodecahedron model of the canonical set of amino acids. Subgroups 1 and 2 – dark gray, subgroup 3 – grey, subgroup 4 – white circles The side chains of amino acids are indicated in the circles corresponding to the dodecahedron vertices. Alpha-carbon atom, to which side chains are attached, is situated THEORY OF TOPOLOGICAL CODING OF PROTEINS 443 at the top, and the side chains are oriented downwards. Side chains are arranged from the top to the bottom in the increasing order of their size. According to model MVM, the dodecahedron structure is divided by three planes - I, II and III. Shorter side chains are situated on the right, whereas their heavier analogs - on the left from the plane I. In this model four groups of antisymmetrical chains can be distinguished: 1) chains antisymmetrical about plane I (e.g. Ser : Thr, etc.), 2) chains antisymmetrical about plane II (e.g. Ser : Cys, etc.), 3) chains antisymmetrical about plane III (e.g. Ser : His) and 4) chains antisymmetrical about the center of the dodecahedron (e.g. Ser : Trp). If this structure is included into the generalized structure of MVM, leaving only the names of amino acids, it will lead to the MVM model of proteins (Karasev and Luchinin 2009a). 3.3 Properties of the canonical set of amino acids derived from the MVM model 3.3.1 Side chains should have different length Longer side chains should yield vectors, directed downwards, towards group Oi-4=C, while those corresponding to shorter chains are directed towards NiH. Hence, side chains of amino acids are located on a dodecahedron in order of increasing their length (Fig. 8). 3.3.2 Quasi-mirror antisymmetry As follows from Figure 7, the i-th -carbon atom, to which side chains are attached, is located to the right of the dodecahedron (asymmetrically). The side chains, yielding the vector, symmetric with respect to plane I should have shorter chains on the right, and the longer ones on the left. On the model of the dodecahedron side chains possessing similar properties but different length are arranged symmetrically with respect to plane I. We call it a quasimirror antisymmetry. Pairs of amino acids connected by this type of antisymmetry, yield symmetric vectors connected by transformation . They are shown in Table 3, part A. 3.3.3 Non-mirrored antisymmetry The side chains yielding vectors symmetrical with respect to plane II, which are interrelated by transformation , as seen in Fig.7, should have close values of the chain VLADIMIR A. KARASEV 444 length, but different groups at the end. We call it non-mirrored antisymmetry property. Side chains of amino acids linked by this type of antisymmetry, as seen in Figure 8, are located on the opposite sides of plane II. They are listed in Table 3, part B. 3.3.4 Rotary antisymmetry As follows from Figure 7, the side chains, yielding vectors, symmetric with respect to plane III, connected by transformation , should have different length. We take into account that the top half of the dodecahedron can be combined with the bottom only by rotation about the axis located in plane III. With this in mind, we can assume that vector rotary antisymmetry is possible between the side chains of the upper and lower half of the dodecahedron. А. Quasi-mirror Thr Met Glu Gln Lys Ile Trp Tyr Ser Cys Asp Asn Arg Val His Phe C. Rotary Thr - Trp Met - Tyr Glu - Lys Gln - Ile Ser - His Cys - Phe Asp - Arg Asn - Val Pro - Gly Ala - Leu B. Non-mirrored Thr - Met Ser - Cys Glu - Gln Asp - Asn Lys - Ile Arg - Val Trp - Tyr His - Phe D. Complemetarity Thr - His Ser - Trp Met - Phe Cys - Tyr Glu - Arg Asp - Lys Gln - Val Asn - Ile Pro - Gly Ala - Leu Table 3: Antisymmertry types for side chains of amino At the same time there are features of complementarity in the properties of opposing side chains, e.g. short chains Ser, Thr oppose cyclic ones His, Trp, and negatively charged Asp, Glu oppose positively charged Arg, Lys. Pair of side chains associated with this type of antisymmetry, can also be seen in Fig. 8. They are shown in table 3, part C. THEORY OF TOPOLOGICAL CODING OF PROTEINS 445 3.3.5 Antisymmetry of complementarity From Figure 7, it follows that the two vectors, directed to the opposite vertices of the dodecahedron, form its diameter. They have the same angle, but the opposite direction of action. Accordingly, the side chains, responsible for the effect of these vectors should be mutually complementary in their properties. Indeed, as follows from Fig. 8 and Table 3, part D, the side chains with additional properties are located in the opposite vertices. Ser with a short side chain opposes Trp, with a bulky side chain, and more massive Thr opposes less massive His, etc. By analogy with the symmetry transformations described for vectors (Table 2) antisymmetry transformation connecting amino acids side chains was undertaken in Table 4. Subgroups 1 2 3 4 Gly Ala Ser Asp Thr Glu Cys Asn Pro Leu His Arg Met Gln Trp Lys Phe Val Tyr Ile Table 4: Antisymmetry transformation connecting the side chains of amino acids 4. CONCLUSION The main objective of the present review was consistent presentation of the theory of topological coding of proteins, illustrated by a model of spatial structure of the canonical set of amino acids on a dodecahedron developed on the basis of this theory. Our model, unlike classifications of side chains of the amino acids based on their physical and chemical properties (Campbell and Smith, 1994), principles of complementarity (Mekler and Idlis, 1993) or genetic code (Biro et al., 2003; Balakrishnan, 2002), emerged in the course of the development of the earlier proposed theory of topological coding of proteins (Karasev et al., 2000; Karasev and Stefanov, 2001; Karasev and Luchinin, 2009, a, b, c). One of concluding results derived from the theory is the molecular vector machine (МVМ) of proteins presented in this paper. Within the limits of МVМ model 20 vectors are introduced in the structure of dodecahedron, affecting formation of hydrogen bond NiH…Oi-4=C of protein pentafragment. Side chains of amino acids realize this action as 446 VLADIMIR A. KARASEV physical operators. The group of vectors manifests symmetry, whereas side chains of amino acids - antisymmetry. The structure model for the canonical set of amino acids on the dodecahedron uses principles of antisymmetry, being a part of the MVM. Thus, a rational explanation of the antisymmetry nature of amino acids side chains is provided. Current data on the structure of ribosomes and protein biosynthesis suggest that the formation of the secondary structure of proteins occurs in the structure of ribosomes cotranslationally, i.e. at the time of their biosynthesis (Kramer et al., 2009). MVM model, in which the side chains of amino acids act as physical operators, is consistent with the data. Further development of the structure model for the canonical set of amino acids is attributed to the two-level scheme (Karasev and Luchinin, 2009, a), explaining the nature of the degeneracy of the genetic code triplets, and to the group-theoretical approach (Karasev and Luchinin, 2009, a, b) considering amino acid side chains as irreducible representations of the group composed by the vectors. More detailed information, as well as applied aspects of the approach can be found on the websites: http://genetic-code.narod.ru, http://amino-acids-20.narod.ru and http://vector-machine.narod.ru. Acknowledgment: We are grateful to V.V. Luchinin and V.E. Stefanov for useful discussion of the paper. REFERENCES Balakrishnan J. (2002) Symmetry scheme for amino acid codons, Phys. Rev. E, Stat. Nonlin. Soft. Matter. Phys., B65 (2 Pt 1), 021912. Biro J.C., Benyo B., Sansom C., Szlavecz A., Fordos G., Micsik T. and Benyo Z. (2003) A common periodic table of codons and amino acids, Biochem. Biophys. Res. Commun., B306, 408-415. Campbell, P.N. and Smith, A.D. (1994) Biochemistry Illustrated. Edinburgh: Curchill Livingstone,, pp. 8-9. Crick, F.H.C. (1968) The origin of the genetic code, J.Mol.Biol. B38, 367-379. Esteve J.G. and Falceto F. (2005) Classification of amino acids induced by their associated matrices, Biophys. Chem., B115, 177-180. Jenni S. and Ban N. (2003) The chemistry of protein synthesis and voyage through the ribosomal tunnel, Curr. Opin. Struct. Biol., B13, 212-219. Jimenez-Montaño, M.A., de la Mora-Basañez, C.R. and Poschel, Th. (1996) The hypercube structure of the genetic code explains conservative and non-conserva-tive amino acid substitutions in vivo and in vitro, Bio Systems, B39, 117-125. Higgs P. and Pudritz R.E. (2009) A thermodynamic basis for prebiotic amino acid synthesis and the nature of the first genetic code, Astrobiology, B9, 483-490. THEORY OF TOPOLOGICAL CODING OF PROTEINS 447 Karasev, V.A., Luchinin, V.V., Stefanov, V.E. (1994) A model of molecular electronics based on the concept of conjugated ionic-hydrogen bond systems, Adv. Mater. Opt. Electron. B4, 203-218. Karasev, V.A. and Sorokin, S.G. (1997) Topological structure of the genetic code, Russ. J. Genetics, B33, 622-628. Karasev V.A., Demchenko E.L. and Stefanov V.E. (2000) Topological coding of polymers and protein structure prediction. In: Chemical topology: applications and techniques. (D.Bonchev & D.Rouvray eds). – Ser. Math.Chem., B.6, 295-345, New-York:Gordon&Breach. Karasev, V.A. and Stefanov, V.E. (2001) Topological nature of the genetic code, J. Theor. Biol. B209, 303317. Karasev V.A., Luchinin V.V. and Stefanov V.E. (2005) Series in Mathematical Biology and Medicine. B.8. Proceedings of the International Conference. "Advances In Вioinformatics And Its Applications”, New Jersey: World Scientific Publishing Co., pp.482-493. Karasev V.A., Luchinin V.V. and Stefanov V.E. (2007) A model of the “molecular vector machine” for protein folding, Proceedings of the 3-rd Moscow conference on computational molecular biology. Moscow, Russia, July 27-31, 2007, pp.134-136. Karasev V.A. and Luchinin V.V. (2009a) Vvedenie v konstruirovanie boinicheskich nanosistem [Introduction to the design of bionical nanosystems, in Russian], Moskow: "Fizmatlit", 464 pp. Karasev V.A. and Luchinin V.V. (2009b) Model topologocheskogo codirovania tsepnikh polimerov dlia bonicheskoi nanoelectroniki. I. Topologicheslii cod i sootvetstviia fisicheskikh operatorov tripletam coda (Model of topological coding of chain polymers for bionical nanoelectronics. I. A topological code and assingment of the physical operators to triplets of a code. In Russian), Biotekhnosfera, No.1, pp. 2 – 10. Karasev V.A. and Luchinin V.V. (2009c) Model topologocheskogo codirovania tsepnikh polimerov dlia bonicheskoi nanoelectroniki. II. Molekuliarnaia vektornaia mashina i struktura kanonicheskogo nabora phisicheskikh operatorov (Model of topological coding of chain polymers for bionical nanoelectronics. II. The molecular vector machine and structure of the canonical set of physical operators. In Russian), Biotekhnosfera, No.2, pp. 6 – 12. Knight R.D., Freeland S.J. and Landweber L.F. (1999) Selection, history and chemistry: three faces of the genetic code, Trends Biochem. Sci. B24, 241-247. Kosiol C., Goldman N. and Buttimore N.H. (2004) A new criterion and method for amino acid classification/ J. Theor. Biol., B228, 97-106. Kramer G, Boehringer D, Ban N. and Bukau B. ( 2009) The ribosome as a platform for co-translational processing, folding and targeting of newly synthesized proteins, Nat. Struct. Mol. Biol. B16:589-97. Mekler, L.B. and Idlis, R.G. (1993) Obschii stereokhimicheskii geneticheskii cod – put k biotechnologii i unuversalnoi medizine ХХI veka uzhe segodnia [General stereochemical genetic code – towards biotechnology and universal medicine of the ХХI century, in Russian], Priroda, No.5, 29-63. Pauling L. (1960) The Nature of the Chemical Bond, 3rd ed.,Ithaca:Cornell Univ.Press, 644 pp. Rumer, Yu.B. (1968). Sistematizacija kodonov v geneticheskom code [Systematization of codons in the genetic code, in Russian], Dokl. Acad. Nauk SSSR B183, 225 – 226. Schulz, G.E. and Schirmer, R.H. (1979) Principles of Protein Structure. New York: Springer-Verlag, 354 pp. Siemion I.Z., Cebrat M. and Kluczyk A. (2004) The problem of amino acid complementarity and antisense peptides, Curr. Protein Pept. Sci. B5, 507-527. Taylor W.R. (1986) The classification of amino acid conservation, J. Theor. Biol. B119, 205-218. 448 AIMS AND SCOPE SYMMETRY: CULTURE AND SCIENCE provides an interdisciplinary forum for representatives of the various fields of art, science, and technology. According to its established tradition, it publishes papers by scientists addressed to their colleagues active in other disciplines, or even in different fields of the arts; and also papers by artists addressed to the representatives of the sciences and diverse fields of technology. Symmetry appears in articles of the various disciplinary and art periodicals, however those tend not to reach scholars in other fields of study. The journal SYMMETRY aims at conveying to them knowledge, methods, and novelties which are applicable to their main fields of interest and creative work. Its basic goal is building bridges between various fields of the arts and sciences, between various disciplines, and between different cultures. Symmetry is suitable for such a bridging function. It is a concept, a phenomenon, a class of properties, and a method. It is present in almost all disciplines and fields of art and technology. As a concept, it has roots in both science and art. As a phenomenon, symmetry or its lack is present in all fields of art, science, and technology. Finally, properties and methods, based on the application and the investigation of symmetry (and symmetry breaking) are transferred from one field to another. Symmetry is understood here in a broad sense, and approach to its study will be referred to as symmetrology. In contrast to the common geometric concept, one can speak about a more general scientific meaning of symmetry if: (i) under any kind of transformation (operation), (ii) at least one property, (iii) of an object is left invariant (intact). This generalised concept of symmetry makes possible the application of symmetry to both animate and inanimate material objects, as well as to products of our mind. In addition to geometric (morphological) symmetries (such as reflection, rotation, translation, etc.), the scope of the journal covers functional symmetries and asymmetries (e.g., in the human brain), gauge symmetries (of physical phenomena), and properties, like color, tone, shading, weight, and so on (of artistic objects). The journal focuses not only on the concept of symmetry, but also on its associates (asymmetry, dissymmetry, and antisymmetry) and related concepts (such as proportion, harmony, rhythm, and invariance) in an interdisciplinary and intercultural context. SYMMETRY publishes original papers on symmetry and related questions which present new results, or new connections between known results. The papers are addressed to a broad non-specialist public, without becoming too general, and have an interdisciplinary character in any of the following senses: (1) they describe concrete interdisciplinary ‘bridges’ between different fields of art, science, and technology using the concept or related to the phenomenon of symmetry; (2) they survey the importance of the application of symmetry (antisymmetry, etc.) in a concrete field with an emphasis on possible ‘bridges’ to other fields. The journal also has a special interest in historic and educational questions, as well as in symmetry-related methods and processes.