Qiang Li
Transcription
Qiang Li
Estimation of evolutionary distance based on K-string composition Qiang Li Department of Combinatorics and Geometry CAS-MPG Partner Institute for Computational Biology Difficulties in phylogenomics • Species trees versus gene trees • Too few common genes “The tree of one percent” • Model selection A science, or an art? • Computational complexity • … A tree based on 31 genes (Ciccarelli ’06) Phylogenetics above sequence-level • Rare genomic change • Gene content & gene order • K-string composition “In summary, the methods for inferring trees based on whole-genome features are at an early stage of their development, which might be comparable to that of sequence-based methods in the early 1970s. In particular, they generally lack a global probabilistic modeling.” (Philippe 2005) CVTree: Compositional vector • “Random background”: (K – 2)-order Markov chain f (α1 α K −1 ) f (α 2 α K ) f ( α 1 α K ) = f (α 2 α K −1 ) 0 • “Subtraction” f (i ) − f 0 (i ) , 0 f (i ) ai = 0, f 0 (i ) ≠ 0 f 0 (i ) = 0 • Distance (dissimilarity) 1 − cos(a, b) d (a, b) = ∈ [0,1] 2 i ∈ ΣK f versus f0 Courtesy of Alexander Grossmann CVTree • • • • simple efficient parameter free truly whole genomic … • mysterious (Qi et al ’04) CVTree of chloroplasts CVTree of coronavirus CVTree of dsDNA viruses CVTree of fungi Mycge Mycpn Mycpe Urepa Mycp Borb u Trep u Chl a Ch mu Ch ltr C lpnA Ch hlpnC A lp Ag grt5 nJ R rt5 hi W m e pe Aer ae Pyr o t l Su so l Su evo Th eac Th lsp Ha tma e M e um su r B ru B hilo r R auc C oliC Ec oliE Ec oliO Ec liK Eco l Shif Salti Salty YerpeC YerpeK Haein Pasmu Sheon Vibch Vibvu Neim Neim eM Pse eZ Ps ae Ra epu Xa lso Xa nax X nc Bu ylfa a Bu ca ca i p S St taa a u Lis auM N m L o Oc isin Ba eih c Ba hd csu Bac The an m Aqua a e Lepin Chlte Deira Strco MyctuH MyctuC Mycle Corgl Coref Biflo l e The c p n Sy sp a An br ig n W icc r R cp Ri amje C lpj He lpy He M M e ta et c M ka Me etth Py tja Py rfu Py rho ra Arc b Enc fu cu Plafa Worm Yeast Schpo Arath Fusnu Thete Clope Cloab StrpnT R Strpn u Strm S y Strp yM p Str pyG Str agN Str agV a r St acl p L ae St auW a St 0.05 The tree plotted to scale “Concentrating on the topology of the trees in the first place, we did not scale the branch lengths on the tree. However, these lengths should reflect evolution rates in terms of Kstring composition changes. The calibration of branch lengths is further complicated by the overlapping nature of the K-strings when K ≥ 2.” • “However, the methods proposed so far are extremely crude. Oligonucleotide or oligopeptide frequencies are transformed into distances without any underlying model of evolution. It is nevertheless remarkable that something considered as a bias in standard sequence-based methods (Lockhart et al. 1994) contains a phylogenetic signal, but it is not yet clear whether accurate methods can be developed to extract it.” (Philippe et al. 2005) • “Why, for example, does the K-string complement of a proteome yield a tree that is similar to sequence-based trees?” (Snel et al. 2005) A heuristic probabilistic model • Transition of frequency distribution: Assume that the evolution process is stationary, and the average mutation rate per site is 1 – α. Let L = |S| - K + 1, then the conditional distribution ( ( t +1) f S ) (i ) S (i= ) n= binomial(n, α ) (t ) K K E S(i ) *binomial L, (1 − α ) L • Time evolution of expectation ( ) n =+ E S (i ) S (i ) = E S(i ) α ( t ') (t ) K ( t ' −t ) (n − E S(i )) Calibration of CVTree • Generalized CV w(S, = i ) (S(i ) − S (i ))c(i ) 0 where S0(i) is an unbiased estimator of S(i), and c(i) is arbitrary “normalization” factor. • Angle between vectors vs. sequence identity E cos( w(S), w(S')) = q K • Calibration pˆ (S,S') = 1 − cos1/ K ( w(S), w(S')) M Myc yc p ge n Fu sn u T Clo het C pe e Str loab p Strp nT StrmnR Strpy u StrpyMS StrpyG StragN StragV Lacla Staep StaauW StaauN M Staaumo Lis in Lis ih Oce hd c Ba csu Ba can Ba 0.05 e m e i Rh rumsu B ru ilo B Rh cr u Ca iccn R cpr Ri cai Bu cap Bu gbr Wi EcoliC EcoliE EcoliO EcoliK Shifl Salti Salty Yerp Yerp eC Hae eK Pas in S mu Vibheon c NeVibvuh Ne im e Ps ime M Z Ps ea ep e u so al ax R an ca X n Xa lfa Xy M e P Py yr tac f Py rho u r Me Me ab tka tth Me tja Plaf Arcfu Encc a u Yeas t Schpo Worm Arath Lepin Chlte n lp Ch J ChlpnC A Chlpn ltr Ch u Chlmpa Tre u rb Bo elpy j H elp H mje Ca u cp y M pa e e p Ur yc M Anasp Synpc Theel Biflo Core Corg f Myc l Myc le My tuC ct S De trcouH A i Th qua ra Ag emae Ag r t rt5 5 W pe Aer ae Pyr to l Su so l Su evo Th eac p Th als tma H e M The calibrated tree Using pre-/absence of words • Any sequence S as above can be represented also by just the set X(S) of its K-strings. In a “gedanken alignment”, ignoring repeats and collisions, we may put 1 X (S) X (S') X (S) X (S') (K ) + Q (S,S') : X (S) X (S') 2 and q (K ) (S,S') := Q (K ) 1/ K (S,S') K = 9 Tree M M Me eta et k tth c Me a Py tja Py rfu rh Py o ra Arc b Plaf fu a Wor m Y eas t Schpo Enccu iE ol liO c E co iK E col E hifl S alti S lty Sa rpeC Ye rpeK Ye in H ae u Pasm n S heo Vibch Vibvu Pseae Psepu NeimeM NeimeZ Ralso X ana X an x Xylf ca Ca a m He je l H e pj Bo lpy Tr rbu Le ep C h pi n a lm u tr A hl n C lp J Ch hlpn C C lpn Ch lte Ch sp a A n pc n S y el T he C Cl lop S oa e St trpn b rp T Str nR Str mu Str pyS p Strp yM Stra yG g Strag V N Lacla Staep StaauW StaauN StaauM Lismo Lisin Oceih d Bach u Bacs n a B ac Arath Deira Strco MyctuH MyctuC Mycle l Corg f e C or o f i B l cpu My epa Ur e cp My ycpn e u M ycg sn ete M Fu Th 0.05 Aquae Thema Agrt5 Agrt5W Rhim Brum e Bru e s R hi u Ca lo Ric ucr R c Bu icpr n Bu ca i W cap E c i gb ol r iC pe A er to S ul o ls Su rae vo e Py T h eac p s Th al tma H e M K =9 M e Py tja Py rf r u P y ho ra Ar b c Me fu t Me ma t a c Ha Plaf lsp a Y ea st S ch p o Worm Enccu p ca gbr u B Wi oliC Ec coliE E oliO Ec oliK Ec ifl Sh lti Sa t y S al eC Yerp K Yerpe Haein Pasmu Sheon Vibch Vibvu Pseae Psep Neim u Neim eM R al eZ Xanso X ax X y a n ca Ca lfa He mje H lp Bo elp j rb y u a ep T r pi n u Le hlm r C lt A Ch hlpn J C lpn Ch lpnC Ch te C hl C Cl lop S o e St trpn ab rp T Str nR Str mu Str pyS p Strp yM Stra yG g Strag V N L a cl a Staep StaauW StaauN StaauM Lismo Lisin Oceih d Bach u s c Ba n a c Ba Arath Deira Strco MyctuH MyctuC Mycle l Corg f e C or o Bifl cpu My epa Ur e cp n My ycp e u e M ycg sn et M Fu Th 0.05 Anasp Synpc T heel A quae T hem Agrt a A gr 5 R h t 5W Bru ime Bru me R su Ca hilo R i uc Bu Ric ccn r ca p r i pe A er to S ul lso Su e o ra ev Py T h eac h t T h et M ka et M K = 10 M e Py tja Py rf u Py rho ra Ar b cfu Me t Me ma t H a ac Plaf lsp a Y ea st Schp o Worm 0.05 p ca gbr u B Wi oliC Ec coliE E oliO Ec oliK Ec ifl Sh lti Sa t y S al eC Yerp K Yerpe Haein Pasmu Sheon Vibch Vibvu Pseae P se p Neim u Neim eM R al eZ Xanso X ax Xy anca Ca lfa He mje H lp Bo elp j rb y u a ep T r pi n u Le hlm r C lt A Ch hlpn J C lpn Ch lpnC Ch te C hl C Cl lop S o e St trpn ab rp T Str nR Str mu Str pyS p Strp yM Stra yG g Strag V N Lacla Staep StaauW StaauN StaauM Lismo Lisin Oceih d B a ch u s c Ba n a B ac Arath Deira Strco MyctuH MyctuC Mycle l Corg f e C or o i B fl cpu My repa U e cp n My ycp e u e M ycg sn et M Fu Th Anasp Synpc T heel A quae T hem Agrt a A gr 5 R h t 5W Bru ime Bru me R su Ca hilo R i uc BuRicp ccn r ca r i pe A er to S ul o ls Su e ra vo e Py T h eac h t T h et M ka Enccu et M K = 11 Arath Deira Strco MyctuH MyctuC Mycle l Corg f e r Co o Bifl cpu My repa U e cp My ycpn e u e M ycg sn het M Fu T 0.05 Anasp Synpc T heel A quae T hem Agrt a A gr 5 R h t 5W Bru ime Bru me R su Ca hilo R i uc BuRicp ccn r ca r i Py Py rho ra Th Arc b f Th evo u ea c Me tm Me a H al t ac Plaf sp a Wor m Y eas t Schpo p ca gbr u B Wi oliC Ec coliE E oliO Ec oliK Ec ifl Sh lti Sa t y S al eC Yerp K Yerpe Haein Pasmu Sheon Vibch Vibvu Pseae Psep Neim u Neim eM R al eZ Xa so X a nax Xy nca Ca lfa H mje He elpj Bo lp rb y u a ep T r pi n u Le hlm r C lt A Ch hlpn J C lpn Ch lpnC Ch te C hl C Cl lop S o e St trpn ab rp T Str nR Str mu Str pyS p Strp yM Stra yG g Strag V N Lacla Staep StaauW StaauN StaauM Lismo Lisin Oceih d Bach u s c Ba n a B ac Enccu pe A er to S ul lso Su e h ra Mett Py tka tja Me Me rfu Py K = 12 iC ol liE c E co liO E co K E coli E hifl S alti S lty Sa rpeC Y e peK Y er n H aei u Pasm n S heo Vibch Vibvu Pseae Psepu NeimeM Neime Z Ralso X an X an ax Xyl ca Ca fa m He je H lpj Bo elpy T rb Le rep u pi a n Ch u m hl C hltr lpnA J C Ch hlpn C C lpn Ch lte sn u C l T he o t C p e Str loa e Str pnT b p Str nR m Strp u Strp yS y Strpy M StragVG StragN Lacla Staep StaauW StaauN StaauM Lismo Lisin ih Oce d h c Ba su c Ba can Ba Arath Thema Aquae Deira Strco H Myctu C u t c My ycle M gl Cor ref Co flo Bi u cp pa y M re U e cp n My ycp cge M y M 0.05 Anasp Synpc T heel Agrt5 Agrt5 Rhim W Bru e Bru me Rh su C ilo Ric aucr R c B ic n Bu uca pr W ca p i ig br M et M ka et th M Th etja Th evo ea c Me tm Me a H al t ac Plaf sp a Wor m Y eas t Schpo Fu Enccu pe A er to S ul lso Su ae rfu r Py o Py rh Py rab fu Py Arc K = 13 Arath Deira Strco MyctuH MyctuC e Mycl l Corg f e r Co o Bifl cpu My repa U e cp n My ycp e u e M ycg usn het M F T 0.05 p ca igbr C u B W oli Ec coliEO E coli E oliK Ec ifl Sh lti Sa lty S a peC Y er eK Yerp Haein Pasmu Sheon Vibch Vibvu Pseae Pse Neim pu Neim eM R al eZ X a so X a nax Xy nca Ca lfa He mje H lp Bo elp j rb y u a ep T r pi n u Le hlmtr C l nA Ch hlp nJ C lp Ch lpnC Ch te C hl C lo St Clo pe S t r pn ab r T Str pnR m St u Str rpyS p Strp yM Stra yG StraggV N Lacla Staep StaauW StaauN StaauM Lismo Lisin Oceih d Bach u Bacsan B ac Enccu ka M et th M e Th tja Th evo ea c Me t Me ma t a Ha c Plaf lsp a Wor m Y eas t Schpo Anasp Synpc T heel A quae T hem Agrt5 a Ag Rhi rt5W Br me Bruume R su Ca hilo R Ri icc ucr Bu cp n ca r i M et pe A er ae P yr t o l Su lso rfu y P Su o rh b Py yra cfu P Ar K = 14 Using pre-/absence of words • Any sequence S as above can be represented also by just the set X(S) of its K-strings. In a “gedanken alignment”, ignoring repeats and collisions, we may put 1 X (S) X (S') X (S) X (S') (K ) + Q (S,S') : X (S) X (S') 2 and or q ( K ) (S,S') := Q ( K ) (S,S ')1/ K := Q ( K +1) (S,S') Q ( K ) (S,S') Making up alignments • Pattern: 1: key, must match 0: space, to watch • “Align” pairwise: For keys occur in both sequences, concatenate letters in spaces. • Sequence-based distance 111110 X: VKAAWG,HAGEYG… Y: VKAAWS,HAGEYG… GG… SG… pˆ = nd / n Y eas t Schpo Enccu Arath Thema Aquae Mycpu Urepa Mycpeycpn M cge My ei ra C l Th C l op e oa e te Str b St m Str rpnTu p StrpnR Strp yS Strp yM StragyVG StragN Lacla Staep StaauW StaauN StaauM Lismo Lisin ih Oce d h Bacsu c B a an c Ba D iE ol liO c E co iK E ol Ec ifl Sh alti S alty S rpeC Ye rpeK Y e heon S Vibchvu Vib Haein Pasmu Pseae Psepu Wigbr Neim NeimeeM Z Rals X an o X yl X a n a x ca Ca fa He mje B H lpj Tr orbelpy C h ep u lm a u A tr n hl lp J C Ch lpn C Ch lpn Ch pi n lte Le Ch o H Strc yctu C ycle M ctu M My rgl f o C ore iflo u C B sn Fu ol iC 0.1 Ec M et M et M ac ka e tth Me Py tja Py rfu rh Py o ra Arc b fu Anasp Synpc T heel Agrt5 Agrt5 Rhim W e Bru R Bru me hilo s Bu R Ric Bu c icp cn Cau a uc ca i r r p pe A er to S ul o ls Su e ra Py Pl Wor afa m o ev Th eac sp T h al a H m et M 011111110 tree The missing words approach • Probabilities of double- vs. single-missing 1/ K p0 = u 2−q K ln p0 ⇒ qˆ = 2 − u ln • Estimation of probabilities pˆ 0 = n0 Σ K n0 + nB n0 + nA · uˆ = K K Σ Σ E nc M e Py tja Py rf u Py rho ra Ar b Me cfu t Me ma t ac H al sp e im e h R um Br rusu ilo B Rh r uc Ca cn Ric pr Ric ai B uc ap B uc br Wig EcoliC EcoliE EcoliO EcoliK Shifl Salti Salty YerpeC Yerp H ae eK Pas in m S he u Vib on ch V Ne ibv Ne ime u M i Ps meZ P s ea ep e u C St loa S t r pn b r T S t pnR r St mu Str rpyS p Str yM pyG Stra Stra gN gV Lacla Staep StaauW StaauN StaauM Lismo Lisin Oceih Bachd u Bacs n a B ac so al x R a n a ca X n Xa lfa Xy mje Ca elpj H lpy H e ae Aqu ma T he cu Y ea st Schp o Plafa Worm Arath Mycpu Urepa Mycpe Mycpn Mycge J n Chlp C n p C hl A pn Chl hltr C u lm Ch repa T bu u r n te B o F us he T pe lo C 0.05 Biflo Coref Corgl Mycle Myctu Myct C uH Dei Strco r An a a S s Th ynpc p e C h el lte Le A p Ag grt5 in rt5 W pe A er ae P yr lto Su so l S u e vo T h eac h t Th Met ka et M MW9 Tree ilo r Rh auc n C icc R icpr R cai Bu cap Bu gbr Wi liC Eco liE Eco O Ecoli EcoliK Shifl Salti Salty YerpeC YerpeK Haein Pasmu Sheo Vibc n Vib h Ne vu Ne imeM Ra imeZ Ps lso P ea Xa sep e Xa na u nc x a S St trpy St rpy S St rpyG M r Str agN ag La V Sta cla S ta e p au Staa W Staa uN uM Lismo Lisin Oceih Bachd Bacsu Bacan Thema Aquae lfa je Xy m Ca elpj y H lp He asp An npc Sy eel Th te Chl bu Bor a Trep Lepin M M e ta et c M ka Me etth Py tja Py rfu Py rho ra Arc b Enc fu cu Plafa Worm Yeast Schpo Arath Deira Strco MyctuH MyctuC Mycle l Corg f Core o Bifl u n Fus ete Th pe Clo oab Cl pnT r R St rpn u St trm S Mycge Mycpn Mycpe Urepa Mycp Chlm u Chlt u Ch r Ch lpnA Ch lpnC Ag lpnJ A rt5 Rh grt5 B im W Br rum e us e u pe Aer lto Su lso Su ae r Py evo Th eac Th lsp Ha tma e M A tree based on “Jensen-Shannon” distance JSh(S, S’), the “symmetrized” Kullback-Leibler distance between the “induced K-string probabilities” ilo r Rh auc n C icc R icpr R cai B u ca p Bu gbr Wi liC Eco liE Eco O Ecoli EcoliK Shifl Salti Salty YerpeC YerpeK Haein Pasmu Sheo Vibc n Vib h Ne vu Ne imeM Ra imeZ Ps lso P ea Xa sep e Xa na u nc x a S St trpy St rpy S St rpyG M r Str agN ag La V Sta cla S ta e p au Staa W Staa uN uM Lismo Lisin Oceih Bachd Bacsu Bacan Thema Aquae lfa je Xy m Ca elpj y H lp He asp An npc Sy eel Th te Chl bu Bor a Trep Lepin M M e ta e M tka c e Me tth Py tja Py rfu Py rho ra Arc b Enc fu cu Plafa Worm Yeast Schpo Arath Deira Strco MyctuH MyctuC Mycle l Corg f e Cor o Bifl u n Fus ete Th pe Clo oab Cl pnT r R St rpn u St trm S 0.1 Mycge Mycpn Mycpe Urepa Mycp Chlm u Chl u Chl tr Ch pnA Ch lpnC Ag lpnJ A rt5 Rh grt5 B im W Br rum e us e u pe Aer lto Su lso Su ae r Py evo Th eac Th lsp Ha tma e M The tree drawn to scale 0.05 N au uW a St taa ep S ta cla S La agV StrtragN S rpyG St rpyM St rpyS St mu S tr n R Strp T Strpn Cloab Clope Thete Fusnu Mycge Mycp Mycp n Urep e Agr Mycpa u A t5 Rhgrt5W B i Br rumme R us e Ca hil u uc o r cn ic r R icp R M et k M a e Th Ar tja Th evo cfu ea Me c Me tma H tac Plaf alsp Wor a m Yeas Schpot Enccu Arath Thema Aquae Chlte Theel Synppc AnasJ n Chlp nC p Chl lpnA r Ch Chlt u lm Ch epina L ep u Tr orb py B el H Deira Biflo Coref Co Mycrlegl My MycctuC tu Ba Strco H c B a B acs n Ocachd u L Li isineih St sm aa o uM 1/ K H Ca el m pj Xa Xyl je Xa ncafa Ps nax Pseepu Ra ae Neimlso Neim eZ e VibvuM Vibch Sheon Pasmu Haein YerpeK YerpeCy Salt i Salt fl ShiK li Eco liO o Ec oliE Ec oliC Ec igbr p W uca ai B uc B 1 − (1 − JSh(S,S') ) pe Aer lto Su l so Su ae rfu r Py Py rho b Py yra tth P Me The tree using the calibrated distance Comparison with “standard” taxonomy 12 10 # diff 8 6 JSh 4 cali. 2 0 8 9 10 11 12 Word length 13 14 Outlook • More realistic evolutionary dynamics • Taking account of the variance of the frequency distribution – Better construction of composition vector – Optimal value of K Acknowledgement • Bailin Hao • Andreas Dress • Zhao Xu
Similar documents
Hjullastare 2,8-8,0 ton
Egenvikt ..................................................ca 5050 kg Brytkraft ....................................................5160 daN Tipplast - rak maskin.................................36...
More information