SDT a virus species classification tool
Transcription
SDT a virus species classification tool
SDT(species demarcation tool): a virus classification tool based on genome-wide pairwise identity calculation CHPC National Meeting 2012 Brejnev Muhire University of Cape Town Computational Biology Group 05/12/2012 Overview • Background • Virus classification protocol • Limitations of current protocol • SDT (species demarcation tool) • SDTMPI (using mpi4py) • Classification of mastrevirus species • Proposed future work Background Virus Nucleic acid (ssDNA, dsDNA, ssRNA, dsRNA) Background Hierarchy of recognised viral taxa Order 6 (Herpesvirales) Family Sub94 (Herpesviridae) family Subfamily 22 (Alphaherpesvirinae) Genus 395 (Simplexvirus) Species 2480 (Human herpesvirus) ICTV (International Committee on Taxonomy of Viruses) Background Hierarchy of recognised viral taxa Order 6 (Herpesvirales) Family 94 (Herpesviridae) Subfamily 22 (Alphaherpesvirinae) Genus 395 (Simplexvirus) Species 2480 (Human herpesvirus) Strain Criteria for virus classification • Vector • Host • Pathogenicity • Genome size • Genome organisation • Sequence comparison Criteria for virus classification • Vector • Host • Pathogenicity • Genome size • Genome organisation • Sequence comparison Mastrevirus (Geminiviridae) ICTV current guideline for mastrevirus classification Multiple sequence alignment ICTV current guideline for mastrevirus classification Multiple sequence alignment Pairwise identity calculation (1- n/m) ICTV current guideline for mastrevirus classification Multiple sequence alignment Pairwise identity calculation (1- n/m) Use a rooted phylogenetic tree to sort scores ICTV current guideline for mastrevirus classification Multiple sequence alignment Pairwise identity calculation (1- n/m) Use a rooted phylogenetic tree to sort scores Use 74 % species demarcation threshold 95 % strain demarcation threshold Limitation of current protocol Three problems 1. Based on multiple sequence alignment Limitation of current protocol Three problems 1. Based on multiple sequence alignment 2. How to treat the gaps Limitation of current protocol Three problems 1. Based on multiple sequence alignment 2. How to treat the gaps 3. Lack of a computer program Limitation of current protocol Three problems 1. Based on multiple sequence alignment 2. How to treat the gaps 3. Lack of a computer program Consequence • ICTV receives proposals of many questionable new mastrevirus species Comparison of current to proposed protocol Current Multiple sequence alignment Proposed Based on pairwise sequence alignment & excludes gaps Comparison of current to proposed protocol Current Multiple sequence alignment 1. Quick process Proposed Based on pairwise sequence alignment & excludes gaps 1. Long process 2. Scores change with alignment size and due to gaps 2. Always provides accurate scores 3. Not always objective 3. Objective and highly reproducible SDT species demarcation tool FASTA file N sequences SDT species demarcation tool FASTA file N sequences Pairwise alignments ((N x N)-N)/2 MUSCLE, ClustalW2, MAFFT Needleman-Wunsch algorithm SDT species demarcation tool FASTA file N sequences Pairwise alignments ((N x N)-N)/2 MUSCLE, ClustalW2, MAFFT Needleman-Wunsch algorithm SDT species demarcation tool FASTA file N sequences Pairwise alignments ((N x N)-N)/2 Compute pairwise identity score (1- n/m) MUSCLE, ClustalW2, MAFFT Needleman-Wunsch algorithm SDT species demarcation tool FASTA file N sequences Pairwise alignments ((N x N)-N)/2 MUSCLE, ClustalW2, MAFFT Needleman-Wunsch algorithm Compute pairwise identity score (1- n/m) Rearrange scores using phylogenetic relationships Neighbor ( PHYLIP 3.63) SDT species demarcation tool FASTA file N sequences Pairwise alignments ((N x N)-N)/2 MUSCLE, ClustalW2, MAFFT Needleman-Wunsch algorithm Compute pairwise identity score (1- n/m) Rearrange scores using phylogenetic relationship Neighbor ( PHYLIP 3.63) of the sequences 2D pairwise identity matrix, and a distribution plot of identity scores 2D graphical representation Identity scores 2D graphical representation Clustered identity scores SDT full colour mode http://web.cbio.uct.ac.za/SDT SDT three colour mode http://web.cbio.uct.ac.za/SDT Pairwise Identity Distribution Plot Candidates demarcation cut-off http://web.cbio.uct.ac.za/SDT SDT limitations 5000,000 4500,000 Begomovirus 2116 4000,000 3500,000 Number of pairs 3000,000 Mastrevirus 903 2500,000 2000,000 1500,000 1000,000 500,000 ,0 -500,000 0 500 1000 1500 2000 2500 3000 3500 Number of sequences 2116 sequences 2,350,000 alignments and score calculations 350 h SDTMPI Features: Modules: • Os • Sys • Subprocess • Biopython • mpi4py Methods: • Sequence alignment • Pairwise identity scores calculation Work flow N sequences Lists of indices used to easily access sequences in the sequence-array object Create a sequence list object [0, 0, 0, 0, 0, …, 0, 1 ,1 ,1 …, 1, 2, 2, 2, …, 2, 3, 3, 3, …, 3, …………………………. ,N-1] [1 ,2 ,3 ,4 ,5, …, N, 2 ,3 ,4, …, N, 3 ,4 ,5, …, N, 4, 5, 6, …, N, ………………..………. , N ] Align each pair Compute the score Cores 1 2 3 Write all the scores into one text file ...... S Work load distribution Output 10 sequences : 45 pairs Speed up 2116 sequences: 2 344 695 pairs SDT SDTMPI SDTMPI Serial 12 cores 40 cores 349h (~14 days) 15.5h 8.1h Application to mastrevirus “A genome-wide pairwise-identity-based proposal for the classification of viruses in the genus Mastrevirus (Geminiviridae)” for all 939 mastrevirus that were publically available in May 2012 New cut-offs 78% and 94 % Old cut-offs 74% for species and 95 % for strain Mastrevirus species and strains Chickpea chloric dwarf virus Chickpea chlorosis virus Wheat dwarf virus Phylogenetic support Future work 1. Full implementation of a web-based version of the tool. 2. Extend the method for analysis of protein sequences. Acknowledgements • Dr Darren Martin, University of Cape Town • Dr Arvind Varsani, University of Canterbury • ICTV Geminiviridae Study Group • HPC Winter School 2012 (CHPC & UFS) Thank you