SDT a virus species classification tool

Transcription

SDT a virus species classification tool
SDT(species demarcation tool): a virus
classification tool based on genome-wide
pairwise identity calculation
CHPC National Meeting 2012
Brejnev Muhire
University of Cape Town
Computational Biology Group
05/12/2012
Overview
• Background
• Virus classification protocol
• Limitations of current protocol
• SDT (species demarcation tool)
• SDTMPI (using mpi4py)
• Classification of mastrevirus species
• Proposed future work
Background
Virus
Nucleic acid (ssDNA, dsDNA, ssRNA, dsRNA)
Background
Hierarchy of recognised viral taxa
Order
6 (Herpesvirales)
Family
Sub94 (Herpesviridae)
family
Subfamily
22 (Alphaherpesvirinae)
Genus
395 (Simplexvirus)
Species 2480 (Human herpesvirus)
ICTV (International Committee on Taxonomy of Viruses)
Background
Hierarchy of recognised viral taxa
Order
6 (Herpesvirales)
Family
94 (Herpesviridae)
Subfamily
22 (Alphaherpesvirinae)
Genus
395 (Simplexvirus)
Species 2480 (Human herpesvirus)
Strain
Criteria for virus classification
• Vector
• Host
• Pathogenicity
• Genome size
• Genome organisation
• Sequence comparison
Criteria for virus classification
• Vector
• Host
• Pathogenicity
• Genome size
• Genome organisation
• Sequence comparison
Mastrevirus (Geminiviridae)
ICTV current guideline for
mastrevirus classification
Multiple sequence alignment
ICTV current guideline for
mastrevirus classification
Multiple sequence alignment
Pairwise identity calculation (1- n/m)
ICTV current guideline for
mastrevirus classification
Multiple sequence alignment
Pairwise identity calculation (1- n/m)
Use a rooted phylogenetic tree to sort scores
ICTV current guideline for
mastrevirus classification
Multiple sequence alignment
Pairwise identity calculation (1- n/m)
Use a rooted phylogenetic tree to sort scores
Use 74 % species demarcation threshold
95 % strain demarcation threshold
Limitation of current protocol
Three problems
1. Based on multiple sequence alignment
Limitation of current protocol
Three problems
1. Based on multiple sequence alignment
2. How to treat the gaps
Limitation of current protocol
Three problems
1. Based on multiple sequence alignment
2. How to treat the gaps
3. Lack of a computer program
Limitation of current protocol
Three problems
1. Based on multiple sequence alignment
2. How to treat the gaps
3. Lack of a computer program
Consequence
• ICTV receives proposals of many
questionable new mastrevirus species
Comparison of current to
proposed protocol
Current
Multiple sequence
alignment
Proposed
Based on pairwise
sequence alignment &
excludes gaps
Comparison of current to
proposed protocol
Current
Multiple sequence
alignment
1. Quick process
Proposed
Based on pairwise
sequence alignment &
excludes gaps
1. Long process
2. Scores change with
alignment size and
due to gaps
2. Always provides
accurate scores
3. Not always objective
3. Objective and
highly reproducible
SDT species demarcation tool
FASTA file
N sequences
SDT species demarcation tool
FASTA file
N sequences
Pairwise alignments
((N x N)-N)/2
MUSCLE, ClustalW2, MAFFT
Needleman-Wunsch algorithm
SDT species demarcation tool
FASTA file
N sequences
Pairwise alignments
((N x N)-N)/2
MUSCLE, ClustalW2, MAFFT
Needleman-Wunsch algorithm
SDT species demarcation tool
FASTA file
N sequences
Pairwise alignments
((N x N)-N)/2
Compute pairwise
identity score (1- n/m)
MUSCLE, ClustalW2, MAFFT
Needleman-Wunsch algorithm
SDT species demarcation tool
FASTA file
N sequences
Pairwise alignments
((N x N)-N)/2
MUSCLE, ClustalW2, MAFFT
Needleman-Wunsch algorithm
Compute pairwise
identity score (1- n/m)
Rearrange scores using
phylogenetic relationships
Neighbor ( PHYLIP 3.63)
SDT species demarcation tool
FASTA file
N sequences
Pairwise alignments
((N x N)-N)/2
MUSCLE, ClustalW2, MAFFT
Needleman-Wunsch algorithm
Compute pairwise
identity score (1- n/m)
Rearrange scores using
phylogenetic relationship
Neighbor ( PHYLIP 3.63)
of the sequences
2D pairwise identity matrix, and a
distribution plot of identity scores
2D graphical representation
Identity scores
2D graphical representation
Clustered identity scores
SDT full colour mode
http://web.cbio.uct.ac.za/SDT
SDT three colour mode
http://web.cbio.uct.ac.za/SDT
Pairwise Identity Distribution Plot
Candidates demarcation cut-off
http://web.cbio.uct.ac.za/SDT
SDT limitations
5000,000
4500,000
Begomovirus 2116
4000,000
3500,000
Number of pairs
3000,000
Mastrevirus 903
2500,000
2000,000
1500,000
1000,000
500,000
,0
-500,000
0
500
1000
1500
2000
2500
3000
3500
Number of sequences
2116 sequences
2,350,000 alignments and score calculations
350 h
SDTMPI
Features:
Modules:
• Os
• Sys
• Subprocess
• Biopython
• mpi4py
Methods:
• Sequence alignment
• Pairwise identity scores calculation
Work flow
N sequences
Lists of indices used to
easily access sequences
in the sequence-array
object
Create a sequence list object
[0, 0, 0, 0, 0, …, 0, 1 ,1 ,1 …, 1, 2, 2, 2, …, 2, 3, 3, 3, …, 3, …………………………. ,N-1]
[1 ,2 ,3 ,4 ,5, …, N, 2 ,3 ,4, …, N, 3 ,4 ,5, …, N, 4, 5, 6, …, N, ………………..………. , N ]
Align each pair
Compute the score
Cores
1
2
3
Write all the scores into one
text file
......
S
Work load distribution
Output
10 sequences : 45 pairs
Speed up
2116 sequences: 2 344 695 pairs
SDT
SDTMPI
SDTMPI
Serial
12 cores
40 cores
349h (~14 days)
15.5h
8.1h
Application to mastrevirus
“A genome-wide pairwise-identity-based proposal for
the classification of viruses in the genus Mastrevirus
(Geminiviridae)”
for all 939 mastrevirus that were publically available in May 2012
New cut-offs 78% and 94 %
Old cut-offs 74% for species and 95 % for strain
Mastrevirus species and strains
Chickpea chloric dwarf virus
Chickpea chlorosis virus
Wheat dwarf virus
Phylogenetic support
Future work
1. Full implementation of a web-based version of
the tool.
2. Extend the method for analysis of
protein sequences.
Acknowledgements
• Dr Darren Martin, University of Cape Town
• Dr Arvind Varsani, University of Canterbury
• ICTV Geminiviridae Study Group
• HPC Winter School 2012 (CHPC & UFS)
Thank you