Protein structure Basel 9.04 Content

Transcription

Protein structure Basel 9.04 Content
Protein structure Basel 9.04
How to search and correctly align
the target and the template
Content
• Why
• Algorithms
ƒ
ƒ
ƒ
ƒ
Pairwise
Search
Multiple
Structural
• Problems
• Conclusion
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
1
Why do we need to align?
• Aligning sequences allows to infer
biological evidences.
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
Why do we need to align?
• Aligning sequences allows to infer
biological evidences.
• More true with 3D structure
• Relationships between sequences
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
2
Algorithms
• Pairwise (one to one)
ƒ Local, global, dotplot, etc.
• Search (one to many)
ƒ BLAST, FASTA, etc.
• Multiple (many to many)
ƒ ClustalW, T-Coffee, etc.
• Structural (same with structure information)
ƒ 3Dcoffee, DALI, Fugue, etc.
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
Pairwise alignment
• Concept of an alignment
ƒ Identity, similarity, homology
• Local vs Global ?
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
3
Pairwise alignment
• Scoring: intuitive or formal
•S = 12
•S = 5
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
Pairwise alignment
• Scoring: substitution matrix
• Examples: BLOSUM62, PAM250
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
4
Pairwise alignment
• Gaps ?
• Scoring: gap penalties for indels
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
Pairwise alignment
• Scoring: gap penalties for indels
ƒ With a match score of 1 and mismatch score of 0
ƒ With an opening penalty of 4 and an extension
penalty of 1
S = 13 x 1 - 4 - 6 x 1 = 3
Swiss Institute of Bioinformatics
Swiss EMBnet node
S= 13 x 1 - 4 x 5 - 6 x 1 = - 13
LF 09.2004
5
Pairwise alignment
• Exact algorithms --> Dynamic Programming
ƒ Global
ƒ Local
Needleman&Wunsch
Smith&Waterman
• Graphic --> Dotplot
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
Pairwise alignment
• LALIGN
ƒ http://www.ch.embnet.org/software/LALIGN_form.html
• SIM
ƒ http://www.expasy.org/tools/sim-prot.html
• EMBOSS (water & needle)
ƒ http://www.ebi.ac.uk/emboss/align
• BLAST2SEQ
ƒ http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html
• PRSS
ƒ http://www.ch.embnet.org/software/PRSS_form.html
• Dotlet
ƒSwisshttp://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
6
Searching databases
• How to find the best match for your protein?
• Solution: align it against all sequences of the
database and compare the scores
ƒ OK but Dynamic Programming is slow!
ƒ OK but what about biased sequences?
ƒ OK but what is a significant match?
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
Searching databases
• Dynamic Programming is slow
• Solution: heuristic algorithm
• Faster but doesn’t guarantee to find the
best alignment
ƒ
ƒ
ƒ
ƒ
ƒ
FASTA
BLAST
SSAHA
BLAT
…
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
7
Searching databases
• Biased sequences create false positive
• Solution: filtering
• BLAST implements several filters (seg,
gnu, coils)
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
Searching databases
• How to compare scores from different
alignments to identify significant matches?
• Solution: normalised scores & statistics
ƒ The distribution of random ungapped local alignments
was shown to follow an Extreme Value Distribution
(EVD)
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
8
Searching databases
• How to compare scores to identify
significant matches?
• Solution: e-value
ƒ BLAST uses precalculated statistics to
transform the raw score in an e-value
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
Searching databases
• FASTA
ƒ http://www.ebi.ac.uk/fasta33
• BLAST ([t]blast[npx], psi, phi, mega)
ƒ http://www.ncbi.nlm.nih.gov/BLAST/
ƒ http://www.ch.embnet.org/software/aBLAST.html
ƒ http://www.ebi.ac.uk/blast2
• SSAHA
ƒ See www.ensembl.org
• BLAT
ƒ http://bioinformatics-blat.ccr.buffalo.edu/cgi-bin/hgBlat
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
9
Multiple Sequence Alignment
• Comparing two sequences is good
• Comparing many sequences is better!
ƒ But requires very intensive computing
• Different heuristic algorithms
ƒ
ƒ
ƒ
ƒ
Carillo&Lipman (MSA, DCA)
Segment based (Dialign)
Iterative (Profiles, HMMs)
Progressive (ClustalW, T-Coffee)
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
Multiple Sequence Alignment
• Fast and reliable are the progressive
algorithms found in ClustalW and T-Coffee
• No guarantee to find best alignment
• No scoring system
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
10
Multiple Sequence Alignment
• ClustalW
ƒ http://www.ch.embnet.org/software/ClustalW.html
ƒ http://www.ebi.ac.uk/clustalw/
• T-Coffee
ƒ http://www.ch.embnet.org/software/TCoffee.html
ƒ http://igs-server.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi
• Other tools
ƒ Dialign
• http://www.genomatix.de/cgi-bin/dialign/dialign.pl
ƒ Multalin
• http://prodes.toulouse.inra.fr/multalin/multalin.html
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
Structural alignment
• Cases where a good sequence alignment
is not observed in the structure!
• Example: Cytochrome B reductase
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
11
Pairwise vs structural alignment
Cytochrome B
Reductase
Pig
VDLVIKVYFKDTHP
|| ||||| ||
E.coli FDLLVKVYFKNEHP
9 out of 14 identities,
but not at the
structural level…
Pig
VDLVIKVYFKDTHP-
E.coli -FDLLVKVYFKNEHP
Saqi, Russell & Sternberg
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
Structural alignment
• Cases where a good sequence alignment is
not observed in the structure!
• Why?
ƒ Sequence alignments do not take into account
•
•
•
•
Secondary and tertiary structure
Charges and hydrophobicity
Distance constraints affecting indels
Cofactors and solvent
ƒ Evaluation of the model (scoring needed)
ƒ Solution: use additional information to align
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
12
Structural alignment
• Similar to sequences
ƒ One to one (CE, FUGUE, LSQRMS, DALI, SAL, SAP)
ƒ One to many (DALI, CE, FUGUE)
ƒ Many to many (3DCoffee, CE)
• But with additional data, i.e., PDB or HOMSTRAD
or other derived structural data
• Many tools not on the web:
ƒ especially MODELLER (MALIGN3D)
http://salilab.org/modeller/modeller.html
ƒ SAP (Orengo, Taylor) used in 3DCoffee
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
Structural alignment
• 3DCoffee (combination of SAP, PDB and pairwise alignment)
ƒ http://igs-server.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi
• DALI (network of services -> MSA of neighboring structures)
ƒ http://www.ebi.ac.uk/dali/
• CE many tools « Combinatorial Extension »
ƒ http://cl.sdsc.edu/ce.html
• FUGUE (Specific matrices and gap penalties)
ƒ http://www-cryst.bioc.cam.ac.uk/~fugue
• LSQRMS (least square root mean square)
ƒ http://www.molmovdb.org/align/
• SAL (vector based)
ƒ http://www.bioinformatics.buffalo.edu/current_buffalo/skolnick/sal.html
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
13
Other tools
• SAM-T02 (search for potential structures)
ƒ http://www.cse.ucsc.edu/research/compbio/HMM-apps/T02query.html
• STRAP (MSA editor with structural info)
ƒ http://www.charite.de/bioinf/strap/
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
Problems
• Compositional Bias or low-complexity
• Repeats
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
14
Problems
• Coiled-coils
• Leu-zipper
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
Problems
•
•
•
•
Compositional Bias or low-complexity
Coiled-coils
Repeats
Solution: use tools to detect these
problems and eventually filter the region
ƒ seg, coils, marcoil, dotplots…
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
15
Conclusion
• If you don’t have the choice: pairwise
ƒ Lalign
• With more sequences: MSA
ƒ T-Coffee
• With 3D structures: Structural alignments
ƒ Modeller
ƒ SAP
• Searches for structure:
ƒ DALI
ƒ SAM-T02
Swiss Institute of Bioinformatics
Swiss EMBnet node
LF 09.2004
16