Protein structure Basel 9.04 Content
Transcription
Protein structure Basel 9.04 Content
Protein structure Basel 9.04 How to search and correctly align the target and the template Content • Why • Algorithms Pairwise Search Multiple Structural • Problems • Conclusion Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 1 Why do we need to align? • Aligning sequences allows to infer biological evidences. Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 Why do we need to align? • Aligning sequences allows to infer biological evidences. • More true with 3D structure • Relationships between sequences Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 2 Algorithms • Pairwise (one to one) Local, global, dotplot, etc. • Search (one to many) BLAST, FASTA, etc. • Multiple (many to many) ClustalW, T-Coffee, etc. • Structural (same with structure information) 3Dcoffee, DALI, Fugue, etc. Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 Pairwise alignment • Concept of an alignment Identity, similarity, homology • Local vs Global ? Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 3 Pairwise alignment • Scoring: intuitive or formal •S = 12 •S = 5 Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 Pairwise alignment • Scoring: substitution matrix • Examples: BLOSUM62, PAM250 Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 4 Pairwise alignment • Gaps ? • Scoring: gap penalties for indels Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 Pairwise alignment • Scoring: gap penalties for indels With a match score of 1 and mismatch score of 0 With an opening penalty of 4 and an extension penalty of 1 S = 13 x 1 - 4 - 6 x 1 = 3 Swiss Institute of Bioinformatics Swiss EMBnet node S= 13 x 1 - 4 x 5 - 6 x 1 = - 13 LF 09.2004 5 Pairwise alignment • Exact algorithms --> Dynamic Programming Global Local Needleman&Wunsch Smith&Waterman • Graphic --> Dotplot Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 Pairwise alignment • LALIGN http://www.ch.embnet.org/software/LALIGN_form.html • SIM http://www.expasy.org/tools/sim-prot.html • EMBOSS (water & needle) http://www.ebi.ac.uk/emboss/align • BLAST2SEQ http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html • PRSS http://www.ch.embnet.org/software/PRSS_form.html • Dotlet Swisshttp://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html Institute of Bioinformatics Swiss EMBnet node LF 09.2004 6 Searching databases • How to find the best match for your protein? • Solution: align it against all sequences of the database and compare the scores OK but Dynamic Programming is slow! OK but what about biased sequences? OK but what is a significant match? Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 Searching databases • Dynamic Programming is slow • Solution: heuristic algorithm • Faster but doesn’t guarantee to find the best alignment FASTA BLAST SSAHA BLAT … Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 7 Searching databases • Biased sequences create false positive • Solution: filtering • BLAST implements several filters (seg, gnu, coils) Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 Searching databases • How to compare scores from different alignments to identify significant matches? • Solution: normalised scores & statistics The distribution of random ungapped local alignments was shown to follow an Extreme Value Distribution (EVD) Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 8 Searching databases • How to compare scores to identify significant matches? • Solution: e-value BLAST uses precalculated statistics to transform the raw score in an e-value Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 Searching databases • FASTA http://www.ebi.ac.uk/fasta33 • BLAST ([t]blast[npx], psi, phi, mega) http://www.ncbi.nlm.nih.gov/BLAST/ http://www.ch.embnet.org/software/aBLAST.html http://www.ebi.ac.uk/blast2 • SSAHA See www.ensembl.org • BLAT http://bioinformatics-blat.ccr.buffalo.edu/cgi-bin/hgBlat Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 9 Multiple Sequence Alignment • Comparing two sequences is good • Comparing many sequences is better! But requires very intensive computing • Different heuristic algorithms Carillo&Lipman (MSA, DCA) Segment based (Dialign) Iterative (Profiles, HMMs) Progressive (ClustalW, T-Coffee) Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 Multiple Sequence Alignment • Fast and reliable are the progressive algorithms found in ClustalW and T-Coffee • No guarantee to find best alignment • No scoring system Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 10 Multiple Sequence Alignment • ClustalW http://www.ch.embnet.org/software/ClustalW.html http://www.ebi.ac.uk/clustalw/ • T-Coffee http://www.ch.embnet.org/software/TCoffee.html http://igs-server.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi • Other tools Dialign • http://www.genomatix.de/cgi-bin/dialign/dialign.pl Multalin • http://prodes.toulouse.inra.fr/multalin/multalin.html Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 Structural alignment • Cases where a good sequence alignment is not observed in the structure! • Example: Cytochrome B reductase Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 11 Pairwise vs structural alignment Cytochrome B Reductase Pig VDLVIKVYFKDTHP || ||||| || E.coli FDLLVKVYFKNEHP 9 out of 14 identities, but not at the structural level… Pig VDLVIKVYFKDTHP- E.coli -FDLLVKVYFKNEHP Saqi, Russell & Sternberg Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 Structural alignment • Cases where a good sequence alignment is not observed in the structure! • Why? Sequence alignments do not take into account • • • • Secondary and tertiary structure Charges and hydrophobicity Distance constraints affecting indels Cofactors and solvent Evaluation of the model (scoring needed) Solution: use additional information to align Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 12 Structural alignment • Similar to sequences One to one (CE, FUGUE, LSQRMS, DALI, SAL, SAP) One to many (DALI, CE, FUGUE) Many to many (3DCoffee, CE) • But with additional data, i.e., PDB or HOMSTRAD or other derived structural data • Many tools not on the web: especially MODELLER (MALIGN3D) http://salilab.org/modeller/modeller.html SAP (Orengo, Taylor) used in 3DCoffee Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 Structural alignment • 3DCoffee (combination of SAP, PDB and pairwise alignment) http://igs-server.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi • DALI (network of services -> MSA of neighboring structures) http://www.ebi.ac.uk/dali/ • CE many tools « Combinatorial Extension » http://cl.sdsc.edu/ce.html • FUGUE (Specific matrices and gap penalties) http://www-cryst.bioc.cam.ac.uk/~fugue • LSQRMS (least square root mean square) http://www.molmovdb.org/align/ • SAL (vector based) http://www.bioinformatics.buffalo.edu/current_buffalo/skolnick/sal.html Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 13 Other tools • SAM-T02 (search for potential structures) http://www.cse.ucsc.edu/research/compbio/HMM-apps/T02query.html • STRAP (MSA editor with structural info) http://www.charite.de/bioinf/strap/ Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 Problems • Compositional Bias or low-complexity • Repeats Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 14 Problems • Coiled-coils • Leu-zipper Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 Problems • • • • Compositional Bias or low-complexity Coiled-coils Repeats Solution: use tools to detect these problems and eventually filter the region seg, coils, marcoil, dotplots… Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 15 Conclusion • If you don’t have the choice: pairwise Lalign • With more sequences: MSA T-Coffee • With 3D structures: Structural alignments Modeller SAP • Searches for structure: DALI SAM-T02 Swiss Institute of Bioinformatics Swiss EMBnet node LF 09.2004 16