screen

Transcription

screen
Alignments
BLAST, BLAT
Genome
vs.
gene
Genome
Gene
Built of...
DNA
DNA
Describes
Organism
Protein
Stored as...
Circular/ linear
Single molecule, or
Part of genome
a few of them
Both (depending on
the species)
Linear
“Life cycle” DNA-DNA-DNA-... DNA-RNA-protein
Size 0.5Mb*) ... 3500Mb 100b ... 100000b
Amount per
cell
1
500 .. 50000
Information
content
2% ... 100%
~30%
The amount of genetic
information in organisms
Name
Mycoplasma
genitalium
Escherichia coli
Saccharomyces
cerevisiae
Drosophila
melanogaster
Caenorhabtitis
elegans
Homo sapiens
Zea mays
Genome
size (Mb) # genes
0.5
4.5
470
4400
12
5500
120
18000
97
3000
2500
22000
23457
50000
The amount of genetic
information in organisms
http://www.cbs.dtu.dk/staff/dave/roanoke/genetics980313.html
Largest genome: amoeba Chaos chaos
http://www.lawrence.edu/dept/biology/animal/
(200x human genome)
Sequence searching challenges

Exponential growth of
databases
Sequence searching –
definition

Task:
– Query: short, new sequence (~1000
letters)
– Database (searching space): very
many sequences
– Goal: find seqs homologous to the
query
Sequence searching –
definition

We want:
– fast tool
– primarily a filter: most sequences will
be unrelated to the query
– fine-tune the alignment later
Database Search Algorithms:
Sensitivity, Selectivity
• True Positive (TP) – a homology detected (positive)
correctly (true)
Signal
Yes
No
Yes
No
Detected Name
Yes
True Positive
No
True Negative
No
False Negative
Yes
False Positive
Database Search Algorithms:
Sensitivity, Selectivity
• Sensitivity =TP/(TP+FN)
• Selectivity =TN/(TN+FP)
–
Sensitivity
Selectivity
Courtesy of Gary Benson (ISSCB 2003)
What is BLAST


Basic Local Alignment Search Tool
Bad news: it is only a heuristics
– Heuristics: A rule of thumb that often helps in solving
a certain class of problems, but makes no guarantees.
Perkins, DN (1981) The Mind's Best Work

Basic idea:
– High scoring segments have well
conserved (almost identical) part
– As well conserved part are identified,
extend it to the real alignment
-
s
e
q
s
e
q u
e
What means well conserved
for BLAST?

BLAST works with k-words (words of length
k)
– k is a parameter
– different for DNA (>10) and proteins
(2..4)

word w1 is T-similar to w2 if the sum of pair
scores is at least T (e.g. T=12)
Similar 3-words
W1:
R K P
W2:
R R P
Score:
9 –1 7
∑ = 15
BLAST algorithm
3 basic steps
1)Preprocess
2)Scan
3)Extend
1)Preprocess the query: extract all
the k-words
2)Scan for T-similar matches in
database
3)Extend them to alignments
BLAST, Step 1:
Preprocess the query




1)Preprocess
2)Scan
3)Extend
Take the query (e.g. LVNRKPVVP)
Chop it into overlapping k-words (k=3 in this
case)
Query:
LVNRKPVVP
Word1:
LVN
Word2:
VNR
Word3:
NRK
…
For each word find all similar words (scoring at least
T)
E.g. for RKP the following 3-words are similar:
QKP KKP RQP REP RRP RKP
Finite state machine




1)Preprocess
2)Scan
3)Extend
abstract machine
constant amount of memory
(states)
used in computation and languages
recognizes regular expressions
– cp dmt*.pdf /home/john
AC*T|GGC
BLAST, Step 2:
Find ¨exact¨ matches
with scanning


1)Preprocess
2)Scan
3)Extend
Use all the T-similar k-words to build the Finite
State Machine
Scan for exact matches
QKP
KKP
RQP movement
REP
RRP
RKP
...
...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA...
1)Preprocess
2)Scan
3)Extend
BLAST, Step 3:
Extending ¨exact¨ matches

Having the list of exact matches we extend
alignment in both directions
Query:
L
T-similar:
Subject:
G
Score:
-3

V
N
V C
4 -3
R
R
R
5
K
R
R
2
P
P
P
7
V
V
P
L K C
1 -2 -3
…till the sum of scores drops below some
level X (e.g. X=-100) from the best known
- what with gaps?
Gapped BLAST
(now standard)


1)Preprocess
2)Scan
3)Extend
gapped local alignments are computed:
much, much, much slower
therefore: modified “Hit criteria”
Hit criteria

1)Preprocess
2)Scan
3)Extend
Extends the alignment only if there are
close two hits on the same diagonal
– sensitivity would drop without lowering T
– reduces extensions (90% time is spend
on extensions)

Gapped local alignments are computed
– increased sensitivity allows us raise T
– raising T speeds up the search
cl
os
eh
it,
sa
m
ed
ia
g
query
pos
dbpos
Gapped BLAST v BLAST

We end up with
– same speed
– gapped alignments!
– much higher sensitivity
BLAST flavours

blastp: protein query, protein db

blastn: DNA query, DNA db

blastx: DNA query, protein db
– in all reading frames. Used to find potential
translation products of an unknown nucleotide
sequence.

tblastn: protein query, DNA db
– database dynamically translated in all reading
frames.

tblastx: DNA query, DNA db
– all translations of query against all translations of
db
PSI-BLAST




Position-Specific Iterated BLAST
A profile is derived from the result
of the first search
Database is searched against the
profile (instead of a sequence)
Up to 3 iterations
Profile


Profile is generalized form of
sequence
probabilities instead of a letter
A
C
D
.
.
.
W
Y

0.5
0
0
.
.
.
0
0.5
0.3
0.1
0
.
.
.
0.3
0.3
0.2
0.0
0.1
.
.
.
0.4
0.3
...
...
...
...
...
...
...
...
0
0.5
0.2
.
.
.
0.1
0.2
Score of the profile
scorep , i , A =∑B p[i ,B]scoreblosum62  A ,B
profile
position
letter
Constructing a profile




Take significant BLAST results
Make an alignment
Assign weights to sequences
Construct the profile
A
C
D
.
.
.
W
Y
0.5
0
0
.
.
.
0
0.5
0.3
0.1
0
.
.
.
0.3
0.3
0.2
0.0
0.1
.
.
.
0.4
0.3
...
...
...
...
...
...
...
...
0
0.5
0.2
.
.
.
0.1
0.2
BLAT


The Blast-Like Alignment Tool
Large-scale genome comparison:
– query can be large

Preprocessing phase:
– BLAST: query
– BLAT: db
BLAT, Step 1:
Preprocess the
database

1)Preprocess
2)Scan
3)Extend
Index the database with k-words
– k=8..16 for nucleotides
– k=3..5 for proteins

For each k-word store in which
sequences it appears
k-word: RKP
Hashed DB:
QKP: HUgn0151194, Gene14, IG0, ...
KKP: haemoglobin, Gene134, IG_30, ...
RQP: HSPHOSR1, GeneA22...
RKP: galactosyltransferase, IG_1...
REP: haemoglobin, Gene134, IG_30, ...
RRP: Z17368, Creatine kinase, ...
...
Hashing – associative
arrays


1)Preprocess
2)Scan
3)Extend
Indexing with the object
Hash function:
hash:
small
(fits in memory)
possible objects - large
– Objects should be “well spread”
x
Hashing - examples

1)Preprocess
2)Scan
3)Extend
T9 Predictive Text in mobile phones
– “hello” in Multitap:
4, 4, 3, 3, 5, 5, 5,
(pause) 5, 5, 5, 6, 6, 6
– “hello” in T9:
4, 3, 5, 5, 6
– Collisions: 4, 6:
“in”, “go”
BLAT, Step 1:
Index to find exact matches
with hashing
1)Preprocess
2)Scan
3)Extend

The database is preprocessed only
once! (independent from the
query)
k-word: RKP
Hashed DB:
QKP: HUgn0151194, Gene14, IG0, ...
KKP: haemoglobin, Gene134, IG_30, ...
RQP: HSPHOSR1, GeneA22...
RKP: galactosyltransferase, IG_1...
REP: haemoglobin, Gene134, IG_30, ...
RRP: Z17368, Creatine kinase, ...
...
BLAT, Step 2:
Hit criteria


1)Preprocess
2)Scan
3)Extend
In a constant time we can get the
sequences with a certain k-word
relaxing hit definition -> improve
sensitivity
– allow imperfect hits
• costly, huge hash grows a few times!
➔
shorten k (would lead to FP), but
expect two hits (see BLAST)
BLAT, Step 3:
Identifying homologous
regions



1)Preprocess
2)Scan
3)Extend
Exclude common k-words
For all k-words from query
– find out the position in db
For results (qpos, dbpos):
– split into buckets (64kbp)
– sort on the diagonal (diag=qposdbpos)
BLAT, Step 3:
Identifying homologous
regions

1)Preprocess
2)Scan
3)Extend
continued...
– from diagonally close hits (gap
limit) create “pre-clusters”
– sort each “pre-cluster” on dbpos
– create clusters from close hits
– run Local Alignment for each
cluster
Seeds – improving sensitivity


More general form of k-word is a
seed
The seed
CT.GT.AT.

gives “hits” with both sequences
...CTCGTTATA...
...CTAGTAATG...
How to detect homology?


Take the score of an maximal local
alignment
can it be obtained by chance?
– any score can be obtained from
comparing (long enough) random
sequences
What is a “chance”?


Extracting local alignments from
random sequences
P-value (e.g. =0.01)
– The probability of obtaining the result
by pure chance
– An alignment giving lower P-value
than set by user is considered a hit.
Best Local Alignments
by chance

Create random seqs, each 1000aa long

Find the max local align

Repeat
7000
6000
# Alignments
5000
4000
3000
2000
1000
0
25
30
35
40
45
50
Score
55
60
65
70
75
The Statistics of local
alignment
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Subst matrix must guarantee
• E(score(a,b)) < 0 for random a, b

Analytical solution
– sum of i.i.d. variables -> normal
distribution
– max of i.i.d. -> extreme value
distribution (EVD)
Expected number of aligns

E-value: the expected number of
alignments scoring >= S
−S
E=K m n e


2x size of seq -> 2x number aligns
2x S -> E drops exponentially
−S
E=K m n e
E-value depends on n, m

Example
– For comparing seqA with seqB: S=88
-> E = 0.001
– For comparing seqA with 1000 seqs:
score 88 -> E=1

Important for db searching:
– n – size of query, m - size of db
Deriving K, L

−S
E=Kmne
The above eq is theoretical result
for gapless (g=inf) alignment
– K, L can be derived from the subst
table
For gapped case
– it seems that the equation holds
– we can derive K, L from “experiments”
7000
6000
5000
# Alignments

4000
3000
2000
1000
0
25
30
35
40
45
50
Score
55
60
65
70
75
Bit score S'


−S
E=Kmne
Score S depends on the substitution
table
What if we want table-independent
score?
E=mn2−S'
where
S'=
 S−ln K
ln 2
Why does the BLAST work?
Relevant riddle

Are there at least 2 people in
Amsterdam with the same
number of hairs?
– At most 500 000 hairs on
each head
– 700 000 people living in
Amsterdam
Why does the BLAST work?
Pigeons

pigeonhole principle: having 9 boxes and 10
pigeons, there is at least one box with more than 1 pigeon
– n=9, k=7 case:
http://en.wikipedia.org/wiki/Pigeonhole_principle
Why does the BLAST work?
Average case



pigeonhole principle describes the worst
case!
On average we'll expect two pigeons in
the same box much earlier
Birthday paradox: among 23 people,
probability that they have the same
birthday is > 0.5
– note: 365 boxes and only 23 pigeons!
Birthday paradox
Why does the BLA[S]T work?


Forget the T-similar words, now use only identities
2 sequences, 100 nucleotides each:
– What's the minimal sequence identity for which
there's a string of 3 consecutive identities?
0 identities, 100 mismatches:
67 identities, 33 mismatches:


68st ?
but if seqs are 50% id, we'll detect it with prob. 99%
28% id -> we'll detect it with prob. 50%
– how is it calculated?
Expected sensitivity


We assume that letters are
independent
I – identity between seqs, for
human-mouse
– 86% for DNA,
– 89% for proteins
pword id =Ik
Expected sensitivity


Q - query size
number of non-overlapping words
R=

Q
k
prob. of a “hit”
pdetect=1−1−pword id R
Expected specificity


How many matches by chance (C)?
G – genome size

G 1
C=Q−k1∗ ∗
k 4


k
For h-m, to get 99% sensitivity we
have to set k=7, and for Q=1000
C ~= 25,000,000
– 7h assuming 1/1000 per alignment