Slides

Transcription

Slides

Sparse Problems in Bioinformatics
Olgica Milenkovic
Joint work with: Wei Dai and Amin Emad
University of Illinois at Urbana-Champaign
Acknowledgment: NSF CCF 1117980, NSF CCF 0729216, NSF CCF 0809895,
NSF CCF 01218764, NSF CSoI
Thanks to: Yaniv Erlich and Noam Shental
Milenkovic et al. (UIUC)
CS and LRMC
May 2012
1 / 54
What information theorists should know about
Bioinformatics
Bioinformatics may be a source for many interesting problems in
coding theory, statistics and signal processing, but there exists no
unifying theory of bioinformatics.
CS and LRMC
May 2012
2 / 54
Bioinformatics
Bioinformatics is data-driven: without testing model/theory on
data, very little credibility for results. Data is noisy and modeling is
hard.
CS and LRMC
May 2012
3 / 54
Bioinformatics
It appears to be a difficult task to reconcile the two areas...
CS and LRMC
May 2012
4 / 54
Something for both Sides: Sparsity
Group Testing for Experimental Design.
I
I
Group testing for Genotyping.
Group testing for synonimous coding studies.
Causal Compressive sensing and Low-Rank Completion.
I
I
Gene regulatory networks.
Synthetic lethality/Protein-Protein Interaction (PPI) inference.
CS and LRMC
May 2012
5 / 54
And Some Hard Problems...
Reconstructing sequences from traces and other problems
regarding sequences...
I
De novo protein sequencing via Tandem Mass Spectrometry:
reconstructing sequences based on composition multisets.
CS and LRMC
May 2012
6 / 54
Let’s start with group testing...
m
Test
Result
n
!
#
#
#
#
"
1
0
1
1
0
1
0
0
1
0
0
1
0
0
0
1
0
0
1
1
1
1
1
0
$
&
&
&
&
%
!
#
#
#
#
"
0
1
0
1
$
&
&
&
&
%
N
Origin of group testing: blood pooling of soldiers during recruitment in WWII
(Dorfman, 1941); n tests and N test subjects, described via test matrix, with
m <<N positives.
CS and LRMC
May 2012
7 / 54
Mathematical Formulation I
Group testing amounts to finding indices of subjects {i1 , . . . , im }
y = xi1 xi1 ... xim ,
for given y. Here, denotes binary OR, and xi ∈ {0, 1}n denotes
signature of subject i (i.e., column of test matrix).
OR - if one positive, test positive.
I
Problem of interest: what it the smallest number of non-adaptive
measurements n needed to discover m defectives among N
subjects?
const. m2 log(N )
CS and LRMC
May 2012
8 / 54
Mathematical Formulation II
Two coding-theoretic paradigms (Kautz and Singleton, 1964)
I
m-separable matrices:
xj1 xj2 . . . xjl 6= xi1 ... xis , l, s ≤ m
I
m-disjunct matrices:
supp (xi ) * ∪l6=i.l≤m supp(xl )
CS and LRMC
May 2012
9 / 54
Applications in Bioinformatics
Out of 3 × 109 bases, about 107 single point mutations (SNPs).
GAT
GAT
GAT
GAT
GAT
GAT
ATTCGTACGGAAT
ATTCGTACTGAAT
ATTCGTACGGAAT
ATTCGTACGGAAT
GTTCGTACGGAAT
GTTCGTACGGAAT
SNPs
(Single Nucleotide Polymorphisms)
CS and LRMC
May 2012
10 / 54
Applications in Bioinformatics
Out of 3 × 109 bases, about 107 single point mutations (SNPs).
Asthma
Systemic Sclerosis
Lung Cancer
Type II Diabetes
Lupus
END1
Fibrilin 1
MMP1
Syn1A
Pro
Healthy individual: 2 normal genes (alleles).
Carrier of disease: 1 normal and 1 faulty gene (allele).
CS and LRMC
May 2012
11 / 54
We are still mixing blood, but...
Illumina high-throughput sequencer: take DNA from an individual, sequence,
determine which SNPs are present
Source: giga.ulg.ac.be
CS and LRMC
May 2012
12 / 54
Cost of sample preparation/sequencing is high!
Illumina high-throughput sequencer: take DNA from a number of individuals,
sequence whole pool, determine which SNPs are present based on
combinatorial designs!
(Erlich et. al. 2009, Shental et.al., 2010)
Source: giga.ulg.ac.be
CS and LRMC
May 2012
13 / 54
On the Cover of a Magazine
Source: Erlich Lab
CS and LRMC
May 2012
14 / 54
Synonymous Coding
DNA-RNA-Protein-Life
4 bases, triple codon = 64 options; only 20 amino acids.
TTA
TTG
W
Leu
W
S
W
W
Synonymous coding (SC)
CS and LRMC
May 2012
15 / 54
Genotyping and Group Testing: Plug-and-Play?
Compressed genotyping (Erlich et.al., IT Special Issue on Molecular Biology,
2010; Erlich and Shental, 2011)
I
Diagonally spaced ones in the test matrix (robotic arm movements
highly restricted);
I
Need sparse matrices due to simplicity of mixing precision
(Thierry-Migg et.al, STDs (shifted transversal designs)).
Source: Hamilton Robotics
CS and LRMC
May 2012
16 / 54
Genotyping and Group Testing: Plug-and-Play?
Nucleic Acids Rese
Compressed genotyping (Erlich et.al., IT Special Issue on Molecular Biology,
2010; Erlich and Shental,
2011=) 40, e =0.01
reads per person
reads per person = 40, er=0
r
x 10
x 10
2
2
I Diagonally
spaced ones in the test
matrix
(robotic
arm movements
(a)
(b)
1.5
highly 1.5
restricted);
−3
measurement value
measurement value
−3
1
0.5
0
#errors: 3
0
10
20
30
40
1
0.5
0
50
#errors: 0
0
10
20
pool
x 10
(c)
1.5
1
0.5
0
#errors: 0
0
10
−3
2
measurement value
−3
x 10
20
30
40
40
x 10
(d)
1
0.5
#errors: 0
0
10
20
30
40
pool
pool
reads per person = ∞, e =0.01
reads per person = ∞, e =0
−3
r
(e)
1
0.5
#errors: 0
10
20
50
reads per person = 400, er=0
1.5
0
50
1.5
0
0
2
measurement value
measurement value
2
reads per person = 400, er=0.01
30 CS
and
40LRMC 50
2
measurement value
−3
30
pool
x 10
50
r
(f)
1.5
1
0.5
0
#errors: 0
0
10
20
May
30
2012
40
1750/ 54
Genotyping: The Constraints
A general theory of genotyping (Emad and M., ITW’2011, ISIT’2012):
I
Non-uniform DNA sampling: availability of genetic material
m
q-‐1 y
!
#
#
n #
#
#
#"
0
1
2
0
1
1
2
0
2
1
0
0
2
1
0
1
1
2
0
2
2
1
0
2
1
0
1
2
0
1
0
2
1
1
1
2
2
1
2
1
1
2
1
0
2
1
1
1
0
1
$
&
&
&
&
&
&%
!
#
#
#
#
#
#"
1
1
3
0
2
$
&
&
&
&
&
&%
N
I
I
Thresholds in above example: {0, 1}{2}{3}{4, 5, 6}.
Copy number variation (most people have two copies of each gene,
some have up to five copies): replication numbers
c1 , . . . , cn ∈ {1, 2, 3, 4, 5}.
CS and LRMC
May 2012
18 / 54
I
Semi-quantitative testing: limited readout precision
95 0 0 5 90 95 10 85 15 80 75 70 30 y
50 0
1 15 20 2 4 35 3 60 55 45 25 30 65 40 55 10 75 70 35 60 5 80 25 1 0 5 85 20 65 0 90 1
50 40 45 
Q −1

, ηQ-1,  , ηQ −1
m
∑x
k,i j
0, 1  η1 −1, η1,  , η2 −1,
j=1
CS and LRMC
May 2012
19 / 54
I
Two-dimensional testing (same mutation may be involved in
multiple diseases)
I
Family tree structure: diseases run in families and testing strategy
should be governed by Mendel’s law.
F
Probabilistic group testing: individuals have a certain probabilities of
being carriers.
CS and LRMC
May 2012
20 / 54
An Information Theoretic Approach
Group testing as an information theoretic problem (Maliytov 1980’s,
Dyachkov 2000)
m - number of positive subjects; η - thresholds; PT - probability
distribution for thresholds
C = sup PT ,η α(m, PT , η)
{i}
α(m, PT, η) = max i=1,...,m
{i}
I (XD1 , XD2 )
i
m
i
m−i
10 1 1 0 1 0 0 01
X D{i1}
CS and LRMC
X D{i2}
1
y
May 2012
21 / 54
An Information Theoretic Approach
Capacity lower bounds evaluated numerically.
0.8
Q=2
Q=3
Lowe r Bound on C
0.7
0.6
0.5
0.4
0.3
0.2
0.1
2
3
4
5
6
m
7
8
9
10
Code constructions based on disjunct and array codes.
Positives identification via belief propagation (Emad and M., 2012,
Huang and M., 2009).
CS and LRMC
May 2012
22 / 54
An Information theoretic Approach
Optimal thresholds - examples for q=3:
I
m
4
6
8
10
Probability Distribution
{0.18,064,0.18}
{0.6,0.07, 0.33}
{0.1,0.8,0.1}
{0.58,0.28,0.14}
Thresholds
{0,1,2,3}{4}{5,6,7}
{0,1,2,3}{4,5}{6,7,8,9,10,11,12}
{0,1,...,7}{8}{9,10,...,16}
{0,1,...,4}{5,6}{7,8,...,20}
Code constructions based on disjunct and array codes:
[C1 C2 ...Cb ], Ci = fi (η) C
CS and LRMC
May 2012
23 / 54
Synonymous Coding: The Constraints
What are the practical constraints in DNA signal detection?
Signal detection (Lin and et.al., Skiena et.al., 2012) - insertion of
synonimous codes into wild type DNA is costly
W
W
W
W
W
W
W
S
W
W
S
S
W
W
S
W
W
S
S
W
W
S
S
S
W
S
W
S
W
S
W
W
S
S
W
W
S
S
S
W
S
W
S
W
S
W
W
W
S
W
W
S
S
W
S
S
S S
S S
S W
S S
Need to minimize number of W-S transitions (insertions) (Skiena et.al., 2012)
Theory of cyclic disjunct codes (Colbourn et.al, 1990s, Dyachkov et.al.
etc)
CS and LRMC
May 2012
24 / 54
Synonymous Coding: The Folding Constraint
Example: two synonymously coded RNA fragments fold very
differently (Vienna folding code, Zucker et.al.)
Coding theoretic approach to RNA folding (M. and Kashyap, 2007)
A
A
U
C
G
GC
CGC
U
U
G
U
C
U
C
U
C
U
CCG
AU
A UC
U
C
G
C C AA
U U
G
G U GA A
UCCAC
CG
CU CU
U
A GU CG
GC
U
G
C
U
A
AC CU
A
CS and LRMC
May 2012
25 / 54
Compressive Sensing in Bioinformatics
x ∈ Rn : when is reconstruction possible? Gelfand, Kashin, 1977; Bresler et.
al., 1996; Donoho et. al., 2004; Candés, Romberg, Tao, 2005
CS and LRMC
May 2012
26 / 54
CS Signal Reconstruction
l0 -minimization:
Find a signal x̂ such that |supp (x̂)| = kx̂k0 ≤ K and y = Φx̂.
l1 -minimization:
min kx̂k1 subject to y = Φx̂.
Greedy algorithms:
OMP [Tropp, 2004], SP [Dai & Milenkovic, 2008], CoSaMP [Needell & Tropp,
2008],
IHT [Blumensath and Davis, 2009]· · ·
CS and LRMC
May 2012
27 / 54
Applications of CS to Bioinformatics
Compressive sensing DNA microarrays (Dai, Sheikh, Barniuk and
M., 2009)
Network analysis (example: vertex distance matrix, gene
expressions, protein interaction affinities etc): gene regulatory
networks and protein-protein interaction networks.
CS and LRMC
May 2012
28 / 54
Gene Regulatory Networks
Transcription factors are proteins, coded by genes.
Transcription factors supress or promote the activity of
CS and LRMC
May 2012
29 / 54
A Simple Dynamical System (DS) Model
Gene expressions: concentration of mRNA corresponding to
genes - X(t) = (X1 (t), . . . , XN (t))
Linear/nonlinear dynamical system model:
X(t + 1) = F (A(t)X(t)) + n(t)
A(t) ∈ RN is sparse, with some column/row weight distribution.
CS and LRMC
May 2012
30 / 54
Compressive Sensing for DS Analysis
Granger causality.
Compressive sensing and Granger causality.
Application in inference of causal gene interactions.
CS and LRMC
May 2012
31 / 54
Causal Compressive Sensing: Granger Causality I
Granger [1960] (Nobel Prize in Economics): how can one deduce
is a random process X causally influences another random
process Y?
Autoregressive model for X :
en = a1 Xn−1 + a2 Xn−2 + ... + am Xn−m
X
en ||
M SEX = ||Xn − X
CS and LRMC
May 2012
32 / 54
Causal Compressive Sensing: Granger Causality I
Granger [1960] (Nobel Prize in Economics): how can one deduce
is a random process X causally influences another random
process Y?
Does including Y in the autoregressive model reduce the
estimation error?
eY,n = a1 Xn−1 + ... + am Xn−m + b0 Yn + b1 Yn−1 + ... + bs Yn−s
X
eY,n ||
M SEX,Y = ||Xn − X
CS and LRMC
May 2012
33 / 54
Causal Compressive Sensing: Granger Causality II
Compressive sensing corresponds to a sparse linear model:
g = Φx = x1 φ1 + x2 φ2 + ... + xN φN
Compressive sensing corresponds to a combination of linear
models:
x
g = [ΦX ΦY ]
= x1 φX,1 + ... + xN φX,N + y1 φY,1 + ... + yN φY,N
y
Sensing vectors may be delayed versions of the same random
process:
φt = φ(t0 − t).
CS and LRMC
May 2012
34 / 54
Causal Compressive Sensing: Granger Causality II
Compressive sensing corresponds to a sparse linear model:
g = Φx = x1 φ1 + x2 φ2 + ... + xN φN
In principle, can make the model non-linear:
x
g = [F (ΦX )H(ΦY )]
=
y
xf (φX,1 ) + ... + xN f (φX,N ) + y1 f (φY,1 ) + ... + yN f (φY,N )
g = x1 f 1 (φX,1 , ..., φX,N ) + ... + xN f N (φX,1 , ..., φX,N ).
CS and LRMC
May 2012
35 / 54
Causal Gene Interactions I
CS and LRMC
May 2012
36 / 54
Causal Gene Interactions II
Gene expression data: time series of gene activity levels
φ(t1 ), φ(t2 ), ..., φ(tm ).
Target gene T: yT (t1 ), yT (t2 ), ..., yT (tm ). Causal gene: C.
Compressive sensing linear model: activity of T an unknown
sparse combination of activities of genes other than C:
yT = Φ/C x + r/C
sparse combination of activities of genes that include C:
yT = Φ+C x + r+C
CS and LRMC
May 2012
37 / 54
Causal Gene Interactions II
Gene expression data: time series of gene activity levels
φ(t1 ), φ(t2 ), ..., φ(tm ).
Target gene T: yT (t1 ), yT (t2 ), ..., yT (tm ). Causal gene: C.
sparse combination of activities of genes other than C:
yT = Φ/C x + r/C
sparse combination of activities of genes that include C:
yT = Φ+C x + r+C
CS and LRMC
May 2012
38 / 54
Causal Gene Interactions III
Determining if C causally influences T:
Residuals: r/C > r+C ;
Gene C is in the list of “strongest” contributors (large xC vs. other
entries of x).
Remark: Self-loops and indirect causality cannot be captured.
CS and LRMC
May 2012
39 / 54
Node 7
Node 66
Node 20
Node 91
Node 79
Node 64
CS and LRMC
Node 31
Node 85
Node 60
Node 61
Node 93
Node 41
Node 28
Node 51
Node 98
Node 19
Node 63
Node 81
Node 87
Node 13
Node 12
Node 9
Node 84
Node 78
Node 16
Node 94
Node 15
Node 25
Node 75
Node 30
Node 68
Node 37
Node 76
Node 14
Node 71
Node 62
Node 8
Node 58
Node 42
Node 50
Node 46
Node 47
Node 17
Node 11
Node 92
Node 21
Node 39
Node 49
Node 26
Node 36
Node 77
Node 43
Node 88
Node 27
Node 22
Node 55
Node 69
Node 74
Node 6
Node 56
Node 89
Node 48
Node 86
Node 10
Node 33
Node 45
Node 44
Node 23
Node 72
Node 18
Node 5
Node 96
Node 100
Node 83
Node 2
Node 4
Node 70
Node 65
Node 59
Node 52
Node 97
Node 35
Node 34
Node 40
Node 82
Node 67
Node 24
Node 1
Node 38
Node 57
Node 3
Node 32
Node 29
Node 95
Node 53
Node 73
Node 99
Node 54
Node 80
Node 90
Results - Synthetic Data I
May 2012
40 / 54
Results I
Reduction in MSE at least 80% (coefficient of determination);
Identified 85% of causal relationships.
Almost all errors caused by problems in CS reconstruction algorithms
(no RIP, no incoherence).
CS and LRMC
May 2012
41 / 54
Results II
[Gardner, 2004]: SOS network of E.coli
R/C
dinI
lexA
recA
recF
rpoD
rpoH
rpoS
ssb
uCD
dinI
0
0 (1)
0 (1)
0
0
0
0 (1)
0 (?)
0 (1)
lexA
1
1
1
0
1
0
0
1
1
recA
0
0
0
0
0
0
0
0
0
recF
0
0
0
0
0
0
0
0
0
rpoD
1
1
1
1
1
1
1
1
1
CS and LRMC
rpoH
0
0
0
0
1
1
0
0
0
rpoS
0
0
0
1
0
0
1
0
0
ssb
0
0
0
0
0
0
0
0
0
uCD
0
0 (1)
0
0
0 (?)
0
0
0
0
May 2012
42 / 54
Results II
[Galkin et.al., 2011]: SOS network of E.coli - the role of dinI
F9.large.jpg 1280×980 pixels
3/21/12 5:45 PM
CS and LRMC
May 2012
43 / 54
Formal definition of the LRMC Problem
Let X ∈ Rm×n be a low-rank matrix: r min (m, n).
One does not have full information of X, but only knows a subset
of entries.
I
Observation subset Ω ⊂ [m] × [n].
Matrix Completion: Given low-rank assumption and Ω & XΩ , X =?
CS and LRMC
May 2012
44 / 54
The Framework
“l0 -approach”:
rank X 0
subject to PΩ X 0 = XΩ .
(P 0) minimize
“l1 -approach”:
(P 1) minimize
0
X ∗
subject to PΩ X 0 = XΩ ,
where kX 0 k∗ =
P
σi (X 0 ).
(P 1) ≡ (P 0) if X singular vectors “sufficiently spread” :
uncorrelated with the standard basis.
m = |Ω| = c(n + m) r log (max(n, m)) for unique recovery.
CS and LRMC
May 2012
45 / 54
Protein-Protein Interaction (PPI) Prediction
Sequence based methods: assumption that interacting proteins
belong to spacially confined conserved regions.
I
I
Sequence alignment.
Phylogenetic tree analysis.
Probabilistic network inference: assumption that interacting
proteins form special network motifs.
I
I
Graphelet models.
Bayesian models.
References: many, just to mention a few...
I
I
I
I
Grama et.al., 2010
Kim et.al. 2006.
Valencia and Pazos, 2010.
Przulj et.al., 2006-2010.
CS and LRMC
May 2012
46 / 54
Protein-Protein Interactions
Let X ∈ Rm×n be a matrix of interaction probabilities between
proteins, say
Protein \ Protein
PR1
PR2
PR3
PR4
PR1
*
0.765
0.99
?
PR2
0.765
*
0.112
0.5
PR3
0.99
0.112
*
?
PR4
?
0.5
?
*
Can you predict the missing entries?
CS and LRMC
May 2012
47 / 54
Why LRMC for PPI inference?
Low-rank assumption can be justified in many ways:
I
Not all pairwise interactions are independent.
I
There is a relatively small number of degrees of freedom - ”factors”
that influences protein binding (compared to number of proteins).
Otherwise, binding would be physically impossible.
For small sampling rates, the solution may not be unique. Pick
“dense subsets” of proteins.
I
I
The dependencies may not be linear (easy to handle in proposed
framework).
CS and LRMC
May 2012
48 / 54
PPI Prediction via Completion
STRING database: Yeast Proteins (roughly 1200 out of 6000 proteins
used in experiment)
PPpair
YBL032W:YDL220C
YDL220C:YDR510W
YDL220C:YLL039C
YDR381W:YLR418C
YBL032W:YLL002W
Affinity
1.00
1.00
1.00
1.00
1.00
CS and LRMC
Biologically Plausible?
Yes
?
?
Yes
?
May 2012
49 / 54
Traces of Sequences
Genomic and Proteomic Sequences
Channel N
Substitution, Deletion/Insertion channel: Levenshtein, 1990s;
Mitzenmacher et.al., 2008.
Substring extraction: Genome assembly problem.
Multiset of prefixes/suffixes: Protein mass spectrometry problem.
CS and LRMC
May 2012
50 / 54
Traces of Sequences I
Problem formulation: given S = s1 s2 s3 ...sn , si ∈ {0, 1}.
Traces:
{{s1 }, {s1 , s2 }, {s2 }, {s2 , s3 }, . . . , {sn }, {sn , sn−1 }, . . . , {s1 , s2 , . . . , sn }}.
Example: 0001 gives {0, 0, 0, 1, 02 , 02 , 01, 03 , 02 1, 03 1}
CS and LRMC
May 2012
51 / 54
Traces of Sequences II
When can the sequence be reconstructed? (Acharya et.al., ISIT
2010)
A string is reconstructable iff its length n satisfies:
n ≤ 7.
n ≥ 8 and n + 1 is a prime or twice a prime.
CS and LRMC
May 2012
52 / 54
Two Announcements
IMA Workshop on Group Testing in Biology, 2012.
ISIT Tutorial on Bioinformatics (with Sharon Aviran, UC Berkeley),
2012.
CS and LRMC
May 2012
53 / 54
Thank you!
CS and LRMC
May 2012
54 / 54
CS and LRMC
May 2012
54 / 54

Slides

Transcription

Similar documents

Poster - EMBO I EMBL Symposia

2 slides per sheet