Slides
Transcription
Slides
Sparse Problems in Bioinformatics Olgica Milenkovic Joint work with: Wei Dai and Amin Emad University of Illinois at Urbana-Champaign Acknowledgment: NSF CCF 1117980, NSF CCF 0729216, NSF CCF 0809895, NSF CCF 01218764, NSF CSoI Thanks to: Yaniv Erlich and Noam Shental Milenkovic et al. (UIUC) CS and LRMC May 2012 1 / 54 What information theorists should know about Bioinformatics Bioinformatics may be a source for many interesting problems in coding theory, statistics and signal processing, but there exists no unifying theory of bioinformatics. Milenkovic et al. (UIUC) CS and LRMC May 2012 2 / 54 What information theorists should know about Bioinformatics Bioinformatics is data-driven: without testing model/theory on data, very little credibility for results. Data is noisy and modeling is hard. Milenkovic et al. (UIUC) CS and LRMC May 2012 3 / 54 What information theorists should know about Bioinformatics It appears to be a difficult task to reconcile the two areas... Milenkovic et al. (UIUC) CS and LRMC May 2012 4 / 54 Something for both Sides: Sparsity Group Testing for Experimental Design. I I Group testing for Genotyping. Group testing for synonimous coding studies. Causal Compressive sensing and Low-Rank Completion. I I Gene regulatory networks. Synthetic lethality/Protein-Protein Interaction (PPI) inference. Milenkovic et al. (UIUC) CS and LRMC May 2012 5 / 54 And Some Hard Problems... Reconstructing sequences from traces and other problems regarding sequences... I De novo protein sequencing via Tandem Mass Spectrometry: reconstructing sequences based on composition multisets. Milenkovic et al. (UIUC) CS and LRMC May 2012 6 / 54 Let’s start with group testing... m Test Result n ! # # # # " 1 0 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 1 1 1 1 0 $ & & & & % ! # # # # " 0 1 0 1 $ & & & & % N Origin of group testing: blood pooling of soldiers during recruitment in WWII (Dorfman, 1941); n tests and N test subjects, described via test matrix, with m <<N positives. Milenkovic et al. (UIUC) CS and LRMC May 2012 7 / 54 Mathematical Formulation I Group testing amounts to finding indices of subjects {i1 , . . . , im } y = xi1 xi1 ... xim , for given y. Here, denotes binary OR, and xi ∈ {0, 1}n denotes signature of subject i (i.e., column of test matrix). OR - if one positive, test positive. I Problem of interest: what it the smallest number of non-adaptive measurements n needed to discover m defectives among N subjects? const. m2 log(N ) Milenkovic et al. (UIUC) CS and LRMC May 2012 8 / 54 Mathematical Formulation II Two coding-theoretic paradigms (Kautz and Singleton, 1964) I m-separable matrices: xj1 xj2 . . . xjl 6= xi1 ... xis , l, s ≤ m I m-disjunct matrices: supp (xi ) * ∪l6=i.l≤m supp(xl ) Milenkovic et al. (UIUC) CS and LRMC May 2012 9 / 54 Applications in Bioinformatics Out of 3 × 109 bases, about 107 single point mutations (SNPs). GAT GAT GAT GAT GAT GAT ATTCGTACGGAAT ATTCGTACTGAAT ATTCGTACGGAAT ATTCGTACGGAAT GTTCGTACGGAAT GTTCGTACGGAAT SNPs (Single Nucleotide Polymorphisms) Milenkovic et al. (UIUC) CS and LRMC May 2012 10 / 54 Applications in Bioinformatics Out of 3 × 109 bases, about 107 single point mutations (SNPs). Asthma Systemic Sclerosis Lung Cancer Type II Diabetes Lupus END1 Fibrilin 1 MMP1 Syn1A Pro Healthy individual: 2 normal genes (alleles). Carrier of disease: 1 normal and 1 faulty gene (allele). Milenkovic et al. (UIUC) CS and LRMC May 2012 11 / 54 We are still mixing blood, but... Illumina high-throughput sequencer: take DNA from an individual, sequence, determine which SNPs are present Source: giga.ulg.ac.be Milenkovic et al. (UIUC) CS and LRMC May 2012 12 / 54 Cost of sample preparation/sequencing is high! Illumina high-throughput sequencer: take DNA from a number of individuals, sequence whole pool, determine which SNPs are present based on combinatorial designs! (Erlich et. al. 2009, Shental et.al., 2010) Source: giga.ulg.ac.be Milenkovic et al. (UIUC) CS and LRMC May 2012 13 / 54 On the Cover of a Magazine Source: Erlich Lab Milenkovic et al. (UIUC) CS and LRMC May 2012 14 / 54 Synonymous Coding DNA-RNA-Protein-Life 4 bases, triple codon = 64 options; only 20 amino acids. TTA TTG W Leu W S W W Synonymous coding (SC) Milenkovic et al. (UIUC) CS and LRMC May 2012 15 / 54 Genotyping and Group Testing: Plug-and-Play? Compressed genotyping (Erlich et.al., IT Special Issue on Molecular Biology, 2010; Erlich and Shental, 2011) I Diagonally spaced ones in the test matrix (robotic arm movements highly restricted); I Need sparse matrices due to simplicity of mixing precision (Thierry-Migg et.al, STDs (shifted transversal designs)). Source: Hamilton Robotics Milenkovic et al. (UIUC) CS and LRMC May 2012 16 / 54 Genotyping and Group Testing: Plug-and-Play? Nucleic Acids Rese Compressed genotyping (Erlich et.al., IT Special Issue on Molecular Biology, 2010; Erlich and Shental, 2011=) 40, e =0.01 reads per person reads per person = 40, er=0 r x 10 x 10 2 2 I Diagonally spaced ones in the test matrix (robotic arm movements (a) (b) 1.5 highly 1.5 restricted); −3 measurement value measurement value −3 1 0.5 0 #errors: 3 0 10 20 30 40 1 0.5 0 50 #errors: 0 0 10 20 pool x 10 (c) 1.5 1 0.5 0 #errors: 0 0 10 −3 2 measurement value −3 x 10 20 30 40 40 x 10 (d) 1 0.5 #errors: 0 0 10 20 30 40 pool pool reads per person = ∞, e =0.01 reads per person = ∞, e =0 −3 r (e) 1 0.5 #errors: 0 10 20 50 reads per person = 400, er=0 1.5 0 50 1.5 0 Milenkovic et al. (UIUC) 0 2 measurement value measurement value 2 reads per person = 400, er=0.01 30 CS and 40LRMC 50 2 measurement value −3 30 pool x 10 50 r (f) 1.5 1 0.5 0 #errors: 0 0 10 20 May 30 2012 40 1750/ 54 Genotyping: The Constraints A general theory of genotyping (Emad and M., ITW’2011, ISIT’2012): I Non-uniform DNA sampling: availability of genetic material m q-‐1 y ! # # n # # # #" 0 1 2 0 1 1 2 0 2 1 0 0 2 1 0 1 1 2 0 2 2 1 0 2 1 0 1 2 0 1 0 2 1 1 1 2 2 1 2 1 1 2 1 0 2 1 1 1 0 1 $ & & & & & &% ! # # # # # #" 1 1 3 0 2 $ & & & & & &% N I I Thresholds in above example: {0, 1}{2}{3}{4, 5, 6}. Copy number variation (most people have two copies of each gene, some have up to five copies): replication numbers c1 , . . . , cn ∈ {1, 2, 3, 4, 5}. Milenkovic et al. (UIUC) CS and LRMC May 2012 18 / 54 Genotyping: The Constraints A general theory of genotyping (Emad and M., ITW’2011, ISIT’2012): I Semi-quantitative testing: limited readout precision 95 0 0 5 90 95 10 85 15 80 75 70 30 y 50 0 1 15 20 2 4 35 3 60 55 45 25 30 65 40 55 10 75 70 35 60 5 80 25 1 0 5 85 20 65 0 90 1 50 40 45 Q −1 , ηQ-1, , ηQ −1 m ∑x k,i j 0, 1 η1 −1, η1, , η2 −1, j=1 Milenkovic et al. (UIUC) CS and LRMC May 2012 19 / 54 Genotyping: The Constraints A general theory of genotyping (Emad and M., ITW’2011, ISIT’2012): I Two-dimensional testing (same mutation may be involved in multiple diseases) I Family tree structure: diseases run in families and testing strategy should be governed by Mendel’s law. F Probabilistic group testing: individuals have a certain probabilities of being carriers. Milenkovic et al. (UIUC) CS and LRMC May 2012 20 / 54 An Information Theoretic Approach Group testing as an information theoretic problem (Maliytov 1980’s, Dyachkov 2000) m - number of positive subjects; η - thresholds; PT - probability distribution for thresholds C = sup PT ,η α(m, PT , η) {i} α(m, PT, η) = max i=1,...,m {i} I (XD1 , XD2 ) i m i m−i 10 1 1 0 1 0 0 01 X D{i1} Milenkovic et al. (UIUC) CS and LRMC X D{i2} 1 y May 2012 21 / 54 An Information Theoretic Approach Capacity lower bounds evaluated numerically. 0.8 Q=2 Q=3 Lowe r Bound on C 0.7 0.6 0.5 0.4 0.3 0.2 0.1 2 3 4 5 6 m 7 8 9 10 Code constructions based on disjunct and array codes. Positives identification via belief propagation (Emad and M., 2012, Huang and M., 2009). Milenkovic et al. (UIUC) CS and LRMC May 2012 22 / 54 An Information theoretic Approach Optimal thresholds - examples for q=3: I m 4 6 8 10 Probability Distribution {0.18,064,0.18} {0.6,0.07, 0.33} {0.1,0.8,0.1} {0.58,0.28,0.14} Thresholds {0,1,2,3}{4}{5,6,7} {0,1,2,3}{4,5}{6,7,8,9,10,11,12} {0,1,...,7}{8}{9,10,...,16} {0,1,...,4}{5,6}{7,8,...,20} Code constructions based on disjunct and array codes: [C1 C2 ...Cb ], Ci = fi (η) C Milenkovic et al. (UIUC) CS and LRMC May 2012 23 / 54 Synonymous Coding: The Constraints What are the practical constraints in DNA signal detection? Signal detection (Lin and et.al., Skiena et.al., 2012) - insertion of synonimous codes into wild type DNA is costly W W W W W W W S W W S S W W S W W S S W W S S S W S W S W S W W S S W W S S S W S W S W S W W W S W W S S W S S S S S S S W S S Need to minimize number of W-S transitions (insertions) (Skiena et.al., 2012) Theory of cyclic disjunct codes (Colbourn et.al, 1990s, Dyachkov et.al. etc) Milenkovic et al. (UIUC) CS and LRMC May 2012 24 / 54 Synonymous Coding: The Folding Constraint Example: two synonymously coded RNA fragments fold very differently (Vienna folding code, Zucker et.al.) Coding theoretic approach to RNA folding (M. and Kashyap, 2007) A A U C G GC CGC U U G U C U C U C U CCG AU A UC U C G C C AA U U G G U GA A UCCAC CG CU CU U A GU CG GC U G C U A AC CU Milenkovic et al. (UIUC) A CS and LRMC May 2012 25 / 54 Compressive Sensing in Bioinformatics x ∈ Rn : when is reconstruction possible? Gelfand, Kashin, 1977; Bresler et. al., 1996; Donoho et. al., 2004; Candés, Romberg, Tao, 2005 Milenkovic et al. (UIUC) CS and LRMC May 2012 26 / 54 CS Signal Reconstruction l0 -minimization: Find a signal x̂ such that |supp (x̂)| = kx̂k0 ≤ K and y = Φx̂. l1 -minimization: min kx̂k1 subject to y = Φx̂. Greedy algorithms: OMP [Tropp, 2004], SP [Dai & Milenkovic, 2008], CoSaMP [Needell & Tropp, 2008], IHT [Blumensath and Davis, 2009]· · · Milenkovic et al. (UIUC) CS and LRMC May 2012 27 / 54 Applications of CS to Bioinformatics Compressive sensing DNA microarrays (Dai, Sheikh, Barniuk and M., 2009) Network analysis (example: vertex distance matrix, gene expressions, protein interaction affinities etc): gene regulatory networks and protein-protein interaction networks. Milenkovic et al. (UIUC) CS and LRMC May 2012 28 / 54 Gene Regulatory Networks Transcription factors are proteins, coded by genes. Transcription factors supress or promote the activity of Milenkovic et al. (UIUC) CS and LRMC May 2012 29 / 54 A Simple Dynamical System (DS) Model Gene expressions: concentration of mRNA corresponding to genes - X(t) = (X1 (t), . . . , XN (t)) Linear/nonlinear dynamical system model: X(t + 1) = F (A(t)X(t)) + n(t) A(t) ∈ RN is sparse, with some column/row weight distribution. Milenkovic et al. (UIUC) CS and LRMC May 2012 30 / 54 Compressive Sensing for DS Analysis Granger causality. Compressive sensing and Granger causality. Application in inference of causal gene interactions. Milenkovic et al. (UIUC) CS and LRMC May 2012 31 / 54 Causal Compressive Sensing: Granger Causality I Granger [1960] (Nobel Prize in Economics): how can one deduce is a random process X causally influences another random process Y? Autoregressive model for X : en = a1 Xn−1 + a2 Xn−2 + ... + am Xn−m X en || M SEX = ||Xn − X Milenkovic et al. (UIUC) CS and LRMC May 2012 32 / 54 Causal Compressive Sensing: Granger Causality I Granger [1960] (Nobel Prize in Economics): how can one deduce is a random process X causally influences another random process Y? Does including Y in the autoregressive model reduce the estimation error? eY,n = a1 Xn−1 + ... + am Xn−m + b0 Yn + b1 Yn−1 + ... + bs Yn−s X eY,n || M SEX,Y = ||Xn − X Milenkovic et al. (UIUC) CS and LRMC May 2012 33 / 54 Causal Compressive Sensing: Granger Causality II Compressive sensing corresponds to a sparse linear model: g = Φx = x1 φ1 + x2 φ2 + ... + xN φN Compressive sensing corresponds to a combination of linear models: x g = [ΦX ΦY ] = x1 φX,1 + ... + xN φX,N + y1 φY,1 + ... + yN φY,N y Sensing vectors may be delayed versions of the same random process: φt = φ(t0 − t). Milenkovic et al. (UIUC) CS and LRMC May 2012 34 / 54 Causal Compressive Sensing: Granger Causality II Compressive sensing corresponds to a sparse linear model: g = Φx = x1 φ1 + x2 φ2 + ... + xN φN In principle, can make the model non-linear: x g = [F (ΦX )H(ΦY )] = y xf (φX,1 ) + ... + xN f (φX,N ) + y1 f (φY,1 ) + ... + yN f (φY,N ) g = x1 f 1 (φX,1 , ..., φX,N ) + ... + xN f N (φX,1 , ..., φX,N ). Milenkovic et al. (UIUC) CS and LRMC May 2012 35 / 54 Causal Gene Interactions I Milenkovic et al. (UIUC) CS and LRMC May 2012 36 / 54 Causal Gene Interactions II Gene expression data: time series of gene activity levels φ(t1 ), φ(t2 ), ..., φ(tm ). Target gene T: yT (t1 ), yT (t2 ), ..., yT (tm ). Causal gene: C. Compressive sensing linear model: activity of T an unknown sparse combination of activities of genes other than C: yT = Φ/C x + r/C Compressive sensing linear model: activity of T an unknown sparse combination of activities of genes that include C: yT = Φ+C x + r+C Milenkovic et al. (UIUC) CS and LRMC May 2012 37 / 54 Causal Gene Interactions II Gene expression data: time series of gene activity levels φ(t1 ), φ(t2 ), ..., φ(tm ). Target gene T: yT (t1 ), yT (t2 ), ..., yT (tm ). Causal gene: C. Compressive sensing linear model: activity of T an unknown sparse combination of activities of genes other than C: yT = Φ/C x + r/C Compressive sensing linear model: activity of T an unknown sparse combination of activities of genes that include C: yT = Φ+C x + r+C Milenkovic et al. (UIUC) CS and LRMC May 2012 38 / 54 Causal Gene Interactions III Determining if C causally influences T: Residuals: r/C > r+C ; Gene C is in the list of “strongest” contributors (large xC vs. other entries of x). Remark: Self-loops and indirect causality cannot be captured. Milenkovic et al. (UIUC) CS and LRMC May 2012 39 / 54 Milenkovic et al. (UIUC) Node 7 Node 66 Node 20 Node 91 Node 79 Node 64 CS and LRMC Node 31 Node 85 Node 60 Node 61 Node 93 Node 41 Node 28 Node 51 Node 98 Node 19 Node 63 Node 81 Node 87 Node 13 Node 12 Node 9 Node 84 Node 78 Node 16 Node 94 Node 15 Node 25 Node 75 Node 30 Node 68 Node 37 Node 76 Node 14 Node 71 Node 62 Node 8 Node 58 Node 42 Node 50 Node 46 Node 47 Node 17 Node 11 Node 92 Node 21 Node 39 Node 49 Node 26 Node 36 Node 77 Node 43 Node 88 Node 27 Node 22 Node 55 Node 69 Node 74 Node 6 Node 56 Node 89 Node 48 Node 86 Node 10 Node 33 Node 45 Node 44 Node 23 Node 72 Node 18 Node 5 Node 96 Node 100 Node 83 Node 2 Node 4 Node 70 Node 65 Node 59 Node 52 Node 97 Node 35 Node 34 Node 40 Node 82 Node 67 Node 24 Node 1 Node 38 Node 57 Node 3 Node 32 Node 29 Node 95 Node 53 Node 73 Node 99 Node 54 Node 80 Node 90 Results - Synthetic Data I May 2012 40 / 54 Results I Reduction in MSE at least 80% (coefficient of determination); Identified 85% of causal relationships. Almost all errors caused by problems in CS reconstruction algorithms (no RIP, no incoherence). Milenkovic et al. (UIUC) CS and LRMC May 2012 41 / 54 Results II [Gardner, 2004]: SOS network of E.coli R/C dinI lexA recA recF rpoD rpoH rpoS ssb uCD dinI 0 0 (1) 0 (1) 0 0 0 0 (1) 0 (?) 0 (1) lexA 1 1 1 0 1 0 0 1 1 Milenkovic et al. (UIUC) recA 0 0 0 0 0 0 0 0 0 recF 0 0 0 0 0 0 0 0 0 rpoD 1 1 1 1 1 1 1 1 1 CS and LRMC rpoH 0 0 0 0 1 1 0 0 0 rpoS 0 0 0 1 0 0 1 0 0 ssb 0 0 0 0 0 0 0 0 0 uCD 0 0 (1) 0 0 0 (?) 0 0 0 0 May 2012 42 / 54 Results II [Galkin et.al., 2011]: SOS network of E.coli - the role of dinI F9.large.jpg 1280×980 pixels Milenkovic et al. (UIUC) 3/21/12 5:45 PM CS and LRMC May 2012 43 / 54 Formal definition of the LRMC Problem Let X ∈ Rm×n be a low-rank matrix: r min (m, n). One does not have full information of X, but only knows a subset of entries. I Observation subset Ω ⊂ [m] × [n]. Matrix Completion: Given low-rank assumption and Ω & XΩ , X =? Milenkovic et al. (UIUC) CS and LRMC May 2012 44 / 54 The Framework “l0 -approach”: rank X 0 subject to PΩ X 0 = XΩ . (P 0) minimize “l1 -approach”: (P 1) minimize 0 X ∗ subject to PΩ X 0 = XΩ , where kX 0 k∗ = P σi (X 0 ). (P 1) ≡ (P 0) if X singular vectors “sufficiently spread” : uncorrelated with the standard basis. m = |Ω| = c(n + m) r log (max(n, m)) for unique recovery. Milenkovic et al. (UIUC) CS and LRMC May 2012 45 / 54 Protein-Protein Interaction (PPI) Prediction Sequence based methods: assumption that interacting proteins belong to spacially confined conserved regions. I I Sequence alignment. Phylogenetic tree analysis. Probabilistic network inference: assumption that interacting proteins form special network motifs. I I Graphelet models. Bayesian models. References: many, just to mention a few... I I I I Grama et.al., 2010 Kim et.al. 2006. Valencia and Pazos, 2010. Przulj et.al., 2006-2010. Milenkovic et al. (UIUC) CS and LRMC May 2012 46 / 54 Protein-Protein Interactions Let X ∈ Rm×n be a matrix of interaction probabilities between proteins, say Protein \ Protein PR1 PR2 PR3 PR4 PR1 * 0.765 0.99 ? PR2 0.765 * 0.112 0.5 PR3 0.99 0.112 * ? PR4 ? 0.5 ? * Can you predict the missing entries? Milenkovic et al. (UIUC) CS and LRMC May 2012 47 / 54 Why LRMC for PPI inference? Low-rank assumption can be justified in many ways: I Not all pairwise interactions are independent. I There is a relatively small number of degrees of freedom - ”factors” that influences protein binding (compared to number of proteins). Otherwise, binding would be physically impossible. For small sampling rates, the solution may not be unique. Pick “dense subsets” of proteins. I I The dependencies may not be linear (easy to handle in proposed framework). Milenkovic et al. (UIUC) CS and LRMC May 2012 48 / 54 PPI Prediction via Completion STRING database: Yeast Proteins (roughly 1200 out of 6000 proteins used in experiment) PPpair YBL032W:YDL220C YDL220C:YDR510W YDL220C:YLL039C YDR381W:YLR418C YBL032W:YLL002W Milenkovic et al. (UIUC) Affinity 1.00 1.00 1.00 1.00 1.00 CS and LRMC Biologically Plausible? Yes ? ? Yes ? May 2012 49 / 54 Traces of Sequences Genomic and Proteomic Sequences Channel N Substitution, Deletion/Insertion channel: Levenshtein, 1990s; Mitzenmacher et.al., 2008. Substring extraction: Genome assembly problem. Multiset of prefixes/suffixes: Protein mass spectrometry problem. Milenkovic et al. (UIUC) CS and LRMC May 2012 50 / 54 Traces of Sequences I Problem formulation: given S = s1 s2 s3 ...sn , si ∈ {0, 1}. Traces: {{s1 }, {s1 , s2 }, {s2 }, {s2 , s3 }, . . . , {sn }, {sn , sn−1 }, . . . , {s1 , s2 , . . . , sn }}. Example: 0001 gives {0, 0, 0, 1, 02 , 02 , 01, 03 , 02 1, 03 1} Milenkovic et al. (UIUC) CS and LRMC May 2012 51 / 54 Traces of Sequences II When can the sequence be reconstructed? (Acharya et.al., ISIT 2010) A string is reconstructable iff its length n satisfies: n ≤ 7. n ≥ 8 and n + 1 is a prime or twice a prime. Milenkovic et al. (UIUC) CS and LRMC May 2012 52 / 54 Two Announcements IMA Workshop on Group Testing in Biology, 2012. ISIT Tutorial on Bioinformatics (with Sharon Aviran, UC Berkeley), 2012. Milenkovic et al. (UIUC) CS and LRMC May 2012 53 / 54 Thank you! Milenkovic et al. (UIUC) CS and LRMC May 2012 54 / 54 Milenkovic et al. (UIUC) CS and LRMC May 2012 54 / 54
Similar documents
2 slides per sheet
inbred strains of mice. Somatsens Mot Res 22:141-150.! †Sun T, Patoine C, Abu-Khalil A, Visvader J, Sum E, Cherry TJ, Orkin SH, Geschwind DH, Walsh CA 2005 Early asymmetry of gene transcription in ...
More information