Fernando Pereira with Axel Bernal, Koby Crammer, Kuzman
Transcription
Fernando Pereira with Axel Bernal, Koby Crammer, Kuzman
Learning to Analyze Sequences Fernando Pereira with Axel Bernal, Koby Crammer, Kuzman Ganchev, Ryan McDonald Department of Computer and Information Science University of Pennsylvania Thanks: Steve Carroll, Artemis Hatzigeorgiu, John Lafferty, Kevin Lerman, Gideon Mann, Andrew McCallum, Fei Sha, Peter White NSF EIA 0205456, EIA 0205448, IIS 0428193 Annotation Markup 02/10/2007 10:05 PM BioIE Annotation File: source_file_1156_28611.src (PMID-10027390) Sequences Everywhere Annotation Legend Am J Pathol Annotation Display Controls 1999 Feb;154(2):325-9 PubMed Article (#10027390) Text Frequent nuclear/cytoplasmic localization of beta-catenin without exon 3 mutations in malignant melanoma. syntax Rimm DL, Caca K, Hu G, Harrison FB, Fearon ER. Department Pathology, root of John saw Yale MaryUniversity in the School parkof Medicine, New Haven, Connecticut 06510, USA. [email protected] content (entities) Beta-Catenin has a critical role in E-cadherin-mediated cell-cell adhesion, and it also functions as a downstream signaling molecule in the wnt pathway. Mutations in the putative glycogen synthase kinase 3beta phosphorylation sites near the beta-catenin amino terminus have been found in some cancers and cancer cell lines. The mutations render beta-catenin resistant to regulation by a complex containing the glycogen synthase kinase 3beta, adenomatous polyposis coli, and axin proteins. As a result, beta-catenin accumulates in the cytosol and nucleus and activates T-cell factor/ lymphoid enhancing factor transcription factors. Previously, 6 Eof 27 melanoma were found Igenic E1 cell lines Eterm Igenic to have beta-catenin init exon 3 mutations affecting the N-terminal phosphorylation sites (Rubinfeld B, Robbins P, Elgamil M, Albert I, Porfiri E, Polakis P: Stabilization of beta-catenin by genetic defects in melanoma cell lines. Science 1997, 275:1790-1792). To assess the role of beta-catenin defects in primary melanomas, we undertook immunohistochemical and DNA sequencing studies in 65 melanoma specimens. Nuclear and/or cytoplasmic localization of beta-catenin, a potential indicator of wnt pathway activation, was seen focally within roughly one third of the tumors, though a clonal somatic mutation in beta-catenin was found in Genes Analyzing Sequences • Mapping from sequences (documents, sentences, genes) to structures Analysis Challenges • Interacting decisions fake news show fake news show • Diverse sequence features • Inference – computing best analysis – may be costly General Setting • Sequences over some alphabet x∈Σ ∗ • A set of possible analyses for each sequence y ∈ Y(x) • A parametric inference method for selecting a “best” analysis ŷ = h(x, w) • Learn the parameters from examples Previous Approaches • Generative: learn stochastic process parameters ŷ = arg max P (x, y|w) y fake news show • Sequential: learn how to build an analysis ŷi = c(x, y1 , . . . , yi−1 , w) fake news show Previous Approaches • Generative • HMMs, probabilistic grammars • Require a complete representation of the input-output relation • Hard to model non-independent features • Sequential • Allow full exploitation of input features • But cannot trade-off decisions at different positions: label-bias problem Structured Linear Models • Generalize linear classification ŷ = arg max w · F (x, y) y • Features based!on local domains F (x, y) f C (x, y) JJ NNS NN fake news show = C∈C(x) f C (x, y) = f C (x, y C ) show fake • Efficient dynamic programming for treestructured interactions news Learning • Prior knowledge • local domains C(x) f local feature functions • • Adjust w to optimize objective function on C some training data 2 w = arg min λ !w! + ∗ w regularizer ! i L(xi , y i ; w) loss Margin • Score advantage between correct and candidate classifications m(x, y, y ; w) = w · F (x, y) − w · F (x, y ) ! ! Losses • Log loss ⇒ maximize probability of correct output ! ! −m(x,y,y ;w) L(x, y; w) = log e y! • Hamming loss ⇒ minimize distanceadjusted misclassification L(x, y; w) = max [d(y, y ) − m(x, y, y ; w)] + ! y ! • Search over y′: dynamic programming on “good” graphs ! Online Training • Process one training instance at a time • Very simple • Predictable runtime, small memory • Adaptable to different loss functions • Basic idea: w = 0 for t = 1, . . . , T : for i = 1, . . . , N : classify xi incurring loss l update w to reduce l Online maximum margin (MIRA) • Project onto subspace where the correct structure scores “far enough” above all incorrect ones w=0 for t = 1, . . . , T : for i = 1, . . . , N : 2 1 ! w ← arg minw! 2 "w − w" s.t. ∀y : w! · F (xi , y i ) − w! · F (xi , y) ≥ d(y i , y) • Exponentially many ys: select best k instead Analysis by Tagging x = x1 · · · xn y = y1 · · · yn Structured classifier x = x1 · · · xn y = y1 · · · yn • Labels give the role of corresponding inputs • Information extraction • Part-of-speech tagging • Shallow parsing • Gene prediction Metrics positive negative correct TP TN incorrect FP FN precision/ specificity recall/ sensitivity F1 TP Sp = P = FP + TP TP Sn = R = TP + FN 2PR F1 = P+R Features • Conjunctions of • Label configuration • Input properties • Term identity • Membership in term list • Orthographic patterns • Conjunctions of the these for current and surrounding words • Feature induction: generate only those conjunctions that help prediction Gene/protein results features log loss 81.2 textual 82.6 +clusters +dictionaries 82.5 83.5 all Hamming FP+2FN 81.8 84.6 82.2 85 83.3 85.7 84.4 86.4 http://fable.chop.edu/index.jsp Gene Structure first exon Igenic internal exon Einit intron E1 last exon intron Eterm DNA sequence Igenic Gene Prediction • Accurate predictions very important for biomedical research uses • Ambiguous evidence • Especially for first and last exons • Many noisy evidence sources • Only incremental advances in machinelearning methods in the last ten years Training Methods Previous Igenic Einit E1 Training Set Ours Eterm Igenic Igenic Einit E1 Eterm Training Set Annotation Predict genes on Training Set Parameters Train Signal Detector Train Content Sensor Sensor Parameters Detector Parameters Parameter Update Prediction Annotation TS Compute Parameter Update Predict genes on Development Set NO Prediction Parameters Train Gene Structure Parameters Igenic Development Set Structure Parameters Parameters Converged? Annotation DS YES Gene Model Gene Model Possible Analyses E0 s I 1 Is0 E1 Is2 E2 IL0 IL1 IL2 Eterm Einit Esingle INI + - IG intergenic region . . . + - END Gene Features • Statistics of protein-coding DNA • Individual codons and bicodons • Length of exon states: semi-Markov model • Motifs and motif conjunctions sets. Transcript and gene-level accuracies are respectively 30% and 30.6% better than Augustus, the second-best program overall. This means that our better accuracy results obtained in the first two single-gene sequence sets scale well to chromosomal regions with multiple, alternatively-spliced genes. !! Level Base Exon All Initial Internal Terminal Single Transcript Gene GenScan Sn Sp 84.0 62.1 Genezilla Sn Sp 87.6 50.9 GenScan++ Sn Sp 76.7 79.3 Augustus Sn Sp 76.9 76.1 CRAIG Sn Sp 84.4 80.8 59.6 28.0 72.6 33.0 28.1 8.1 16.7 62.5 36.4 73.9 36.7 44.1 10.3 20.6 51.6 25.5 68.0 25.7 35.0 6.0 12.5 52.1 34.7 59.1 37.6 43.9 10.9 22.3 60.8 37.3 71.7 33.3 55.9 13.5 26.6 47.7 23.5 54.3 31.6 31.0 11.4 11.4 50.5 25.0 63.2 28.5 14.5 9.9 9.9 64.8 47.8 62.8 53.9 45.7 17.0 17.0 63.6 38.1 74.7 45.5 25.5 16.9 16.9 Table 4: Accuracy Results for ENCODE294. Sensitivity (Sn) and specificity (Sp) results for each level and exon type. Dataset BGHM953 TIGR251 ENCODE294 GenScan 0.03 2.2 x 3.1–7 0.17 –6 Genezilla 1.66 x 10–6 0.22 >= 0.5 –4 GenScan++ 5.2 x 10–66 1.4 x 10–23 2.33 x 3–22 –105 Augustus 1.3 x 10–5 1.4 x 10–25 4.7 x 10–16 –35 72.7 55.2 81.2 52.6 26.4 23.8 23.8 380000| 400000| 420000| 440000| ENm002: Prediction of Gene SLC22A4 HAVANA Annot CRAIG Augustus GenScan GenScan++ Genezilla HMMGene 960000| 980000| 1000000| 1020000| ENm005: Prediction of Gene IFNAR1 HAVANA Annot CRAIG Augustus GenScan GenScan++ Genezilla HMMGene Higher Accuracy Gene Prediction with CRAIG Learning to Parse root John saw Mary in the park S(saw) compared with lexicalized phrase structure trees VP(saw) PP(in) NP(park) PN(John) V(saw) PN(Mary) P(in) D(the) John saw Mary in the N(park) park Why Dependencies? • Capture meaningful predicate-argument • • relations in language O(n3) parsing with small grammar constant vs O(n5) with large grammar constant for lexicalized phrase structure Useful in many language processing tasks machine translation relation extraction question answering textual entailment • • • • ical2.1 information. EdgeEdge Based Factorization 2.1 Based Factorization For Dependency Parsing and Spanning Trees graph InDependency what follows, x = xx1=and · ·x· 1x·Spanning a root, x1 2Edge Parsing Tre In what follows, x represen n· ·represents n Based Factorization V , xj input input sentence, and y a generic sentence, andrepresents y represents axgen what follows, x = x · · · x represents a generic 2.1dency Edge Based Factorization 1 n Basic (first-order) version: all the • dency tree for sentence x. Seeing y as the se tree for sentence x. Seeing y as th ut sentence, and y represents a generic depen- tices a In what follows, x = x · · · x represents gene edges, we write (i, j) ∈ y if there is a depe edges, we write (i, j) ∈ y if there is 1 n ncy tree for sentence x. Seeing y as the set of tree pair aofd input depeI in(i, y j) from xword in write ysentence, from word x torepresents xj . axgeneric ges, we ∈and yword ifiythere dependency iistoa word j. word. y from xIn topaper word xwe dencyword sentence Seeing as set of ifor j . x. spannin this paper we follow thethe common Intree this follow they common met s(i, j) = w · f (i, j) In edges, thisfactoring paper we follow the common method of trees a we write (i, j) ∈ y if there is a dependen factoring the score of a dependency tree the score of a dependency tree as t ! toring the score of a dependency tree as the sum tence. in y from word x to word x . i ·all j= ofscores the scores all intree the tree s(x, y) = w Fof (x, y)edges s(i, j) (Eisn of the of edges in the (Eisner, 1 the scores of all edges in the tree (Eisner, 1996) . tree of In this paper we follow the common method (i,j)∈y In particular, we define the score of an In particular, we define the score of an edg particular, we define the score of an edge to be imum ( factoring the score of a dependency tree as the su the dot product between a high dimensio dot product dimensional the dotbetween producta high between a highfeature dimensional dummy Parse Scoring root hit Scoring a Parse the bat John ball with the root 0 John 1 hit 2 the 3 ball 4 with 5 the 6 bat 7 Figure 1: An example dependency tree. s(x, y) = s(0, 1) + s(2, 1) + s(2, 4)+ s(4, 3) + s(2, 5) + s(5, 7) + s(7, 6) long history (Hudson, 1984). Figure 1 shows a dependency tree for the sentence, John hit the ball with the bat. A dependency tree must satisfy the tree constraint: each word must have exactly one incoming Finding the Best Parse y = arg max s(x, y) ∗ • General idea: maximum spanning tree 9 10 0 20 30 30 root John saw 3 11 Mary Inference Algorithms • Nested dependencies • O(n ) Eisner algorithm • dynamic programming • related to CKY for probabilistic CFGs • Crossing dependencies • O(n ) Chu-Liu-Edmonds algorithm • greedy 3 2 Features ... ... ... part-of-speech word ... Parsing Multiple Languages • Same methods, standard feature templates DATA SET A RABIC B ULGARIAN C HINESE C ZECH DANISH D UTCH G ERMAN JAPANESE P ORTUGUESE S LOVENE S PANISH S WEDISH T URKISH AVERAGE UA 79.3 92.0 91.1 87.3 90.6 83.6 90.4 92.8 91.4 83.2 86.1 88.9 74.7 87.0 LA 66.9 87.6 85.9 80.2 84.8 79.2 87.3 90.7 86.8 73.4 82.3 82.5 63.2 80.8 Table 2: Erro over Arabic, Portuguese, S ish. N/P: A S/A: Sequent clude morpho What’s Next? • Reduce training data requirements • Structural correspondence learning: • • unsupervised adaptation from one (training) domain to another What if inference is intractable? Can we approximate? Self-training: introduce diversity in learning so that trained predictor can estimate its own confidence and label new training data