Data-Driven Spell Checking: The Synergy of Two Algorithms
Transcription
Data-Driven Spell Checking: The Synergy of Two Algorithms
Data-Driven Spell Checking: The Synergy of Two Algorithms for Spelling Error Detection and Correction Eranga Jayalatharachchi, Asanka Wasala* , Ruvan Weerasinghe University of Colombo School of Computing, 35, Reid Avenue, Colombo 00700, Sri Lanka *Localisation Research Centre CSIS Department, University of Limerick, Limerick, Ireland {dej,arw}@ucsc.cmb.ac.lk ,*[email protected] 1 Contents 1. Introduction 2. Background – Sinhala Language – Work on Indian Languages – Work on Sinhala 3. Methodology – Subasa v1 – Subasa v2 4. Evaluation 5. Conclusions & Future Work 6. Demonstration 2 Introduction Spell Checking • The task of identifying and flagging incorrectly spelled words in a document written in a natural language Spell Correcting • The process of replacing the misspelled words with the most likely intended ones Applications • Word processing, optical character recognition (OCR), character recognition, speech recognition, computer aided language learning (CALL) etc. 3 Introduction Misspelled Words Non-word errors Real-word errors It was teh wind My sun is a doctor Automatic Spelling Error Detection and Correction (Kukich 1992): . 1. Non-word error detection 2. Isolated word error correction 3. Context-dependent error correction 4 Introduction About 80% of all misspelled English words (non-word errors) in human typewritten text are due to single-error misspellings. (Damerau 1964) teh ther insertion the th transposition deletion thw substitution 5 Introduction Correction Techniques (Kukich. 1992) 1. Minimum edit distance techniques 2. Similarity key techniques 3. Rule-based techniques 4. N-gram-based techniques 5. Probabilistic techniques 6. Neural nets 6 Introduction Objective • To enhance Subasa, the only documented spell checker available to-date for Sinhala (Wasala et al. 2010; Walasa et al. 2011) • Subasa v1 : n-gram • Subasa v2: n-gram + edit distance 7 Introduction N-grams An n-gram is a sub-sequence of n items from a given sequence Word intention Letter unigrams i n t e n t i o n Letter bi-grams in nt te en nt Letter tri-grams int nte ten ent nti tio ion ti io on 8 Introduction N-gram Generating Algorithm function get_n_grams (word, n) returns n_grams_list l ← length (word) - n n_grams_list ← empty () for i from 0 to l do n_grams_list ← append ( substring (word, i, n) ) 9 Introduction Minimum Edit-Distance Minimum number of editing operations required to transform one string to another (Wagner 1974) • Insertions • Deletions • Substitutions 10 Introduction Editing Operations i n t en t i on execu t i on i n t en t i on execu t i on 5 Substitutions 1 Deletion Cost = 5 x 2 = 10 3 Substitutions 1 Insertion Cost = 1 + (3 x 2) + 1 = 8 Cost of Edit Operations Insertion = 1 Deletion = 1 Substitution = Deletion + Insertion = 1 + 1 = 2 11 Introduction Minimum Edit Distance Calculation Algorithm A dynamic programming algorithm for minimum edit-distance computation creates an edit-distance matrix M with one column for each symbol in the target sequence and one row for each symbol in the source sequence. function minimum_edit_distance (source, target) returns min_distance m ← length(source) n ← length(target) create distance matrix M[n+1,m+1] M[0,0] ← 0 for each column i from 0 to n do for each row j from 0 to m do M[i,j] ← min ( M[i-1,j] + cost_insert(target i), M[i-1,j-1] + cost_substitute(source j, target i), M[i,j-1] + cost_delete(source j) ) min_distance ← M[i+1,j+1] 12 Introduction source Edit Distance Matrix n 9 10 11 10 11 12 11 10 9 8 o 8 9 10 9 10 11 10 9 8 9 i 7 8 9 8 9 10 9 8 9 10 t 6 7 8 7 8 9 8 9 10 11 n 5 6 7 6 7 8 9 10 11 12 e 4 5 6 5 6 7 8 9 t 3 4 5 6 7 8 9 10 11 12 n 22 3 4 5 6 7 8 8 10 11 i 11 22 33 4 5 6 7 8 9 10 # 00 11 22 3 4 5 6 7 8 9 # e x e c u t i o n 10 11 target Each cell M[i,j] contains the minimum edit distance between the first i characters of the target and the first j characters of the source 13 Introduction source Edit Distance Matrix n 9 10 11 10 11 12 11 10 9 8 o 8 9 10 9 10 11 10 9 8 9 i 7 8 9 8 9 10 9 8 9 10 t 6 7 8 7 8 9 8 9 10 11 n 5 6 7 6 7 8 9 10 11 12 e 4 5 6 5 6 7 8 9 t 3 4 5 6 7 8 9 10 11 12 n 2 3 4 5 6 7 8 8 10 11 i 1 2 3 4 5 6 7 8 9 10 # 0 1 2 3 4 5 6 7 8 9 # e x e c u t i o n 10 11 target Each cell M[i,j] contains the minimum edit distance between the first i characters of the target and the first j characters of the source 14 Background Sinhala Language & Script • • • • Majority language of Sri Lanka Sinhala script is a derivative of Brahmi script Sinhala script is an syllabic script 5 pre-nasalized stops & 2 unique vowels (Nandasara, 2009) • Sinhala is a phonetic language • “na-Na-la-La” dissention • Conjunct letters 15 Background Work on Indic Languages • Non-word spelling correction for Assamese (Das et al. 2002) – Uses similarity-key and minimum edit distance techniques • “Rule cum Dictionary based approach” for spell checking Malayalam (Santhosh et al. 2002) • Spelling correction for Tamil (Dhanabalan et al. 2003) – Non-word error detection using simple dictionary lookups • Spell checking for Bangla (Chaudhuri 2002) – An adaptation of similarity key based technique 16 Background Work on Sinhala Language – Thibus • Commercial-grade – Mozilla Firefox Extension (addons.mozilla.org) • Dictionary-based – OpenOffce Extension (openoffice.org) • Uses Hunspell – Microsoft Office Word 2007 (microsoft.com) • Via Language Interface Pack (LIP) for Sinhala – Subasa (v1) (Wasala et al. 2009; Wasala et al. 2010) • N-gram based • Phonetic errors 17 Methodology: Subasa v1 The Process (k, c) kat kat cat 18 Methodology: Subasa v1 The Process (contd.) kat cat ka, at ca, at ka, at = 10+5 ca, at = 20+5 kat cat ka = 10 ca = 20 at = 5 cat 19 Methodology: Subasa v1 Phoneme Classes Graphemes Phoneme class Graphemes Phoneme class , /k/ , /d̪/ , /g/ , /p/ , /tʃ/ , /b/ , /dʒ/ , /n/ , /ʈ/ , /l/ , /ɖ/ , , /s/ or /ʃ/ , /t̪/ , /ɲ/ 20 Methodology: Subasa v1 Example • UCSC Corpus – 10 Mn Words – – – • Word Unigrams (440,021) Letter bi-grams (46,878) Letter tri-grams (16,6460) Dictionary of Sinhala Spelling (Koparahewa. 2006) 21 http://subasa.ambitiouslemon.com/ 22 Methodology: Subasa v2 The Process 23 Methodology: Subasa v2 The Process : Edit Distance Module 24 Methodology: Subasa v2 Data • UCSC Corpus – 10 Mn Words Word Unigrams (440,021) Letter bi-grams (46,878) Letter tri-grams (166,460) • Dictionary of Sinhala Spelling (Koparahewa 2006) • Word Unigrams (spell checked by Subasa v1) 25 Methodology: Subasa v2 New Phoneme Classes 26 http://subasa.ambitiouslemon.com/subasa2/ 27 Evaluation Compared with: Microsoft Word 2007 Sinhala Language Interface Pack 2007 for Microsoft Office OpenOffice.org 3.2 Writer based on Hunspell Subasa v1 based on n-grams from UCSC Corpus Manual Inspection by a linguist Test cases Test 1: Public Sinhala Newspaper Test 2: Sinhala Blog Syndicator 28 Evaluation Results: Test 1 6155 words from a Public Sinhala Newspaper http://www.divaina.com/2010/10/28/ Word Writer Subasa v1 Subasa v2 Manual Incorrect Words Detected 2830 46% 1592 26% 255 4% 808 13% 1055 17% Correct Words Detected 3325 54% 4563 74% 5900 96% 5347 87% 5100 83% 29 Evaluation Results: Test 2 4117 words extracted from a Sinhala blog syndicator http://blogs.sinhalabloggers.com/ Word Writer Subasa v1 Subasa v2 Manual Incorrect Words Detected 1979 48% 1494 36% 353 9% 953 23% 1047 25% Correct Words Detected 2138 52% 2623 64% 3764 91% 3164 77% 3070 74% 30 Conclusions and Future Work Conclusions • Subasa v2 performs much closer to Manual inspection • N-gram + Edit distance is better than n-gram only approach • Data driven • Good for languages with limited resources 31 Conclusions and Future Work Future Works • • • • • Larger dictionary Optimizations to Edit Distance module Candidate correction ranking Word boundary analysis Morphological analysis 32 Demonstration http://subasa.ambitiouslemon.com/ & http://subasa.ambitiouslemon.com/subasa2/ 33 Improved Detections Subasa v1 Subasa v2 34 Improved Corrections Subasa v1 Subasa v2 35