Data-Driven Spell Checking: The Synergy of Two Algorithms

Transcription

Data-Driven Spell Checking:
The Synergy of Two Algorithms for
Spelling Error Detection and Correction
Eranga Jayalatharachchi, Asanka Wasala* , Ruvan Weerasinghe
University of Colombo School of Computing, 35, Reid Avenue, Colombo 00700, Sri Lanka
*Localisation Research Centre
CSIS Department, University of Limerick, Limerick, Ireland
{dej,arw}@ucsc.cmb.ac.lk ,*[email protected]
1
Contents
1. Introduction
2. Background
– Sinhala Language
– Work on Indian Languages
– Work on Sinhala
3. Methodology
– Subasa v1
– Subasa v2
4. Evaluation
5. Conclusions & Future Work
6. Demonstration
2
Introduction
Spell Checking
• The task of identifying and flagging incorrectly
spelled words in a document written in a natural
language
Spell Correcting
• The process of replacing the misspelled words
with the most likely intended ones
Applications
• Word processing, optical character recognition
(OCR), character recognition, speech recognition,
computer aided language learning (CALL) etc.
3
Introduction
Misspelled Words
Non-word errors
Real-word errors
It was teh wind
My sun is a doctor
Automatic Spelling Error Detection and Correction (Kukich 1992):
.
1. Non-word error detection
2. Isolated word error correction
3. Context-dependent error correction
4
Introduction
About 80% of all misspelled English words (non-word errors) in
human typewritten text are due to single-error misspellings.
(Damerau 1964)
teh
ther
insertion
the
th
transposition
deletion
thw
substitution
5
Introduction
Correction Techniques (Kukich. 1992)
1. Minimum edit distance techniques
2. Similarity key techniques
3. Rule-based techniques
4. N-gram-based techniques
5. Probabilistic techniques
6. Neural nets
6
Introduction
Objective
• To enhance Subasa, the only documented
spell checker available to-date for Sinhala
(Wasala et al. 2010; Walasa et al. 2011)
• Subasa v1 : n-gram
• Subasa v2: n-gram + edit distance
7
Introduction
N-grams
An n-gram is a sub-sequence of n items from a
given sequence
Word
intention
Letter unigrams
i
n
t
e n
t
i
o n
Letter bi-grams
in nt te en nt
Letter tri-grams
int nte ten ent nti tio ion
ti io on
8
Introduction
N-gram Generating Algorithm
function get_n_grams (word, n) returns n_grams_list
l ← length (word) - n
n_grams_list ← empty ()
for i from 0 to l do
n_grams_list ← append ( substring (word, i, n) )
9
Introduction
Minimum Edit-Distance
Minimum number of editing operations
required to transform one string to another
(Wagner 1974)
• Insertions
• Deletions
• Substitutions
10
Introduction
Editing Operations
i n t en t i on
execu t i on
i n t en
t i on
execu t i on
5 Substitutions
1 Deletion
Cost = 5 x 2 = 10
3 Substitutions
1 Insertion
Cost = 1 + (3 x 2) + 1 = 8
Cost of Edit Operations
Insertion = 1
Deletion = 1
Substitution = Deletion + Insertion = 1 + 1 = 2
11
Introduction
Minimum Edit Distance Calculation Algorithm
A dynamic programming algorithm for minimum edit-distance computation creates an
edit-distance matrix M with one column for each symbol in the target sequence and one
row for each symbol in the source sequence.
function minimum_edit_distance (source, target) returns min_distance
m ← length(source)
n ← length(target)
create distance matrix M[n+1,m+1]
M[0,0] ← 0
for each column i from 0 to n do
for each row j from 0 to m do
M[i,j] ← min (
M[i-1,j] + cost_insert(target i),
M[i-1,j-1] + cost_substitute(source j, target i),
M[i,j-1] + cost_delete(source j)
)
min_distance ← M[i+1,j+1]
12
Introduction
source
Edit Distance Matrix
n
9
10 11 10 11 12 11 10
9
8
o
8
9
10
9
10 11 10
9
8
9
i
7
8
9
8
9
10
9
8
9
10
t
6
7
8
7
8
9
8
9
10 11
n
5
6
7
6
7
8
9
10 11 12
e
4
5
6
5
6
7
8
9
t
3
4
5
6
7
8
9
10 11 12
n
22
3
4
5
6
7
8
8
10 11
i
11
22
33
4
5
6
7
8
9
10
#
00
11
22
3
4
5
6
7
8
9
#
e
x
e
c
u
t
i
o
n
10 11
target
Each cell M[i,j] contains the minimum edit distance between the first i characters of
the target and the first j characters of the source
13
Introduction
source
Edit Distance Matrix
n
9
10 11 10 11 12 11 10
9
8
o
8
9
10
9
10 11 10
9
8
9
i
7
8
9
8
9
10
9
8
9
10
t
6
7
8
7
8
9
8
9
10 11
n
5
6
7
6
7
8
9
10 11 12
e
4
5
6
5
6
7
8
9
t
3
4
5
6
7
8
9
10 11 12
n
2
3
4
5
6
7
8
8
10 11
i
1
2
3
4
5
6
7
8
9
10
#
0
1
2
3
4
5
6
7
8
9
#
e
x
e
c
u
t
i
o
n
10 11
target
Each cell M[i,j] contains the minimum edit distance between the first i characters of
the target and the first j characters of the source
14
Background
Sinhala Language & Script
•
•
•
•
Majority language of Sri Lanka
Sinhala script is a derivative of Brahmi script
Sinhala script is an syllabic script
5 pre-nasalized stops & 2 unique vowels (Nandasara,
2009)
• Sinhala is a phonetic language
• “na-Na-la-La” dissention
• Conjunct letters
15
Background
Work on Indic Languages
• Non-word spelling correction for Assamese (Das et al.
2002)
– Uses similarity-key and minimum edit distance techniques
• “Rule cum Dictionary based approach” for spell
checking Malayalam (Santhosh et al. 2002)
• Spelling correction for Tamil (Dhanabalan et al. 2003)
– Non-word error detection using simple dictionary lookups
• Spell checking for Bangla (Chaudhuri 2002)
– An adaptation of similarity key based technique
16
Background
Work on Sinhala Language
– Thibus
• Commercial-grade
– Mozilla Firefox Extension (addons.mozilla.org)
• Dictionary-based
– OpenOffce Extension (openoffice.org)
• Uses Hunspell
– Microsoft Office Word 2007 (microsoft.com)
• Via Language Interface Pack (LIP) for Sinhala
– Subasa (v1) (Wasala et al. 2009; Wasala et al. 2010)
• N-gram based
• Phonetic errors
17
Methodology: Subasa v1
The Process
(k, c)
kat
kat
cat
18
The Process (contd.)
kat
cat
ka, at
ca, at
ka, at = 10+5
ca, at = 20+5
kat
cat
ka = 10
ca = 20
at = 5
cat
19
Phoneme Classes
Graphemes Phoneme class
Graphemes Phoneme class
,
/k/
,
/d̪/
,
/g/
,
/p/
,
/tʃ/
,
/b/
,
/dʒ/
,
/n/
,
/ʈ/
,
/l/
,
/ɖ/
, ,
/s/ or /ʃ/
,
/t̪/
,
/ɲ/
20
Example
•
UCSC Corpus – 10 Mn Words
–
–
–
•
Word Unigrams (440,021)
Letter bi-grams (46,878)
Letter tri-grams (16,6460)
Dictionary of Sinhala Spelling (Koparahewa. 2006)
21
http://subasa.ambitiouslemon.com/
22
The Process
23
The Process : Edit Distance Module
24
Data
• UCSC Corpus – 10 Mn Words
 Word Unigrams (440,021)
 Letter bi-grams (46,878)
 Letter tri-grams (166,460)
• Dictionary of Sinhala Spelling (Koparahewa 2006)
• Word Unigrams (spell checked by Subasa v1)
25
New Phoneme Classes
26
http://subasa.ambitiouslemon.com/subasa2/
27
Evaluation
Compared with:
 Microsoft Word 2007
Sinhala Language Interface Pack 2007 for Microsoft Office
 OpenOffice.org 3.2 Writer
based on Hunspell
 Subasa v1
based on n-grams from UCSC Corpus
 Manual Inspection
by a linguist
Test cases
Test 1: Public Sinhala Newspaper
Test 2: Sinhala Blog Syndicator
28
Evaluation
Results: Test 1
6155 words from a Public Sinhala Newspaper
http://www.divaina.com/2010/10/28/
Word
Writer
Subasa v1
Subasa v2
Manual
Incorrect Words
Detected
2830
46%
1592
26%
255
4%
808
13%
1055
17%
Correct Words
Detected
3325
54%
4563
74%
5900
96%
5347
87%
5100
83%
29
Evaluation
Results: Test 2
4117 words extracted from a Sinhala blog syndicator
http://blogs.sinhalabloggers.com/
Word
Writer
Subasa v1
Subasa v2
Manual
Incorrect Words
Detected
1979
48%
1494
36%
353
9%
953
23%
1047
25%
Correct Words
Detected
2138
52%
2623
64%
3764
91%
3164
77%
3070
74%
30
Conclusions and Future Work
Conclusions
• Subasa v2 performs much closer to Manual inspection
• N-gram + Edit distance is better than n-gram only
approach
• Data driven
• Good for languages with limited resources
31
Conclusions and Future Work
Future Works
•
•
•
•
•
Larger dictionary
Optimizations to Edit Distance module
Candidate correction ranking
Word boundary analysis
Morphological analysis
32
Demonstration
http://subasa.ambitiouslemon.com/
&
http://subasa.ambitiouslemon.com/subasa2/
33
Improved Detections
Subasa v1
Subasa v2
34
Improved Corrections
Subasa v1 Subasa v2
35

Data-Driven Spell Checking: The Synergy of Two Algorithms

Transcription

Similar documents