slides

Transcription

slides
Annotating Syntactic Information on
5.5 Billion Word Corpus of
Japanese Blogs
Michal Ptaszynski 1, Rafal Rzepka 2, Kenji Araki 2,
Yoshio Momouchi 3
1) JSPS Research Fellow / Hokkai-Gakuen University, High-Tech Research Center
2) Hokkaido University, Graduate School of Information Science and Technology
3) Hokkai-Gakuen University, Department of Electronics and Information Engineering
Annotating Syntactic Information on
5.5 Billion Word Corpus of
Japanese Blogs
Michal Ptaszynski 1, Rafal Rzepka 2, Kenji Araki 2,
Yoshio Momouchi 3
1) JSPS Research Fellow / Hokkai-Gakuen University, High-Tech Research Center
2) Hokkaido University, Graduate School of Information Science and Technology
3) Hokkai-Gakuen University, Department of Electronics and Information Engineering
Presentation Outline
•
•
•
•
•
Introduction
YACIS Corpus Description
Syntax/Morphology Annotation
Corpus Statistics
Conclusions and Future Work
Introduction
• Corpora are important in many NLP tasks
Introduction
• Corpora are important in many NLP tasks
– Text normalization
– Lexicon generation
– Sentiment analysis
– Dialog agent development
–…
Introduction
• There are some (somewhat) large corpora for
Japanese
– KOTONOHA: BCCWJ (Balanced Corpus of
Contemporary Written Japanese) (4,800,000w)
– Aozora Bunko (more than 10,000 books)
– Mainichi Shinbun (200,000 articles)
– Asahi Shinbun (130,000 articles)
Introduction
• There are some (somewhat) large corpora for
Japanese
– KOTONOHA: BCCWJ (Balanced Corpus of
Contemporary Written Japanese) (4,800,000w)
– Aozora Bunko (more than 10,000 books)
– Mainichi Shinbun (200,000 articles)
– Asahi Shinbun (130,000 articles)
Introduction
• A good source of casual language:
– INTERNET
• BLOGS
Introduction
• Internet based corpora for Japanese
– KWIC on WEB (http://languagecraft.jp/kwic/)
• 2,000,000pages
– JpWaC (http://trac.sketchengine.co.uk/wiki/Corpora/JpWaC)
• 49,000 pages
–…
Introduction
• Internet based corpora for Japanese
– Problems:
•
•
•
•
•
Robots.txt
Duplicates
Language detection
Encoding
No specific domain (multi-domain)
Introduction
• Blog based corpora for Japanese
– jBlogs
• 28,000 pages, 62 mil words
– KNB
• 249 pages, 67,000 words
jBlogs: M. Baroni, and M. Ueyama, ”Building General- and Special-Purpose Corpora by Web
Crawling”, In: Proceedings of the 13th NIJL International Symposium on Language Corpora: Their
Compilation and Application, 2006, www.tokuteicorpus.jp/result/pdf/2006 004.pdf
KNB: Chikara Hashimoto, Sadao Kurohashi, Daisuke Kawahara, Keiji Shinzato and Masaaki
Nagata, “Construction of a Blog Corpus with Syntactic, Anaphoric, and Sentiment Annotations”
[in Japanese], Journal of Natural Language Processing, Vol 18, No. 2, pp. 175-201, 2011.
Introduction
• Blog based corpora for Japanese
– jBlogs
• 28,000 pages, 62 mil words
– KNB
• 249 pages, 67,000 words
Introduction
• Written L. Corpora vs. Internet corpora vs. Blog corpora
corpus
scale
corpus
scale
KWIC on
KOTONOHA 4,800,000w
WEB
2,000,000
pages
Aozora
Bunko
>10,000
books
49,000
pages
Mainichi
Shinbun
Asahi
Shinbun
~200,000
articles
~130,000
articles
JpWaC
corpus
scale
jBlogs
28,000
pages / 62
mil words
KNB
249 pages /
67,000
words
Introduction
• Written L. Corpora vs. Internet corpora vs. Blog corpora
corpus
scale
corpus
scale
KWIC on
KOTONOHA 4,800,000w
WEB
2,000,000
pages
Aozora
Bunko
>10,000
books
49,000
pages
Mainichi
Shinbun
Asahi
Shinbun
~200,000
articles
~130,000
articles
JpWaC
corpus
scale
jBlogs
28,000
pages / 62
mil words
KNB
249 pages /
67,000
words
YACIS Corpus Description
• Need a BIG corpus of blogs for Japanese
YACIS Corpus Description
• Need a BIG corpus of blogs for Japanese
• Looked through a number of blog services
YACIS Corpus Description
• Need a BIG corpus of blogs for Japanese
• Looked through a number of blog services
• Ameba (www.ameba.jp/) has a clear structure
YACIS Corpus Description
YACIS Corpus Description
...
<div class="contents">
<div class="subContents">
<!-- google_ad_section_start(name=s1, weight=.9) -->
<p><font size="2">ずいぶん前になりますが岡山に行ってきました<img alt="0" src="http://emoji.ameba.jp/img/user/ts/tsumegaeru/606746.gif" /><img alt="きら
きら" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1956233.gif" /></font></p>
<br />
<p><font size="2">なんと人生初の一人旅(゚∀゚*)</font></p>
<br />
<p><font size="2">そしてこれはその時に買ったお土産です<img alt="パンダ" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1114039.gif" /><br />
</font><a id="i10537038426" class="detailOn" href="http://ameblo.jp/0ppe1zryshan/image-10619583023-10537038426.html"><font size="2"><img
height="330" alt="○●ようのうまいもん日記●○"
src="http://stat.ameba.jp/user_images/20100511/19/0ppe1zryshan/42/f4/j/t02200330_0800120010537038426.jpg" width="220" border="0" /></font></a>
<font size="2"><br />
『白桃のロイヤルガレット』</font></p>
<br />
<p><font size="2">桃系はやっぱり買っとかなければ!!! ってことで買いました( ̄∀ ̄)</font></p>
<br />
<p><br />
<br />
<a id="i10537038429" class="detailOn" href="http://ameblo.jp/0ppe1zryshan/image-10619583023-10537038429.html"><font size="2"><img alt="○●ようのうま
いもん日記●○" src="http://stat.ameba.jp/user_images/20100511/19/0ppe1zryshan/5b/a7/j/t02200147_0800053410537038429.jpg" border="0"
/></font></a>
<font size="2"><br />
YACIS Corpus Description
...
<div class="contents">
<div class="subContents">
Extract these
<!-- google_ad_section_start(name=s1, weight=.9) -->
<font size="2">
<img alt="0" src="http://emoji.ameba.jp/img/user/ts/tsumegaeru/606746.gif" /><img alt="きら
きら" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1956233.gif" /></font>
<br />
<font size="2">
</font>
<br />
<font size="2">
<img alt="パンダ" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1114039.gif" /><br />
</font><a id="i10537038426" class="detailOn" href="http://ameblo.jp/0ppe1zryshan/image-10619583023-10537038426.html"><font size="2"><img
height="330" alt="○●ようのうまいもん日記●○"
src="http://stat.ameba.jp/user_images/20100511/19/0ppe1zryshan/42/f4/j/t02200330_0800120010537038426.jpg" width="220" border="0" /></font></a>
<font size="2"><br />
『白桃のロイヤルガレット』</font>
<br />
<font size="2">
</font>
<br />
<p><br />
<br />
<a id="i10537038429" class="detailOn" href="http://ameblo.jp/0ppe1zryshan/image-10619583023-10537038429.html"><font size="2"><img alt="○●ようのうま
いもん日記●○" src="http://stat.ameba.jp/user_images/20100511/19/0ppe1zryshan/5b/a7/j/t02200147_0800053410537038429.jpg" border="0"
/></font></a>
<font size="2"><br />
From between
these
YACIS Corpus Description
...
<div class="contents">
<div class="subContents">
Extract these
<!-- google_ad_section_start(name=s1, weight=.9) -->
<font size="2">
<img alt="0" src="http://emoji.ameba.jp/img/user/ts/tsumegaeru/606746.gif" /><img alt="きら
きら" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1956233.gif" /></font>
<br />
<font size="2">
</font>
<br />
<font size="2">
<img alt="パンダ" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1114039.gif" /><br />
</font><a id="i10537038426" class="detailOn" href="http://ameblo.jp/0ppe1zryshan/image-10619583023-10537038426.html"><font size="2"><img
height="330" alt="○●ようのうまいもん日記●○"
src="http://stat.ameba.jp/user_images/20100511/19/0ppe1zryshan/42/f4/j/t02200330_0800120010537038426.jpg" width="220" border="0" /></font></a>
<font size="2"><br />
『白桃のロイヤルガレット』</font>
<br />
<font size="2">
</font>
<br />
<p><br />
<br />
<a id="i10537038429" class="detailOn" href="http://ameblo.jp/0ppe1zryshan/image-10619583023-10537038429.html"><font size="2"><img alt="○●ようのうま
いもん日記●○" src="http://stat.ameba.jp/user_images/20100511/19/0ppe1zryshan/5b/a7/j/t02200147_0800053410537038429.jpg" border="0"
/></font></a>
<font size="2"><br />
From between
these
Get rid of:
- Pictures
- Irrelevant HTML tags
- Emoji (but leave kaomoji)
- Non-Japanese pages
- … (Other post-processing)
YACIS Corpus Description
• YACIS:
– Yet Another Corpus of Internet Sentences
• Corpus compilation:
– 2009 Dec 3-24
• Ameba blogs
• Only one query to Google:
“site:ameblo.jp” (take 1000 links)
~> crawl from page to page
YACIS Corpus Description
YACIS Corpus Description
YACIS Corpus Description
corpus
scale
corpus
scale
corpus
scale
KWIC on
KOTONOHA 4,800,000w
WEB
2,000,000
pages
jBlogs
10 bil char
28,000
pages / 62
mil words
Aozora
Bunko
>10,000
books
49,000
pages
249 pages /
67,000
words
Mainichi
Shinbun
Asahi
Shinbun
~200,000
articles
~130,000
articles
JpWaC
KNB
YACIS Corpus Description
corpus
scale
corpus
scale
corpus
KWIC on
KOTONOHA 4,800,000w
WEB
2,000,000
pages
jBlogs
10 bil char
Aozora
Bunko
>10,000
books
49,000
pages
Mainichi
Shinbun
Asahi
Shinbun
~200,000
articles
~130,000
articles
JpWaC
KNB
YACIS
scale
28,000
pages / 62
mil words
249 pages /
67,000
words
12 mil page
28 bil. char
5,6 bil. w.
YACIS Corpus Description
corpus
scale
corpus
scale
corpus
KWIC on
KOTONOHA 4,800,000w
WEB
2,000,000
pages
jBlogs
10 bil char
Aozora
Bunko
>10,000
books
49,000
pages
Mainichi
Shinbun
Asahi
Shinbun
~200,000
articles
~130,000
articles
JpWaC
KNB
YACIS
scale
28,000
pages / 62
mil words
249 pages /
67,000
words
12 mil page
28 bil. char
5,6 bil. w.
YACIS Corpus Description
• What we have:
YACIS Corpus Description
• What we have:
Original URL
Extraction time
Sentence ID
Tags:
<doc> one blog page
<post> one post in blog
<s> one sentence
<comments> all comments
<cmt> one comment
YACIS Corpus Description
Dependency
structure
Tokenization
• What we want:
POS
Lemmatization
Named
Entities
Emotive
expressions
Emotion
classes
Positive/
Negative
Emotion
objects
Emotive
sentences
Emoticons
YACIS Corpus Description
Dependency
structure
Tokenization
• What we want:
POS
Lemmatization
Named
Entities
Emotive
expressions
Emotion
classes
Positive/
Negative
Emotion
objects
Emotive
sentences
Emoticons
Syntax/Morphology Annotation
•
•
•
•
•
Tokenization (T)
Lemmatization (L)
POS tagging (POS)
Dependency structure (DS)
Named entity recognition (NER)
Syntax/Morphology Annotation
•
•
•
•
•
ChaSen
Tokenization (T)
Lemmatization (L) MeCab
POS tagging (POS)
Dependency structure (DS)Cabocha
Named entity recognition (NER)
Juman
KNP
Syntax/Morphology Annotation
Speed
POS
Juman
Slower
~
~
MeCab Cabocha
~
~
Faster
DS
KNP
ChaSen
* Subjective evaluation on a small test set, no benchmarks
** Test with “time” command in Linux
Syntax/Morphology Annotation
• Cool features of MeCab
– POS prediction
– Use of two dictionaries (ipadic, jumandic)
• Cool features of Cabocha
– Works with MeCab
– IREX (NER standard)
Syntax/Morphology Annotation
• Annotation time:
– MeCab: 2 days
– Cabocha: 7 days
• File size:
– Raw (only text, no HTML)
– Tokenization
– POS/ipadic, Lemma, etc.
– POS/jumandic, Lemma
– DS with NER
27 GB
32 GB
286 GB
286 GB
86 GB
Syntax/Morphology Annotation
• Annotation example
Syntax/Morphology Annotation
• Annotation example
Corpus Statistics
• Evaluation
– Manual:
• Impossible (5.6 bil. words, 350 mil. sentences)
– 1 sentence in 1 sec. = 4050 days (11 years)
• MeCab and Cabocha are standard tools (reliable)
Corpus Statistics
• Evaluation
– Automatic: comparison of general features
• Now mostly for POS
1. Ipadic vs. jumandic
2. YACIS vs. other Japanese corpora
3. YACIS vs. other language corpora
Corpus Statistics
Ipadic vs. jumandic
YACIS-ipadic
YACIS-jumandic
Corpus Statistics
Ipadic vs. jumandic
YACIS-ipadic
YACIS-jumandic
Corpus Statistics
Ipadic vs. jumandic
Differences in dictionaries.
For example: いやー (interjeticon)
Ipadic: いやー【感動詞】
YACIS-ipadic
YACIS-jumandic
Jumandic: いや【感動詞】 + ー【記号・特殊】
Corpus Statistics
YACIS vs. jBlogs and JENAAD
jBlogs: blog corpus from 2006
61 mil. words, 30 thousand blog docs
JENAAD: news corpus from 2003
4.7 mil. words, Yomiuri (1989-2001)
jBlogs: M. Baroni, and M. Ueyama, ”Building General- and Special-Purpose Corpora by Web
Crawling”, In: Proceedings of the 13th NIJL International Symposium on Language Corpora:
Their Compilation and Application, 2006, www.tokuteicorpus.jp/result/pdf/2006 004.pdf
JENAAD: Masao Utiyama and Hitoshi Isahara. (2003) “Reliable Measures for Aligning JapaneseEnglish News Articles and Sentences”. ACL-2003, pp. 72–79.
Corpus Statistics
YACIS vs. jBlogs and JENAAD
YACIS-ipadic YACIS-jumandic
Corpus Statistics
YACIS vs. jBlogs and JENAAD
Spearman's ρ
YACIS-ipadic
YACIS-jumandic
jBlogs
JENAAD
YACIS-ipadic
YACIS-jumandic
YACISYACIS-
jBlogs
JENAAD
ipadic
jumandic
1
0.88
0.96
1
1
0.79
0.85
1
1
Corpus Statistics
YACIS vs. jBlogs and JENAAD
Spearman's ρ
YACIS-ipadic
YACIS-jumandic
jBlogs
JENAAD
YACIS-ipadic
YACIS-jumandic
YACISYACIS-
jBlogs
JENAAD
ipadic
jumandic
1
0.88
0.96
1
1
0.79
0.85
------------- ChaSen / ipadic ----------
1
1
Corpus Statistics
YACIS vs. jBlogs and JENAAD
Spearman's ρ
YACIS-ipadic
YACIS-jumandic
YACISYACIS-
YACIS-ipadic
YACIS (large)
jBlogs
jBlogs
(medium)
JENAAD (small)
JENAAD
JENAAD
ipadic
jumandic
1
0.88
0.96
1
1
0.79
0.85
Statistically, part-of-speech distribution is
similar for ipadic across all three corpora:
YACIS-jumandic
jBlogs
5,600,000,000 words
61,000,000 words
4,700,000 words
------------- ChaSen / ipadic ----------
1
1
Corpus Statistics
YACIS vs. jBlogs and JENAAD
It doesn’t mean
YACIS-ipadic
YACIS-jumandic MeCab/ipadic is
YACISYACISSpearman's ρ
jBlogs better.
JENAAD
ipadic
jumandic ------------- ChaSen / ipadic ---------It means: It is
consistent
regardless
YACIS-ipadic
1
0.88
0.96
1
Statistically, part-of-speech distribution is
of corpus size.
similar for ipadic across all three corpora:
And even if it has
YACIS-jumandic
1
0.79
0.85
errors, the errors
YACIS (large) 5,600,000,000 words
jBlogs
1
appear
consistently. *
jBlogs
(medium)
61,000,000 words
JENAAD (small)
JENAAD
4,700,000 words
1
* Within limited evaluation range:
Comparison of POS distribution
Corpus Statistics
Japanese vs. British English and Italian
• British English: ukWaC
– 2bil. Words, .uk domain, POS, lemma
• Italian: itWaC
– 2bil. Words, .it domain, POS, lemma
Corpus Statistics
Japanese vs. British English and Italian
•
Size comparable to
British English: ukWaC
YACIS
– 2bil. Words, .uk domain, POS, lemma>1 bil. words
• Italian: itWaC
– 2bil. Words, .it domain, POS, lemma
• Both from WaCky (Web as Corpus kool
ynitiative)
http://wacky.sslmit.unibo.it/doku.php
Corpus Statistics
Japanese vs. British English and Italian
YACIS-ipadic
YACIS-jumandic
*
*
* ) Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta, “The WaCky Wide Web: A Collection of
Very Large Linguistically Processed Web-Crawled Corpora”, Kluwer Academic Publishers, Netherlands, 2008.
Corpus Statistics
Japanese vs. British English and Italian
YACIS-ipadic
YACIS-jumandic
*
*
* ) Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta, “The WaCky Wide Web: A Collection of
Very Large Linguistically Processed Web-Crawled Corpora”, Kluwer Academic Publishers, Netherlands, 2008.
Corpus Statistics
Japanese vs. British English and Italian
YACIS-ipadic
Noun
Verb
Adjecti
ve
*
YACIS-jumandic
*
Noun
Adjecti
ve
Verb
Noun
Adjecti
ve
Verb
* ) Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta, “The WaCky Wide Web: A Collection of
Very Large Linguistically Processed Web-Crawled Corpora”, Kluwer Academic Publishers, Netherlands, 2008.
Corpus Statistics
What it means?
Japanese
Corporavs.
of British English
comparable size, but
different languages YACIS-ipadic
(Japanese vs. 2
European languages) Noun
Verb
have different POS
Adjecti
distribution.
ve
and Italian
*
YACIS-jumandic
*
Noun
Adjecti
ve
Verb
Noun
Adjecti
ve
Verb
If we won’t argue about POS definitions, this
could be a small hint for a proof that POS
distribution is not universal across languages.
* ) Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta, “The WaCky Wide Web: A Collection of
Very Large Linguistically Processed Web-Crawled Corpora”, Kluwer Academic Publishers, Netherlands, 2008.
Conclusions
• Gathered YACIS
– 5.6 bil. word corpus of Japanese blogs
• Annotated YACIS with Syntactic/Morphological
information
– POS, Tokenization, Lemma, Dependency Structure,
Named Entities
• Evaluated YACIS by comparing to other
corpora
Conclusions
• Corpora of the same language, but different
size have the same POS distribution
=POS tagging is consistent
• Corpora of comparable size, but different
languages have different POS distribution
=POS distribution IS NOT Universal across
languages
Future Work
•
•
•
•
Online interface!
More detailed evaluation (e.g. of dependency)
Lexicon generation
N-gram version for download without
limitations
• Applications
Thank you for your attention!
Michal Ptaszynski
[email protected]
Discussion
• Copyrights
– YACIS will not be put on sale
– Only for scientific purposes
– Usage of corpus will need a two-side agreement
• Gathering of the corpus is similar to search
engines
– If YACIS was illegal, Google, Yahoo,… would be
even more illegal.

Similar documents

Note: The only uncolored characters have round

Note: The only uncolored characters have round file:///Users/everson/Documents/%20Downloads/07257-emoji-wd-table/emoji_mapping_utc_pub.html

More information