slides
Transcription
slides
Annotating Syntactic Information on 5.5 Billion Word Corpus of Japanese Blogs Michal Ptaszynski 1, Rafal Rzepka 2, Kenji Araki 2, Yoshio Momouchi 3 1) JSPS Research Fellow / Hokkai-Gakuen University, High-Tech Research Center 2) Hokkaido University, Graduate School of Information Science and Technology 3) Hokkai-Gakuen University, Department of Electronics and Information Engineering Annotating Syntactic Information on 5.5 Billion Word Corpus of Japanese Blogs Michal Ptaszynski 1, Rafal Rzepka 2, Kenji Araki 2, Yoshio Momouchi 3 1) JSPS Research Fellow / Hokkai-Gakuen University, High-Tech Research Center 2) Hokkaido University, Graduate School of Information Science and Technology 3) Hokkai-Gakuen University, Department of Electronics and Information Engineering Presentation Outline • • • • • Introduction YACIS Corpus Description Syntax/Morphology Annotation Corpus Statistics Conclusions and Future Work Introduction • Corpora are important in many NLP tasks Introduction • Corpora are important in many NLP tasks – Text normalization – Lexicon generation – Sentiment analysis – Dialog agent development –… Introduction • There are some (somewhat) large corpora for Japanese – KOTONOHA: BCCWJ (Balanced Corpus of Contemporary Written Japanese) (4,800,000w) – Aozora Bunko (more than 10,000 books) – Mainichi Shinbun (200,000 articles) – Asahi Shinbun (130,000 articles) Introduction • There are some (somewhat) large corpora for Japanese – KOTONOHA: BCCWJ (Balanced Corpus of Contemporary Written Japanese) (4,800,000w) – Aozora Bunko (more than 10,000 books) – Mainichi Shinbun (200,000 articles) – Asahi Shinbun (130,000 articles) Introduction • A good source of casual language: – INTERNET • BLOGS Introduction • Internet based corpora for Japanese – KWIC on WEB (http://languagecraft.jp/kwic/) • 2,000,000pages – JpWaC (http://trac.sketchengine.co.uk/wiki/Corpora/JpWaC) • 49,000 pages –… Introduction • Internet based corpora for Japanese – Problems: • • • • • Robots.txt Duplicates Language detection Encoding No specific domain (multi-domain) Introduction • Blog based corpora for Japanese – jBlogs • 28,000 pages, 62 mil words – KNB • 249 pages, 67,000 words jBlogs: M. Baroni, and M. Ueyama, ”Building General- and Special-Purpose Corpora by Web Crawling”, In: Proceedings of the 13th NIJL International Symposium on Language Corpora: Their Compilation and Application, 2006, www.tokuteicorpus.jp/result/pdf/2006 004.pdf KNB: Chikara Hashimoto, Sadao Kurohashi, Daisuke Kawahara, Keiji Shinzato and Masaaki Nagata, “Construction of a Blog Corpus with Syntactic, Anaphoric, and Sentiment Annotations” [in Japanese], Journal of Natural Language Processing, Vol 18, No. 2, pp. 175-201, 2011. Introduction • Blog based corpora for Japanese – jBlogs • 28,000 pages, 62 mil words – KNB • 249 pages, 67,000 words Introduction • Written L. Corpora vs. Internet corpora vs. Blog corpora corpus scale corpus scale KWIC on KOTONOHA 4,800,000w WEB 2,000,000 pages Aozora Bunko >10,000 books 49,000 pages Mainichi Shinbun Asahi Shinbun ~200,000 articles ~130,000 articles JpWaC corpus scale jBlogs 28,000 pages / 62 mil words KNB 249 pages / 67,000 words Introduction • Written L. Corpora vs. Internet corpora vs. Blog corpora corpus scale corpus scale KWIC on KOTONOHA 4,800,000w WEB 2,000,000 pages Aozora Bunko >10,000 books 49,000 pages Mainichi Shinbun Asahi Shinbun ~200,000 articles ~130,000 articles JpWaC corpus scale jBlogs 28,000 pages / 62 mil words KNB 249 pages / 67,000 words YACIS Corpus Description • Need a BIG corpus of blogs for Japanese YACIS Corpus Description • Need a BIG corpus of blogs for Japanese • Looked through a number of blog services YACIS Corpus Description • Need a BIG corpus of blogs for Japanese • Looked through a number of blog services • Ameba (www.ameba.jp/) has a clear structure YACIS Corpus Description YACIS Corpus Description ... <div class="contents"> <div class="subContents"> <!-- google_ad_section_start(name=s1, weight=.9) --> <p><font size="2">ずいぶん前になりますが岡山に行ってきました<img alt="0" src="http://emoji.ameba.jp/img/user/ts/tsumegaeru/606746.gif" /><img alt="きら きら" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1956233.gif" /></font></p> <br /> <p><font size="2">なんと人生初の一人旅(゚∀゚*)</font></p> <br /> <p><font size="2">そしてこれはその時に買ったお土産です<img alt="パンダ" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1114039.gif" /><br /> </font><a id="i10537038426" class="detailOn" href="http://ameblo.jp/0ppe1zryshan/image-10619583023-10537038426.html"><font size="2"><img height="330" alt="○●ようのうまいもん日記●○" src="http://stat.ameba.jp/user_images/20100511/19/0ppe1zryshan/42/f4/j/t02200330_0800120010537038426.jpg" width="220" border="0" /></font></a> <font size="2"><br /> 『白桃のロイヤルガレット』</font></p> <br /> <p><font size="2">桃系はやっぱり買っとかなければ!!! ってことで買いました( ̄∀ ̄)</font></p> <br /> <p><br /> <br /> <a id="i10537038429" class="detailOn" href="http://ameblo.jp/0ppe1zryshan/image-10619583023-10537038429.html"><font size="2"><img alt="○●ようのうま いもん日記●○" src="http://stat.ameba.jp/user_images/20100511/19/0ppe1zryshan/5b/a7/j/t02200147_0800053410537038429.jpg" border="0" /></font></a> <font size="2"><br /> YACIS Corpus Description ... <div class="contents"> <div class="subContents"> Extract these <!-- google_ad_section_start(name=s1, weight=.9) --> <font size="2"> <img alt="0" src="http://emoji.ameba.jp/img/user/ts/tsumegaeru/606746.gif" /><img alt="きら きら" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1956233.gif" /></font> <br /> <font size="2"> </font> <br /> <font size="2"> <img alt="パンダ" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1114039.gif" /><br /> </font><a id="i10537038426" class="detailOn" href="http://ameblo.jp/0ppe1zryshan/image-10619583023-10537038426.html"><font size="2"><img height="330" alt="○●ようのうまいもん日記●○" src="http://stat.ameba.jp/user_images/20100511/19/0ppe1zryshan/42/f4/j/t02200330_0800120010537038426.jpg" width="220" border="0" /></font></a> <font size="2"><br /> 『白桃のロイヤルガレット』</font> <br /> <font size="2"> </font> <br /> <p><br /> <br /> <a id="i10537038429" class="detailOn" href="http://ameblo.jp/0ppe1zryshan/image-10619583023-10537038429.html"><font size="2"><img alt="○●ようのうま いもん日記●○" src="http://stat.ameba.jp/user_images/20100511/19/0ppe1zryshan/5b/a7/j/t02200147_0800053410537038429.jpg" border="0" /></font></a> <font size="2"><br /> From between these YACIS Corpus Description ... <div class="contents"> <div class="subContents"> Extract these <!-- google_ad_section_start(name=s1, weight=.9) --> <font size="2"> <img alt="0" src="http://emoji.ameba.jp/img/user/ts/tsumegaeru/606746.gif" /><img alt="きら きら" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1956233.gif" /></font> <br /> <font size="2"> </font> <br /> <font size="2"> <img alt="パンダ" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1114039.gif" /><br /> </font><a id="i10537038426" class="detailOn" href="http://ameblo.jp/0ppe1zryshan/image-10619583023-10537038426.html"><font size="2"><img height="330" alt="○●ようのうまいもん日記●○" src="http://stat.ameba.jp/user_images/20100511/19/0ppe1zryshan/42/f4/j/t02200330_0800120010537038426.jpg" width="220" border="0" /></font></a> <font size="2"><br /> 『白桃のロイヤルガレット』</font> <br /> <font size="2"> </font> <br /> <p><br /> <br /> <a id="i10537038429" class="detailOn" href="http://ameblo.jp/0ppe1zryshan/image-10619583023-10537038429.html"><font size="2"><img alt="○●ようのうま いもん日記●○" src="http://stat.ameba.jp/user_images/20100511/19/0ppe1zryshan/5b/a7/j/t02200147_0800053410537038429.jpg" border="0" /></font></a> <font size="2"><br /> From between these Get rid of: - Pictures - Irrelevant HTML tags - Emoji (but leave kaomoji) - Non-Japanese pages - … (Other post-processing) YACIS Corpus Description • YACIS: – Yet Another Corpus of Internet Sentences • Corpus compilation: – 2009 Dec 3-24 • Ameba blogs • Only one query to Google: “site:ameblo.jp” (take 1000 links) ~> crawl from page to page YACIS Corpus Description YACIS Corpus Description YACIS Corpus Description corpus scale corpus scale corpus scale KWIC on KOTONOHA 4,800,000w WEB 2,000,000 pages jBlogs 10 bil char 28,000 pages / 62 mil words Aozora Bunko >10,000 books 49,000 pages 249 pages / 67,000 words Mainichi Shinbun Asahi Shinbun ~200,000 articles ~130,000 articles JpWaC KNB YACIS Corpus Description corpus scale corpus scale corpus KWIC on KOTONOHA 4,800,000w WEB 2,000,000 pages jBlogs 10 bil char Aozora Bunko >10,000 books 49,000 pages Mainichi Shinbun Asahi Shinbun ~200,000 articles ~130,000 articles JpWaC KNB YACIS scale 28,000 pages / 62 mil words 249 pages / 67,000 words 12 mil page 28 bil. char 5,6 bil. w. YACIS Corpus Description corpus scale corpus scale corpus KWIC on KOTONOHA 4,800,000w WEB 2,000,000 pages jBlogs 10 bil char Aozora Bunko >10,000 books 49,000 pages Mainichi Shinbun Asahi Shinbun ~200,000 articles ~130,000 articles JpWaC KNB YACIS scale 28,000 pages / 62 mil words 249 pages / 67,000 words 12 mil page 28 bil. char 5,6 bil. w. YACIS Corpus Description • What we have: YACIS Corpus Description • What we have: Original URL Extraction time Sentence ID Tags: <doc> one blog page <post> one post in blog <s> one sentence <comments> all comments <cmt> one comment YACIS Corpus Description Dependency structure Tokenization • What we want: POS Lemmatization Named Entities Emotive expressions Emotion classes Positive/ Negative Emotion objects Emotive sentences Emoticons YACIS Corpus Description Dependency structure Tokenization • What we want: POS Lemmatization Named Entities Emotive expressions Emotion classes Positive/ Negative Emotion objects Emotive sentences Emoticons Syntax/Morphology Annotation • • • • • Tokenization (T) Lemmatization (L) POS tagging (POS) Dependency structure (DS) Named entity recognition (NER) Syntax/Morphology Annotation • • • • • ChaSen Tokenization (T) Lemmatization (L) MeCab POS tagging (POS) Dependency structure (DS)Cabocha Named entity recognition (NER) Juman KNP Syntax/Morphology Annotation Speed POS Juman Slower ~ ~ MeCab Cabocha ~ ~ Faster DS KNP ChaSen * Subjective evaluation on a small test set, no benchmarks ** Test with “time” command in Linux Syntax/Morphology Annotation • Cool features of MeCab – POS prediction – Use of two dictionaries (ipadic, jumandic) • Cool features of Cabocha – Works with MeCab – IREX (NER standard) Syntax/Morphology Annotation • Annotation time: – MeCab: 2 days – Cabocha: 7 days • File size: – Raw (only text, no HTML) – Tokenization – POS/ipadic, Lemma, etc. – POS/jumandic, Lemma – DS with NER 27 GB 32 GB 286 GB 286 GB 86 GB Syntax/Morphology Annotation • Annotation example Syntax/Morphology Annotation • Annotation example Corpus Statistics • Evaluation – Manual: • Impossible (5.6 bil. words, 350 mil. sentences) – 1 sentence in 1 sec. = 4050 days (11 years) • MeCab and Cabocha are standard tools (reliable) Corpus Statistics • Evaluation – Automatic: comparison of general features • Now mostly for POS 1. Ipadic vs. jumandic 2. YACIS vs. other Japanese corpora 3. YACIS vs. other language corpora Corpus Statistics Ipadic vs. jumandic YACIS-ipadic YACIS-jumandic Corpus Statistics Ipadic vs. jumandic YACIS-ipadic YACIS-jumandic Corpus Statistics Ipadic vs. jumandic Differences in dictionaries. For example: いやー (interjeticon) Ipadic: いやー【感動詞】 YACIS-ipadic YACIS-jumandic Jumandic: いや【感動詞】 + ー【記号・特殊】 Corpus Statistics YACIS vs. jBlogs and JENAAD jBlogs: blog corpus from 2006 61 mil. words, 30 thousand blog docs JENAAD: news corpus from 2003 4.7 mil. words, Yomiuri (1989-2001) jBlogs: M. Baroni, and M. Ueyama, ”Building General- and Special-Purpose Corpora by Web Crawling”, In: Proceedings of the 13th NIJL International Symposium on Language Corpora: Their Compilation and Application, 2006, www.tokuteicorpus.jp/result/pdf/2006 004.pdf JENAAD: Masao Utiyama and Hitoshi Isahara. (2003) “Reliable Measures for Aligning JapaneseEnglish News Articles and Sentences”. ACL-2003, pp. 72–79. Corpus Statistics YACIS vs. jBlogs and JENAAD YACIS-ipadic YACIS-jumandic Corpus Statistics YACIS vs. jBlogs and JENAAD Spearman's ρ YACIS-ipadic YACIS-jumandic jBlogs JENAAD YACIS-ipadic YACIS-jumandic YACISYACIS- jBlogs JENAAD ipadic jumandic 1 0.88 0.96 1 1 0.79 0.85 1 1 Corpus Statistics YACIS vs. jBlogs and JENAAD Spearman's ρ YACIS-ipadic YACIS-jumandic jBlogs JENAAD YACIS-ipadic YACIS-jumandic YACISYACIS- jBlogs JENAAD ipadic jumandic 1 0.88 0.96 1 1 0.79 0.85 ------------- ChaSen / ipadic ---------- 1 1 Corpus Statistics YACIS vs. jBlogs and JENAAD Spearman's ρ YACIS-ipadic YACIS-jumandic YACISYACIS- YACIS-ipadic YACIS (large) jBlogs jBlogs (medium) JENAAD (small) JENAAD JENAAD ipadic jumandic 1 0.88 0.96 1 1 0.79 0.85 Statistically, part-of-speech distribution is similar for ipadic across all three corpora: YACIS-jumandic jBlogs 5,600,000,000 words 61,000,000 words 4,700,000 words ------------- ChaSen / ipadic ---------- 1 1 Corpus Statistics YACIS vs. jBlogs and JENAAD It doesn’t mean YACIS-ipadic YACIS-jumandic MeCab/ipadic is YACISYACISSpearman's ρ jBlogs better. JENAAD ipadic jumandic ------------- ChaSen / ipadic ---------It means: It is consistent regardless YACIS-ipadic 1 0.88 0.96 1 Statistically, part-of-speech distribution is of corpus size. similar for ipadic across all three corpora: And even if it has YACIS-jumandic 1 0.79 0.85 errors, the errors YACIS (large) 5,600,000,000 words jBlogs 1 appear consistently. * jBlogs (medium) 61,000,000 words JENAAD (small) JENAAD 4,700,000 words 1 * Within limited evaluation range: Comparison of POS distribution Corpus Statistics Japanese vs. British English and Italian • British English: ukWaC – 2bil. Words, .uk domain, POS, lemma • Italian: itWaC – 2bil. Words, .it domain, POS, lemma Corpus Statistics Japanese vs. British English and Italian • Size comparable to British English: ukWaC YACIS – 2bil. Words, .uk domain, POS, lemma>1 bil. words • Italian: itWaC – 2bil. Words, .it domain, POS, lemma • Both from WaCky (Web as Corpus kool ynitiative) http://wacky.sslmit.unibo.it/doku.php Corpus Statistics Japanese vs. British English and Italian YACIS-ipadic YACIS-jumandic * * * ) Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta, “The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora”, Kluwer Academic Publishers, Netherlands, 2008. Corpus Statistics Japanese vs. British English and Italian YACIS-ipadic YACIS-jumandic * * * ) Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta, “The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora”, Kluwer Academic Publishers, Netherlands, 2008. Corpus Statistics Japanese vs. British English and Italian YACIS-ipadic Noun Verb Adjecti ve * YACIS-jumandic * Noun Adjecti ve Verb Noun Adjecti ve Verb * ) Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta, “The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora”, Kluwer Academic Publishers, Netherlands, 2008. Corpus Statistics What it means? Japanese Corporavs. of British English comparable size, but different languages YACIS-ipadic (Japanese vs. 2 European languages) Noun Verb have different POS Adjecti distribution. ve and Italian * YACIS-jumandic * Noun Adjecti ve Verb Noun Adjecti ve Verb If we won’t argue about POS definitions, this could be a small hint for a proof that POS distribution is not universal across languages. * ) Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta, “The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora”, Kluwer Academic Publishers, Netherlands, 2008. Conclusions • Gathered YACIS – 5.6 bil. word corpus of Japanese blogs • Annotated YACIS with Syntactic/Morphological information – POS, Tokenization, Lemma, Dependency Structure, Named Entities • Evaluated YACIS by comparing to other corpora Conclusions • Corpora of the same language, but different size have the same POS distribution =POS tagging is consistent • Corpora of comparable size, but different languages have different POS distribution =POS distribution IS NOT Universal across languages Future Work • • • • Online interface! More detailed evaluation (e.g. of dependency) Lexicon generation N-gram version for download without limitations • Applications Thank you for your attention! Michal Ptaszynski [email protected] Discussion • Copyrights – YACIS will not be put on sale – Only for scientific purposes – Usage of corpus will need a two-side agreement • Gathering of the corpus is similar to search engines – If YACIS was illegal, Google, Yahoo,… would be even more illegal.
Similar documents
Note: The only uncolored characters have round
file:///Users/everson/Documents/%20Downloads/07257-emoji-wd-table/emoji_mapping_utc_pub.html
More information