A look at the future of the denshi jisho
Transcription
A look at the future of the denshi jisho
A look at the future of the denshi jisho Designing the next generation of digital Japanese dictionary software Final essay submitted in part fulfilment of the requirements for the degree Bachelor of Arts in the Japanese Languages and Cultures programme Abstract This essay aims to combine the fields of Japanology and Software Engineering in an attempt to leverage advances made in digital language processing and evaluate the possibilities of combining existing free and open resources into a single software dictionary application. Such an application could provide students of Japan, its language and its culture — from the beginner, to the seasoned scholar — with a complete digital reference of Japanese words, names and characters that improves upon paper resources and conventional portable electronic dictionaries by allowing the user to retrieve information faster and in a more efficient way. Final Bachelor Essay Leiden University Department of Japanese Languages and Cultures Author: Student number: E-mail address: Supervision: Essay submitted on the 22nd of August 2007 Statistics: Jeroen Douwe Hoek, BEng S0416851 [email protected] Dr Riikka Länsisalmi Dr Rob Goedemans 9216 words A look at the future of the denshi jisho Table of Contents Introduction 4 The evolution of digital dictionaries 6 A modest typography of common issues 8 Obsolete kanji forms 8 Japanese names 9 Character information 9 Finding kanji 11 Combining existing resources 13 Towards a next generation interface 16 Obsolete kanji 16 Japanese names 16 Character information 17 Finding kanji 18 Differing shades of “free” 24 The next step 26 Bibliography 28 Internet resources 30 2 A look at the future of the denshi jisho Illustrations Figure i: Obsolete kanji form – bōtoku (blasphemy), the second character (highlighted) has been simplified during the script reforms. 8 Figure ii: The kanji 藤 (fuji – wisteria) decomposed into its four parts. 10 Figure iii: The 艹 and 月 radicals decomposed into their strokes. 11 Figure iv: Illegible kanji – A small slightly enlarged sample from a copy of an encyclopaedia entry on the Nara period scholar and China traveller Kibi no Makibi. Figure v: Information for the kanji 藤 with some information shown. 11 17 Figure vi: Information for the kanji 藤 tailored to the preferences of a specific user with a keen interest in the Korean language and a background in software engineering. 18 Figure vii: An example of how a hard to read character might be interpreted. 18 Figure viii: Concept kanji selector in its initial state. 19 Figure ix: The kanji selector after choosing a vertical split. 20 Figure x: The kanji selector after specifying the radical for the left section. 21 Figure xi: The kanji selector with all of our arguments and the only kanji in the Jōyō Kanji list that applies. 21 Figure xii: The kanji selector after setting a couple of arguments. Unset sections (in this case the lower right area of the kanji) act as wild cards. 23 Tables Table i: Samples of hiragana, katakana and kanji. The text in the hiragana and katakana samples is a fragment of the Iroha poem, a classic poem, which is also a pangram, representing all the sounds in both syllabaries. The kanji sample is composed of a number of kanji from the Jōyō Kanji set. 5 Table ii: Kanji sharing a reading, with the part of the character commonly held responsible highlighted. 10 Table iii: Possible constraints that can be used to construct a query for searching a specific Chinese character. 20 3 A look at the future of the denshi jisho Introduction With the global popularisation of the Internet witnessed in the 1990s, a new distribution network arose that is distinctly lacking the boundaries that used to limit the dissemination of information. Now, the cost of publishing information in some paper form and the subsequent cost of distribution are no longer significantly limiting factors as they have been reduced to relatively negligible figures. For a non-native learner of the Japanese language and culture — for studying one requires at least some degree of insight into the other — the possibilities of using, sharing and contributing to digital lexical resources for the Japanese language by means of the Internet have provided him with a wide variation of decent, accessible and free1 alternatives to traditional paper dictionaries and encyclopaedia. What are the possibilities of consolidating this vast array of resources into a single user-friendly software application? This essay examines a number of possibilities by evaluating existing digitally available resources, such as dictionaries, and contrasts these against a number of problematic cases a learner of the Japanese language may encounter when reading or translating Japanese texts. For some of the more common issues described, a few conceptual software solutions (using and combining these resources) are presented. This essay starts off with a brief overview of the history of electronic Japanese lexical resources. Then, by looking at some common scenarios encountered when working with the Japanese language and the ways in which one might deal with them using traditional resources, a reference point is established to the current state of the field. From there on a selection of existing free and open resources are reviewed and compared to the more traditional resources, and subsequently the scenarios encountered earlier are revisited with new approaches and user interface concepts, building on the free resources introduced. Finally a rationale for the strong focus on free software projects in this essay is given. From a Western perspective, Japanese is often regarded as one of the most difficult languages to master. The issues of grammar, the differing word order2 and the intricacies of correctly employing honourifics seem trivial compared to 1 Free, in both senses of the word. That is gratis (available free of charge, as in “free beer”), and more importantly libre (granting the user the freedom to use, study, change and distribute it, as in “free speech”). Stallman 2002, p. 43. 2 Put in linguistic terms, sentences generally use the subject-object-verb (SOV) sequence instead of subject-verb-object (SVO). Introduction 4 A look at the future of the denshi jisho mastering the Japanese writing system which uses two native syllabaries, the Roman alphabet and a rather large amount of Chinese characters.3 Getting to grips with the two phonetic syllabaries, the hiragana and katakana, each consisting of 47 characters, usually takes no more than a few weeks of practice, but having to learn to read and write the 2000 or so Chinese characters, or kanji as they are known in Japanese, in daily use is the cause of a steep incline of the learning curve. Hiragana い ろ は に ほ へ と ち り ぬ る を わ か よ … も せ す Katakana イ ロ ハ ニ ホ ヘ ト チ リ ヌ ル ヲ ワ カ ヨ … モ セ ス Kanji 亜 哀 愛 悪 握 圧 扱 安 暗 案 以 位 依 偉 囲 … 枠 湾 腕 Table i: Samples of hiragana, katakana and kanji. The text in the hiragana and katakana samples is a fragment of the Iroha poem, a classic poem, which is also a pangram, representing all the sounds in both syllabaries. The kanji sample is composed of a number of kanji from the Jōyō Kanji set. Most of the issues and difficulties encountered when working with Japanese texts described in this essay come from my own practical experience as a nonnative learner of the Japanese language. Although this essay is written from a Western perspective — or more specifically, a Dutch perspective — the ideas presented herein should apply to any non-native speaker of Japanese, especially those from cultures where the Chinese characters — be they known as hànzì, kanji, hanja, hán tự 4 or simply Han characters5 — are not used in daily life. 3 The reader accustomed to Asian languages using Chinese characters might quite correctly remark that, in fact, the Chinese themselves use far more Chinese characters than the Japanese in their daily usage of the language, but compared with the twenty-odd letters in the Western alphabets the Japanese do use “a fair amount” of characters. 4 Respectively the Chinese, Japanese, Korean and Vietnamese pronunciation of the same compound word. These are the cultures where Chinese characters are, or were in the case of Vietnam, used. Lunde 1999, p. 4 & 50. 5 Han characters is a direct translation of the two characters used for this term in Chinese, Japanese, Korean and Vietnamese. Introduction 5 A look at the future of the denshi jisho The evolution of digital dictionaries When speaking of Japanese lexical resources three distinct families of resources can be discerned. Naturally, the traditional paper dictionaries and encyclopaedia are the first that spring to mind. Well-known dictionaries such as The New Nelson Japanese–English Character Dictionary, Kenkyusha's New Japanese–English Dictionary and the Kōjien (an authoritative Japanese dictionary with a good deal of encyclopaedic content) are but a few of these time-honoured and revered works. But respected as these tomes may be, they are hardly portable, which in part explains the popularity of the second family of resources; the portable electronic dictionary6 (PED). Since the 1980s, Japanese electronics companies have been producing pocket-size devices with an ever increasing feature-set capable of providing quick and easy access to digitized versions of existing paper dictionaries. Fitted with a small keyboard and screen, these devices easily fit in the pocket of a jacket or a pair of pants and can perform their duties for ages on a single battery cell. Current models literally carry a whole bookshelf worth of dictionaries, practice material and other reference works and some are able to interface with a laptop or desktop computer and provide its dictionary resources there. Prizes range from €200 to €4007, depending on what dictionaries are included and the functionality of the device itself. The more extensive PED’s sold by manufacturers such as Canon, Casio, Seiko and Sharp come with handwriting recognition. A common feature is the ability to add more dictionaries via a memory card slot. These can be purchased separately, with the price depending on the dictionary. Despite having the backing of a well-funded research and development department, the future of the commercial electronic dictionary is — in my opinion — limited, not by the manufacturer’s ability or lack thereof to innovate, but by the demands of the domestic Japanese market. Notwithstanding the impressive capabilities of these devices, fact remains that they are designed and marketed primarily as an aid for Japanese students who wish to learn English or be able to consult one of the classic Japanese dictionaries at any time and any place. Or to put it in more concrete terms; students whose goal is to pass the demanding 6 The manufacturer Canon uses the brand name Wordtank for its product-line of PED’s. This term is often used as a blanket-term for all PED’s with Japanese dictionaries in colloquial speech. 7 ¥35.000 and ¥70.000 respectively at the time of writing. The evolution of digital dictionaries 6 A look at the future of the denshi jisho entrance exams in order to gain admission to Japan’s most prestigious schools and universities, and who are thus willing to pay for a tool that can help them study for these important events. Although some of the ideas touched upon in this essay could well be of use to their primary target demographic, a lot of features that would be useful to someone learning Japanese as a secondary language might not have that same appeal. Although the producers of these devices are more likely than not well aware of a secondary group of users found in the growing group of foreign students, scholars and businessmen who use them as learning aids and portable dictionaries, they appear to remain reluctant to market this type of product beyond Japan’s national borders, or even export them for that matter. The foreign market for the Japanese electronic dictionaries depends mostly on third party importers selling products from the latest range, or a favour from a friend or colleague visiting Japan. The third type of resource came into being with the emergence of the Internet and the Worldwide Web in the 1990s. One significant aspect of this global network is the near absence of distribution costs for digital resources. If someone composes and publishes a useful resource on-line with the intent of spreading this information, than anyone, anywhere can freely access this resource. Perhaps the most well-known example of a free lexical resource dealing with the Japanese language is the Japanese–English dictionary best known as EDICT. Started in the early 1990s by professor Jim Breen of Australian Monash University, the project has evolved into an extensive dictionary comprised of over 100.000 headwords. The fact that a lot of software dictionary applications that provide a Japanese– English dictionary — from the desktop computer to the handheld PDA, and on every operating system (OS) imaginable — use the data from EDICT is indicative of the popularity of this project. Of course, EDICT is not the only example of a lexical resource accessible through the Internet,8 there a number of other successful bilingual Japanese dictionaries — the free Japanese–French and Japanese–German dictionaries are of particular note — and for an encyclopaedia Wikipedia, the famous poster-child of social software advocates, is available in many languages with the English language edition holding the record of being the largest encyclopaedia in the world by far.9 8 Accessible either by using a website, such as the WWWJDIC service that houses EDICT, or by downloading the material for off-line use. 9 The English language edition is nearing two million articles as of July 2007. The evolution of digital dictionaries 7 A look at the future of the denshi jisho A modest typography of common issues Any non-native learner of Japanese will be familiar with at least some of the difficulties that can arise when working with Japanese texts. Older kanji forms, place and personal names, kanji unknown to the reader or partially illegible words due to age, nth generation photocopies, or even coffee stains. Below, a modest selection of common problems is listed, along with the current solution to deciphering the particular word or character. Obsolete kanji forms From 1946 onwards, a series of script reforms10 took place that led to the simplification of a number of kanji and the limiting of the amount of kanji permitted for use in official government publications, media and public education. When reading Japanese texts from the earlier half of the twentieth century and before, the older kanji forms can make a simple dictionary lookup a daunting task if the reader is not familiar with the older form of the character. ぼう とく ぼう とく 冒瀆 冒涜 Original shape Simplified shape Figure i: Obsolete kanji form – bōtoku (blasphemy), the second character (highlighted) has been simplified during the script reforms. Figure i shows a word found in a modern text written after the script reforms, but which cites a wartime account of a soldier in the field. The text offered no furigana11 complicating a dictionary lookup based on pronunciation. The word can be retrieved in two steps with current generation of PED’s and the larger paper dictionaries12, provided the user is able to locate both kanji — either by drawing it into a PED, or by using a traditional radical and stroke count 10 Initially the number of kanji was limited to 1850 (known as the Tōyō Kanji), this set was revised in the 1980s to include 1945 kanji in total (the Jōyō Kanji). Several dozens of kanji were simplified in shape ( 簡 易 字 体 , kan’i jitai) as part of the script reforms. Küenburg 1952, p. 230. 11 Furigana or rubies are kana placed above or next to kanji to indicate the pronunciation. This is a common feature in texts aimed at a younger audience, but can also be employed for words that are considered unfamiliar, archaic or contain deprecated kanji. 12 The New Nelson (Nelson 1997) for example points the reader to the simplified kanji when looking up an older form. A modest typography of common issues 8 A look at the future of the denshi jisho lookup — and realise that the second character is an older form. Limiting the search to merely examining the list of two character words starting with 冒 is not sufficient, as neither lists the compound word with the older variant of the second kanji. Japanese names When the reading of a character or word is known, a dictionary lookup is relatively straightforward. Kanji used in Japanese names however, can use a differing reading, which is called nanori, which may make finding the word in a dictionary — or more likely, an encyclopaedia — difficult. Unfamiliarity with Japanese names can aggravate the problem. Furthermore, the fact that a name is written in kanji does not guarantee that is in fact Japanese. A word that eluded me for the greater part of an afternoon when reading a Japanese text on wartime experiences turned out to be a Chinese name. If the encyclopaedia used to look up the word — provided the user has recognized it as being a name — is limited to Japanese topics this may pose a problem. Character information In time, a learner of the Japanese language will, generally speaking, acquire a feeling for the pronunciation of kanji. In cases where a particular character is not known to the reader, the reading can often be guessed from a part of the kanji which often, though not always, corresponds to a certain reading.13 The following kanji all share the same reading in these compound words ( SEI in this case, the corresponding part is highlighted in Table ii). This information can be very useful as a mnemonic device for remembering the reading of kanji. As far as paper dictionaries goes, this information is only available in specialized character dictionaries. Any decent paper dictionary or PED will give the user plenty of general information on at the very least the Jōyō Kanji. Stroke order diagrams are common place and naturally the common readings, including the name readings, are present in the larger dictionaries. Chinese character dictionaries are traditionally indexed by radical, so the one radical that is considered to be the leading radical for a characters is always shown. But what 13 In compound words, often consisting of two kanji, the on-reading is used in most cases. The on-reading of a kanji is derived from the original Chinese pronunciation of the character, in contrast to the kun-reading which is native Japanese. A modest typography of common issues 9 A look at the future of the denshi jisho Kanji Unicode codepoint Example compound word せい ぎ U+6B63 正義 (seigi – justice) せい ふ U+653F 政府 (seifu – government) せい ふく U+5F81 征服 (seifuku – conquest) ちょうせい U+6574 調 整 (chōsei – adjustment / tuning) Table ii: Kanji sharing a reading, with the part of the character commonly held responsible highlighted. if one wants to know the exact decomposition of a more complex character? Chinese characters are composed of a limited set of parts. Most of these parts are the so-called radicals, but not all parts are. Figure ii: The kanji 藤 (fuji – wisteria) decomposed into its four parts. Of the kanji in illustration ii, three parts are classical Kangxi radicals,14 namely 艹 (a variant of the grass radical ⾋), 月 (the moon radical) and 氺 (a variant of the water radical ⽔ ). The last part contains the ⼋ (the radical eight) and four additional strokes. A part — for lack of a better term — is always composed of a number of standard strokes as can be seen in illustration iii for the case of the grass and water radicals.15 A stroke is as the name implies a single stroke of a brush, that is, it is drawn without removing the brush from the canvas. 14 Most character dictionaries are indexed according to the 214 traditional radicals as defined in the 1716 Kangxi dictionary. It is possible to find any kanji in the dictionary quite fast as long as the reader can recognise which part of the kanji acts as radical and how many strokes are left in the rest of the character. Nelson 1997, p. 1233. 15 Lunde uses the term radical-like elements for the parts that aren’t directly related to a traditional radical. Lunde 1999, p. 55. A modest typography of common issues 10 A look at the future of the denshi jisho Figure iii: The 艹 and 月 radicals decomposed into their strokes. Finding kanji The numerous kanji present in every aspect of the Japanese language pose a challenge to learners of the language not familiar with these characters of Chinese origin. Roughly 2000 characters, known collectively as the Jōyō Kanji are deemed suitable for daily use and are used in government publications, newspapers and education. A good 1000 of these, the Kyōiku Kanji, are taught in elementary education. Add to this list the kanji permissible for use in personal names16 (variant shapes of kanji in the Jōyō Kanji list as well as unique shapes) and a penchant for using obsolete kanji in place of their simplified variants as a stylistic vehicle, and it becomes clear why finding kanji in a dictionary can be a problem for non-native users. When at least one reading of a kanji is known, looking it up won’t pose a problem. In case the reading is not known, it can be found by looking it up in the index of a traditional paper dictionary or a PED by using the traditional radical and stroke count method. Most IME’s and the newer PED’s allow for the user to draw the character with the stylus or mouse, but these methods depend on the user being familiar with the standard stroke order. Figure iv: Illegible kanji – A small slightly enlarged sample from a copy of an encyclopaedia entry on the Nara period scholar and China traveller Kibi no Makibi. 16 These kanji are known as the jinmeiyō kanji. In September of 2004 the list of allowed kanji was extended significantly to include a total of 983 characters. Japanese Standards Association 2004, p 1. A modest typography of common issues 11 A look at the future of the denshi jisho However, finding kanji by means of the radical and stroke method or by drawing it is not always possible. Figure iv shows a copy of an encyclopaedic entry handed to students taking a course on how China and its culture affected Japanese literature throughout the years. The professor teaching the class copied the text in advance from another copy. Unfortunately, somewhere along the way this short section of text had lost its details, making class preparation somewhat challenging. To find out what the kanji here means, the methods mentioned above are insufficient. Our search is further complicated by the fact that the kanji in this example stands alone. It is not part of a compound word, which precludes us from looking it up in an electronic dictionary by entering the other character(s) and specifying a wild card.17 Our best bet is probably to guess at the radical and a range for the number of strokes this kanji might consist of, and spend a good deal of time poring over a dictionary in the hopes of finding it. 17 Most PED's support a method to define a wild card. One could search for “all compound words starting with this kanji and another kanji I don’t know” and more often than not, it will return a relatively short list of possible results. A modest typography of common issues 12 A look at the future of the denshi jisho Combining existing resources In the next section of this essay, a number of existing free resources are combined to form a solution to the issues enumerated above. Here a number of representative free resources are introduced. Due to the prevalence of the English language on the Internet and in scholarly research, a large number of usable English language resources exist to assist the non-native user of Japanese in understanding the language. The most well-known example of a free lexical resource on the Japanese language is the Japanese–English dictionary called EDICT, mentioned earlier. For other languages, the state of bilingual resources to or from Japanese varies greatly, from promising to disheartening. For speakers of the German language for example — over 100 million people — the WaDoku dictionary provides an excellent alternative to paper Japanese–German dictionaries. Conversely, the situation for the Dutch language stands in sharp contrast with this, with the last paper Japanese–Dutch dictionary being Van de Stadt’s Nichi–Ran Jiten published in the 1930s.18 On the digital front some progress is being made in the form of a collaborative project run by the Catholic University of Leuven in Belgium. Their WaRan19 project employs Wiki software20 to allow students and scholars alike to add words and definitions. One technique employed to create a new bilingual dictionary with Japanese being one of the languages is to use the English language as a pivot and combine EDICT with another free English to X dictionary, with X being the desired second language. This method is being used for Japanese–Swedish and Japanese–Slovene dictionaries.21 The Papillon project22 goes a step further, and strives for the noble goal of a multilingual free dictionary. The idea behind this project is to combine existing efforts such as those mentioned above, and create a dictionary that uses an interlingual pivot internally. That is, it links words from different languages on the 18 There is however a Dutch–Japanese dictionary, published by Kodansha in 1994. Currently a project is under way to digitize the Van de Stadt dictionary, and the scanned pages are all indexed and available for browsing at the project website. 19 The transliteration of 和蘭, the Chinese characters to represent the Japanese and Dutch languages. 20 Schiltz, Truyen & Coppens 2007, p. 97–101. Or, unscholarly as it may be, simply visit Wikipedia and read the articles on Wikis and Wikipedia itself. 21 Paik, Bond and Satoshi 2001, Sjöbergh 2005. 22 Mangeot 2000, p. 4–6. Combining existing resources 13 A look at the future of the denshi jisho basis of their meaning. As of yet, the project's appears to be still in the early stages of development. Although Papillon is not mature enough to consider as a usable resource at this point in time, one of the resources it uses certainly is. JMdict is the name of what is essentially a superset of EDICT. It goes beyond the bilingual dictionary by combining a number of other dictionaries with Japanese lemmata into one single resource using Japanese as its pivot language, currently containing glosses in English, German and French as well as a limited amount of words in Russian. Besides JMdict and its “classic” subset EDICT, professor Breen also provides a number of related lexical resources on his website. There are several Japanese to English dictionary files containing specialistic terms — for the fields of computing,23 bio-medical science and even forestry to name but a few. Another smaller, but useful resource is the 4JWORDS dictionary, which provides English translations for the idiomatic four-character compound words24 used in Japan. As with any language that uses Chinese characters, a character dictionary is extremely useful to have. Two well-established resources are professor Breen's Kanjidic and the Unihan database made available by the Unicode Consortium. Kanjidic provides the Japanese and Chinese readings of the roughly 11.000 Chinese characters in Japanese industrial standards 1990. JIS X 0202-1990 and JIS X 0212- Other data on the characters include references to their location in a number of paper character dictionaries, the radical used to index it and the meanings associated with the kanji. The Unihan database — a by-product of the Unicode standardisation effort — provides similar data, but includes Chinese characters unique to China and Korea as well. A different type of project is CHISE, the CHaracter Information Service Environment. The aim of this project is to develop a character processing environment that is not necessarily limited to a Chinese character being defined in Unicode or one of the national character encoding standards. Each character is defined by a collection of its features,25 such as its codepoints in different standards (if present), its readings and its composition. This last property in particular is used in the examples below, a set of operators is used to list the radicals or parts a character is composed of and how it is composed. This 23 COMPDIC, as this glossary is called, contains over 14.000 entries. 24 Similar to sayings and proverbs in Western languages, the meaning of these words can often not be inferred from the characters it is composed of. 25 Morioka 2005, p. 2. Combining existing resources 14 A look at the future of the denshi jisho is the Ideographic Description Sequence (IDS).26 These three resources combined provide a complete and continually evolving source of information on the Chinese characters. For encyclopaedic references, the largest and arguably most successful project is Wikipedia. Available in a plethora of languages and with more articles than any other encyclopaedia, it suffices to say that it is a very usable resource, although the quality of articles does vary greatly. How do the free resources available compare to their traditional paper counterparts and the resources included on modern PEDs? This is a difficult question, and the safest answer is “it mostly depends on what the user needs”. Naturally, a rough comparison can be made by just looking at the statistics. How many lemmata does this dictionary have? How many articles in this encyclopaedia? How many Chinese characters in this character database? But quantity is only part of the equation, what the user expects from a resource and what he uses it for is an important aspect as well. A connoisseur of the Japanese classics will probably be better of with the type of time-honoured and revered Japanese dictionary that cites the Man’yōshū27 in its examples. An anthropologist studying contemporary Japanese culture on the other hand, may prefer a resource that can be amended at any time because of the volatile nature of information in his field. Free resources such as EDICT and Wikipedia allow for user contributions, and both have their own means of quality control. The future of collaborative — or social — software seems promising, but this phenomenon being a relatively recent development is still being explored and the exact dynamics of what makes or breaks a collaborative project are vaguely understood at best. A well thought-through user interface, an active community, a solid project structure and a clear hierarchy, even something as mundane as the look and feel of the website and project tools, these are all factors that can combine to form a successful, thriving project or spell its untimely demise. 26 A different method that focuses more on automatically generating typefaces with Chinese characters is the Character Description Language. Bishop and Cook 2003. 27 A well-known classic in Japanese literature. Combining existing resources 15 A look at the future of the denshi jisho Towards a next generation interface Obsolete kanji Now that the technological barriers that prevented the usage of all sorts of kanji — obsoleted character forms, unique shape variations and even custom Chinese characters — are slowly, but surely, being dealt with, the future of Chinese characters in computing looks bright. At the moment it would be trivial to implement a system using the data from the CHISE project to show the user which variants or relatives any character has. If a dictionary application for example encounters the situation in Figure i and cannot find any word matching the requested query, it might suggest to the user to continue looking for the word with the older characters substituted for their modern variants. Conversely, if a user wants to know the older character forms of the kanji in a word, or even the current simplified Chinese equivalents of a given Japanese kanji, the application could easily do this as this data is available in the CHISE database as well. Japanese names The solution to the issue of unfamiliar names is mostly a matter of finding a way to provide the user with the potentially large amount of information in a sensible manner. Resources such as Wikipedia spring to mind when searching for a word such as 吉備真備,28 whereas a dictionary such as ENAMDICT can be valuable in determining the reading of a name in Chinese characters. It is important to recognize that whilst it may be obvious that a word is the name of someone or something in most cases, sometimes this may not be the case. When someone inputs any word into a dictionary, it should not limit itself to any specific resource — unless the user requests this — but retrieve information from wherever the word is indexed. Computers, including PED's, have a clear advantage over paper dictionaries here. A solution that bases itself on resources such as Wikipedia also benefit from a high degree of actuality. When using such an application to translate, for example, news in today's newspaper, this can be of great use. 28 Again, the Japanese scholar Kibi no Makibi. Towards a next generation interface 16 A look at the future of the denshi jisho Character information For any one Chinese character a large amount of information is freely available. The Unihan database alone has a myriad of data available ranging from the codepoint entries in different character-set encoding schemes to the different readings. Of obvious interest are the Japanese readings of character, but depending on who uses the application different properties might be interesting. Someone who owns the New Nelson might want to display the index number when the character can be found in that dictionary, whilst someone taking Korean classes might like to know the Korean reading of a character as well. Figure v: Information for the kanji 藤 with some information shown. Figure v shows a basic set of information for the character 藤 , including its composition drawn from the CHISE database and readings and variants from the Unihan database. The idea behind this design is that all the button-like fields with the white background and grey border act as hyperlinks to more information. Clicking on one of the readings could present the user with a list of characters that share this reading, clicking on one of the radicals might show some more information on the radical, such as the origin of its shape. It could also display all characters that use that radical as its leading radical. Basically, allowing the user to follow link after link to his heart’s content is very similar to surfing the Internet or browsing Wikipedia, with minimal effort it is possible to quickly jump to related information. Figure vi shows information for the same character, but with the interface configured to the needs of a specific user. Towards a next generation interface 17 A look at the future of the denshi jisho Figure vi: Information for the kanji 藤 tailored to the preferences of a specific user with a keen interest in the Korean language and a background in software engineering. Finding kanji Using the radical and stroke method to find the blurred kanji in example Error: Reference source not found is tricky, and drawing the character on a PED or in an IME is out of the question when the character is this hard to read. However, we may not be able to clearly distinguish the strokes, but some features do stand out. Provided the user is familiar with the more common radicals and shapes that occur, he might reason as follows: This could be the ⾦ (gold) or ⾷ (eat) radical. Looks like and the radical 一 (one) stacked on top of each other. Figure vii: An example of how a hard to read character might be interpreted. Even clever indexes such as the Universal Radical Index29 in the New Nelson won’t be of help here. These allow the user to find any kanji as long as he can recognise a radical — regardless of position — appearing in it and if he knows the total stroke count, but that is information we do not have in this case. What if a 29 A more comprehensive explanation can be found in Nelson 1997, p. 1370, followed by the index at ibid., p. 1371–1600. Towards a next generation interface 18 A look at the future of the denshi jisho dictionary allowed the user to retrieve this type of hard to find kanji by specifying a query of sorts using arguments such as the pair above? With CHISE we have a database with information about the composition of the kanji. It would be quite possible, although challenging, to generate a database structure that allows for the user to do exactly this. Consider the following widget30 (Figure viii). Figure viii: Concept kanji selector in its initial state. The icons on the “Add rule” bar are the basic set of tools to construct our query. The upper row of icons correspond largely to the characters defined in the IDS block in Unicode, and serve a similar purpose. However, instead of defining the exact composition of a Chinese character, these help define a set of rules the sought after kanji must comply to. It should be noted that the user needs no knowledge of CHISE, IDS, Unicode or the exact IDS for the kanji. In fact, if the composition of a certain character can be specified in more than one way, the application should transparently handle this.31 The “Structure unknown” icon can be used to add a wild card to the query with regards to the structure of the kanji. This makes it possible to create a query that says “somewhere in this part of the kanji I want, this structure or radical occurs”. Next are four constraints that can be added to any section of the query. Using these we can specify that a specific section — that is to say, a part of the kanji described by the description icons above — either contains a specified radical, and possibly some other bits, or is exactly that radical. There is no reason to limit this functionality to only radicals, provided the user already has an IME for Japanese, he could just as easily specify that “this section of the kanji I’m looking for is exactly this kanji that I already know”. The last two options can be used, 30 The term widget is usually employed for the buttons, text fields and menus that make up a graphical application, but can also be applied to larger interface elements that perform a specific function. 31 For example, the IDS used in the CHISE project may seem odd at times. The IDS for 藤 is ⿱艹⿸⿰月龹氺, where one would probably expect ⿱艹⿰月⿱龹氺. Towards a next generation interface 19 A look at the future of the denshi jisho Icon Function Icon Function Vertical split Horizontal split Full enclosure Surround from lower left Surround from upper left Surround from upper right Enclose from the left Enclose from above Enclose from below Overlap Structure unknown Section is radical Section contains radical Section is kanji Section contains kanji Must be in collection(s) Section has a number of strokes in this range Table iii: Possible constraints that can be used to construct a query for searching a specific Chinese character. respectively, to set a global limit on which collection or collections of kanji we should search in, and how many strokes the whole or a section of the kanji may have. This stroke limit can be set as a range. It should be noted that these icons are by means definitive, although the Chinese characters used here are illustrative of the constraints they represent, it probably is not the most accessible way. From this starting position, we enter our first argument. We suspect either the gold or eat radical to be positioned all along the left side of our elusive character, so we enter this in two steps. First we indicate that we want to define the left side of the kanji, so we select the vertical split, and subsequently click the upper section created by the split to add another constraint (Figure ix). Figure ix: The kanji selector after choosing a vertical split. Towards a next generation interface 20 A look at the future of the denshi jisho Next we choose the “section is radical” option for the left section of our query (Figure x), and specify both the radicals that make a likely candidate in this case. Figure x: The kanji selector after specifying the radical for the left section. As we are limiting ourselves to the Jōyō Kanji, the list of possible results is already limited to a more manageable number. There should now be 36 kanji in the result list, already the user could probably spot the outline of our blurry friend at a glance. So far this process seems straight forward enough, selecting these radicals from a grid ordered by stroke count presented to the user would generally be quite fast as he will be familiar with — at the very least — the most common radicals. However, the next part is tricky as we want to enter a part into our query that is not a traditional radical, but a fragment composed of a non-radical part and the radical one. Note that because the exact composition of the section where this kanji part lies is not known to us, we add the “Unknown structure” rule, and after that the horizontal split with the two parts we have identified, or at least suspect. With the last arguments in place, the kanji turns out be 鑑. Figure xi: The kanji selector with all of our arguments and the only kanji in the Jōyō Kanji list that applies. Towards a next generation interface 21 A look at the future of the denshi jisho In the illustrations above, the kanji that apply to the arguments, the result list, might be displayed below or next to the widget. By dynamically updating this list as the user adds constraints, he has the opportunity to stop adding more arguments and simply select the desired kanji if he spots it in the candidate list. 32 This “kanji selector” concept could be called from the main dictionary interface, and starts with one constraint already added, a sensible constraint initially limiting the results to the Jōyō Kanji. The user could of course choose to set this limiter to a different collection such as only the Kyōiku Kanji, Jinmeiyō Kanji, or kanji defined in a character encoding standard, such as the Japanese JIS X 0213 or perhaps a Chinese or Korean standard. As long as a collection is defined it could be used, including a set defined by the user himself. Being optional, the constraint can also be removed altogether to broaden the search to basically any kanji defined by CHISE or a similar database. Of course, such a method is by no means a replacement for quickly entering kanji and compound words by pronunciation, or if that fails drawing it, but it may prove an interesting solution for cases where the requirements for these traditional methods cannot be met. One application might be a case where an eager student of Japanese wonders what kanji featuring the ⾨ (gate) radical appear in the Jōyō Kanji list (12) or in the Kyōiku Kanji subset (8). Or consider a situation where the exact composition of a kanji seen somewhere just barely escapes the user, but he remembered quite clearly that it contained the ⺾ (grass) radical at the top, and had a ⽉ (moon) radical below it on the left (Figure xii). A tool such as this might also be of use to linguistic scholars interested in performing some form of quantitative analysis on the Japanese language. It would be unreasonable to expect every user presented with this method to be able to distinguish between the ⽉ (moon) and ⾁ (meat) radicals, which have the same shape in some positions. An implementation of this concept should consider these radicals the same when executing the query, unless the user explicitly requests only one of these radicals. This problem is hardly new, it is not uncommon for dictionaries to index kanji with a moon or meat radical under only one of these.33 32 This is similar to the behaviour the kanji drawing widgets found in IME’s and modern PED’s display. 33 The New Nelson for instance indexes all kanji with the moon radical under radical 140, meat, except for the moon radical itself. Nelson 1997, p. 2530. Towards a next generation interface 22 A look at the future of the denshi jisho Figure xii: The kanji selector after setting a couple of arguments. Unset sections (in this case the lower right area of the kanji) act as wild cards. When dealing with a limited set of Chinese characters, such as the Japanese Jōyō Kanji list, this type of input method’s usefulness may appear to be limited to being a last resort for situations such as described above, rather than an actively used tool. However, more scientific applications of this method could very well arise with the ever increasing capabilities of the OS34 to handle even the most obscure or even made-up Chinese characters. Whilst the Chinese, Korean or Japanese IME’s are geared towards entering characters from their respective common use character collections, this type of structured method could be used when entering characters cannot be entered via an IME. This could potentially be useful in providing a method to access characters that are available at the software level — either encoded in the character set used, or through an environment that provides a different way of accessing these characters — considerably faster than picking it from a huge matrix, such as those provided by character map applications. 34 Or more specifically, the underlying libraries responsible for handling and displaying text. Towards a next generation interface 23 A look at the future of the denshi jisho Differing shades of “free” Throughout this essay a heavy focus on free resources has purposely been maintained. The term free, when applied to software or other digital resources, can be slightly confusing due to its ambiguity in the English language. Often it is interpreted as meaning gratis. It is true that a resource licensed under a free licence is often available free of charge, but this is not what free means in this case. There are plenty of examples of free resources related to the subject of this essay that are gratis. Consider for example a piece of software that provides a dictionary application, but which is not released under a free licence. 35 Or think of a popular on-line dictionary provided as a service by a large company. This statement by a team of researchers working on a free Korean–Japanese dictionary is not uncommon: Finally, this research was made possible by the existence of a number of open source resources. The results of this research will, of course, be made open, and we have filed bug reports and updates with many of the resources we use. In doing so, we produce better resources for everyone to use, so that the tedious process of compiling lexicons does not have to be repeated over and over again. We hope and expect that this will become standard, so that each generation of researchers can build not only on the ideas of their predecessors, but also on the knowledge that they have compiled.36 This project made use of the fact that a large number of words in Korean and Japanese share the same Chinese characters37 and coupled words that matched. Existing digital Japanese–English and Korean–English dictionaries were used to further refine the results by using the English words as pivot. Unfortunately, the results of this project are no longer distributed because of strong suspicions that the Korean–English dictionary used as a resource was not a free resource, but copied from an existing non-free dictionary.38 35 These types of software often carry the label freeware (not to be confused with free software) or shareware. 36 Paik, Bond & Satoshi 2001, p. 66 37 The Korean hanja weren’t simplified after the war in the way the Japanese kanji were, but it is relatively simple to create a conversion table between the traditional hanja and simplified kanji. Differing shades of “free” 24 A look at the future of the denshi jisho Anecdotes such as these help stress the importance of being aware of the restrictions placed upon resources when creating a derivative work. Once such information is free, it stays free.39 In my opinion, with the risk of sounding too idealistic, the use of free licences is not only beneficial to scientific progress, but inherently superior to non-permissive licensing models as resources using them often benefit humanity as a whole. Most of the academics and developers working on the type of resources described in this essay appear to subscribe to this view. 38 This team would not have used this resource had they known that it was not a free resource. 39 Software philosopher Richard M. Stallman is generally regarded as an authority on the subject of free software. The essays in Stallman 2002 provide a solid introduction to the matter. Differing shades of “free” 25 A look at the future of the denshi jisho The next step Due to the way characters are encoded, variant shapes of kanji are not always natively supported on the modern OS. Figuring out a way to standardize the way character variants, obscure Chinese characters from ancient sources and even custom-made characters40 is one of the challenges that lies ahead.41 Projects like CHISE may help in providing a future method of encoding these characters, but this does not address all of the usage issues. Consider for instance a user who is running a search query on a selection of documents. If he’s looking for a name 42 that can be written with a variant of a more common character then it would be useful if documents containing that name written with that particular character could show up in the search results as well. Within the limits of currently available software, this problem can be worked around by providing the dictionary software with it’s own search routines that can do this, but in time these kind of lower level text processing features should be dealt with at the OS-level. Being able to deal with all sorts of characters is a good thing, especially in light of the desire to digitise all kinds of cultural documents, from present day novels to the Kojiki. A realisation anyone venturing into the field of digital dictionaries will quickly come to is the fact that it is not just one field. Consider building the “perfect digital dictionary application”, with a pleasant and usable interface capable of displaying potentially large amounts of information drawn from several different resources in a clear and concise manner, well thought through input capabilities — either through the OS’s input methods or methods similar to the kanji selector discussed above — and well-written documentation. Add to this the cultural and lingual aspects of the Japanese language and the complexity of the Chinese characters, and you end up with a project that happily crosses the boundaries between Linguistics, Software Engineering, Human-Computer Interaction and Japanology. Presenting colourful mock-up illustrations of some application design concepts is 40 This is an issue in Taiwan, where people have the option of creating new Chinese characters for use in personal names. 41 Jenkins, p.29. 42 One example sometimes cited is the case of the former Japanese prime minister Yoshida Shigeru’s name ( 田・茂). The first character of his name is written with a variant of the more common 吉 (note the differing stroke length of the topmost horizontal stroke). Although this particular variant has been added to the Unicode standard recently, many people use the more common kanji simply because it is what the IME returns for the name Yoshida. The next step 26 A look at the future of the denshi jisho one thing, creating a working intuitive tool is a completely different matter. One thing to look out for when trying to make numerous resources accessible through a single interface is the risk that the casual user gets overwhelmed by the information.43 Where do we end up if we think of the future of electronic dictionaries? With the current influx of information made available through the Internet and the ever increasing versatility of (portable) computers, it is likely we will end up with a device capable of providing us with much more than a digitised bookshelf. A dictionary lookup of the name of any Japanese town would yield not only the pronunciation and information about the kanji in the name, but also information about the town itself, its history, its environment, the people, local customs, notable sights and museums, where to get a great meal, or where to eat and sleep on a shoestring budget, maps, directions, the list goes on.44 A device capable of doing this — that is, providing context to the character, word or phrase we are looking for — sounds remarkably similar to the (fictional) Hitchhiker’s Guide to the Galaxy,45 albeit slightly more focussed on earthly matters. Or perhaps we will end up with a form of universal translator as seen in the science fiction series Star Trek? Experiments with devices that go beyond the written language and are capable of dealing with phrases spoken to and from it seem promising. The list of existing Japanese lexical resources mentioned in this essay is by no means exhaustive. New initiatives that could be of use for a user-friendly software application that allows a user to perform the actions described above continue to emerge, and existing projects continue to evolve and improve. Just keeping tabs on all the available relevant resources and their progress is quite challenging. It will be interesting to experiment with new interface concepts and the combination of the various resources to find out what works and what does not. 43 Human interface guidelines tend to stress this point, and rightly so. Benson et al. 2004, p. 4. 44 There is of course no reason to think these developments will stop at “only” providing foreigners with a firm selection of resources to help comprehend Japan and its language and culture. Information going beyond a dictionary definition can help to put words, phrases and names in context for anyone, regardless of the language or culture concerned. 45 As eloquently described in the five-part trilogy The Hitchhiker’s Guide to the Galaxy by science fiction author Douglas Adams. The next step 27 A look at the future of the denshi jisho Bibliography • Benson, Calum, et al., GNOME Human Interface Guidelines 2.0 , The GNOME Usability Project, 2004. • Bishop, Tom and Richard Cook , “A Specification for CDL – Character Description Language” presented at the Glyph and Typesetting Workshop, Kyoto, 2003. • Breen, Jim, “Building an Electronic Japanese–English Dictionary” presented at the Japanese Studies Association of Australia Conference, Brisbane, 1995. • Breen, Jim, “Computing in Japanese – what are the frontiers now?” presented at the Workshop on Computational Japanese Studies, University of Tokyo, 2007. • Japanese Standards Association – Information Technology Standardisation Center, “Jinmeiyō Kanji no Moji Fugō ni Kansuru Kikaku Kentōkai Hōkoku” in Hyōjunka Jānaru 34:11 (2004). • Jenkins, John, H. , “The Dao of Unihan ” in Proceedings of the 26th International Unicode Conference (IUC-26), 2004. • Küenburg, Max, “Tōyō Kanji – The Story of Modern Japanese Characters” in Monumenta Nipponica 8:1/2 (1952), p. 230–238. • Lunde, Ken, CJKV Information Processing, Sebastopol: O’Reilly & Associates, 1999. • Mangeot, Mathieu, “Papillon Lexical Database Project: Monolingual Dictionaries & Interlingual Links” presented at WAINS 2000. • Morioka, Tomohiko, “Character Processing Based on Character Ontology” presented at the Sino-Japanese Joint Symposium on New Technologies Concerning the Storage of Chinese Character Literature, Beijing, 2005. • Nelson, Andrew N. and John H. Haig, The New Nelson Japanese–English Character Dictionary, Tuttle Publishing, 1997. Bibliography 28 A look at the future of the denshi jisho • Paik, Kyonghee, Francis Bond and Shirai Satoshi, “Using Multiple Pivots to align Korean and Japanese Lexical Resources” in Proceedings of NLPRS (2001), p. 63–67. • Schiltz, Michael, Frederik Truyen and Hans Coppens , “Cutting the Trees of Knowledge: Social Software, Information Architecture and Their Epistemic Consequences ” in Thesis Eleven 89 (2007), p. 94–114. • Sjöbergh, Jonas, “Creating a free digital Japanese–Swedish lexicon” in Proceedings of PACLING (2005), p. 296–300. • Stallman, Richard M., Free Software, Free Society: Selected Essays of Richard M. Stallman, GNU Press, 2002. Bibliography 29 A look at the future of the denshi jisho Internet resources • 4JWORDS, Four-Character Idiomatic Compounds: [ http://home.earthlink.net/~4jword/index3.htm ]. • CHISE IDS-find interface: [ • CHISE project page: [ • EDICT, JMdict, Kanjidic, etc.: [ • Jiten.nl, Japans–Nederlands woordenboek van Peter Adriaan van de Stadt, http://mousai.kanji.zinbun.kyoto-u.ac.jp/ids-find http://kanji.zinbun.kyoto-u.ac.jp/projects/chise Nichi–Ran Jiten 1934: [ http://www.jiten.nl Google Earth: [ http://www.google.com/earth • Google Maps: [ http://www.google.com/maps • Papillon Project: [ • Reading Tutor: [ • Unicode Unihan Database: [ • WaDoku, Japanisch–Deutsches Wörterbuch: [ • WaRan Japans–Nederlands Woordenboek: ]. ]. http://www.papillon-dictionary.org http://language.tiu.ac.jp ]. ]. http://www.unicode.org/charts/unihan.html http://www.wadoku.de http://akira.arts.kuleuven.ac.be/waranwiki ]. ]. ]. Wikipedia, The Free Encyclopedia: [ • Wiktionary, The Free Dictionary: [ • WWWJDIC Server (web-interface to EDICT, JMdict, Kanjidic, etc.): http://www.wikipedia.org http://www.wiktionary.org http://www.csse.monash.edu.au/~jwb/wwwjdic.html Internet resources ]. ]. • [ ]. http://www.csse.monash.edu.au/~jwb/japanese.html • [ ]. ]. ]. ]. 30