A look at the future of the denshi jisho

Transcription

Designing the next generation of digital Japanese dictionary software
Final essay submitted in part fulfilment of the
requirements for the degree Bachelor of Arts in the
Japanese Languages and Cultures programme
Abstract
This essay aims to combine the fields of Japanology and Software
Engineering in an attempt to leverage advances made in digital language
processing and evaluate the possibilities of combining existing free and open
resources into a single software dictionary application. Such an application
could provide students of Japan, its language and its culture — from the
beginner, to the seasoned scholar — with a complete digital reference of
Japanese words, names and characters that improves upon paper resources and
conventional portable electronic dictionaries by allowing the user to retrieve
information faster and in a more efficient way.
Final Bachelor Essay
Leiden University
Department of Japanese Languages and Cultures
Author:
Student number:
E-mail address:
Supervision:
Essay submitted on the 22nd of August 2007
Statistics:
Jeroen Douwe Hoek, BEng
S0416851
[email protected]
Dr Riikka Länsisalmi
Dr Rob Goedemans
9216 words
Table of Contents
Introduction
4
The evolution of digital dictionaries
6
A modest typography of common issues
8
Obsolete kanji forms
8
Japanese names
9
Character information
9
Finding kanji
11
Combining existing resources
13
Towards a next generation interface
16
Obsolete kanji
16
Japanese names
16
17
Finding kanji
18
Differing shades of “free”
24
The next step
26
Bibliography
28
Internet resources
30
2
Illustrations
Figure i: Obsolete kanji form – bōtoku (blasphemy), the second character
(highlighted) has been simplified during the script reforms.
8
Figure ii: The kanji 藤 (fuji – wisteria) decomposed into its four parts.
10
Figure iii: The 艹 and 月 radicals decomposed into their strokes.
11
Figure iv: Illegible kanji – A small slightly enlarged sample from a copy of an
encyclopaedia entry on the Nara period scholar and China traveller Kibi no
Makibi.
Figure v: Information for the kanji 藤 with some information shown.
11
17
Figure vi: Information for the kanji 藤 tailored to the preferences of a specific user
with a keen interest in the Korean language and a background in software
engineering.
18
Figure vii: An example of how a hard to read character might be interpreted.
18
Figure viii: Concept kanji selector in its initial state.
19
Figure ix: The kanji selector after choosing a vertical split.
20
Figure x: The kanji selector after specifying the radical for the left section.
21
Figure xi: The kanji selector with all of our arguments and the only kanji in the
Jōyō Kanji list that applies.
21
Figure xii: The kanji selector after setting a couple of arguments. Unset sections
(in this case the lower right area of the kanji) act as wild cards.
23
Tables
Table i: Samples of hiragana, katakana and kanji. The text in the hiragana and
katakana samples is a fragment of the Iroha poem, a classic poem, which is
also a pangram, representing all the sounds in both syllabaries. The kanji
sample is composed of a number of kanji from the Jōyō Kanji set.
5
Table ii: Kanji sharing a reading, with the part of the character commonly held
responsible highlighted.
10
Table iii: Possible constraints that can be used to construct a query for searching a
specific Chinese character.
20
3
Introduction
With the global popularisation of the Internet witnessed in the 1990s, a new
distribution network arose that is distinctly lacking the boundaries that used to
limit the dissemination of information. Now, the cost of publishing information in
some paper form and the subsequent cost of distribution are no longer
significantly limiting factors as they have been reduced to relatively negligible
figures. For a non-native learner of the Japanese language and culture — for
studying one requires at least some degree of insight into the other — the
possibilities of using, sharing and contributing to digital lexical resources for the
Japanese language by means of the Internet have provided him with a wide
variation of decent, accessible and free1 alternatives to traditional paper
dictionaries and encyclopaedia. What are the possibilities of consolidating this vast
array of resources into a single user-friendly software application?
This essay examines a number of possibilities by evaluating existing digitally
available resources, such as dictionaries, and contrasts these against a number of
problematic cases a learner of the Japanese language may encounter when reading
or translating Japanese texts. For some of the more common issues described, a
few conceptual software solutions (using and combining these resources) are
presented. This essay starts off with a brief overview of the history of electronic
Japanese lexical resources. Then, by looking at some common scenarios
encountered when working with the Japanese language and the ways in which one
might deal with them using traditional resources, a reference point is established
to the current state of the field. From there on a selection of existing free and open
resources are reviewed and compared to the more traditional resources, and
subsequently the scenarios encountered earlier are revisited with new approaches
and user interface concepts, building on the free resources introduced. Finally a
rationale for the strong focus on free software projects in this essay is given.
From a Western perspective, Japanese is often regarded as one of the most
difficult languages to master. The issues of grammar, the differing word order2 and
the intricacies of correctly employing honourifics seem trivial compared to
1 Free, in both senses of the word. That is gratis (available free of charge, as in “free
beer”), and more importantly libre (granting the user the freedom to use, study, change
and distribute it, as in “free speech”). Stallman 2002, p. 43.
2 Put in linguistic terms, sentences generally use the subject-object-verb (SOV) sequence
instead of subject-verb-object (SVO).
Introduction
4
mastering the Japanese writing system which uses two native syllabaries, the
Roman alphabet and a rather large amount of Chinese characters.3 Getting to grips
with the two phonetic syllabaries, the hiragana and katakana, each consisting of 47
characters, usually takes no more than a few weeks of practice, but having to learn
to read and write the 2000 or so Chinese characters, or kanji as they are known in
Japanese, in daily use is the cause of a steep incline of the learning curve.
Hiragana
いろはにほへとちりぬるをわかよ
…
もせす
Katakana
イロハニホヘトチリヌルヲワカヨ
…
モセス
Kanji
亜哀愛悪握圧扱安暗案以位依偉囲
…
枠湾腕
Table i: Samples of hiragana, katakana and kanji. The text in the hiragana and
katakana samples is a fragment of the Iroha poem, a classic poem, which is also a
pangram, representing all the sounds in both syllabaries. The kanji sample is
composed of a number of kanji from the Jōyō Kanji set.
Most of the issues and difficulties encountered when working with Japanese
texts described in this essay come from my own practical experience as a nonnative learner of the Japanese language. Although this essay is written from a
Western perspective — or more specifically, a Dutch perspective — the ideas
presented herein should apply to any non-native speaker of Japanese, especially
those from cultures where the Chinese characters — be they known as hànzì, kanji,
hanja, hán tự 4 or simply Han characters5 — are not used in daily life.
3 The reader accustomed to Asian languages using Chinese characters might quite
correctly remark that, in fact, the Chinese themselves use far more Chinese characters
than the Japanese in their daily usage of the language, but compared with the twenty-odd
letters in the Western alphabets the Japanese do use “a fair amount” of characters.
4 Respectively the Chinese, Japanese, Korean and Vietnamese pronunciation of the same
compound word. These are the cultures where Chinese characters are, or were in the
case of Vietnam, used. Lunde 1999, p. 4 & 50.
5 Han characters is a direct translation of the two characters used for this term in Chinese,
Japanese, Korean and Vietnamese.
Introduction
5
When speaking of Japanese lexical resources three distinct families of
resources can be discerned. Naturally, the traditional paper dictionaries and
encyclopaedia are the first that spring to mind. Well-known dictionaries such as
The New Nelson Japanese–English Character Dictionary, Kenkyusha's New
Japanese–English Dictionary and the Kōjien (an authoritative Japanese dictionary
with a good deal of encyclopaedic content) are but a few of these time-honoured
and revered works. But respected as these tomes may be, they are hardly portable,
which in part explains the popularity of the second family of resources; the
portable electronic dictionary6 (PED). Since the 1980s, Japanese electronics
companies have been producing pocket-size devices with an ever increasing
feature-set capable of providing quick and easy access to digitized versions of
existing paper dictionaries. Fitted with a small keyboard and screen, these devices
easily fit in the pocket of a jacket or a pair of pants and can perform their duties
for ages on a single battery cell. Current models literally carry a whole bookshelf
worth of dictionaries, practice material and other reference works and some are
able to interface with a laptop or desktop computer and provide its dictionary
resources there. Prizes range from €200 to €4007, depending on what dictionaries
are included and the functionality of the device itself. The more extensive PED’s
sold by manufacturers such as Canon, Casio, Seiko and Sharp come with
handwriting recognition. A common feature is the ability to add more dictionaries
via a memory card slot. These can be purchased separately, with the price
depending on the dictionary.
Despite having the backing of a well-funded research and development
department, the future of the commercial electronic dictionary is — in my
opinion — limited, not by the manufacturer’s ability or lack thereof to innovate, but
by the demands of the domestic Japanese market. Notwithstanding the impressive
capabilities of these devices, fact remains that they are designed and marketed
primarily as an aid for Japanese students who wish to learn English or be able to
consult one of the classic Japanese dictionaries at any time and any place. Or to
put it in more concrete terms; students whose goal is to pass the demanding
6 The manufacturer Canon uses the brand name Wordtank for its product-line of PED’s.
This term is often used as a blanket-term for all PED’s with Japanese dictionaries in
colloquial speech.
7 ¥35.000 and ¥70.000 respectively at the time of writing.
6
entrance exams in order to gain admission to Japan’s most prestigious schools and
universities, and who are thus willing to pay for a tool that can help them study for
these important events. Although some of the ideas touched upon in this essay
could well be of use to their primary target demographic, a lot of features that
would be useful to someone learning Japanese as a secondary language might not
have that same appeal. Although the producers of these devices are more likely
than not well aware of a secondary group of users found in the growing group of
foreign students, scholars and businessmen who use them as learning aids and
portable dictionaries, they appear to remain reluctant to market this type of
product beyond Japan’s national borders, or even export them for that matter. The
foreign market for the Japanese electronic dictionaries depends mostly on third
party importers selling products from the latest range, or a favour from a friend or
colleague visiting Japan.
The third type of resource came into being with the emergence of the
Internet and the Worldwide Web in the 1990s. One significant aspect of this global
network is the near absence of distribution costs for digital resources. If someone
composes and publishes a useful resource on-line with the intent of spreading this
information, than anyone, anywhere can freely access this resource. Perhaps the
most well-known example of a free lexical resource dealing with the Japanese
language is the Japanese–English dictionary best known as EDICT. Started in the
early 1990s by professor Jim Breen of Australian Monash University, the project
has evolved into an extensive dictionary comprised of over 100.000 headwords.
The fact that a lot of software dictionary applications that provide a Japanese–
English dictionary — from the desktop computer to the handheld PDA, and on
every operating system (OS) imaginable — use the data from EDICT is indicative of
the popularity of this project. Of course, EDICT is not the only example of a lexical
resource accessible through the Internet,8 there a number of other successful
bilingual Japanese dictionaries — the free Japanese–French and Japanese–German
dictionaries are of particular note — and for an encyclopaedia Wikipedia, the
famous poster-child of social software advocates, is available in many languages
with the English language edition holding the record of being the largest
encyclopaedia in the world by far.9
8 Accessible either by using a website, such as the WWWJDIC service that houses EDICT,
or by downloading the material for off-line use.
9 The English language edition is nearing two million articles as of July 2007.
7
Any non-native learner of Japanese will be familiar with at least some of the
difficulties that can arise when working with Japanese texts. Older kanji forms,
place and personal names, kanji unknown to the reader or partially illegible words
due to age, nth generation photocopies, or even coffee stains. Below, a modest
selection of common problems is listed, along with the current solution to
deciphering the particular word or character.
Obsolete kanji forms
From 1946 onwards, a series of script reforms10 took place that led to the
simplification of a number of kanji and the limiting of the amount of kanji
permitted for use in official government publications, media and public education.
When reading Japanese texts from the earlier half of the twentieth century and
before, the older kanji forms can make a simple dictionary lookup a daunting task
if the reader is not familiar with the older form of the character.
ぼう
とく
ぼう
とく
冒瀆
冒涜
Original shape
Simplified shape
Figure i: Obsolete kanji form – bōtoku (blasphemy), the second character
(highlighted) has been simplified during the script reforms.
Figure i shows a word found in a modern text written after the script
reforms, but which cites a wartime account of a soldier in the field. The text
offered no furigana11 complicating a dictionary lookup based on pronunciation. The
word can be retrieved in two steps with current generation of PED’s and the larger
paper dictionaries12, provided the user is able to locate both kanji — either by
drawing it into a PED, or by using a traditional radical and stroke count
10 Initially the number of kanji was limited to 1850 (known as the Tōyō Kanji), this set was
revised in the 1980s to include 1945 kanji in total (the Jōyō Kanji). Several dozens of
kanji were simplified in shape ( 簡易字体 , kan’i jitai) as part of the script reforms.
Küenburg 1952, p. 230.
11 Furigana or rubies are kana placed above or next to kanji to indicate the pronunciation.
This is a common feature in texts aimed at a younger audience, but can also be employed
for words that are considered unfamiliar, archaic or contain deprecated kanji.
12 The New Nelson (Nelson 1997) for example points the reader to the simplified kanji
when looking up an older form.
8
lookup — and realise that the second character is an older form. Limiting the
search to merely examining the list of two character words starting with 冒 is not
sufficient, as neither lists the compound word with the older variant of the second
kanji.
Japanese names
When the reading of a character or word is known, a dictionary lookup is
relatively straightforward. Kanji used in Japanese names however, can use a
differing reading, which is called nanori, which may make finding the word in a
dictionary — or more likely, an encyclopaedia — difficult. Unfamiliarity with
Japanese names can aggravate the problem. Furthermore, the fact that a name is
written in kanji does not guarantee that is in fact Japanese. A word that eluded me
for the greater part of an afternoon when reading a Japanese text on wartime
experiences turned out to be a Chinese name. If the encyclopaedia used to look up
the word — provided the user has recognized it as being a name — is limited to
Japanese topics this may pose a problem.
In time, a learner of the Japanese language will, generally speaking, acquire
a feeling for the pronunciation of kanji. In cases where a particular character is not
known to the reader, the reading can often be guessed from a part of the kanji
which often, though not always, corresponds to a certain reading.13 The following
kanji all share the same reading in these compound words ( SEI in this case, the
corresponding part is highlighted in Table ii).
This information can be very useful as a mnemonic device for remembering
the reading of kanji. As far as paper dictionaries goes, this information is only
available in specialized character dictionaries. Any decent paper dictionary or PED
will give the user plenty of general information on at the very least the Jōyō Kanji.
Stroke order diagrams are common place and naturally the common readings,
including the name readings, are present in the larger dictionaries. Chinese
character dictionaries are traditionally indexed by radical, so the one radical that
is considered to be the leading radical for a characters is always shown. But what
13 In compound words, often consisting of two kanji, the on-reading is used in most cases.
The on-reading of a kanji is derived from the original Chinese pronunciation of the
character, in contrast to the kun-reading which is native Japanese.
9
Kanji
Unicode codepoint
Example compound word
せいぎ
U+6B63
正義 (seigi – justice)
せいふ
U+653F
政府 (seifu – government)
せいふく
U+5F81
征服 (seifuku – conquest)
ちょうせい
U+6574
調整 (chōsei – adjustment / tuning)
Table ii: Kanji sharing a reading, with the part of the character commonly held
responsible highlighted.
if one wants to know the exact decomposition of a more complex character?
Chinese characters are composed of a limited set of parts. Most of these parts are
the so-called radicals, but not all parts are.
Figure ii: The kanji 藤 (fuji – wisteria) decomposed into its four parts.
Of the kanji in illustration ii, three parts are classical Kangxi radicals,14
namely 艹 (a variant of the grass radical ⾋), 月 (the moon radical) and 氺 (a variant
of the water radical ⽔ ). The last part contains the ⼋ (the radical eight) and four
additional strokes. A part — for lack of a better term — is always composed of a
number of standard strokes as can be seen in illustration iii for the case of the
grass and water radicals.15 A stroke is as the name implies a single stroke of a
brush, that is, it is drawn without removing the brush from the canvas.
14 Most character dictionaries are indexed according to the 214 traditional radicals as
defined in the 1716 Kangxi dictionary. It is possible to find any kanji in the dictionary
quite fast as long as the reader can recognise which part of the kanji acts as radical and
how many strokes are left in the rest of the character. Nelson 1997, p. 1233.
15 Lunde uses the term radical-like elements for the parts that aren’t directly related to a
traditional radical. Lunde 1999, p. 55.
10
Figure iii: The 艹 and 月 radicals decomposed into their strokes.
Finding kanji
The numerous kanji present in every aspect of the Japanese language pose a
challenge to learners of the language not familiar with these characters of Chinese
origin. Roughly 2000 characters, known collectively as the Jōyō Kanji are deemed
suitable for daily use and are used in government publications, newspapers and
education. A good 1000 of these, the Kyōiku Kanji, are taught in elementary
education. Add to this list the kanji permissible for use in personal names16 (variant
shapes of kanji in the Jōyō Kanji list as well as unique shapes) and a penchant for
using obsolete kanji in place of their simplified variants as a stylistic vehicle, and it
becomes clear why finding kanji in a dictionary can be a problem for non-native
users. When at least one reading of a kanji is known, looking it up won’t pose a
problem. In case the reading is not known, it can be found by looking it up in the
index of a traditional paper dictionary or a PED by using the traditional radical and
stroke count method. Most IME’s and the newer PED’s allow for the user to draw
the character with the stylus or mouse, but these methods depend on the user
being familiar with the standard stroke order.
Figure iv: Illegible kanji – A small slightly enlarged sample from a copy of an
encyclopaedia entry on the Nara period scholar and China traveller Kibi no Makibi.
16 These kanji are known as the jinmeiyō kanji. In September of 2004 the list of allowed
kanji was extended significantly to include a total of 983 characters. Japanese Standards
Association 2004, p 1.
11
However, finding kanji by means of the radical and stroke method or by
drawing it is not always possible. Figure iv shows a copy of an encyclopaedic entry
handed to students taking a course on how China and its culture affected Japanese
literature throughout the years. The professor teaching the class copied the text in
advance from another copy. Unfortunately, somewhere along the way this short
section of text had lost its details, making class preparation somewhat challenging.
To find out what the kanji here means, the methods mentioned above are
insufficient. Our search is further complicated by the fact that the kanji in this
example stands alone. It is not part of a compound word, which precludes us from
looking it up in an electronic dictionary by entering the other character(s) and
specifying a wild card.17 Our best bet is probably to guess at the radical and a
range for the number of strokes this kanji might consist of, and spend a good deal
of time poring over a dictionary in the hopes of finding it.
17 Most PED's support a method to define a wild card. One could search for “all compound
words starting with this kanji and another kanji I don’t know” and more often than not, it
will return a relatively short list of possible results.
12
In the next section of this essay, a number of existing free resources are
combined to form a solution to the issues enumerated above. Here a number of
representative free resources are introduced. Due to the prevalence of the English
language on the Internet and in scholarly research, a large number of usable
English language resources exist to assist the non-native user of Japanese in
understanding the language. The most well-known example of a free lexical
resource on the Japanese language is the Japanese–English dictionary called
EDICT, mentioned earlier. For other languages, the state of bilingual resources to
or from Japanese varies greatly, from promising to disheartening. For speakers of
the German language for example — over 100 million people — the WaDoku
dictionary
provides
an
excellent
alternative
to
paper
Japanese–German
dictionaries. Conversely, the situation for the Dutch language stands in sharp
contrast with this, with the last paper Japanese–Dutch dictionary being Van de
Stadt’s Nichi–Ran Jiten published in the 1930s.18 On the digital front some
progress is being made in the form of a collaborative project run by the Catholic
University of Leuven in Belgium. Their WaRan19 project employs Wiki software20 to
allow students and scholars alike to add words and definitions. One technique
employed to create a new bilingual dictionary with Japanese being one of the
languages is to use the English language as a pivot and combine EDICT with
another free English to X dictionary, with X being the desired second language.
This
method
is
being
used
for
Japanese–Swedish
and
Japanese–Slovene
dictionaries.21
The Papillon project22 goes a step further, and strives for the noble goal of a
multilingual free dictionary. The idea behind this project is to combine existing
efforts such as those mentioned above, and create a dictionary that uses an
interlingual pivot internally. That is, it links words from different languages on the
18 There is however a Dutch–Japanese dictionary, published by Kodansha in 1994. Currently
a project is under way to digitize the Van de Stadt dictionary, and the scanned pages are
all indexed and available for browsing at the project website.
19 The transliteration of 和蘭, the Chinese characters to represent the Japanese and Dutch
languages.
20 Schiltz, Truyen & Coppens 2007, p. 97–101. Or, unscholarly as it may be, simply visit
Wikipedia and read the articles on Wikis and Wikipedia itself.
21 Paik, Bond and Satoshi 2001, Sjöbergh 2005.
22 Mangeot 2000, p. 4–6.
13
basis of their meaning. As of yet, the project's appears to be still in the early stages
of development. Although Papillon is not mature enough to consider as a usable
resource at this point in time, one of the resources it uses certainly is. JMdict is the
name of what is essentially a superset of EDICT. It goes beyond the bilingual
dictionary by combining a number of other dictionaries with Japanese lemmata
into one single resource using Japanese as its pivot language, currently containing
glosses in English, German and French as well as a limited amount of words in
Russian. Besides JMdict and its “classic” subset EDICT, professor Breen also
provides a number of related lexical resources on his website. There are several
Japanese to English dictionary files containing specialistic terms — for the fields of
computing,23 bio-medical science and even forestry to name but a few. Another
smaller, but useful resource is the 4JWORDS dictionary, which provides English
translations for the idiomatic four-character compound words24 used in Japan.
As with any language that uses Chinese characters, a character dictionary is
extremely useful to have. Two well-established resources are professor Breen's
Kanjidic and the Unihan database made available by the Unicode Consortium.
Kanjidic provides the Japanese and Chinese readings of the roughly 11.000
Chinese characters in Japanese industrial standards
1990.
JIS X 0202-1990
and
JIS X 0212-
Other data on the characters include references to their location in a number
of paper character dictionaries, the radical used to index it and the meanings
associated with the kanji. The Unihan database — a by-product of the Unicode
standardisation effort — provides similar data, but includes Chinese characters
unique to China and Korea as well. A different type of project is CHISE, the
CHaracter Information Service Environment. The aim of this project is to develop a
character processing environment that is not necessarily limited to a Chinese
character being defined in Unicode or one of the national character encoding
standards. Each character is defined by a collection of its features,25 such as its
codepoints in different standards (if present), its readings and its composition. This
last property in particular is used in the examples below, a set of operators is used
to list the radicals or parts a character is composed of and how it is composed. This
23 COMPDIC, as this glossary is called, contains over 14.000 entries.
24 Similar to sayings and proverbs in Western languages, the meaning of these words can
often not be inferred from the characters it is composed of.
25 Morioka 2005, p. 2.
14
is the Ideographic Description Sequence (IDS).26 These three resources combined
provide a complete and continually evolving source of information on the Chinese
characters.
For encyclopaedic references, the largest and arguably most successful
project is Wikipedia. Available in a plethora of languages and with more articles
than any other encyclopaedia, it suffices to say that it is a very usable resource,
although the quality of articles does vary greatly.
How do the free resources available compare to their traditional paper
counterparts and the resources included on modern PEDs? This is a difficult
question, and the safest answer is “it mostly depends on what the user needs”.
Naturally, a rough comparison can be made by just looking at the statistics. How
many lemmata does this dictionary have? How many articles in this encyclopaedia?
How many Chinese characters in this character database? But quantity is only part
of the equation, what the user expects from a resource and what he uses it for is
an important aspect as well. A connoisseur of the Japanese classics will probably
be better of with the type of time-honoured and revered Japanese dictionary that
cites the Man’yōshū27 in its examples. An anthropologist studying contemporary
Japanese culture on the other hand, may prefer a resource that can be amended at
any time because of the volatile nature of information in his field. Free resources
such as EDICT and Wikipedia allow for user contributions, and both have their own
means of quality control. The future of collaborative — or social — software seems
promising, but this phenomenon being a relatively recent development is still
being explored and the exact dynamics of what makes or breaks a collaborative
project are vaguely understood at best. A well thought-through user interface, an
active community, a solid project structure and a clear hierarchy, even something
as mundane as the look and feel of the website and project tools, these are all
factors that can combine to form a successful, thriving project or spell its untimely
demise.
26 A different method that focuses more on automatically generating typefaces with
Chinese characters is the Character Description Language. Bishop and Cook 2003.
27 A well-known classic in Japanese literature.
15
Obsolete kanji
Now that the technological barriers that prevented the usage of all sorts of
kanji — obsoleted character forms, unique shape variations and even custom
Chinese characters — are slowly, but surely, being dealt with, the future of
Chinese characters in computing looks bright. At the moment it would be trivial to
implement a system using the data from the CHISE project to show the user which
variants or relatives any character has. If a dictionary application for example
encounters the situation in Figure i and cannot find any word matching the
requested query, it might suggest to the user to continue looking for the word with
the older characters substituted for their modern variants. Conversely, if a user
wants to know the older character forms of the kanji in a word, or even the current
simplified Chinese equivalents of a given Japanese kanji, the application could
easily do this as this data is available in the CHISE database as well.
Japanese names
The solution to the issue of unfamiliar names is mostly a matter of finding a
way to provide the user with the potentially large amount of information in a
sensible manner. Resources such as Wikipedia spring to mind when searching for a
word such as 吉備真備,28 whereas a dictionary such as ENAMDICT can be valuable
in determining the reading of a name in Chinese characters. It is important to
recognize that whilst it may be obvious that a word is the name of someone or
something in most cases, sometimes this may not be the case. When someone
inputs any word into a dictionary, it should not limit itself to any specific
resource — unless the user requests this — but retrieve information from wherever
the word is indexed. Computers, including PED's, have a clear advantage over
paper dictionaries here. A solution that bases itself on resources such as Wikipedia
also benefit from a high degree of actuality. When using such an application to
translate, for example, news in today's newspaper, this can be of great use.
28 Again, the Japanese scholar Kibi no Makibi.
16
For any one Chinese character a large amount of information is freely
available. The Unihan database alone has a myriad of data available ranging from
the codepoint entries in different character-set encoding schemes to the different
readings. Of obvious interest are the Japanese readings of character, but
depending on who uses the application different properties might be interesting.
Someone who owns the New Nelson might want to display the index number when
the character can be found in that dictionary, whilst someone taking Korean
classes might like to know the Korean reading of a character as well.
Figure v: Information for the kanji 藤 with some information shown.
Figure v shows a basic set of information for the character 藤 , including its
composition drawn from the CHISE database and readings and variants from the
Unihan database. The idea behind this design is that all the button-like fields with
the white background and grey border act as hyperlinks to more information.
Clicking on one of the readings could present the user with a list of characters that
share this reading, clicking on one of the radicals might show some more
information on the radical, such as the origin of its shape. It could also display all
characters that use that radical as its leading radical. Basically, allowing the user
to follow link after link to his heart’s content is very similar to surfing the Internet
or browsing Wikipedia, with minimal effort it is possible to quickly jump to related
information. Figure vi shows information for the same character, but with the
interface configured to the needs of a specific user.
17
Figure vi: Information for the kanji 藤 tailored to the preferences of a specific user
with a keen interest in the Korean language and a background in software
engineering.
Finding kanji
Using the radical and stroke method to find the blurred kanji in example
Error: Reference source not found is tricky, and drawing the character on a PED or
in an IME is out of the question when the character is this hard to read. However,
we may not be able to clearly distinguish the strokes, but some features do stand
out. Provided the user is familiar with the more common radicals and shapes that
occur, he might reason as follows:
This could be the ⾦ (gold) or ⾷ (eat) radical.
Looks like
and the radical 一 (one) stacked on top of each other.
Figure vii: An example of how a hard to read character might be interpreted.
Even clever indexes such as the Universal Radical Index29 in the New Nelson
won’t be of help here. These allow the user to find any kanji as long as he can
recognise a radical — regardless of position — appearing in it and if he knows the
total stroke count, but that is information we do not have in this case. What if a
29 A more comprehensive explanation can be found in Nelson 1997, p. 1370, followed by the
index at ibid., p. 1371–1600.
18
dictionary allowed the user to retrieve this type of hard to find kanji by specifying
a query of sorts using arguments such as the pair above? With CHISE we have a
database with information about the composition of the kanji. It would be quite
possible, although challenging, to generate a database structure that allows for the
user to do exactly this. Consider the following widget30 (Figure viii).
Figure viii: Concept kanji selector in its initial state.
The icons on the “Add rule” bar are the basic set of tools to construct our
query. The upper row of icons correspond largely to the characters defined in the
IDS block in Unicode, and serve a similar purpose. However, instead of defining
the exact composition of a Chinese character, these help define a set of rules the
sought after kanji must comply to. It should be noted that the user needs no
knowledge of CHISE, IDS, Unicode or the exact IDS for the kanji. In fact, if the
composition of a certain character can be specified in more than one way, the
application should transparently handle this.31 The “Structure unknown” icon can
be used to add a wild card to the query with regards to the structure of the kanji.
This makes it possible to create a query that says “somewhere in this part of the
kanji I want, this structure or radical occurs”.
Next are four constraints that can be added to any section of the query.
Using these we can specify that a specific section — that is to say, a part of the
kanji described by the description icons above — either contains a specified
radical, and possibly some other bits, or is exactly that radical. There is no reason
to limit this functionality to only radicals, provided the user already has an IME for
Japanese, he could just as easily specify that “this section of the kanji I’m looking
for is exactly this kanji that I already know”. The last two options can be used,
30 The term widget is usually employed for the buttons, text fields and menus that make up
a graphical application, but can also be applied to larger interface elements that perform
a specific function.
31 For example, the IDS used in the CHISE project may seem odd at times. The IDS for 藤 is
⿱艹⿸⿰月龹氺, where one would probably expect ⿱艹⿰月⿱龹氺.
19
Icon
Function
Icon
Function
Vertical split
Horizontal split
Full enclosure
Surround from lower left
Surround from upper left
Surround from upper right
Enclose from the left
Enclose from above
Enclose from below
Overlap
Structure unknown
Section is radical
Section contains radical
Section is kanji
Section contains kanji
Must be in collection(s)
Section has a number of
strokes in this range
Table iii: Possible constraints that can be used to construct a query for searching a
specific Chinese character.
respectively, to set a global limit on which collection or collections of kanji we
should search in, and how many strokes the whole or a section of the kanji may
have. This stroke limit can be set as a range. It should be noted that these icons
are by means definitive, although the Chinese characters used here are illustrative
of the constraints they represent, it probably is not the most accessible way.
From this starting position, we enter our first argument. We suspect either
the gold or eat radical to be positioned all along the left side of our elusive
character, so we enter this in two steps. First we indicate that we want to define
the left side of the kanji, so we select the vertical split, and subsequently click the
upper section created by the split to add another constraint (Figure ix).
Figure ix: The kanji selector after choosing a vertical split.
20
Next we choose the “section is radical” option for the left section of our
query (Figure x), and specify both the radicals that make a likely candidate in this
case.
Figure x: The kanji selector after specifying the radical for the left section.
As we are limiting ourselves to the Jōyō Kanji, the list of possible results is
already limited to a more manageable number. There should now be 36 kanji in the
result list, already the user could probably spot the outline of our blurry friend at a
glance. So far this process seems straight forward enough, selecting these radicals
from a grid ordered by stroke count presented to the user would generally be quite
fast as he will be familiar with — at the very least — the most common radicals.
However, the next part is tricky as we want to enter a part into our query that is
not a traditional radical, but a fragment composed of a non-radical part and the
radical one. Note that because the exact composition of the section where this
kanji part lies is not known to us, we add the “Unknown structure” rule, and after
that the horizontal split with the two parts we have identified, or at least suspect.
With the last arguments in place, the kanji turns out be 鑑.
Figure xi: The kanji selector with all of our arguments and the only kanji in the Jōyō
Kanji list that applies.
21
In the illustrations above, the kanji that apply to the arguments, the result
list, might be displayed below or next to the widget. By dynamically updating this
list as the user adds constraints, he has the opportunity to stop adding more
arguments and simply select the desired kanji if he spots it in the candidate list. 32
This “kanji selector” concept could be called from the main dictionary interface,
and starts with one constraint already added, a sensible constraint initially limiting
the results to the Jōyō Kanji. The user could of course choose to set this limiter to a
different collection such as only the Kyōiku Kanji, Jinmeiyō Kanji, or kanji defined
in a character encoding standard, such as the Japanese
JIS X 0213
or perhaps a
Chinese or Korean standard. As long as a collection is defined it could be used,
including a set defined by the user himself. Being optional, the constraint can also
be removed altogether to broaden the search to basically any kanji defined by
CHISE or a similar database.
Of course, such a method is by no means a replacement for quickly entering
kanji and compound words by pronunciation, or if that fails drawing it, but it may
prove an interesting solution for cases where the requirements for these
traditional methods cannot be met. One application might be a case where an
eager student of Japanese wonders what kanji featuring the ⾨ (gate) radical
appear in the Jōyō Kanji list (12) or in the Kyōiku Kanji subset (8). Or consider a
situation where the exact composition of a kanji seen somewhere just barely
escapes the user, but he remembered quite clearly that it contained the ⺾ (grass)
radical at the top, and had a ⽉ (moon) radical below it on the left (Figure xii). A
tool such as this might also be of use to linguistic scholars interested in performing
some form of quantitative analysis on the Japanese language.
It would be unreasonable to expect every user presented with this method to
be able to distinguish between the ⽉ (moon) and ⾁ (meat) radicals, which have
the same shape in some positions. An implementation of this concept should
consider these radicals the same when executing the query, unless the user
explicitly requests only one of these radicals. This problem is hardly new, it is not
uncommon for dictionaries to index kanji with a moon or meat radical under only
one of these.33
32 This is similar to the behaviour the kanji drawing widgets found in IME’s and modern
PED’s display.
33 The New Nelson for instance indexes all kanji with the moon radical under radical 140,
meat, except for the moon radical itself. Nelson 1997, p. 2530.
22
Figure xii: The kanji selector after setting a couple of arguments. Unset sections (in
this case the lower right area of the kanji) act as wild cards.
When dealing with a limited set of Chinese characters, such as the Japanese
Jōyō Kanji list, this type of input method’s usefulness may appear to be limited to
being a last resort for situations such as described above, rather than an actively
used tool. However, more scientific applications of this method could very well
arise with the ever increasing capabilities of the OS34 to handle even the most
obscure or even made-up Chinese characters. Whilst the Chinese, Korean or
Japanese IME’s are geared towards entering characters from their respective
common use character collections, this type of structured method could be used
when entering characters cannot be entered via an IME. This could potentially be
useful in providing a method to access characters that are available at the software
level — either encoded in the character set used, or through an environment that
provides a different way of accessing these characters — considerably faster than
picking it from a huge matrix, such as those provided by character map
applications.
34 Or more specifically, the underlying libraries responsible for handling and displaying
text.
23
Throughout this essay a heavy focus on free resources has purposely been
maintained. The term free, when applied to software or other digital resources,
can be slightly confusing due to its ambiguity in the English language. Often it is
interpreted as meaning gratis. It is true that a resource licensed under a free
licence is often available free of charge, but this is not what free means in this
case. There are plenty of examples of free resources related to the subject of this
essay that are gratis. Consider for example a piece of software that provides a
dictionary application, but which is not released under a free licence. 35 Or think of
a popular on-line dictionary provided as a service by a large company.
This statement by a team of researchers working on a free Korean–Japanese
dictionary is not uncommon:
Finally, this research was made possible by the existence of a number of open
source resources. The results of this research will, of course, be made open,
and we have filed bug reports and updates with many of the resources we use.
In doing so, we produce better resources for everyone to use, so that the
tedious process of compiling lexicons does not have to be repeated over and
over again. We hope and expect that this will become standard, so that each
generation of researchers can build not only on the ideas of their predecessors,
but also on the knowledge that they have compiled.36
This project made use of the fact that a large number of words in Korean and
Japanese share the same Chinese characters37 and coupled words that matched.
Existing digital Japanese–English and Korean–English dictionaries were used to
further refine the results by using the English words as pivot. Unfortunately, the
results of this project are no longer distributed because of strong suspicions that
the Korean–English dictionary used as a resource was not a free resource, but
copied from an existing non-free dictionary.38
35 These types of software often carry the label freeware (not to be confused with free
software) or shareware.
36 Paik, Bond & Satoshi 2001, p. 66
37 The Korean hanja weren’t simplified after the war in the way the Japanese kanji were,
but it is relatively simple to create a conversion table between the traditional hanja and
simplified kanji.
24
Anecdotes such as these help stress the importance of being aware of the
restrictions placed upon resources when creating a derivative work. Once such
information is free, it stays free.39 In my opinion, with the risk of sounding too
idealistic, the use of free licences is not only beneficial to scientific progress, but
inherently superior to non-permissive licensing models as resources using them
often benefit humanity as a whole. Most of the academics and developers working
on the type of resources described in this essay appear to subscribe to this view.
38 This team would not have used this resource had they known that it was not a free
resource.
39 Software philosopher Richard M. Stallman is generally regarded as an authority on the
subject of free software. The essays in Stallman 2002 provide a solid introduction to the
matter.
25
The next step
Due to the way characters are encoded, variant shapes of kanji are not
always natively supported on the modern OS. Figuring out a way to standardize
the way character variants, obscure Chinese characters from ancient sources and
even custom-made characters40 is one of the challenges that lies ahead.41 Projects
like CHISE may help in providing a future method of encoding these characters,
but this does not address all of the usage issues. Consider for instance a user who
is running a search query on a selection of documents. If he’s looking for a name 42
that can be written with a variant of a more common character then it would be
useful if documents containing that name written with that particular character
could show up in the search results as well. Within the limits of currently available
software, this problem can be worked around by providing the dictionary software
with it’s own search routines that can do this, but in time these kind of lower level
text processing features should be dealt with at the OS-level. Being able to deal
with all sorts of characters is a good thing, especially in light of the desire to
digitise all kinds of cultural documents, from present day novels to the Kojiki.
A realisation anyone venturing into the field of digital dictionaries will quickly
come to is the fact that it is not just one field. Consider building the “perfect digital
dictionary application”, with a pleasant and usable interface capable of displaying
potentially large amounts of information drawn from several different resources in
a clear and concise manner, well thought through input capabilities — either
through the OS’s input methods or methods similar to the kanji selector discussed
above — and well-written documentation. Add to this the cultural and lingual
aspects of the Japanese language and the complexity of the Chinese characters,
and you end up with a project that happily crosses the boundaries between
Linguistics, Software Engineering, Human-Computer Interaction and Japanology.
Presenting colourful mock-up illustrations of some application design concepts is
40 This is an issue in Taiwan, where people have the option of creating new Chinese
characters for use in personal names.
41 Jenkins, p.29.
42 One example sometimes cited is the case of the former Japanese prime minister Yoshida
Shigeru’s name ( 田・茂). The first character of his name is written with a variant of the
more common 吉 (note the differing stroke length of the topmost horizontal stroke).
Although this particular variant has been added to the Unicode standard recently, many
people use the more common kanji simply because it is what the IME returns for the
name Yoshida.
The next step
26
one thing, creating a working intuitive tool is a completely different matter. One
thing to look out for when trying to make numerous resources accessible through a
single interface is the risk that the casual user gets overwhelmed by the
information.43
Where do we end up if we think of the future of electronic dictionaries? With
the current influx of information made available through the Internet and the ever
increasing versatility of (portable) computers, it is likely we will end up with a
device capable of providing us with much more than a digitised bookshelf. A
dictionary lookup of the name of any Japanese town would yield not only the
pronunciation and information about the kanji in the name, but also information
about the town itself, its history, its environment, the people, local customs,
notable sights and museums, where to get a great meal, or where to eat and sleep
on a shoestring budget, maps, directions, the list goes on.44 A device capable of
doing this — that is, providing context to the character, word or phrase we are
looking for — sounds remarkably similar to the (fictional) Hitchhiker’s Guide to the
Galaxy,45 albeit slightly more focussed on earthly matters. Or perhaps we will end
up with a form of universal translator as seen in the science fiction series Star
Trek? Experiments with devices that go beyond the written language and are
capable of dealing with phrases spoken to and from it seem promising.
The list of existing Japanese lexical resources mentioned in this essay is by no
means exhaustive. New initiatives that could be of use for a user-friendly software
application that allows a user to perform the actions described above continue to
emerge, and existing projects continue to evolve and improve. Just keeping tabs on
all the available relevant resources and their progress is quite challenging. It will
be interesting to experiment with new interface concepts and the combination of
the various resources to find out what works and what does not.
43 Human interface guidelines tend to stress this point, and rightly so. Benson et al. 2004,
p. 4.
44 There is of course no reason to think these developments will stop at “only” providing
foreigners with a firm selection of resources to help comprehend Japan and its language
and culture. Information going beyond a dictionary definition can help to put words,
phrases and names in context for anyone, regardless of the language or culture
concerned.
45 As eloquently described in the five-part trilogy The Hitchhiker’s Guide to the Galaxy by
science fiction author Douglas Adams.
The next step
27
Bibliography
•
Benson, Calum, et al., GNOME Human Interface Guidelines 2.0 , The GNOME
Usability Project, 2004.
•
Bishop, Tom and Richard Cook , “A Specification for CDL – Character
Description Language” presented at the Glyph and Typesetting Workshop,
Kyoto, 2003.
•
Breen, Jim, “Building an Electronic Japanese–English Dictionary” presented at
the Japanese Studies Association of Australia Conference, Brisbane, 1995.
•
Breen, Jim, “Computing in Japanese – what are the frontiers now?” presented at
the Workshop on Computational Japanese Studies, University of Tokyo, 2007.
•
Japanese Standards Association – Information Technology Standardisation
Center, “Jinmeiyō Kanji no Moji Fugō ni Kansuru Kikaku Kentōkai Hōkoku” in
Hyōjunka Jānaru 34:11 (2004).
•
Jenkins, John, H. , “The Dao of Unihan ” in Proceedings of the 26th International
Unicode Conference (IUC-26), 2004.
•
Küenburg, Max, “Tōyō Kanji – The Story of Modern Japanese Characters” in
Monumenta Nipponica 8:1/2 (1952), p. 230–238.
•
Lunde, Ken, CJKV Information Processing, Sebastopol: O’Reilly & Associates,
1999.
•
Mangeot, Mathieu, “Papillon Lexical Database Project: Monolingual Dictionaries
& Interlingual Links” presented at WAINS 2000.
•
Morioka, Tomohiko, “Character Processing Based on Character Ontology”
presented at the Sino-Japanese Joint Symposium on New Technologies
Concerning the Storage of Chinese Character Literature, Beijing, 2005.
•
Nelson, Andrew N. and John H. Haig, The New Nelson Japanese–English
Character Dictionary, Tuttle Publishing, 1997.
Bibliography
28
•
Paik, Kyonghee, Francis Bond and Shirai Satoshi, “Using Multiple Pivots to
align Korean and Japanese Lexical Resources” in Proceedings of NLPRS (2001),
p. 63–67.
•
Schiltz, Michael, Frederik Truyen and Hans Coppens , “Cutting the Trees of
Knowledge: Social Software, Information Architecture and Their Epistemic
Consequences ” in Thesis Eleven 89 (2007), p. 94–114.
•
Sjöbergh, Jonas, “Creating a free digital Japanese–Swedish lexicon” in
Proceedings of PACLING (2005), p. 296–300.
•
Stallman, Richard M., Free Software, Free Society: Selected Essays of Richard
M. Stallman, GNU Press, 2002.
Bibliography
29
Internet resources
•
4JWORDS, Four-Character Idiomatic Compounds:
[
http://home.earthlink.net/~4jword/index3.htm
].
•
CHISE IDS-find interface: [
•
CHISE project page: [
•
EDICT, JMdict, Kanjidic, etc.: [
•
Jiten.nl, Japans–Nederlands woordenboek van Peter Adriaan van de Stadt,
http://mousai.kanji.zinbun.kyoto-u.ac.jp/ids-find
http://kanji.zinbun.kyoto-u.ac.jp/projects/chise
Nichi–Ran Jiten 1934: [
http://www.jiten.nl
Google Earth: [
http://www.google.com/earth
•
Google Maps: [
http://www.google.com/maps
•
Papillon Project: [
•
Reading Tutor: [
•
Unicode Unihan Database: [
•
WaDoku, Japanisch–Deutsches Wörterbuch: [
•
WaRan Japans–Nederlands Woordenboek:
].
].
http://www.papillon-dictionary.org
http://language.tiu.ac.jp
].
].
http://www.unicode.org/charts/unihan.html
http://www.wadoku.de
http://akira.arts.kuleuven.ac.be/waranwiki
].
].
].
Wikipedia, The Free Encyclopedia: [
•
Wiktionary, The Free Dictionary: [
•
WWWJDIC Server (web-interface to EDICT, JMdict, Kanjidic, etc.):
http://www.wikipedia.org
http://www.wiktionary.org
http://www.csse.monash.edu.au/~jwb/wwwjdic.html
Internet resources
].
].
•
[
].
http://www.csse.monash.edu.au/~jwb/japanese.html
•
[
].
].
].
].
30

A look at the future of the denshi jisho

Transcription

Similar documents

Power Ranger Samuraizer Switch Rules

This program

Monday 25 August 2014 10:30 am to 5:30 pm

Kodomo no hi - Edmonton Japanese Community Association

Taisho Manchuria class

TEACHING TUESDAY

Kokoro November 2010

An Invitation to the Japanese Shin-Nen

ThE RiSE OF JAPANESE GRAPhic DESiGN

Breaking up the Kanji - Personal website of Jeroen Hoek