I18N Primer TUT 2008

Transcription

I18N Primer TUT 2008
Internationalization – a primer
Tampere University of Technology
2008-01-14
Jere Käpyaho
Nokia Corporation
Software as part of culture
•
•
•
•
•
Cultural aspects often neglected in software design
Localized software as a unique selling point, legal requirement
Software – menus and dialogs – in your own language
Data in familiar format and order
All this can be achieved with the right disciplines
Several different cultural models exist:
• Hofstede (1991): culture is "software of the mind"
• Trompenaars (1993): culture is "the way which a group of people solves problems"
• Victor (1992): LESCANT model
• Hall (1990): culture is a "program of behaviour"
“Believe it or not!”
•
The alphabet is not just A to Z (or even A to Ö)
•
Sometimes there is no alphabet, but ideographs, or
pictograms
•
Upper case and lower case matter – except when
they don’t
•
Punctuation is often different
•
Line-breaking and justification rules are important
ÅÄÖ
ÐØÆĚŮ
Санкт-Петербург
Παναθηναϊκός
zum Beispiel
ขอให้โชคดี
¿Como esta usted?
はぃ、そぅです。
汉语/漢語
The "Believe it or not" slides point out some of the typical (Western/American) assumptions about text
and user interfaces.
And it gets stranger...
•
Text and layout are not always left to right
•
User interface elements are sometimes
“mirrored”
•
Bidirectional languages, Arabic and Hebrew
•
Latin text can occur in between
•
.siht ekil kool ton seod hsilgnE tuB
Points of note:
• the tabs are right aligned
• text is also right aligned and runs from right to left, but contains Latin text which runs from left to right
• radio button alignment
• arrowhead directions
How to type it all in?
•
Latin text is easy, only 26 to 50 letters
•
Chinese, Japanese, Korean – thousands!
•
Input Method Editor finds the equivalent
character as the user types
•
Full keyboards often have keys for the
most frequently used characters
Microsoft Windows: Input Method Manager, Global IME
Java Input Method Framework available since Java 2, v1.2
Recommended reading:
Michael S. Kaplan and Cathy Wissink, Unicode and Keyboards on Windows (http://www.microsoft.com/
globaldev/handson/dev/inputinwin.mspx)
What year is it, anyway?
•
•
Gregorian calendar: 2008
What about Islamic? Hebrew? Chinese lunar?
1429 in Islamic hijri
5768 in Hebrew
4704, the Year of the Fire Pig (year of Ding Hai, 丁亥)
•
•
•
in Chinese lunar (until 2008-02-06)
•
•
First day of week: Monday – or is it Sunday?
Week numbers: not widely used in the U.S.
This Gregorian year’s equivalents in other calendars were mostly obtained from the Calendrica applet
from Rheingold and Dershowitz, Calendrical Calculations - The Millennium Edition, Cambridge University
Press, 2001. The applet is online at http://www.calendarists.com.
In China the Gregorian calendar is often used in daily life, but festivities (such as the Chinese New Year)
are determined by the traditional lunar calendar. For the Chinese new year, see Wikipedia (http://
en.wikipedia.org/wiki/Chinese_calendar) or the Lunarcal utility at http://www.lunarcal.org.
The Islamic religious calendar is based on observation of the Moon, but in daily life the computational
variant is usually used.
The Japanese also use the Gregorian calendar, but traditionally years are numbered from the beginning of
the current emperor's reign. The name of the era is attached to the year (currently "Heisei").
The Buddhist calendar is identical to the Gregorian calendar in all respects except for the year. Years are
numbered since the birth of Buddha in 543 BCE (Gregorian).
According to the ISO 8601 standard, the year begins with the Monday between December 29 and
January 4, and ends with a Sunday between December 28 and January 3. So the year has 52 or 53 weeks.
Dates and times
14.1.2008 (Finland)
1/14/08 (U.S.A)
2008年1月14日 (China)
20080114 (standard)
•
Different order of date parts (day, month, year) –
confusion!
•
Different separators between date parts (dash, slash,
dot, letter, text...)
•
•
•
Leading zeros for days and months < 10
Two or four digit years: 2008 or just 08
International standard (ISO 8601) not in everyday use
The ISO 8601 standard defines a canonical representation for dates and times. A full date in ISO 8601
format would be something like:
2008-01-14T10:00+02:00
The date part comes first, followed by a T as a separator. Then comes the time part, which should include
a time zone offset unless you mean Universal Coordinated Time (UTC, equivalent to the old GMT or
Greenwich Mean Time, which is now obsolete as a term). The datestamp above corresponds to 10:00
UTC (12:00 PM Finnish time).
The ISO 8601 is logical (proceeding from the most significant part down) and useful in storing dates and
times for machine processing, but it is different from the everyday formats most people are used to. So it
will most likely never be adopted into everyday use.
See Markus Kuhn’s summary of ISO 8601 at http://www.cl.cam.ac.uk/~mgk25/iso-time.html. For an
excellent tool to convert between time zones see http://www.timeanddate.com.
Numbers and currencies
•
•
•
Separators for whole part and decimal part
•
•
Special formats for negative numbers
Finland: 1 234,56
USA: 1,234.56
Grouping separators (thousands or something else)
Currency symbols: before or after the number?
Any white space?
Up-to-date exchange rates for currency conversions
£$€¥
-1,10 €
($1.10)
-£1.10
N$1.10-
The ISO 4217 currency codes are mostly used in banking and finance, not in personal productivity or
normal office applications.
Interesting discussion topic: the European Commission has specified (or at least used to specify) that the
plural form of euro should also be "euro" (not "euros") and the same for cent. Many people view this as
ungrammatical. See, for example, http://www.evertype.com/standards/euro/euronames.html.
The technical use (or misuse) of the euro symbol is a frequent concern for e-mail applications and web
browsers. See http://www.cs.tut.fi/~jkorpela/html/euro.html.
Measurements
•
Metric system as standard, but imperial still
widely used
•
Conversions from kilometers to miles,
kilograms to pounds, Celsius to Fahrenheit,
liters to gallons... and back
•
•
Clearly labeled units are essential
Allow both systems and give the user a choice
Paper sizes
•
•
Seemingly endless varieties
•
European “DIN” papers: A0, A1, A2,
A3, A4, A5...
•
•
•
U.S. Letter and U.S. Legal
Japanese JIS standards
Various envelopes
Delegate work to operating system,
allow user to choose
For a thorough introduction to paper sizes, see Markus Kuhn’s “International standard paper sizes” at
<http://www.cl.cam.ac.uk/~mgk25/iso-paper.html>.
Addresses, telephone numbers
•
Order of address fields varies (country/
addressee first / last)
•
•
Mandatory fields different across countries
Grouping of telephone number digits
067 853 815
01 44 76 12 05
9 33 12 36
45 40 16 8
362 8090
456-5570
(Italy)
(France)
(Germany)
(Luxembourg)
(Finland)
(USA)
Note that in the Russian address example the country comes first, then the postal code, city and street
address. The addressee is last.
Mandatory "U.S. state" fields are finally starting to disappear from web forms. However, there are still
many, many forms that ask for your "ZIP code" - even if you selected something else besides the United
States as your country. In other countries the equivalent is usually known as "postal code" or "post
code".
The Universal Postal Union (UPU) publishes a list of the international postal address formats. See http://
www.upu.int for guidance.
For telephone numbers, the World Telephone Numbering Guide or WTNG is one of the best resources.
See http://www.wtng.info for details.
Names and ordering
•
Last name first or first name last?
•
•
•
•
Hungary, Japan, China, Korea: family name first
Dictionary or phonebook ordering (Germany, Iceland)
Alphabetical order is language dependent (V and W)
Local variations in ordering even in the same language
Because accented characters are not part of the English language, accents are usually ignored in English
sorting.
In Finnish the letters V and W are treated as equal in sorting.
In German phonebook ordering "o" and "oe" are equal (c.f. Örtner - Oertner), as are "ss" and "ß".
Symbols, images, colors, gestures
•
Symbols (eagle, owl, skull & crossbones)
have variations in meaning
•
Images from everyday life look different
around the world
•
Affinity to different kinds of color
schemes; sacred colors; warnings
•
Hand gestures: thumbs up, thumbs
down, fingers crossed...
English-speaking world
France
Some Mediterranean
countries
Greece
Brazil
Germany
Japan
Europe
Paraguay
Australia
Europe
The
Americas
It is interesting to note how the checkmark has found its way to user interfaces. In Finland the checkmark
means "wrong answer" in a school test, especially to elderly people. This is almost the complete opposite
of the U.S. usage - it means something is done, or "taken care of".
The U.S. rural mailbox may be puzzling to some people. The raised flag means there is mail in the box
(incoming or outgoing). In other countries people may wonder about this strange elevated tin can. (Fun
fact: there are several patented mailbox flag designs, see for example http://www.patentstorm.us/patents/
5454509-description.html.)
From Axtell, Roger E. (ed.) Do's and Taboos Around the World. 3rd ed. New York: Wiley, 1993
The OK gesture:
English-speaking: OK
France: zero, nothing, worthless
Mediterranean: a homosexual man
Japan: money
Brazil & Germany: vulgar, obscene gesture
Fingers crossed:
Europe: good luck, protection
Paraguay: offensive gesture
Thumbs up:
Australia: rude gesture
Most of the world: okay
Be prepared for all this
•
Design software to accommodate different languages,
conventions, and content
•
•
•
Isolate the parts that change from one language/culture to another
Adapt to user settings at runtime
Use a localization expert for the text translation and other adaptation
Internationalization (I18N)
•
“The process of designing an application so that it can
be adapted to various languages and regions without
engineering changes.“ (The Java 2 SDK documentation, my emphasis)
•
•
Software design with cultural awareness
Ideally an integral part of the engineering process, not an
afterthought
Internationalization is what makes localization possible.
A quote from Suzanne Topping:
"Localization and internationalization are symbiotic. Without localization, there is no need to
internationalize. Without internationalization, you'll wish you never attempted to localize."
Localization (L10N)
•
Translation and cultural adaptation of the user-visible
text and other media content
•
•
Preparing a product to meet regional needs
•
Very difficult without internationalization
Performed by localization experts and professional
translators
Internationalization guidelines
•
•
Separate all user-visible text from source code into resource files
•
•
Use operating system services to format and sort data
Have the text and content translated and adapted by localization
professionals
Design your user interface to accommodate different text lengths
Using resource bundles
•
•
•
•
•
Move all user-visible text out of the program source code
Isolate text and other content into resource files or “bundles”
Label each resource bundle with the language (and country)
Make the same application work with all the resource files
Same concept in all modern operating systems, details and APIs vary
In their simplest form, resources are just text strings with a key value that is used to identify them when
loaded at runtime.You typically have several different sets of resources, labelled according to the locales.
Resources can also have images, cursors, sounds, and even arbitrary binary content (objects in Java
and .NET).
Windows: resource script file, compiled and linked to the binary
Java: resource bundles, can be key-value pairs of text or Java objects
.NET: plain text or XML, key-value strings or objects, embedded to application or assembly
Mac OS X: string files containing key-value pairs, nib files for UI elements
Working with locales
•
•
Locale = combination of language and country
•
•
•
Used by the operating system for user language and settings
Examples: French as spoken in France (fr-FR), or French in Canada (frCA)
Resource bundles labeled with locales
Application queries locale at runtime, loads correct resources
There has been much debate about the actual definition of a locale. Is it just a marker for a languageregion combination, or does it contain also all the related information? The jury is still out: see, for
example, "Locale definitions: the ongoing debate" by Suzanne Topping (Multilingual Computing &
Technology #47 (Volume 13 Issue 3, April/May 2002). http://www.multilingual.org
Java: uses ISO 639 and ISO 3166 standard country and language codes fi_FI, sv_SE, fr_FR, fr_CA, en_US,
en_GB
Microsoft Windows: numeric identifier called the LCID. 0x040B = Finnish, 0x0C0C = French (Canada)
etc.
Microsoft .NET: "cultures": identified with RFC 1766 compliant strings like en-US, de-DE, ja-JP
Mac OS X: locales use ISO 639 and ISO 3166 codes
Formatting and sorting
•
•
•
Formatting dates and numbers correctly could be tricky on your own
•
Java SE, Win32 API, .NET, Mac OS X, gettext in Linux, Dojo & GWT in
JavaScript/AJAX/Web 2.0, Symbian OS...
Sorting is even more difficult to get right - don’t even bother!
Operating system APIs and libraries do all the work, just call
them
Some pointers for formatting data in different systems:
Java: the java.text package classes, like SimpleDateFormat, DateFormat and NumberFormat
Windows / Win32 API: GetDateFormat, GetTimeFormat, GetNumberFormat
C# / .NET: classes in the System.Globalization package, such as DateTimeFormatInfo and
NumberFormatInfo
Mac OS X / Cocoa: NSFormatter and its subclasses, like NSDateFormatter and NSNumberFormatter
Flexible, localizable user interface
•
•
•
•
Text will often expand when it is translated (10-300%)
Leave room, make sure translated text is not clipped
Don’t reuse terminology (same original word, different translation)
Pay attention to data presentation, changes in field order
Problems with multilingual text
•
ASCII, Windows code pages, ISO 8859 family not enough codes for the world’s languages (like
Chinese)
•
Mutually incompatible encodings exist - different
code for same character
•
Transcoding between systems may result in
information loss
•
Users need to install special fonts, switch
encodings... difficult.
Unicode
•
•
•
•
•
Unicode is a “universal character encoding scheme”
Developed by The Unicode Consortium (www.unicode.org)
Has enough codes for all living languages (and some dead ones, too)
Latest version 5.0 encodes almost 100,000 characters
Not a silver bullet, but a huge step in the right direction
Unicode and ISO 10646 were started at the same time, but luckily both groups realized soon enough that
they were doing the same thing. Being sensible people, they decided to make the two standards
compatible with each other (but still had to make two iinstead of just one).
The Unicode Consortium is an industry cooperation effort, whereas ISO is an international standards
body.
UTF-8
•
•
•
•
•
“UCS Transformation Format, 8-bit encoding form”
Eight-bit, variable-length encoding of Unicode
Used by Google, XML, Linux, Mac OS X, Windows...
Your best bet to learn and standardize on
ASCII-transparent
UTF-8 is quite important because it is the default encoding of XML documents: if no encoding is specified
in the XML prolog, UTF-8 is assumed.
UTF-8 is also “ASCII transparent”, i.e. all ASCII codes are the same in UTF-8
Benefits of Unicode
•
•
Every character has its own unique codepoint
•
•
•
Several possibilities for encoding (UTF-8 and also UTF-16, UTF-32)
Every character in the repertoire has a name and semantic
information
Every XML document uses Unicode (UTF-8) by default
The only sensible solution for multilingual documents
Multilingual fonts
•
•
•
•
One glyph in a font may represent many characters
All characters for all languages = one very big font, not always practical
Arial Unicode MS, ships with Microsoft Office
Code 2000, shareware
Lucida Sans Unicode
•
•
•
For quality, dedicated fonts often used for Chinese, Korean, Japanese
Outline or raster: screen resolution vs. memory size
Packaging, shipping and handling
•
•
•
•
Resource files are just part of normal builds
Desktop applications with installer: business as usual
Web applications: resources loaded from the server
Embedded software: whatever fits in ROM or flash memory
•
•
iPod has all languages on board, Nokia phones 5-10 at a time
Fonts included / downloaded / found in OS
Summary
•
•
•
•
•
No special language skills required; no rocket science here
I18N is an engineering problem, not a language problem
Learn the guidelines and how to implement them in your platform
Use existing services and established professionals
Your mindset matters: openness and curiosity go a long way!
Thank you
Jere Käpyaho
[email protected]
References
•
Apple/Mac OS X, Getting Started With Internationalization. http://developer.apple.com/referencelibrary/
GettingStarted/GS_Internationalization/index.html
•
Dr. International, Developing International Software, 2nd Edition. Microsoft Press, 2002
http://www.microsoft.com/globaldev/getWR/DIS_v2/default.mspx
•
The Unicode Standard,Version 5.0. Addison-Wesley, 2006. http://www.unicode.org
http://www.unicode.org/versions/Unicode5.0.0/
•
The World Wide Web Consortium (W3C) Internationalization Activity
http://www.w3.org/International/
•
Andrew Deitsch & David Czarnecki, Java Internationalization. O'Reilly, 2001
http://www.oreilly.com/catalog/javaint/
•
•
Bill Tuthill & David Smallberg, Creating Worldwide Software, 2nd Edition. Sun Microsystems Press, 1997
David A. Schmitt, International Programming for Microsoft Windows. Microsoft Press, 2000