I18N Primer TUT 2008
Transcription
I18N Primer TUT 2008
Internationalization – a primer Tampere University of Technology 2008-01-14 Jere Käpyaho Nokia Corporation Software as part of culture • • • • • Cultural aspects often neglected in software design Localized software as a unique selling point, legal requirement Software – menus and dialogs – in your own language Data in familiar format and order All this can be achieved with the right disciplines Several different cultural models exist: • Hofstede (1991): culture is "software of the mind" • Trompenaars (1993): culture is "the way which a group of people solves problems" • Victor (1992): LESCANT model • Hall (1990): culture is a "program of behaviour" “Believe it or not!” • The alphabet is not just A to Z (or even A to Ö) • Sometimes there is no alphabet, but ideographs, or pictograms • Upper case and lower case matter – except when they don’t • Punctuation is often different • Line-breaking and justification rules are important ÅÄÖ ÐØÆĚŮ Санкт-Петербург Παναθηναϊκός zum Beispiel ขอให้โชคดี ¿Como esta usted? はぃ、そぅです。 汉语/漢語 The "Believe it or not" slides point out some of the typical (Western/American) assumptions about text and user interfaces. And it gets stranger... • Text and layout are not always left to right • User interface elements are sometimes “mirrored” • Bidirectional languages, Arabic and Hebrew • Latin text can occur in between • .siht ekil kool ton seod hsilgnE tuB Points of note: • the tabs are right aligned • text is also right aligned and runs from right to left, but contains Latin text which runs from left to right • radio button alignment • arrowhead directions How to type it all in? • Latin text is easy, only 26 to 50 letters • Chinese, Japanese, Korean – thousands! • Input Method Editor finds the equivalent character as the user types • Full keyboards often have keys for the most frequently used characters Microsoft Windows: Input Method Manager, Global IME Java Input Method Framework available since Java 2, v1.2 Recommended reading: Michael S. Kaplan and Cathy Wissink, Unicode and Keyboards on Windows (http://www.microsoft.com/ globaldev/handson/dev/inputinwin.mspx) What year is it, anyway? • • Gregorian calendar: 2008 What about Islamic? Hebrew? Chinese lunar? 1429 in Islamic hijri 5768 in Hebrew 4704, the Year of the Fire Pig (year of Ding Hai, 丁亥) • • • in Chinese lunar (until 2008-02-06) • • First day of week: Monday – or is it Sunday? Week numbers: not widely used in the U.S. This Gregorian year’s equivalents in other calendars were mostly obtained from the Calendrica applet from Rheingold and Dershowitz, Calendrical Calculations - The Millennium Edition, Cambridge University Press, 2001. The applet is online at http://www.calendarists.com. In China the Gregorian calendar is often used in daily life, but festivities (such as the Chinese New Year) are determined by the traditional lunar calendar. For the Chinese new year, see Wikipedia (http:// en.wikipedia.org/wiki/Chinese_calendar) or the Lunarcal utility at http://www.lunarcal.org. The Islamic religious calendar is based on observation of the Moon, but in daily life the computational variant is usually used. The Japanese also use the Gregorian calendar, but traditionally years are numbered from the beginning of the current emperor's reign. The name of the era is attached to the year (currently "Heisei"). The Buddhist calendar is identical to the Gregorian calendar in all respects except for the year. Years are numbered since the birth of Buddha in 543 BCE (Gregorian). According to the ISO 8601 standard, the year begins with the Monday between December 29 and January 4, and ends with a Sunday between December 28 and January 3. So the year has 52 or 53 weeks. Dates and times 14.1.2008 (Finland) 1/14/08 (U.S.A) 2008年1月14日 (China) 20080114 (standard) • Different order of date parts (day, month, year) – confusion! • Different separators between date parts (dash, slash, dot, letter, text...) • • • Leading zeros for days and months < 10 Two or four digit years: 2008 or just 08 International standard (ISO 8601) not in everyday use The ISO 8601 standard defines a canonical representation for dates and times. A full date in ISO 8601 format would be something like: 2008-01-14T10:00+02:00 The date part comes first, followed by a T as a separator. Then comes the time part, which should include a time zone offset unless you mean Universal Coordinated Time (UTC, equivalent to the old GMT or Greenwich Mean Time, which is now obsolete as a term). The datestamp above corresponds to 10:00 UTC (12:00 PM Finnish time). The ISO 8601 is logical (proceeding from the most significant part down) and useful in storing dates and times for machine processing, but it is different from the everyday formats most people are used to. So it will most likely never be adopted into everyday use. See Markus Kuhn’s summary of ISO 8601 at http://www.cl.cam.ac.uk/~mgk25/iso-time.html. For an excellent tool to convert between time zones see http://www.timeanddate.com. Numbers and currencies • • • Separators for whole part and decimal part • • Special formats for negative numbers Finland: 1 234,56 USA: 1,234.56 Grouping separators (thousands or something else) Currency symbols: before or after the number? Any white space? Up-to-date exchange rates for currency conversions £$€¥ -1,10 € ($1.10) -£1.10 N$1.10- The ISO 4217 currency codes are mostly used in banking and finance, not in personal productivity or normal office applications. Interesting discussion topic: the European Commission has specified (or at least used to specify) that the plural form of euro should also be "euro" (not "euros") and the same for cent. Many people view this as ungrammatical. See, for example, http://www.evertype.com/standards/euro/euronames.html. The technical use (or misuse) of the euro symbol is a frequent concern for e-mail applications and web browsers. See http://www.cs.tut.fi/~jkorpela/html/euro.html. Measurements • Metric system as standard, but imperial still widely used • Conversions from kilometers to miles, kilograms to pounds, Celsius to Fahrenheit, liters to gallons... and back • • Clearly labeled units are essential Allow both systems and give the user a choice Paper sizes • • Seemingly endless varieties • European “DIN” papers: A0, A1, A2, A3, A4, A5... • • • U.S. Letter and U.S. Legal Japanese JIS standards Various envelopes Delegate work to operating system, allow user to choose For a thorough introduction to paper sizes, see Markus Kuhn’s “International standard paper sizes” at <http://www.cl.cam.ac.uk/~mgk25/iso-paper.html>. Addresses, telephone numbers • Order of address fields varies (country/ addressee first / last) • • Mandatory fields different across countries Grouping of telephone number digits 067 853 815 01 44 76 12 05 9 33 12 36 45 40 16 8 362 8090 456-5570 (Italy) (France) (Germany) (Luxembourg) (Finland) (USA) Note that in the Russian address example the country comes first, then the postal code, city and street address. The addressee is last. Mandatory "U.S. state" fields are finally starting to disappear from web forms. However, there are still many, many forms that ask for your "ZIP code" - even if you selected something else besides the United States as your country. In other countries the equivalent is usually known as "postal code" or "post code". The Universal Postal Union (UPU) publishes a list of the international postal address formats. See http:// www.upu.int for guidance. For telephone numbers, the World Telephone Numbering Guide or WTNG is one of the best resources. See http://www.wtng.info for details. Names and ordering • Last name first or first name last? • • • • Hungary, Japan, China, Korea: family name first Dictionary or phonebook ordering (Germany, Iceland) Alphabetical order is language dependent (V and W) Local variations in ordering even in the same language Because accented characters are not part of the English language, accents are usually ignored in English sorting. In Finnish the letters V and W are treated as equal in sorting. In German phonebook ordering "o" and "oe" are equal (c.f. Örtner - Oertner), as are "ss" and "ß". Symbols, images, colors, gestures • Symbols (eagle, owl, skull & crossbones) have variations in meaning • Images from everyday life look different around the world • Affinity to different kinds of color schemes; sacred colors; warnings • Hand gestures: thumbs up, thumbs down, fingers crossed... English-speaking world France Some Mediterranean countries Greece Brazil Germany Japan Europe Paraguay Australia Europe The Americas It is interesting to note how the checkmark has found its way to user interfaces. In Finland the checkmark means "wrong answer" in a school test, especially to elderly people. This is almost the complete opposite of the U.S. usage - it means something is done, or "taken care of". The U.S. rural mailbox may be puzzling to some people. The raised flag means there is mail in the box (incoming or outgoing). In other countries people may wonder about this strange elevated tin can. (Fun fact: there are several patented mailbox flag designs, see for example http://www.patentstorm.us/patents/ 5454509-description.html.) From Axtell, Roger E. (ed.) Do's and Taboos Around the World. 3rd ed. New York: Wiley, 1993 The OK gesture: English-speaking: OK France: zero, nothing, worthless Mediterranean: a homosexual man Japan: money Brazil & Germany: vulgar, obscene gesture Fingers crossed: Europe: good luck, protection Paraguay: offensive gesture Thumbs up: Australia: rude gesture Most of the world: okay Be prepared for all this • Design software to accommodate different languages, conventions, and content • • • Isolate the parts that change from one language/culture to another Adapt to user settings at runtime Use a localization expert for the text translation and other adaptation Internationalization (I18N) • “The process of designing an application so that it can be adapted to various languages and regions without engineering changes.“ (The Java 2 SDK documentation, my emphasis) • • Software design with cultural awareness Ideally an integral part of the engineering process, not an afterthought Internationalization is what makes localization possible. A quote from Suzanne Topping: "Localization and internationalization are symbiotic. Without localization, there is no need to internationalize. Without internationalization, you'll wish you never attempted to localize." Localization (L10N) • Translation and cultural adaptation of the user-visible text and other media content • • Preparing a product to meet regional needs • Very difficult without internationalization Performed by localization experts and professional translators Internationalization guidelines • • Separate all user-visible text from source code into resource files • • Use operating system services to format and sort data Have the text and content translated and adapted by localization professionals Design your user interface to accommodate different text lengths Using resource bundles • • • • • Move all user-visible text out of the program source code Isolate text and other content into resource files or “bundles” Label each resource bundle with the language (and country) Make the same application work with all the resource files Same concept in all modern operating systems, details and APIs vary In their simplest form, resources are just text strings with a key value that is used to identify them when loaded at runtime.You typically have several different sets of resources, labelled according to the locales. Resources can also have images, cursors, sounds, and even arbitrary binary content (objects in Java and .NET). Windows: resource script file, compiled and linked to the binary Java: resource bundles, can be key-value pairs of text or Java objects .NET: plain text or XML, key-value strings or objects, embedded to application or assembly Mac OS X: string files containing key-value pairs, nib files for UI elements Working with locales • • Locale = combination of language and country • • • Used by the operating system for user language and settings Examples: French as spoken in France (fr-FR), or French in Canada (frCA) Resource bundles labeled with locales Application queries locale at runtime, loads correct resources There has been much debate about the actual definition of a locale. Is it just a marker for a languageregion combination, or does it contain also all the related information? The jury is still out: see, for example, "Locale definitions: the ongoing debate" by Suzanne Topping (Multilingual Computing & Technology #47 (Volume 13 Issue 3, April/May 2002). http://www.multilingual.org Java: uses ISO 639 and ISO 3166 standard country and language codes fi_FI, sv_SE, fr_FR, fr_CA, en_US, en_GB Microsoft Windows: numeric identifier called the LCID. 0x040B = Finnish, 0x0C0C = French (Canada) etc. Microsoft .NET: "cultures": identified with RFC 1766 compliant strings like en-US, de-DE, ja-JP Mac OS X: locales use ISO 639 and ISO 3166 codes Formatting and sorting • • • Formatting dates and numbers correctly could be tricky on your own • Java SE, Win32 API, .NET, Mac OS X, gettext in Linux, Dojo & GWT in JavaScript/AJAX/Web 2.0, Symbian OS... Sorting is even more difficult to get right - don’t even bother! Operating system APIs and libraries do all the work, just call them Some pointers for formatting data in different systems: Java: the java.text package classes, like SimpleDateFormat, DateFormat and NumberFormat Windows / Win32 API: GetDateFormat, GetTimeFormat, GetNumberFormat C# / .NET: classes in the System.Globalization package, such as DateTimeFormatInfo and NumberFormatInfo Mac OS X / Cocoa: NSFormatter and its subclasses, like NSDateFormatter and NSNumberFormatter Flexible, localizable user interface • • • • Text will often expand when it is translated (10-300%) Leave room, make sure translated text is not clipped Don’t reuse terminology (same original word, different translation) Pay attention to data presentation, changes in field order Problems with multilingual text • ASCII, Windows code pages, ISO 8859 family not enough codes for the world’s languages (like Chinese) • Mutually incompatible encodings exist - different code for same character • Transcoding between systems may result in information loss • Users need to install special fonts, switch encodings... difficult. Unicode • • • • • Unicode is a “universal character encoding scheme” Developed by The Unicode Consortium (www.unicode.org) Has enough codes for all living languages (and some dead ones, too) Latest version 5.0 encodes almost 100,000 characters Not a silver bullet, but a huge step in the right direction Unicode and ISO 10646 were started at the same time, but luckily both groups realized soon enough that they were doing the same thing. Being sensible people, they decided to make the two standards compatible with each other (but still had to make two iinstead of just one). The Unicode Consortium is an industry cooperation effort, whereas ISO is an international standards body. UTF-8 • • • • • “UCS Transformation Format, 8-bit encoding form” Eight-bit, variable-length encoding of Unicode Used by Google, XML, Linux, Mac OS X, Windows... Your best bet to learn and standardize on ASCII-transparent UTF-8 is quite important because it is the default encoding of XML documents: if no encoding is specified in the XML prolog, UTF-8 is assumed. UTF-8 is also “ASCII transparent”, i.e. all ASCII codes are the same in UTF-8 Benefits of Unicode • • Every character has its own unique codepoint • • • Several possibilities for encoding (UTF-8 and also UTF-16, UTF-32) Every character in the repertoire has a name and semantic information Every XML document uses Unicode (UTF-8) by default The only sensible solution for multilingual documents Multilingual fonts • • • • One glyph in a font may represent many characters All characters for all languages = one very big font, not always practical Arial Unicode MS, ships with Microsoft Office Code 2000, shareware Lucida Sans Unicode • • • For quality, dedicated fonts often used for Chinese, Korean, Japanese Outline or raster: screen resolution vs. memory size Packaging, shipping and handling • • • • Resource files are just part of normal builds Desktop applications with installer: business as usual Web applications: resources loaded from the server Embedded software: whatever fits in ROM or flash memory • • iPod has all languages on board, Nokia phones 5-10 at a time Fonts included / downloaded / found in OS Summary • • • • • No special language skills required; no rocket science here I18N is an engineering problem, not a language problem Learn the guidelines and how to implement them in your platform Use existing services and established professionals Your mindset matters: openness and curiosity go a long way! Thank you Jere Käpyaho [email protected] References • Apple/Mac OS X, Getting Started With Internationalization. http://developer.apple.com/referencelibrary/ GettingStarted/GS_Internationalization/index.html • Dr. International, Developing International Software, 2nd Edition. Microsoft Press, 2002 http://www.microsoft.com/globaldev/getWR/DIS_v2/default.mspx • The Unicode Standard,Version 5.0. Addison-Wesley, 2006. http://www.unicode.org http://www.unicode.org/versions/Unicode5.0.0/ • The World Wide Web Consortium (W3C) Internationalization Activity http://www.w3.org/International/ • Andrew Deitsch & David Czarnecki, Java Internationalization. O'Reilly, 2001 http://www.oreilly.com/catalog/javaint/ • • Bill Tuthill & David Smallberg, Creating Worldwide Software, 2nd Edition. Sun Microsystems Press, 1997 David A. Schmitt, International Programming for Microsoft Windows. Microsoft Press, 2000