How to add a collation Alexander Barkov September, 2007 MySQL AB
Transcription
How to add a collation Alexander Barkov September, 2007 MySQL AB
How to add a collation Alexander Barkov September, 2007 MySQL AB Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 1 Plan for this session • • • • What is collation? The WEIGHT_STRING function Types of collations in MySQL Adding collations for 8-bit character sets without recompiling • Adding collations for Unicode character sets without recompiling • Adding collations into the source code Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 2 What is collation? • Collation is a set of rules how to compare and sort character strings. • Every MySQL collation belongs to a single character set. • Every MySQL character set can have one or more collations which belong to it. • This session assumes that the character sets already exist ("How to add a new character set" would be another session). Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 3 The WEIGHT_STRING SQL function New function in 5.2: WEIGHT_STRING(character_string_expression) -> binary_key Converts a character string into its binary key, which is used for comparison and sorting. Any two strings which are equal in their collation return the same results: SELECT WEIGHT_STRING('a'); -> 0x41 SELECT WEIGHT_STRING('A'); -> 0x41 On the main goals of WEIGHT_STRING: to make debugging and testing of collations easier. Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 4 Types of MySQL collations: • • • • • Collations for 8-bit character sets. Collations for non-Unicode multi-byte character sets. Collations for Unicode character sets. xxx_bin collations Other collations (minority) Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 5 "Simple" collations for 8-bit character sets. Internal implementation uses an array of 256 weights, with one-to-one mapping between a character and its weight. mysql> create table t1 (c1 varchar(2) character set latin1 collate latin1_swedish_ci); Query OK, 0 rows affected (0.10 sec) mysql> insert into t1 values ('a'),('A'); Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql> select c1, hex(weight_string(c1)) from t1; +------+------------------------+ | c1 | hex(weight_string(c1)) | +------+------------------------+ | a | 41 | | A | 41 | +------+------------------------+ 2 rows in set (0.00 sec) Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 6 Collations for non-Unicode multi-byte character sets: type #1 • ASCII range: one-to-one mapping from code to weight, case insensitive • Multi-byte characters: weight = code Example: sjis_japanese_ci: mysql> create table t1 (c1 varchar(2) character set sjis collate utf8_japanese_ci); Query OK, 0 rows affected (0.07 sec) mysql> insert into t1 values ('a'),('A'),(0x82C0); Query OK, 3 rows affected (0.00 sec) Records: 3 Duplicates: 0 Warnings: 0 mysql> select c1, hex(weight_string(c1)) from t1; +------+------------------------+ | c1 | hex(weight_string(c1)) | +------+------------------------+ | a | 41 | | A | 41 | | ぢ | 82C0 | +------+------------------------+ 3 rows in set (0.00 sec) Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 7 Collations for non-Unicode multi-byte character sets: type #2 • ASCII range: one-to-one mapping from code to weight, case insensitive • Multi-byte characters: one-to-one mapping from code to weight Example: gbk_chinese_ci: mysql> create table t1 (c1 varchar(2) character set gbk collate gbk_chinese_ci); Query OK, 0 rows affected (0.02 sec) mysql> insert into t1 values ('a'),('A'),(0x81B0),(0x81B1); Query OK, 4 rows affected (0.01 sec) Records: 4 Duplicates: 0 Warnings: 0 mysql> select c1, hex(c1), hex(weight_string(c1)) from t1; +------+---------+------------------------+ | c1 | hex(c1) | hex(weight_string(c1)) | +------+---------+------------------------+ | a | 61 | 41 | | A | 41 | 41 | | 伆 | 81B0 | C286 | | 伇 | 81B1 | CACC | +------+---------+------------------------+ 4 rows in set (0.00 sec) Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 8 Collations for Unicode multi-byte character sets: type #1 • One-to-one mapping from code to weight, case insensitive • Removes accents before converting from code to weight Example: utf8_general_ci: mysql> create table t1 (c1 char(1) character set utf8 collate utf8_general_ci); Query OK, 0 rows affected (0.11 sec) mysql> insert into t1 values ('a'),('A'),('À'), ('á'); Query OK, 0 rows affected (0.06 sec) Query OK, 4 rows affected (0.00 sec) Records: 4 Duplicates: 0 Warnings: 0 mysql> select c1, hex(c1), hex(weight_string(c1)) from t1; +------+---------+------------------------+ | c1 | hex(c1) | hex(weight_string(c1)) | +------+---------+------------------------+ | a | 61 | 0041 | | A | 41 | 0041 | | À | C380 | 0041 | | á | C3A1 | 0041 | +------+---------+------------------------+ 4 rows in set (0.20 sec) Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 9 Collations for the Unicode character sets: type#2, based on "Unicode collation algorithm" (UCA). • One weight uses 2 bytes • One-character-to-zero-weights: ignorable: "U+0000 NULL" does't have a weight (or has an empty weight) • One-character-to-one-weight • One-character-to-many-weights: expansion: German letter ß (SZ LEAGUE, or SHARP S) • Many-characters-to-one-weight: contraction: "ch" is a separate single letter in Czech • Many-characters-to-many-weights: contraction with expansion: not supported Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 10 Weights in a UCA collation mysql> create table t1 (c1 varchar(2) character set utf8 collate utf8_czech_ci); Query OK, 0 rows affected (0.00 sec) mysql> insert into t1 values (0x00),('a'),('A'),('h'),('i'),('s'),('ss'),('ß'),('ch'); Query OK, 9 rows affected (0.00 sec) Records: 9 Duplicates: 0 Warnings: 0 mysql> select c1, hex(c1), hex(weight_string(c1)) from t1 order by c1; +------+---------+------------------------+ | c1 | hex(c1) | hex(weight_string(c1)) | +------+---------+------------------------+ | | 00 | | <- ignorable | a | 61 | 0E33 | | A | 41 | 0E33 | | h | 68 | 0EE1 | | ch | 6368 | 0EE2 | | i | 69 | 0EFB | | s | 73 | 0FEA | | ss | 7373 | 0FEA0FEA | | ß | C39F | 0FEA0FEA | <- expansion <- contraction +------+---------+------------------------+ 9 rows in set (0.00 sec) Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 11 Collations that can be added without recompiling • “Simple” collations for 8-bit character sets. • UCA-based collations for Unicode character sets. • xxx_bin Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 12 Adding a collation: choosing an ID First step: find a vacant ID (IDs are used in .frm, binlog, protocol, etc) mysql> select id from information_schema.collations order by id ; +-----+ | id | +-----+ | 1 | ... | 55 | | 57 | <- 56 | 58 | ... | 242 | | 243 | | 254 | <- 244-253 +-----+ 196 rows in set (0.02 sec) Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 13 Adding a simple collation for an 8-bit character set Step 1: Declare name-to-id association in /usr/share/mysql/charsets/Index.xml: <charset name="latin1"> ... <collation name="latin1_test_ci" id="56"/> ... </charset> Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 14 Adding a simple collation for an 8-bit character set Step 2: Define code-to-weight mapping table in /usr/share/mysql/charsets/latin1.xml: <collation name="latin1_test_ci"> <map> 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F 60 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 51 52 53 54 55 56 57 58 59 5A 7B 7C 7D 7E 7F 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF 41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49 44 4E 4F 4F 4F 4F 5C D7 5C 55 55 55 59 59 DE DF 41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49 44 4E 4F 4F 4F 4F 5C F7 5C 55 55 55 59 59 DE FF </map> </collation> Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 15 Adding a UCA collation: supported character sets UCA collations can be added for the following character sets, without recompiling "mysqld" sources: • MySQL 4.1, 5.0, 5.1: utf8 and ucs2 • MySQL 5.2.x will also support utf16 and utf32 (the patch is currently in review) Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 16 Adding a UCA-based collation • A collation is added with help of subset of "Locale Data Markup Language" (LDML) http://unicode.org/reports/tr35/ • MySQL supports LDML starting from versions: 5.0.46, 5.1.20, 5.2.6 • A new UCA collation uses utf8_unicode_ci as a base (or ucs2_unicde_ci, in case if ucs2) • Collation ordering rules do not define the whole collation, but instead describe only how a collation differs from utf8_unicode_ci Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 17 Adding a UCA-based collation: editing Index.xml Collation ordering rules are to be put between <rules>...</rules> tags of character set definition file /usr/share/mysql/charsets/Index.xml <charset name="utf8"> ... <!-- associate collation name with its ID --> <collation name="utf8_phone_ci" id="252"> <rules> ... put rules here... </rules> </collation> ... </charset> Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 18 LDML ordering rule types • Reset rules • Shift rules: – Shift rules defining primary difference – Shift rules defining secondary difference – Shift rules defining tertiary difference Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 19 LDML: reset rules <reset>A</reset> or <reset>\u0041</reset> Reset rules do not change the order of the character itself ("A" in this example), but tell that the characters given in subsequent shift rules will be sorted near the letter "A". Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 20 LDML: shift rules: primary difference <p>G<p> or <p>\u0047</p> Reset+Shift example: <reset>A</reset> <p>G</p> says that "G" will be greater than "A", but less than "B". Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 21 Shift rules: Secondary and tertiary difference Secondary difference: <s>G</s> or <s>\u0047</s> Tertiary difference: <t>G</t> or <t>\u0047</t> Shift+Reset example (Swedish): <reset>Y</reset> <s>\u00DC</s> <t>\u00FC</t> Defines this sorting order: YyÜü Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 22 Secondary and tertiary difference However, currently MySQL does not support secondary and tertiary difference yet. So these rules actually make all these four letters equal to each other (not only for comparison, but for sorting as well), thus mutual order of these letters (YyÜü) in ORDER BY is not strict. Support for secondary and tertiary levels is on TODO (WL#896). Which difference to use: - Secondary - for accent - Tertiary - for case Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 23 Pre-defined UCA collations: utf8_unicode_ci In utf8_unicode_ci, accented letters are equal to thei r non-accented counterparts. mysql> create table t1 (a char(1) character set utf8 collate utf8_unicode_ci); Query OK, 0 rows affected (0.24 sec) mysql> insert into t1 values ('C'),('Č'),('S'),('Š'),('Z'),('Ž'),('N'),('Ñ'); Query OK, 8 rows affected (0.01 sec) Records: 8 Duplicates: 0 Warnings: 0 mysql> select group_concat(a) from t1 group by a; +-----------------+ | group_concat(a) | +-----------------+ | C,Č | | N,Ñ | | S,Š | | Z,Ž | +-----------------+ 4 rows in set (0.05 sec) Some languages consider certain accented characters as separate letters. MySQL provides predefined collations for a number of languages. Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 24 Predefined UCA collations: utf8_slovenian_ci utf8_slovenian_ci changes ordering of the letters Č, Š and Ž, making them separate letters. LDML definition, equivalent to utf8_slovenian_ci ordering would be: <rules> <reset>C</reset><p>\u010C</p><t>\u010D</t> <reset>S</reset><p>\u0160</p><t>\u0161</t> <reset>Z</reset><p>\u017D</p><t>\u017E</t> </rules> Note: Index.xml does not really have a definition for utf8_slovenian_ci. It is defined in strings/ctype-uca.c Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 25 Predefined UCA collations: utf8_slovenian_ci Usage example: mysql> create table t1 (a char(1) character set utf8 collate utf8_slovenian_ci); Query OK, 0 rows affected (0.11 sec) mysql> insert into t1 values ('C'),('Č'),('S'),('Š'),('Z'),('Ž'),('N'),('Ñ'); Query OK, 8 rows affected (0.00 sec) Records: 8 Duplicates: 0 Warnings: 0 mysql> select group_concat(a) from t1 group by a; +-----------------+ | group_concat(a) | +-----------------+ | C | | Č | | N,Ñ | | S | | Š | | Z | | Ž | +-----------------+ 7 rows in set (0.02 sec) Note: Ñ is still equal to its non-accented counterpart. Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 26 Predefined UCA collations: utf8_spanish_ci utf8_spanish_ci changes ordering of the letter Ñ, making it separate letter. LDML definition, equivalent to utf8_spanish_ci ordering: <rules> <reset>N</reset><p>\u00D1</p><s>\u00F1</s> </rules> Note: Index.xml does not really have a definition for utf8_spanish_ci. It is defined in strings/ctype-uca.c Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 27 Predefined UCA collations: utf8_spanish_ci Usage example: mysql> create table t1 (a char(1) character set utf8 collate utf8_spanish_ci); mysql> insert into t1 values ('C'),('Č'),('S'),('Š'),('Z'),('Ž'),('N'),('Ñ'); Query OK, 8 rows affected (0.00 sec) Records: 8 Duplicates: 0 Warnings: 0 mysql> select group_concat(a) from t1 group by a; +-----------------+ | group_concat(a) | +-----------------+ | C,Č | | N | | Ñ | | S,Š | | Z,Ž | +-----------------+ 5 rows in set (0.00 sec) Note: letters Č, Š, Ž are still equal to their non-accented counterparts. Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 28 Adding a UCA collation: Slovenian + Spanish together What if I need two languages at the same time? Combine the rules and put them into Index.xml! 1. Edit Index.xml and put the rules inside "utf8" definition: <charset name="utf8"> ... <collation name="utf8_spanishslovenian_ci" id="251"> <rules> <reset>C</reset><p>\u010C</p><t>\u010D</t> <reset>N</reset><p>\u00D1</p><s>\u00F1</s> <reset>S</reset><p>\u0160</p><t>\u0161</t> <reset>Z</reset><p>\u017D</p><t>\u017E</t> </rules> </collation> ... </charset> 2. Restart mysqld 3. Check that mysqld detected the new collation: mysql> show collation like '%spanishslovenian%'; +--------------------------+---------+-----+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +--------------------------+---------+-----+---------+----------+---------+ | utf8_spanishslovenian_ci | utf8 | 251 | | | 8 | +--------------------------+---------+-----+---------+----------+---------+ 1 row in set (0.00 sec) Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 29 Adding a UCA collation: testing utf8_spanishslovenian_ci mysql> create table t1 (a char(1) character set utf8 collate utf8_spanishslovenian_ci); mysql> insert into t1 values ('C'),('Č'),('S'),('Š'),('Z'),('Ž'),('N'),('Ñ'); Query OK, 8 rows affected (0.05 sec) Records: 8 Duplicates: 0 Warnings: 0 mysql> select group_concat(a) from t1 group by a; +-----------------+ | group_concat(a) | +-----------------+ | C | | Č | | N | | Ñ | | S | | Š | | Z | | Ž | +-----------------+ 8 rows in set (0.00 sec) Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 30 A phone book application Suppose we have a web application, users post their names and phone numbers. Phone numbers can have very different format: +7-12345-67 +7-12-345-67 +7 12 345 67 +7 (12) 345 67 +71234567 Searching for a phone number is difficult. Solution: reorder punctuation characters, making them ignorable Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 31 Adding utf8_phone_ci 1. Add rules to /usr/share/mysql/Index.xml: <charset name="utf8"> ... <collation name="utf8_phone_ci" id="252"> <rules> <reset>\u0000</reset> <s>\u0020</s> <!-- space --> <s>\u0028</s> <!-- l p --> <s>\u0029</s> <!-- r p --> <s>\u002B</s> <!-- plus --> <s>\u002D</s> <!-- hyphen --> </rules> </collation> ... </charset> 2. Restart mysqld 3. Check that mysqld detected the new collation: mysql> show collation like '%phone%'; +---------------+---------+-----+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +---------------+---------+-----+---------+----------+---------+ | utf8_phone_ci | utf8 | 252 | | | 8 | +---------------+---------+-----+---------+----------+---------+ 1 row in set (0.04 sec) Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 32 Testing utf8_phone_ci mysql> create table phonebook (name varchar(64), phone varchar(64) character set utf8 collate utf8_phone_ci); Query OK, 0 rows affected (0.01 sec) mysql> insert into phonebook values ('Svoj','+7 912 800 80 02'); Query OK, 1 row affected (0.05 sec) mysql> insert into phonebook values ('Hf','+7 (912) 800 80 04'); Query OK, 1 row affected (0.00 sec) mysql> insert into phonebook values ('Bar','+7-912-800-80-01'); Query OK, 1 row affected (0.00 sec) mysql> insert into phonebook values ('Ramil','(7912) 800 80 03'); Query OK, 1 row affected (0.01 sec) mysql> insert into phonebook values ('Sanja','+380 (912) 8008005'); Query OK, 1 row affected (0.02 sec) Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 33 Testing utf8_phone_ci mysql> select * from phonebook order by phone; +-------+--------------------+ | name | phone | +-------+--------------------+ | Sanja | +380 (912) 8008005 | | Bar | +7-912-800-80-01 | | Svoj | +7 912 800 80 02 | | Ramil | (7912) 800 80 03 | | Hf | +7 (912) 800 80 04 | +-------+--------------------+ 5 rows in set (0.00 sec) Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 34 Testing utf8_phone_ci mysql> select * from t1 where phone='+7(912)800-80-01'; +------+------------------+ | name | phone | +------+------------------+ | Bar | +7-912-800-80-01 | +------+------------------+ 1 row in set (0.00 sec) mysql> select * from t1 where phone='+7(912)800-80-02'; +------+------------------+ | name | phone | +------+------------------+ | Svoj | +7 912 800 80 02 | +------+------------------+ 1 row in set (0.00 sec) Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 35 Testing utf8_phone_ci mysql> select * from t1 where phone='+7(912)800-80-03'; +-------+------------------+ | name | phone | +-------+------------------+ | Ramil | (7912) 800 80 03 | +-------+------------------+ 1 row in set (0.00 sec) mysql> select * from t1 where phone='+7(912)800-80-04'; +------+--------------------+ | name | phone | +------+--------------------+ | Hf | +7 (912) 800 80 04 | +------+--------------------+ 1 row in set (0.00 sec) mysql> select * from phonebook where phone='+380-912-800-80-05'; +-------+--------------------+ | name | phone | +-------+--------------------+ | Sanja | +380 (912) 8008005 | +-------+--------------------+ 1 row in set (0.00 sec) Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 36 Adding a built-in collation A conglomerate object representing a character set + collation pair: typedef struct charset_info_st { uint number; uint primary_number; uint binary_number; uint state; const char *csname; const char *name; const char *comment; const char *tailoring; uchar *ctype; uchar *to_lower; uchar *to_upper; uchar *sort_order; uint16 *contractions; uint16 **sort_order_big; uint16 *tab_to_uni; MY_UNI_IDX *tab_from_uni; uchar *state_map; uchar *ident_map; uint strxfrm_multiply; uchar caseup_multiply; uchar casedn_multiply; uint mbminlen; uint mbmaxlen; uint16 min_sort_char; uint16 max_sort_char; uchar pad_char; my_bool escape_with_backslash_is_dangerous; uchar levels_for_compare; uchar levels_for_order; MY_CHARSET_HANDLER *cset; MY_COLLATION_HANDLER *coll; } CHARSET_INFO; MY_UNICASE_INFO **caseinfo; Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 37 typedef struct my_collation_handler_st { my_bool (*init)(struct charset_info_st *, void *(*alloc)(size_t)); Adding a built-in collation int (*strnncoll)(struct charset_info_st *, const uchar *, size_t, const uchar *, size_t, my_bool); int (*strnncollsp)(struct charset_info_st *, const uchar *, size_t, const uchar *, size_t, my_bool diff_if_only_endspace_difference); size_t (*strnxfrm)(struct charset_info_st *, uchar *dst, size_t dstlen, uint nweights, const uchar *src, size_t srclen, uint flags); size_t (*strnxfrmlen)(struct charset_info_st *, size_t); my_bool (*like_range)(struct charset_info_st *, const char *s, size_t s_length, pchar w_prefix, pchar w_one, pchar w_many, size_t res_length, char *min_str, char *max_str, size_t *min_len, size_t *max_len); int (*wildcmp)(struct charset_info_st *, const char *str,const char *str_end, const char *wildstr,const char *wildend, int escape,int w_one, int w_many); int (*strcasecmp)(struct charset_info_st *, const char *, const char *); uint (*instr)(struct charset_info_st *, const char *b, size_t b_length, const char *s, size_t s_length, my_match_t *match, uint nmatch); Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 38 Thanks! How to add a build-in collation in depth – in another session. Copyright 2007 MySQL AB The World’s Most Popular Open Source Database 39