Demystifying identifier mapping

While tinkering with some things the other day, I noticed that if I create a table named “ñ”, MySQL creates a .frm file with the name of @0x.frm. Similar “encoding” is done for any file that needs to be created (.MYI, .ibd, .TRG, .TRN, et cetera). I’m not sure this is necessary or useful when my filesystem seems to support all kinds of fanciful characters in filenames, and especially when character_set_system is set by default (and unchangeably!) to utf8, but it’s happening nevertheless and I decided to learn more about how it works. This behavior is discussed in the reference manual at Mapping of Identifiers to File Names. There’s even a chart that claims to show how the mapping is performed. The chart and text below it on the “Mapping of Identifiers to File Names” page show that characters in the range 00C0..017F should be mapped using the pattern [@][0..4][g..z], but it wasn’t very obvious why. (I had no luck understanding any of it at first, in large part because I hadn’t yet figured out that the values in the Code Range column are UCS-2 values, not UTF-8 values.)

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> select convert(0x00c0 using ucs2);
+-----------------------------+
| convert(0x00c0 using ucs2) |
+----------------------------+
| À                          |
+----------------------------+
1 row in set (0.00 sec)

mysql> create table À (id int);
Query OK, 0 rows affected (0.02 sec)
$ pwd
/home/kolbe/mysql/5.6/data/test
$ ls -l
total 108
-rw-rw---- 1 kolbe kolbe  8556 2012-01-03 13:51 @0G.frm
-rw-rw---- 1 kolbe kolbe 98304 2012-01-03 13:51 @0G.ibd

Well, that shows that it works as described, but I still didn’t really understand why. I decided to start digging around the MySQL Server source code. First I looked at a lot of things in mysys, thinking that this functionality might be low-level enough to exist there. I was wrong. I looked around in sql, because that’s where the vast majority of all the server code exists, but I also didn’t have any luck there, initially, because there are tons of file operations that this algorithm applies to and I didn’t know where to start. Eventually, I stumbled across some use of build_table_filename in sql/sql_base.cc, which led me to sql/sql_table.cc, where that function is defined. Well, build_table_filename calls tablename_to_filename, which uses strconvert to actually perform the conversion of the string from one “character set” to another. The strconvert function, defined in sql/strfunc.cc, is called with a “to” character set (to_cs) of my_charset_filename. So, MySQL seems to be using some internal separate “character set” to control/define the mapping used for filenames. The to_cs argument of the strconvert function is a CHARSET_INFO struct that contains information about a given character set, including some methods that can be used to perform conversion operations. There’s a good amount of information about the CHARSET_INFO structure in strings/CHARSET_INFO.txt. The strconvert function uses the wc_mb function in the cset member (type MY_CHARSET_HANDLER) of the CHARSET_INFO struct to perform the conversion. I found my_charset_filename defined in strings/ctype-utf8.c. The cset member of my_charset_filename is my_charset_filename_handler, defined in the same file. The wc_mb function for this character set is my_wc_mb_filename, once again located in the same file. Here’s the contents of that function:

static int
my_wc_mb_filename(const CHARSET_INFO *cs __attribute__((unused)),
                  my_wc_t wc, uchar *s, uchar *e)
{ 
  int code;
  char hex[]= "0123456789abcdef";
  if (wc < 128 && filename_safe_char[wc])
  { 
    *s= (uchar) wc;
    return 1;
  }

  if (s + 3 > e)
    return MY_CS_TOOSMALL3;

  *s++= MY_FILENAME_ESCAPE;
  if ((wc >= 0x00C0 && wc <= 0x05FF && (code= uni_0C00_05FF[wc - 0x00C0])) ||
      (wc >= 0x1E00 && wc <= 0x1FFF && (code= uni_1E00_1FFF[wc - 0x1E00])) ||
      (wc >= 0x2160 && wc <= 0x217F && (code= uni_2160_217F[wc - 0x2160])) ||
      (wc >= 0x24B0 && wc <= 0x24EF && (code= uni_24B0_24EF[wc - 0x24B0])) ||
      (wc >= 0xFF20 && wc <= 0xFF5F && (code= uni_FF20_FF5F[wc - 0xFF20])))
  { 
    *s++= (code / 80) + 0x30;
    *s++= (code % 80) + 0x30;
    return 3;
  }

  /* Non letter */
  if (s + 5 > e)
    return MY_CS_TOOSMALL5;

  *s++= hex[(wc >> 12) & 15];
  *s++= hex[(wc >> 8) & 15];
  *s++= hex[(wc >> 4) & 15];
  *s++= hex[(wc)      & 15];
  return 5;
}

There it is, after all the hunting, the actual code that does the conversion from À to @0G. Let’s see how it works. There are numerous arrays of code points defined in strings/ctype-utf8.c (uni_0C00_05FF, uni_1E00_1FFF, et cetera); they look like this:

/* 00C0-05FF */
static uint16 uni_0C00_05FF[1344]=
{
  0x0017,0x0018,0x0019,0x001A,0x001B,0x001C,0x001D,0x001E,
  0x001F,0x0020,0x0021,0x0022,0x0023,0x0024,0x0025,0x0026,
  0x0027,0x0028,0x0029,0x002A,0x0067,0x0068,0x0069,0x0000,
  0x006B,0x006C,0x006D,0x006E,0x006F,0x0070,0x0071,0x008A,
  0x0037,0x0038,0x0039,0x003A,0x003B,0x003C,0x003D,0x003E,
  0x003F,0x0040,0x0041,0x0042,0x0043,0x0044,0x0045,0x0046,
  0x0047,0x0048,0x0049,0x004A,0x0087,0x0088,0x0089,0x0000,
  0x008B,0x008C,0x008D,0x008E,0x008F,0x0090,0x0091,0x0092,
...

(You may notice that the uni_0C00_05FF is misnamed. It should be uni_00C0_05FF, which is even what the comment above its definition says! Oops…) The uni_0C00_05FF array covers 11 Unicode blocks, including part of Latin-1 Supplement, IPA Extensions, Cyrillic, and Hebrew, among others. Only part of the Latin-1 Supplement block is handled by my_wc_mb_filename, and the entirety of that is shown above, 0x00C0 through 0x00FF, or elements 0 through 63 (since 0x00FF – 0x00C0 = 63). There’s not really a consistent pattern to the contents of the Latin-1 Supplement, which is one reason this array has to exist in the source as a static array with over a thousand elements. Within the second half of the Latin-1 Supplement block, the portion starting at 0x00C0 handled by this function, there are some uppercase vowels with consonants and math operators interspersed, and then the lowercase versions of the same; in other words, the uppercase letters are grouped, and the lowercase letters follow in a separate group, but there’s other non-letter stuff in there, too. The snippet above of the uni_0C00_05FF array consists of the UCS-2 values for G through Z and g through z, with 0x30 subtracted from them, with varying offsets of 80, skipping various non-letter characters. (One such character is 0x00D7, MULTIPLICATION SIGN, element 23 in the array, at the end of the 3rd line above. Such characters exist in filenames using their full unicode representation. In this case, a table named “×” —the MULTIPLICATION SIGN, not the letter “x”!— gets a .frm named @00d7.frm.) In the Latin Extended-A Block (0x0100 to 0x017F), by contrast, each uppercase letter is directly followed by its lowercase equivalent (Ā, ā, Ă, ă, et cetera). We can see this reflected in another snippet of uni_0C00_05FF directly following the one shown above:

static uint16 uni_0C00_05FF[1344]=
{
...
  0x0073,0x0093,0x0074,0x0094,0x0075,0x0095,0x0076,0x0096,
  0x0077,0x0097,0x0078,0x0098,0x0079,0x0099,0x007A,0x009A,
  0x00B7,0x00D7,0x00B8,0x00D8,0x00B9,0x00D9,0x00BA,0x00DA,
  0x00BB,0x00DB,0x00BC,0x00DC,0x00BD,0x00DD,0x00BE,0x00DE,
...

There’s a much clearer pattern there. Each value is followed by itself + 0x20 (decimal 32), which is the distance between an uppercase letter and its lowercase equivalent in Basic Latin (ASCII!):

mysql> select ord('A');
+----------+
| ord('A') |
+----------+
|       65 |
+----------+
1 row in set (0.01 sec)

mysql> select ord('a');
+----------+
| ord('a') |
+----------+
|       97 |
+----------+
1 row in set (0.00 sec)

mysql> select ord('a') - ord('A');
+---------------------+
| ord('a') - ord('A') |
+---------------------+
|                  32 |
+---------------------+
1 row in set (0.00 sec)

As explained in the manual, this conversion/encoding scheme for “identifier mapping” is designed to preserve case sensitivity on the filesystem. That is, À and à become @0G and @0g respectively. That means that table “À” is equal to table “à” on a case-insensitive filesystem (because @0G is equal to @0g) while they’re not equal on a case-sensitive filesystem. The code to perform the conversion essentialy determines which of the arrays to consult, subtracts the offset of the beginning of the range, does a lookup in the appropriate array, adds '0' (0x30), shifts things around a bit with division and modulus to get everything to fit into the scheme of allowed characters ([0..4][g..z]), and that’s that. Because my brain handles character sets/encodings (and arithmetic, I guess) a lot better at the SQL prompt than elsewhere, I put this together:

mysql> create table c (id int not null auto_increment primary key, c int);
Query OK, 0 rows affected (0.10 sec)

mysql> delimiter //
mysql> DROP PROCEDURE IF EXISTS insert_ucs2;
    -> CREATE PROCEDURE insert_ucs2 (c VARCHAR(255)) BEGIN
    ->         DECLARE i INT DEFAULT 1;
    ->         DECLARE n INT DEFAULT 1;
    ->
    ->         WHILE i < LENGTH(c) AND n <> 0 DO
    ->                 SET n = LOCATE(',', SUBSTRING(c FROM i));
    ->                 IF n = 0 THEN SET n = LENGTH(c)-i+2; END IF;
    ->                 INSERT INTO c (c) VALUES ( CONV(SUBSTRING(c FROM i+2 for n-3), 16, 10) );
    ->                 SET i = i + n;
    ->         END WHILE;
    -> 
    -> END//
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

mysql> delimiter ;

mysql> call insert_ucs2('0x0073,0x0093,0x0074,0x0094,0x0075,0x0095,0x0076,0x0096,0x0077,0x0097,0x0078,0x0098,0x0079,0x0099,0x007A,0x009A');
Query OK, 1 row affected (0.04 sec)

That’s a little procedure to parse the lines from the conversion mapping arrays and put them into a very basic table. I used the first few line that correspond to the Latin Extended-A Block. Now, we can see how those actually map to filenames:

mysql> select concat('@',floor(c / 80), char(convert(c % 80 + 0x30 using ucs2))) as v from c;
+------+
| v    |
+------+
| @1S  |
| @1s  |
| @1T  |
| @1t  |
| @1U  |
| @1u  |
| @1V  |
| @1v  |
| @1W  |
| @1w  |
| @1X  |
| @1x  |
| @1Y  |
| @1y  |
| @1Z  |
| @1z  |
+------+
16 rows in set (0.00 sec)
mysql> create table Ā (id int);
Query OK, 0 rows affected (0.03 sec)

$ ls /home/kolbe/mysql/5.6/data/test
@1S.frm  @1S.ibd  c.frm  c.ibd

Looks like we guessed right! Ā is the first character of Latin Extended-A, and MySQL does map it to @1S as expected. So, here’s something even more silly, a re-implementation of (part of) the conversion algorithm using SQL:

$ tail -n +6897 < ./ctype-utf8.c | head -n $((7065 - 6897)) | perl -e 'my $i = 0; while (<STDIN>) { chomp; print "call insert_ucs2("" if $i == 0; s/s//g; s/,$// if $i == 3; print; print "");n" if $i == 3; $i = ($i + 1) %4 }' > inserts.sql

$ mysql < ~/inserts.sql

mysql> select count(*) from c;
+----------+
| count(*) |
+----------+
|     1344 |
+----------+
1 row in set (0.00 sec)

mysql> select 0x05ff - 0x00c0;
+-----------------+
| 0x05ff - 0x00c0 |
+-----------------+
|            1343 |
+-----------------+
1 row in set (0.00 sec)

mysql> create table u (c char(1) character set utf8);
Query OK, 0 rows affected (0.09 sec)

mysql> insert into u values (convert(0x00c0 using ucs2));
Query OK, 1 row affected (0.01 sec)

mysql> insert into u select convert(char(max(ord(convert(c using ucs2)))+1) using ucs2) from u;
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0

...repeat a few more times...

mysql> select * from u;
+------+
| c    |
+------+
| À    |
| Á    |
| Â    |
| Ã    |
| Ä    |
| Å    |
| Æ    |
| Ç    |
| È    |
| É    |
+------+
10 rows in set (0.00 sec)

mysql> select concat('@',floor(c.c / 80), char(convert(c.c % 80 + 0x30 using ucs2))) as v from c join u on c.id = ord(convert(u.c using ucs2)) - 0x00c0 +1  ;
+------+
| v    |
+------+
| @0G  |
| @0H  |
| @0I  |
| @0J  |
| @0K  |
| @0L  |
| @0M  |
| @0N  |
| @0O  |
| @0P  |
+------+
10 rows in set (0.00 sec)

mysql> truncate u;
Query OK, 0 rows affected (0.08 sec)

mysql> insert into u values ('Ā'),('ā'),(convert(0x00d7 using ucs2));
Query OK, 3 rows affected (0.01 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> select * from u;
+------+
| c    |
+------+
| Ā    |
| ā    |
| ×    |
+------+
3 rows in set (0.00 sec)

mysql> select concat(
        '@',
        if(
            concat(
                floor(c.c / 80),
                char(convert(c.c % 80 + 0x30 using ucs2))
            ) <> '00'
        ,
            concat(
                floor(c.c / 80),
                char(convert(c.c % 80 + 0x30 using ucs2))
            )
               ,
                       lower( hex(convert(u.c using ucs2)))
               )
       ) as v  
               from c join u
                       on
               c.id = ord(convert(u.c using ucs2)) - 0x00c0 +1
       ;
+-------+
| v     |
+-------+
| @1S   |
| @1s   |
| @00d7 |
+-------+
3 rows in set (0.00 sec)

And, finally, in case you want that as a stored function…

delimiter //
DROP FUNCTION IF EXISTS my_wc_mb_filename//
CREATE FUNCTION my_wc_mb_filename(wc char(1) character set ucs2) RETURNS CHAR(5)
BEGIN
    DECLARE output CHAR(5);
    SELECT concat(
                '@',
                if(
                        concat( 
                                floor(c.c / 80),
                                char(convert(c.c % 80 + 0x30 using ucs2))
                        ) <> '00'
                ,
                        concat( 
                                floor(c.c / 80),
                                char(convert(c.c % 80 + 0x30 using ucs2))
                        )
               ,
                       lower( hex(wc) )
               )
    ) as v  INTO output
    from c
    where 
    id = ord(wc) - 0x00c0 +1
    ;
    RETURN output;
END
//
delimiter ;
mysql> select my_wc_mb_filename('Ý');
+-------------------------+
| my_wc_mb_filename('Ý')  |
+-------------------------+
| @1P                     |
+-------------------------+
1 row in set (0.00 sec)

mysql> select my_wc_mb_filename('÷');
+-------------------------+
| my_wc_mb_filename('÷')  |
+-------------------------+
| @00f7                   |
+-------------------------+
1 row in set (0.00 sec)

It’s worth mentioning that Drizzle does all of this quite differently. (They seem to take some pride in having deliberately torn strconvert out of the source tree entirely.) Instead of the complicated algorithm used by MySQL, Drizzle uses a succinct function that deals in UTF-8. You can find it in drizzled/util/tablename_to_filename.cc. Here’s the part of it that performs the conversion of non-ASCII characters:

    to.push_back('@');
    to.push_back(hexchars[(it >> 4) & 15]);
    to.push_back(hexchars[it & 15]);

Here are the effects:

drizzle> select hex('ữỗỿḵḝ');
+--------------------------------+
| hex('ữỗỿḵḝ')                   |
+--------------------------------+
| E1BBAFE1BB97E1BBBFE1B8B5E1B89D |
+--------------------------------+
1 row in set (0.000418 sec)

drizzle> create table ữỗỿḵḝ (id int);
Query OK, 0 rows affected (0.010721 sec)

...

$ ls /var/lib/drizzle/local/test
db.opt  @e1@bb@af@e1@bb@97@e1@bb@bf@e1@b8@b5@e1@b8@9d.dfe

I’m not trying to make a case that re-implementing this kind of thing in SQL has any practical purpose, but it was a fun way to poke around in the MySQL source code, learn more about how this algorithm works, and to keep my character set and stored functions chops honed up a bit.