Character Sets in MariaDB Xpand

Overview

MariaDB Xpand in SkySQL supports many different character sets. Character sets allow users to store characters from many different languages in the database. Some character sets also support special characters like currency symbols, mathematical symbols, musical symbols, and emoji.

Character sets are configured independently on the server and the client. The server and the client should typically use the same character set, otherwise there can be corruption or loss of data.

Compatibility

The detail provided here applies to:

  • MariaDB Xpand 5.3

  • MariaDB Xpand 6.0

  • MariaDB Xpand 6.1

  • MariaDB SkySQL Distributed Transactions

Supported Character Sets

For the full list of supported character sets, see Character Sets in Xpand.

SHOW CHARACTER SET LIKE 'latin%';

And the supported character sets can also be queried using the system.mysql_character_sets system table:

SELECT * FROM system.mysql_character_sets;
+---------+-----------------------+--------------------+--------+
| Charset | Description           | Default collation  | Maxlen |
+---------+-----------------------+--------------------+--------+
| binary  | Binary pseudo charset | binary             |      1 |
| latin1  | CP1252 West European  | latin1_swedish_ci  |      1 |
| utf8    | UTF-8 Unicode         | utf8_general_ci    |      3 |
| utf8mb4 | UTF-8 Unicode         | utf8mb4_general_ci |      4 |
| koi8r   | KOI8-R Relcom Russian | koi8r_general_ci   |      1 |
| euckr   | EUC-KR Korean         | euckr_korean_ci    |      2 |
+---------+-----------------------+--------------------+--------+

Server Character Set

The server character set is used to determine the default character set of new databases and tables, unless a different character set is specified for a given database or table.

Topology

Default Value

Distributed Transactions

utf8

Checking the Server Character Set

SHOW SESSION VARIABLES
   LIKE 'character_set_server';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| character_set_server | utf8  |
+----------------------+-------+

Effect of the Server Character Set

The server character set is used to determine the default character set of new databases and tables, unless a different character set is specified for a given database or table. To demonstrate this effect:

Database's Default Character Set

A database's default character set is used to determine the character set of new tables within the database, unless a different character set is specified for a given table.

Checking a Database's Default Character Set

SELECT DEFAULT_CHARACTER_SET_NAME
FROM information_schema.SCHEMATA
WHERE SCHEMA_NAME='db1';
+----------------------------+
| DEFAULT_CHARACTER_SET_NAME |
+----------------------------+
| utf8                       |
+----------------------------+

Setting a Database's Default Character Set and Collation

CREATE DATABASE db1
   DEFAULT CHARACTER SET 'utf8';

The new database's default collation will be the default collation for the specified character set.

To set the database's default collation along with the default character set, specify the optional DEFAULT COLLATE database option:

CREATE DATABASE db1
   DEFAULT CHARACTER SET 'utf8'
   DEFAULT COLLATE 'utf8_bin';

Changing a Database's Default Character Set and Collation

ALTER DATABASE db1
   DEFAULT CHARACTER SET 'utf8';

The database's default collation will be the default collation for the specified character set.

To change the database's default collation along with the default character set, specify the optional DEFAULT COLLATE database option:

ALTER DATABASE db1
   DEFAULT CHARACTER SET 'utf8'
   DEFAULT COLLATE 'utf8_bin';

Effect of a Database's Default Character Set

A database's default character set is used to determine the default character set of new tables within the database, unless a different character set is specified for a given table. To demonstrate this effect:

Table's Default Character Set

A table's default character set is used to determine the character set of new columns within the table, unless a different character set is specified for a given column.

Checking a Table's Default Character Set

SHOW CREATE TABLE db1.tab\G
*************************** 1. row ***************************
       Table: tab
Create Table: CREATE TABLE `tab` (
  `id` int(11) not null,
  `str` varchar(50) CHARACTER SET utf8,
  PRIMARY KEY (`id`) /*$ DISTRIBUTE=1 */
) CHARACTER SET utf8 /*$ SLICES=3 */

Setting a Table's Default Character Set

CREATE TABLE db1.tab (
   id int PRIMARY KEY,
   str varchar(50)
)
DEFAULT CHARACTER SET='utf8';

The table's default collation will be the default collation for the specified character set.

To set the table's default collation along with the default character set, specify the optional DEFAULT COLLATE table option:

CREATE TABLE db1.tab (
   id int PRIMARY KEY,
   str varchar(50)
)
DEFAULT CHARACTER SET='utf8'
DEFAULT COLLATE='utf8_bin';

Changing a Table's Default Character Set

ALTER TABLE db1.tab
   DEFAULT CHARACTER SET='utf8';

The table's default collation will be the default collation for the specified character set.

To change the table's default collation along with the default character set, specify the optional DEFAULT COLLATE table option:

ALTER TABLE db1.tab
   DEFAULT CHARACTER SET='utf8'
   DEFAULT COLLATE='utf8_bin';

Effect of a Table's Default Character Set

A table's default character set is used to determine the default character set of new columns within the table, unless a different character set is specified for a given column. To demonstrate this effect:

Column's Character Set

A column's character set is used to determine the character set used by the column to store its character data.

Checking a Column's Character Set

SELECT CHARACTER_SET_NAME
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA='db1'
   AND TABLE_NAME='tab'
   AND COLUMN_NAME='str';
+--------------------+
| CHARACTER_SET_NAME |
+--------------------+
| utf8               |
+--------------------+

Setting a Column's Character Set and Collation

CREATE TABLE db1.tab (
   id int PRIMARY KEY,
   str varchar(50) CHARACTER SET 'utf8'
);

The column's collation will be the default collation for the specified character set.

To set the column's collation along with the default character set, specify the optional COLLATE column option:

CREATE TABLE db1.tab (
   id int PRIMARY KEY,
   str varchar(50) CHARACTER SET 'utf8' COLLATE 'utf8_bin'
);

Changing a Column's Character Set and Collation

ALTER TABLE db1.tab
   MODIFY str varchar(50)
   CHARACTER SET 'utf8';

The column's collation will be the default collation for the specified character set.

To set the column's collation along with the default character set, specify the optional COLLATE column option:

ALTER TABLE db1.tab
   MODIFY str varchar(50)
   CHARACTER SET 'utf8'
   COLLATE 'utf8_bin';

Client Character Sets

Character sets are configured independently on the server and the client. The server and the client should typically use the same character set, otherwise there can be corruption or loss of data.

Clients have multiple character sets that should typically use the same character set:

Users should not typically change these system variables manually. Instead, there are methods to change the client character set that ensure that all the relevant client character sets are set to the same value together.

Setting the Client Character Set for an Existing Connection

Setting the Character Set for MariaDB CLI Tools

For MariaDB CLI tools, such as MariaDB Client and MariaDB Import, the default character set depends on the operating system.

Note

When you connect to the Distributed Transactions topology with MariaDB Client, you should set the default character set to utf8.

On Linux, the default character set is determined from the LC_CTYPE of the console environment. This is influenced by the LC_CTYPE, LC_ALL, and LANG environment variables. Note that Linux's utf8 and utf-8 character sets are currently mapped to the utf8 character set, which is an incomplete 3-byte version of the UTF-8 standard. To get access to the full UTF-8 standard, users should manually set the client's character set to utf8mb4 character set.

If the console environment's character set cannot be determined, it will default to latin1.

MariaDB Client and other MariaDB CLI tools also provide the --default-character-set option, which can be used to manually change the client character set.

Precedence

The character set precedence is different in different contexts.

Receiving a Query

When Xpand receives a query, the character set of the statement is determined by the character_set_client system variable.

Executing a Query

When Xpand executes a query, the character set of string literals depends on whether the literal has a character set introducer.

If a string literal contains a character set introducer, Xpand uses the character set specified for the string literal:

SELECT _latin1'this is a latin1 string';
+-------------------------+
| this is a latin1 string |
+-------------------------+
| this is a latin1 string |
+-------------------------+

If a string literal does not contain a character set introducer, Xpand uses the character set specified by the character_set_client system variable.

Xpand currently ignores the character_set_connection system variable.

Sending Query Results

When Xpand sends query results, the character set is determined by the character_set_results system variable.

Creating an Object

When Xpand creates an object (such as a database, table, or column), the character set of the object is determined with the following precedence:

  1. Column default:

    • If the [DEFAULT] CHARACTER SET or [DEFAULT] CHARSET column option are defined, the specified character set is used.

    • If the [DEFAULT] COLLATION column option is defined, the default character set for the specified collation is used.

  2. Table default:

    • If the [DEFAULT] CHARACTER SET or [DEFAULT] CHARSET table option are defined, the specified character set is used.

    • If the [DEFAULT] COLLATION table option is defined, the default character set for the specified collation is used.

  3. Database default:

    • If the [DEFAULT] CHARACTER SET or [DEFAULT] CHARSET database option are defined, the specified character set is used.

    • If the [DEFAULT] COLLATION database option is defined, the default character set for the specified collation is used.

  4. Server-wide default (defined by the character_set_server system variable)

Caveats

Inbound Replication and Character Sets

If you are using replication, you should ensure that your tables in the slave cluster use the same character set as the tables in the master. The best way to ensure that the tables use the right character set is to explicitly define the character set at table creation.

If you are using inbound replication into a SkySQL service, you should ensure that your tables in the SkySQL service use the same character set as the tables in the external primary server. The best way to ensure that the tables use the right character set is to explicitly define the character set at table creation.

For example, you can use the following process to migrate tables to SkySQL prior to setting inbound replication:

  1. On the external primary server, check the table's default character set by executing the SHOW CREATE TABLE statement and looking for the DEFAULT CHARSET table option:

    SHOW CREATE TABLE db1.tab;
    
    *************************** 1. row ***************************
           Table: tab
    Create Table: CREATE TABLE `tab` (
      `id` int(11) NOT NULL,
      `str` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
      PRIMARY KEY (`id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
    
  2. Migrate the table to SkySQL with the proper character set by executing the same CREATE TABLE statement provided from the external primary server, after making adjustments for MariaDB Xpand:

    CREATE TABLE db1.`tab` (
      `id` int(11) NOT NULL,
      `str` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
      PRIMARY KEY (`id`)
    ) DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
    

UTF-8 Code Points

UTF8 code points in Xpand are not same as that of MySQL due to the fact that the internal implementation of UTF8 code point validity in Xpand varies from that of MySQL.

SHOW CREATE TABLE

Xpand will display column-level character set information as part of SHOW CREATE TABLE output, even if it does not differ from the character set defined for the table.