July 27, 2014

Multisource Replication: How to resolve the schema name conflicts

MariaDB 10.0 has introduced the functionality to replicate data from more than one Master server. This is what many have been waiting for, especially those who do business intelligence analysis, aggregation and reporting on data coming from different and sometimes related applications.

The way multi-source replication is implemented is extremely simple - for every Master you need what is called a 'connection'. Previously you just had one, the default. Each 'connection' points to a Master and it has, just like regular replication, two threads: IO Thread and SQL Thread. This means that if you have two different Masters using the same Schema name, the two 'connections' associated with the separate Masters would operate on the same Schema on the Slave (specifically the Slave SQL Thread of both connections would work in the same schema). Moreover should the two source schemas contain tables with the same name, the changes will be applied on the very same table by both Slave SQL threads. However this behaviour is probably unwanted in most cases.

Problems arise when you have two or more Master servers that have one or more schema names that conflict. If this is the case, each SQL Thread of each connection will apply changes in the same schema. At the very least there would be a 'logical' mix of tables in the same schema (if none of the tables names conflict). But more likely data consistency would be lost by having two different sources applying different changes to the same tables.

How can we resolve this?

MySQL Replication has the facility to rename the schema that is used on the Slave. This means that if the schema on the Master is named 'customers', you can have the Slave replicating the changes inside a schema with a different name, say 'customers_usa'. The option is called --replicate-rewrite-db.

Would this option work with Multisource replication? (Sort of rhetorical question)

Multisource replication introduces the notation to select the Master connection, so that the --replicate-rewrite-db option can be set on a per-master-connection basis. The option must be set in the configuration file (my.cnf) and it is picked up at instance startup.

If we use as example two Master sources named: master_usa and master_emea, the (multisource) Slave configuration would look like:

[mysqld]
...
...
...
master_usa.replicate-rewrite-db=customer->customer_usa
master_emea.replicate-rewrite-db=customer->customer_emea
...
...

Once you have the Slave up and running, to define the Master connections you must use the notation for Multi-source replication that is:

mysql> CHANGE MASTER 'master_usa' TO MASTER_HOST='10.0.0.101', MASTER_PORT=10101, MASTER_USER='repl';
mysql> CHANGE MASTER 'master_emea' TO MASTER_HOST='10.0.0.102', MASTER_PORT=10102, MASTER_USER='repl';
mysql> START ALL SLAVES;

Keep in mind that Multi-source replication has a slightly modified syntax which is nicely described on our Knowledge Base.

While with simple Replication you have only one Slave worker, with Multi-source one you can have multiples. This means you always need to specify whether you're refer to a specific worker or all of them. This can be done by either using the connection name ('master_usa' and 'master_emea' in our example), the keyword ALL, or the session variable 'default_master_connection'. If everything is working as expected the changes that happen on the 'customer' schema on '10.0.0.101' will end up in a schema named 'customer_usa' on the Slave. Similarly the changes to '10.0.0.102' will go into 'customer_emea' schema on the Slave.

Use Case

An example use case is reporting data from two different geographical databases (sharded servers) that are now, thanks to multi-source replication, consolidated in one place.

If you have the table 'sales' in the 'customer' schema which are both identical on the USA & EMEA Master servers, and now you have a multi-source slave instance with the schemas 'customer_usa' and 'customer_emea', you may want to create a report including data from both geographical datasets.

One way is to create a view over the 'sales' table on both schemas similar to this:

SELECT 'usa' as `Area`, s1.* FROM customer_usa.sales s1
UNION ALL
SELECT 'emea' as `Area`, s2.* FROM customer_emea.sales s2;

(PS: UNION ALL avoids check for duplicates making the UNION faster)

About og_116758641130679290187

Read all posts by og_116758641130679290187