Galera cluster fail during dump or optimize
I'm experiencing an issue with a MariaDB cluster when dumping or optimizing databases. When I run a a "mysqlcheck" or a "mysqldump" on my Mariadb 10.1.10 database server (which runs in a Galera cluster with two other servers) then the tasks stop after a short time and don't show any progress. The entire cluster seems to stall.
For example the mysqldump stops after creating an 14,0 MB (14.760.912 bytes) dump file and doesn't proceed, even after several minutes nothing happens. The mysqlcheck to repair and optimize tables also hangs after checking a few tables.
In both situations the cluster starts to have issues and the only way to get it to work normally again is by taking the server that executed the job offline and also taking another server offline. I then take them back online on by one and the cluster works normally again.
I'm not sure what is causing these problems. I haven't found any errors in the syslog, although during the shutdown of the servers I noticed the following:
Jan 10 20:43:46 france mysqld: 2016-01-10 20:43:46 140096330258176 [Warning] WSREP: TO isolation failed for: 3, schema: mysql, sql: OPTIMIZE TABLE proc. Check wsrep connection state and retry the query.
Jan 10 21:58:47 france mysqld: 2016-01-10 21:58:47 139691511322368 [Warning] WSREP: TO isolation failed for: 3, schema: smf, sql: OPTIMIZE TABLE smf_categories. Check wsrep connection state and retry the query. Jan 10 21:58:47 france mysqld: 2016-01-10 21:58:47 139691511322368 [Warning] Aborted connection 24 to db: 'smf' user: 'maintenance' host: 'localhost' (Unknown error) Jan 10 21:58:47 france mysqld: 2016-01-10 21:58:47 139691509827328 [Warning] WSREP: TO isolation failed for: 3, schema: (null), sql: SELECT 1 FROM mysql.user LIMIT 1. Check wsrep connection state and retry the query.
Any idea what could be the cause of this?
Answer Answered by Tom Worster Worster in this comment.
Sounds like flow control. Jay Janssen has a couple of good articles on the Percona blog.
Short form: if activity in one node causes it to fall behind on applying replicated transactions received from the others, its receive queue builds up and when it reaches a threshold it signals the other nodes to stop sending, which they do by halting new transactions.
You may be able to alleviate this by adjusting the threshold and/or disabling flow control (donor desync) for the duration of the slow task. Jay Janssen has the details.
I've noticed that things happening in mysqld or in other processes can have enough impact to lead to flow control.