Comments - Replication stops after one hour or so
Content reproduced on this site is the property of its respective owners,
and this content is not reviewed in advance by MariaDB. The views, information and opinions
expressed by this content do not necessarily represent those of MariaDB or any other party.
There should always be an error logged to the error log if the slave stops replication. Both in case of error and normal STOP SLAVE.
What do you actually mean with "replication stops"? What does SHOW SLAVE STATUS say? Are the SQL thread and IO thread running?
If everything is running but just slow, you need to investigate what it is that is slow. You mention row-based replication, one thing that can _really_ be slow in row-based replication is updates/deletes to large tables with no primary key, that should be avoided.
Oh deary me, where can i hide my face?! You were right about the tables. We have one without a primary key and about 390000 records (created a long time before i joined the company).
Furthermore the DELETE statement is encapsulated in a transaction, so the Execution-Pointer really stops for an extensive time. I was able to find the problem by looking into the binlogs with "mysqlbinlog". Very useful tool. Thank you for pointing me in the right direction!
SHOW SLAVE STATUS says that SQL and IO Slave-threads are running. Exec_Master_Log_Pos is static (which is bad), but Relay_Log_Space is increasing, as well as Read_Master_Log_Pos, so IO seems to work. Seconds_Behind_Master (SBM) is ticking up one second per second (bad). Slave_IO_State is "Waiting for master to send event".
In the initial state, right after setting up replication and starting the slave, he goes down from like 1700 SBM to zero SBM and the servers are in full sync. And then, after "some amount of time" the slave simply stops executing the changes made to the master. We have large tables (27 million records) but they are well indexed.
With "show full processlist" i see that he is executing "Reading event from the relay log" for the same time as SBM. I also see that the slave is running at 100% (or even more, multicore) CPU. I just checked the binlog AFTER the last executed entry and saw a lot of "at <some number>" with nothing behind (so i guess its a read-query on the master) but then a large block of deletes.
I will setup replication again, wait for it to fail and check that sequence again. Maybe this is the right track...