mariadb - cannot join cluster

You are viewing an old version of this question. View the current version here.

Hi,

I have a 3 nodes mariadb-galera that was working on linux, firewall was stopped but not disabled. After a power failure, firewall was started and all nodes were with safe_to_bootstrap: 0 at the same scn.

I changed safe_to_bootstrap to 1 on one node: this one started.

The other ones didn't start because of firewall. So I stopped all firewalls, re-stopped everything and made a restart: galera_new_cluster on 1st node.

systemctl start mariadb on other nodes... but 2nd and 3rd node don't start.

on 1st node (ip ending with .140), I have (logical) timeouts as other nodes don't start. On 2nd node (ip ending with .141):

---

[root@eidlot2-database-1 ]# systemctl status mariadb

● mariadb.service - MariaDB 10.4.9 database server

Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled)

Drop-In: /etc/systemd/system/mariadb.service.d

└─migrated-from-my.cnf-settings.conf

Active: active (running) since Mon 2020-01-27 14:23:34 CET; 32min ago

Docs: man:mysqld(8)

https://mariadb.com/kb/en/library/systemd/

Process: 15194 ExecStartPost=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)

Process: 15068 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= || VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ] && systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=exited, status=0/SUCCESS)

Process: 15066 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)

Main PID: 15154 (mysqld)

Status: "Taking your SQL requests now..."

CGroup: /system.slice/mariadb.service

└─15154 /usr/sbin/mysqld --wsrep-new-cluster --wsrep_start_position=81e2e742-0016-11ea-ae1f- 221166a346b3:38024

Jan 27 14:55:44 eidlot2-database-1.pass.lan mysqld[15154]: 2020-01-27 14:55:44 0 [Note] WSREP: (37ab1fde, 'tcp:0.0.0.0:4567') connection to peer 00000000 with addr tcp:172.16.57.141:4567 timed out, no messages seen in PT3S

Jan 27 14:55:46 eidlot2-database-1.pass.lan mysqld[15154]: 2020-01-27 14:55:46 0 [Note] WSREP: (37ab1fde, 'tcp:0.0.0.0:4567') connection to peer 00000000 with addr tcp:172.16.57.144:4567 timed out, no messages seen in PT3S

Jan 27 14:55:48 eidlot2-database-1.pass.lan mysqld[15154]: 2020-01-27 14:55:48 0 [Note] WSREP: (37ab1fde, 'tcp:0.0.0.0:4567') connection to peer 00000000 with addr tcp:172.16.57.141:4567 timed out, no messages seen in PT3S

Jan 27 14:55:50 eidlot2-database-1.pass.lan mysqld[15154]: 2020-01-27 14:55:50 0 [Note] WSREP: (37ab1fde, 'tcp:0.0.0.0:4567') connection to peer 00000000 with addr tcp:172.16.57.144:4567 timed out, no messages seen in PT3S

Jan 27 14:55:52 eidlot2-database-1.pass.lan mysqld[15154]: 2020-01-27 14:55:52 0 [Note] WSREP: (37ab1fde, 'tcp:0.0.0.0:4567') connection to peer 00000000 with addr tcp:172.16.57.141:4567 timed out, no messages seen in PT3S

Jan 27 14:55:54 eidlot2-database-1.pass.lan mysqld[15154]: 2020-01-27 14:55:54 0 [Note] WSREP: (37ab1fde, 'tcp:0.0.0.0:4567') connection to peer 00000000 with addr tcp:172.16.57.144:4567 timed out, no messages seen in PT3S

Jan 27 14:55:56 eidlot2-database-1.pass.lan mysqld[15154]: 2020-01-27 14:55:56 0 [Note] WSREP: (37ab1fde, 'tcp:0.0.0.0:4567') connection to peer 00000000 with addr tcp:172.16.57.141:4567 timed out, no messages seen in PT3S

Jan 27 14:55:58 eidlot2-database-1.pass.lan mysqld[15154]: 2020-01-27 14:55:58 0 [Note] WSREP: (37ab1fde, 'tcp:0.0.0.0:4567') connection to peer 00000000 with addr tcp:172.16.57.144:4567 timed out, no messages seen in PT3S

Jan 27 14:56:00 eidlot2-database-1.pass.lan mysqld[15154]: 2020-01-27 14:56:00 0 [Note] WSREP: (37ab1fde, 'tcp:0.0.0.0:4567') connection to peer 00000000 with addr tcp:172.16.57.141:4567 timed out, no messages seen in PT3S

Jan 27 14:56:02 eidlot2-database-1.pass.lan mysqld[15154]: 2020-01-27 14:56:02 0 [Note] WSREP: (37ab1fde, 'tcp:0.0.0.0:4567') connection to peer 00000000 with addr tcp:172.16.57.144:4567 timed out, no messages seen in PT3S


On node 2:

[root@eidlot2-database-2 ]# systemctl status mariadb

● mariadb.service - MariaDB 10.4.9 database server

Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled)

Drop-In: /etc/systemd/system/mariadb.service.d

└─migrated-from-my.cnf-settings.conf

Active: failed (Result: exit-code) since Mon 2020-01-27 14:24:55 CET; 33min ago

Docs: man:mysqld(8)

https://mariadb.com/kb/en/library/systemd/

Process: 20219 ExecStart=/usr/sbin/mysqld $MYSQLD_OPTS $_WSREP_NEW_CLUSTER $_WSREP_START_POSITION (code=exited, status=1/FAILURE)

Process: 20133 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= || VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ] && systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=exited, status=0/SUCCESS)

Process: 20131 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)

Main PID: 20219 (code=exited, status=1/FAILURE)

Status: "MariaDB server is down"

Jan 27 14:24:54 eidlot2-database-2.pass.lan mysqld[20219]: at gcomm/src/pc.cpp:connect():158

Jan 27 14:24:54 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:54 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)

Jan 27 14:24:54 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:54 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1608: Failed to open channel 'galeratest' at 'gcomm:172.16.57.140,172.16.57.141,172.16.57.144'...tion timed out)

Jan 27 14:24:54 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:54 0 [ERROR] WSREP: gcs connect failed: Connection timed out

Jan 27 14:24:54 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:54 0 [ERROR] WSREP: wsrep::connect(gcomm:172.16.57.140,172.16.57.141,172.16.57.144) failed: 7

Jan 27 14:24:54 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:54 0 [ERROR] Aborting

Jan 27 14:24:55 eidlot2-database-2.pass.lan systemd[1]: mariadb.service: main process exited, code=exited, status=1/FAILURE

Jan 27 14:24:55 eidlot2-database-2.pass.lan systemd[1]: Failed to start MariaDB 10.4.9 database server.

Jan 27 14:24:55 eidlot2-database-2.pass.lan systemd[1]: Unit mariadb.service entered failed state.

Jan 27 14:24:55 eidlot2-database-2.pass.lan systemd[1]: mariadb.service failed. Hint: Some lines were ellipsized, use -l to show in full.

[root@eidlot2-database-2 ]# journalctl -xe

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: GCache::RingBuffer initial scan... 0.0% ( 0/314572824 bytes) complete.

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (314572824/314572824 bytes) complete.

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 209-37984

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: GCache::RingBuffer unused buffers scan... 0.0% ( 0/61980600 bytes) complete.

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (61980600/61980600 bytes) complete.

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): found 1/37778 locked buffers

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): used space: 61980600/314572800

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 172.16.57.141; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa =

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: Service thread queue flushed.

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: # Assign initial position for certification: 81e2e742-0016-11ea-ae1f-221166a346b3:37984, protocol version: -1

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: Start replication

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: Connecting with bootstrap option: 0

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: Setting GCS initial position to 81e2e742-0016-11ea-ae1f-221166a346b3:37984

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: protonet asio version 0

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: Using CRC-32C for message checksums.

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: backend: asio

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: gcomm thread scheduling priority set to other:0

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Warning] WSREP: access file(/var/lib/mysqlgvwstate.dat) failed(No such file or directory)

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: restore pc from disk failed

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: GMCast version 0

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: (55e1b535, 'tcp:0.0.0.0:4567') listening at tcp:0.0.0.0:4567

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: (55e1b535, 'tcp:0.0.0.0:4567') multicast: , ttl: 1

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: EVS version 1

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: gcomm: connecting to group 'galeratest', peer '172.16.57.140:,172.16.57.141:,172.16.57.144:'

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: (55e1b535, 'tcp:0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address tcp:172.16.57.141:4567

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: (55e1b535, 'tcp:0.0.0.0:4567') connection established to 55d53448 tcp:172.16.57.144:4567

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: (55e1b535, 'tcp:0.0.0.0:4567') turning message relay requesting on, nonlive peers:

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: (55e1b535, 'tcp:0.0.0.0:4567') connection established to 37ab1fde tcp:172.16.57.140:4567

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: EVS version upgrade 0 -> 1

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: declaring 37ab1fde at tcp:172.16.57.140:4567 stable

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: declaring 55d53448 at tcp:172.16.57.144:4567 stable

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: PC protocol upgrade 0 -> 1

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:24 0 [Note] WSREP: view(view_id(NON_PRIM,37ab1fde,12) memb {

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 37ab1fde,0

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 55d53448,0

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 55e1b535,0

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: } joined {

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: } left {

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: } partitioned {

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 48b83e85,0

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: 4d71f4d4,0

Jan 27 14:24:24 eidlot2-database-2.pass.lan mysqld[20219]: })

Jan 27 14:24:25 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:25 0 [Note] WSREP: (55e1b535, 'tcp:0.0.0.0:4567') connection established to 55d53448 tcp:172.16.57.144:4567

Jan 27 14:24:28 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:28 0 [Note] WSREP: (55e1b535, 'tcp:0.0.0.0:4567') turning message relay requesting off

Jan 27 14:24:54 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:54 0 [Note] WSREP: (55e1b535, 'tcp:0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp:172.16.57.144:4567

Jan 27 14:24:54 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:54 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)

Jan 27 14:24:54 eidlot2-database-2.pass.lan mysqld[20219]: at gcomm/src/pc.cpp:connect():158

Jan 27 14:24:54 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:54 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)

Jan 27 14:24:54 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:54 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1608: Failed to open channel 'galeratest' at 'gcomm:172.16.57.140,172.16.57.141,172.16.57.144': -110 (Connection

Jan 27 14:24:54 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:54 0 [ERROR] WSREP: gcs connect failed: Connection timed out

Jan 27 14:24:54 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:54 0 [ERROR] WSREP: wsrep::connect(gcomm:172.16.57.140,172.16.57.141,172.16.57.144) failed: 7

Jan 27 14:24:54 eidlot2-database-2.pass.lan mysqld[20219]: 2020-01-27 14:24:54 0 [ERROR] Aborting

Jan 27 14:24:55 eidlot2-database-2.pass.lan systemd[1]: mariadb.service: main process exited, code=exited, status=1/FAILURE

Jan 27 14:24:55 eidlot2-database-2.pass.lan systemd[1]: Failed to start MariaDB 10.4.9 database server.

-- Subject: Unit mariadb.service has failed

-- Defined-By: systemd

-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit mariadb.service has failed. -- -- The result is failed.

Jan 27 14:24:55 eidlot2-database-2.pass.lan systemd[1]: Unit mariadb.service entered failed state.

Jan 27 14:24:55 eidlot2-database-2.pass.lan systemd[1]: mariadb.service failed.

(ip .144 is 3rd node, still down, not yet started, so it is normal it gives a timeout -110).

My question: why does node 2 go to failed state? If he joins node 1 they have the majority. Should I also set safe_to_bootstrap to 1 on 2nd node????

Thanks.

Comments

Comments loading...
Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.