Host Anomaly Detector for MariaDB Xpand
MariaDB Xpand 6.1.1 introduces the Host Anomaly Detector, which can help determine the cause of cluster-wide instability, with an initial focus on diagnosing network issues:
Xpand's Host Anomaly Detector aggregates and analyzes the cluster logs to detect issues
Metrics can be collected by scraping the exposed Prometheus endpoint or by configuring the monitor to export to InfluxDB directly
MariaDB Xpand 6.1 (6.1.1 and later)
Identify potential problems with nodes sooner
Detect when backend network connections repeatedly fail
Detect when particular periodic tasks take too long
Aggregates and Analyzes Cluster Logs
MariaDB Xpand's Host Anomaly Detector functions by aggregating and analyzing the cluster logs.
Despite the additional complexity required to aggregate and analyze the cluster logs, the logs are an excellent source of information about the state of the Xpand cluster that is not easily available through other means, especially in cases where the database itself can't be queried.
For example, consider a scenario where a cluster is repeatedly looping through group changes due to poor network connectivity, so the cluster never reaches the point where it can execute queries. In this scenario, the Xpand cluster can't be queried, so some other source of data must be used to obtain the state of the cluster. Since the logs are available even when the database is down, the logs can be used entirely independent of the state of the local node or the cluster as a whole.
Additionally, the logs from different nodes can be used to correlate events from each node. In our example scenario, the problematic nodes with poor connectivity would fail to connect to the good nodes roughly as often as the good nodes would fail to connect to the problematic nodes. Since the Host Anomaly Detector collects logs from all nodes, it is able to view all similar events from all logs to find the common node in the error messages.
MariaDB Xpand's Host Anomaly Detector currently tracks the following metrics on each Xpand node:
The amount of memory used (in kilobytes) by the monitor process
The Unix timestamp of the last observed database process restart on the monitor's host
The number of errors reported by this node
The number of alerts reported by this node
The number of crashes reported by this node
The Unix timestamp of the most recent time this node has crashed
The number of group changes this node has experienced
The Unix timestamp of when the most recent group change began
The Unix timestamp of when the most recent group formed
The number of times this node has failed to connect to each other node in the cluster