MariaDB Xpand Monitoring

Overview

MariaDB Xpand provides all of the information necessary for monitoring the deployment. The system database contains low-level information about all portions of the deployment, and proprietary SQL commands return high-level summary information. The following sections describe these features in detail.

As a best practice, Unix sysadmin tools such as top should not be used on Xpand to monitor node status, as their results may not be reliable.

The system variables, system tables, and statements referenced on this page are only available on the Xpand nodes. When using the Xpand Storage Engine topology, you need to connect to an Xpand node to use these features.

Current Sessions

MariaDB Xpand provides a system.sessions system table that can be used to monitor current sessions.

When using the Xpand Storage Engine topology, the system.sessions system table can be accessed by connecting to the Xpand nodes.

The following query will list all sessions in the deployment:

SELECT * FROM system.sessions;

Example output:

*************************** 1. row ***************************
            nodeid: 1
        session_id: 1560577
         source_ip: NULL
       source_port: NULL
          local_ip: NULL
        local_port: NULL
              user: 4106
          database: NULL
         trx_state: closed
   statement_state: finished
               xid: NULL
               cpu: 3
         isolation: REPEATABLE-READ
    last_statement: delete from system.rebalancer_activity_log where
                    finished < subdate(now(), interval @global.
                    rebalancer_activity_log_hours * 60 minute)
   time_in_state_s: 161
           created: 2016-01-11 20:31:41
           heap_id: 288230379201751147
           trx_age: NULL
          trx_mode: autocommit
trx_counter_select: 0
trx_counter_insert: 0
trx_counter_update: 0
trx_counter_delete: 1
     trx_is_writer: 1

Monitoring CPU Load

MariaDB Xpand provides a SHOW LOAD statement that can be used to monitor CPU load.

When using the Xpand Storage Engine topology, the SHOW LOAD statement can be accessed by connecting to the Xpand nodes.

To display CPU load percentages (0 to 100) for CPU utilization, disk read/write utilization, and buffer cache miss rates, issue the SHOW LOAD command. In the deployment, CPU core 0 is used for specific tasks, and therefore omitted from the load average when calculating the overall load.

Disk read and write values are calculated as a percentage using estimated maximums for solid state drives and might exceed 100% under certain conditions.

Monitoring Space Utilization per Table

MariaDB Xpand provides a system.table_sizes system table that can be used to monitor the size of Xpand tables.

When using the Xpand Storage Engine topology, the system.table_sizes system table can be accessed by connecting to the Xpand nodes.

The system.table_sizes table shows the size in bytes of each table in every database, including indexes, replicas, and row data. The value in the size column represents total storage on disk, rather than just the size of the data.

Monitoring Data Rebalancing Activity

MariaDB Xpand continuously monitors how data is distributed in the deployment. When it detects an imbalance, it rebalances data by moving, copying, and splitting as required. Data is also copied for protective purposes, such as when a disk or node is lost.

statd

MariaDB Xpand provides a clustrix_statd database that can be used to monitor Xpand statistics. It contains multiple tables, including:

  • statd_history

  • statd_metadata

When using the Xpand Storage Engine topology, the clustrix_statd database can be accessed by connecting to the Xpand nodes.

Statistics are available as either counters or gauges. Counters are additive whereas a gauge captures a statistic at a given point of time and can be used to graph movement of that statistic.

Querying Statistics with statd

Use the REGEXP operator to query the following statistics. This example queries transactions per second from the prior 24 hours:

USE clustrix_statd;

SELECT timestamp, value, name
FROM clustrix_statd.statd_history
NATURAL JOIN clustrix_statd.statd_metadata
WHERE name REGEXP 'tps$'
  AND timestamp > now() - interval 1 day
ORDER BY timestamp DESC;

Frequently Used statd Metrics

Frequently used statd metrics are listed below.

Connections

Metric Name

REGEXP

Description

Measurers

Graph As

clustrix.stats.connections

connections

Total number of connections to the deployment

Number of connections

counter

Diagnostic

Metric Name

REGEXP

Description

Measurers

Graph As

clustrix.containers.count

containers.count

Number of containers in use by Xpand

Container Count

gauge

clustrix.cluster.nodes_in_quorum

nodes_in_quorum

Number of nodes currently online and in the deployment

Number of nodes online

gauge

clustrix.stats.bigc_value.node.nn

bigc_value.node.*

XID value through which garbage collection will clean up

BigC value by node.

gauge

clustrix.stats.fc_wait_time.node.nn

fc_wait_time.*

Total time all queries spent waiting for flow control (inter-node communication)

Time waiting on flow control

counter

clustrix.stats.key_caches_updates.node.nn

key_caches_updates.node

Number of times the PD Cache issued update keys. This statistic is sometimes useful for support.

PD Cache keys updated

counter

clustrix.stats.layer_merge_bytes.node.nn

layer_merge_bytes.node.

Size is bytes of layer merge operations by node

Layer merge bytes by node

counter

clustrix.stats.layer_merge_rows.node.nn

layer_merge_rows.node.

Number of rows merged by layer merge operations by node

Layer merge rows by node

counter

clustrix.stats.pdcache_fetches.node.nn

pdcache_fetches.node.

Number of times the PD Cache fetched new PDs

PD Cache fetches by node.

counter

clustrix.stats.pdcache_loaded.node.nn

pdcache_loaded.node.

Number of times the PD Cache loaded new PDs

PD Cache loaded by node.

counter

clustrix.stats.qpc_misses

qpc_misses

Number of times a query was missed by the QPC

Query Plan Cache, misses

counter

clustrix.stats.qpc_recompiles

qpc_recompiles

Number of times the QPC recompiled queries

Query Plan Cache, recompiles

counter

clustrix.stats.queries_not_using_indexes

queries_not_using_indexes

Number of queries on the deployment not using any index

Queries not using indexes

counter

clustrix.stats.walltime_(read|write)_us

walltime_(read|write)_us

Total wall time (elapsed) of read or write queries

Wall time of read or write queries

counter

Monitoring

Metric Name

REGEXP

Description

Measurers

Graph As

clustrix.rebalancer_reprotects

rebalancer_reprotects$

Metrics for various Rebalancer actions

Various Rebalancer actions

counter

clustrix.rebalancer_deletes

rebalancer_deletes$

Metrics for various Rebalancer actions

Various Rebalancer actions

counter

clustrix.rebalancer_rebalance

rebalancer_rebalance$

Metrics for various Rebalancer actions

Various Rebalancer actions

counter

clustrix.rebalancer_redistributes

rebalancer_redistributes$

Metrics for various Rebalancer actions

Various Rebalancer actions

counter

clustrix.rebalancer_request

rebalancer_request$

Metrics for various Rebalancer actions

Various Rebalancer actions

counter

clustrix.rebalancer_reranks

rebalancer_reranks$

Metrics for various Rebalancer actions

Various Rebalancer actions

counter

clustrix.rebalancer_splits

rebalancer_splits$

Metrics for various Rebalancer actions

Various Rebalancer actions

counter

clustrix.cluster.total_nodes

cluster.total_nodes

The total number of nodes configured for your deployment (not just in quorum)

Total number of nodes

gauge

Networking

Metric Name

REGEXP

Description

Measurers

Graph As

clustrix.io.network.latency_ms.avg.nodenn.(network interface)

latency_ms.avg.node.

Network latency average per node.

Network Latency avg by node.

gauge

clustrix.io.network.latency_ms.avg

latency_ms....$

Network latency average

Network latency average

gauge

clustrix.io.network.latency_ms.max

latency_ms....$

Network latency maximum

Network latency maximum

gauge

clustrix.io.network.latency_ms.min

latency_ms....$

Network latency minimum

Network latency minimum

gauge

Performance

Metric Name

REGEXP

Description

Measurers

Graph As

clustrix.activity.core0.node.nn

core0.node.*

TIL CPU core-0 utilization by node.

Til CPU Core-0 by node.

gauge

clustrix.activity.til_avg

til_avg$

Average core utilization for the deployment, excluding core-0

Average Til CPU

gauge

clustrix.activity.til_avg.node.nn

til_avg.node.*

Average TIL core (non core-0) utilization per node in a deployment

Average Til CPU by node

gauge

clustrix.activity.til_(min|max|avg).node.nn

til_(min|max|avg).node.

TIL CPU core (non core-0) utilization min/max/avg

Til CPU min/max/avg by node

gauge

clustrix.activity.til_max

til_max$

Max core utilization for the deployment, excluding core-0

Max Til CPU

gauge

clustrix.activity.til_max.node.nn

til_max.node.*

Average TIL core (non core-0) utilization per node in a deployment

Max Til CPU by node

gauge

clustrix.cpu.load_(min|max)

cpu.load_...$

CPU Core min/max database utilization only

Database CPU min/max

gauge

clustrix.cpu.load.node.nn.cpu.0

load.node.*.cpu.0

CPU Core-0 by node database utilization only

Database CPU Core only core-0 by node

gauge

clustrix.cpu.load.node.nn.cpu.nn

load.node.*.cpu.[^0]\d*

CPU Cores excluding core-0 by node database utilization only

Database CPU core excluding core-0 by node

gauge

clustrix.lockman.wait_ms.(min|max|avg)

lockman.wait_ms....$

Total time all queries spent waiting on locks

Time waiting on query locks

gauge

clustrix.qps

qps$

Cluster wide Queries per second

Cluster QPS

gauge

clustrix.response_time.read_us

response_time.read_us

Cluster query read latency, number of queries divided by runtime

Cluster read latency

gauge

clustrix.response_time.write_us

response_time.write_us

Cluster query write latency, number of queries divided by runtime

Cluster write latency

gauge

clustrix.stats.executing_sessions

executing_sessions

The number of sessions currently executing

Sessions currently executing

gauge

clustrix.stats.qps_total.node.nn

qps_total.node

Queries per second by node

QPS by node

counter

clustrix.stats.tm_cpu_waittime_us

tm_cpu_waittime

The total amount of time queries spent waiting due to the fair scheduler

Fair Scheduler wait time

counter

clustrix.stats.tps_total.node.nn

tps_total.node

Transactions per second by node

TPS by node

counter

clustrix.tps

tps$

Cluster wide Transactions per second

Cluster TPS

gauge

Queries

Metric Name

REGEXP

Description

Measurers

Graph As

clustrix.stats.Com_(delete|insert|select|update)

Com_(delete|insert|select|update)$

Number of queries separated by query type

Queries by type

counter

clustrix.stats.Com_(delete|insert|select|update)_us

Com_(delete|insert|select|update)_us

Amount of run time by query type

Time per query type

gauge

clustrix.stats.runtime_read_us

runtime_read_us

Total runtime of read queries

Runtime of read queries

counter

clustrix.stats.runtime_write_us

runtime_write_us

Total runtime of write queries

Runtime of write queries

counter

Replication

Metric Name

REGEXP

Description

Measurers

Graph As

clustrix.replication.slave.relay_log_bytes.(slave)

relay_log_bytes[^_]

Size in bytes of the relay logs for the replication slave

Relay log bytes for all slaves

gauge

clustrix.replication.slave.bytes_read.(slave)

slave.bytes_read

Bytes read by the replication slave

Bytes read by slave

counter

clustrix.replication.slave.events_read.(slave)

events_read

Number of events read by the replication slave

Events read by slave

counter

clustrix.replication.slave.seconds_behind_master.(slave)

seconds_behind_master

Seconds behind master value for all replication slaves

Slave seconds behind master

gauge

Rows

Metric Name

REGEXP

Description

Measurers

Graph As

clustrix.containers.rows_read.node.nn

rows_read.node..$

Number of rows read by node at the container level

Rows read by node, container

counter

clustrix.containers.rows_written.node.nn

rows_written.node..$

Number of rows written by node at the container level

Rows written by node, container

counter

clustrix.containers.rows_(deleted|inserted|replaced)

rows_(deleted|inserted|replaced)$

Number of rows deleted, inserted, and updated at the container level

Rows, deleted, inserted, updated

counter

clustrix.containers.rows_(read|written)

containers.rows_(read|written)$

Number of rows read and written at the container level

Rows read and written, container

counter

clustrix.containers.rows_deleted

rows_deleted$

Number of rows deletes at the container level

Rows deleted, container level

counter

clustrix.containers.rows_inserted

rows_inserted$

Number of rows inserted at the container level

Rows inserted, container level

counter

clustrix.containers.rows_read

containers.rows_read$

Number of rows read at the container level

Rows read, container level

counter

clustrix.containers.rows_replaced

rows_replaced$

Number of rows updated at the container level

Rows updated, container level

counter

clustrix.containers.rows_written

rows_written$

Number of rows written at the container level

Rows written, container level

counter

clustrix.stats.tm_rows_output

tm_rows_output

Number of rows written by queries

Rows written, transaction manager

counter

clustrix.stats.tm_rows_read

tm_rows_read

Number of rows read by queries

Rows read, transaction manager

counter

Storage

Metric Name

REGEXP

Description

Measurers

Graph As

clustrix.capacity.disks.(avg|max|min)_used_percent

[avg|max|min]_used_percent

The min/max/avg disk capacity used

Percentage of disks in use

gauge

clustrix.io.disk.bytes_read.node.nn.disk.(disk segment)

disk.bytes_read.node.

Not applicable for virtual environments (i.e., i3s).

Bytes read at the kernel level

counter

clustrix.io.disk.bytes_written.node.nn.disk.(disk segment)

disk.bytes_written.node.

Not applicable for virtual environments (i.e., i3s).

Bytes written at the kernel level

counter

clustrix.io.disk.pct_utilization.node.nn.disk.(disk segment)

disk.pct_utilization.node

Disk utilization metric for individual physical disks hosting the vdev file (e.g., through an md RAID device)

Disk utilization

gauge

clustrix.io.disk.read_latency_ms.node.nn.disk.(disk segment)

read_latency_ms.node.

Average read latency from IO time over the same period

Storage read latency, kernel level by node

gauge

clustrix.io.disk.write_latency_ms.node.nn.disk.(disk segment)

write_latency_ms.node.

Average write latency from IO time over the same period

Storage write latency, kernel level by node

gauge

clustrix.io.vdevs.bytes_read

bytes_read$

Device1 bytes read, deployment level

Device1 bytes read, deployment level

counter

clustrix.io.vdev.bytes_read.node.nn.vdev.3

bytes_read.node.*.vdev.3

Device1 bytes read by node

Device1 bytes read by node

counter

clustrix.io.vdevs.bytes_read_per_sec

bytes_read_per_sec

Device1 bytes read per second

Device1 bytes read per second

gauge

clustrix.io.vdevs.bytes_writtten

bytes_written$

Device1 bytes written, deployment level

Device1 bytes written, deployment level

counter

clustrix.io.vdev.bytes_written.node.nn.vdev.3

bytes_written.node.*.vdev.3

Device1 bytes written by node

Device1 bytes written by node

counter

clustrix.io.vdevs.bytes_written_per_sec

bytes_written_per_sec

Device1 bytes written per second

Device1 bytes written per second

gauge

clustrix.io.vdev.read_latency_us.node.nn.vdev.3

read_latency_us.node.*.vdev.3

Device1 read latency by

Device1 read latency by node

gauge

clustrix.io.vdev.sync_latency_us.node.nn.vdev.3

sync_latency_us.node.*.vdev.3

Device1 sync latency by node

Device1 sync latency by node

gauge

clustrix.io.vdev.write_latency_us.node.nn.vdev.3

write_latency_us.node.*.vdev.3

Device1 write latency by node

Device1 write latency by node

gauge