Email Alerts for MariaDB Xpand

Overview

MariaDB Xpand constantly self-monitors to ensure your deployment is healthy and operating optimally. When it detects conditions that require attention, Xpand will send alerts via email using its Alerter. Alerts are of different severities (INFO, WARNING, ERROR, and CRITICAL) and Xpand is pre-configured with default thresholds for each.

Configure Email Alerts

MariaDB Xpand can be configured to send email alerts..

Use the following steps to configure the alerts for your system:

  1. Set Identifying Global Variables

    Set these identifying global variables for your database. These are especially important to aid MariaDB Support in troubleshooting.

    SET GLOBAL customer_name = 'customer name';
    SET GLOBAL cluster_name = 'deployment identifier';
    
  2. Configure alerts_parameters for SMTP Server

    The parameters defined in the system.alerts_parameters table control how alerts are formatted and sent.

    MariaDB Xpand requires an SMTP server to send the alert messages. These instructions presume that an SMTP server has already been set up correctly for your environment.

    Set the following SMTP parameters as they apply to your deployment.

    Parameter

    Description

    Required

    smtp_server

    hostname for SMTP server

    Yes

    smtp_port

    SMTP port for your environment, if different from the default TCP port 25.

    Yes

    smtp_username

    SMTP username

    No

    smtp_password

    SMTP password

    No

    smtp_security

    SMTP security type. Must be SMTPS or TLS.

    No

    Follow this syntax to update the parameters shown:

    UPDATE  system.alerts_parameters
    SET value = 'your smpt-specific value'
    WHERE name = 'parameter name';
    
  3. Configure alerts_subscriptions

    Add email addresses of the individual(s) or group(s) who are to receive the alerts to the system.alerts_subscriptions table. You can INSERT, UPDATE, and DELETE from this table using standard SQL.

    To see current list of alert subscriptions:

    SELECT * FROM system.alerts_subscriptions;
    

    To add a new email address:

    INSERT INTO system.alerts_subscriptions
    VALUES ('desired_email@domain_name.com');
    
  4. RESET Alerter

    Any time that changes are made to the system.alerts_parameters or system.alerts_subscriptions table(s), the Alerter must be RESET. Your changes will not take effect until this is done.

    To reset the Alerter:

    ALTER CLUSTER RESET ALERTER;
    

    This will not cause a group change on your deployment.

    If invalid information is provided, you may encounter the following error:

    ALTER CLUSTER RESET ALERTER;
    
    ERROR 1 (HY000): [64512] Bad configuration for alerts:
    

    Check clustrix.log for more information. Here is an example where the smtp_server parameter was not specified:

    2018-10-11 21:07:51.068524 UTC karma068.example.com clxnode:
    ERROR cluster/alerter.ct:219 prepare_write(): Couldn't write alerter
    config: Bad configuration for alerts: No smtp_server specified
    
  5. Request Alert

    To verify that the configuration works properly, execute this SQL to send a test alert:

    SELECT alert(severity, 'alert text');
    

    If you do not receive the expected email alert, please re-review your configuration.

Sample Emailed Alerts

Here are some sample emailed alert messages that may be similar to some you could encounter on your deployment. These alerts will also appear in the query.log.

Database Space WARNING

This alert is a WARNING for a deployment with a device1 file that is at least 80% full.

Severity: WARNING
Date: 2018-10-02 18:49:24.177250 UTC
Host: clxdb003
Cluster: Dogfood7
Version: clustrix-9.1.3
OS Version: CentOS Linux release 7.4.1708 (Core)
Message: Database space is 80% used. Soon user queries will fail.
path=/data/clustrix/device1 device_total=4,247,830,372,352
wal_total=1,073,741,824 device_free=327,733,190,656
temp_total_space=161,061,273,600 system_avail=758,480,666,624
system_total=3,757,962,166,272 total_used=2,999,481,499,648 %=80
user_avail=382,684,449,996 user_total=3,382,165,949,644
cont_type=USER trx_type=USER

Backup INFO

This INFO alert shows that the backup has failed.

This particular sample shows additional information that is available from deployments deployed in AWS.

Severity: INFO
Date: 2018-09-25 23:42:59.798249 UTC
Host: clxdb005
Cluster: Dogfood7
Version: clustrix-9.1.3
OS Version: CentOS Linux release 7.4.1708 (Core)
EC2 Region: us-west-2a
EC2 Instance ID: i-0882894eb6aa887ac
Message: [SQL] backup-25-09-2018 ERROR 2018-09-25 22:52:02

Read ERROR

This ERROR alert indicates that your system's disk is experiencing hardware failures. Contact Support for suggestions.

Severity: ERROR
Date: 2018-09-09 13:18:25.769801 UTC
Host: clxdb001
Cluster: Dogfood7
Version: clustrix-9.1.3
OS Version: CentOS Linux release 7.4.1708 (Core)
Message: Error reading 32768 bytes at offset 0x1d7367d0000 of
"/data/clustrix/device1": Input/output error

Additional Information

Alert Severity Codes

Severity Code

Meaning

0

Critical

1

Error

2

Warning

3

Informational

Alerting Conditions

These are the conditions that MariaDB Xpand monitors and for which alerts are issued. These alerts are predefined within the database (system.alerts_messages) and may not be changed. The severity of these alerts range from critical to simply informational.

Name

Summary

Message

ACTIVATION_FAILED

Activation Failed

Activation of device &device1 failed

DATABASE_SPACE_CRITICAL

Database space critical

Database space is &percent used. User queries will fail, and soon system queries will fail.

DATABASE_SPACE_EXHAUSTED

Database space exhausted

Database space is &percent used. User queries and system queries will now fail.

DATABASE_SPACE_EXTREME

Database space extreme

Database space is &percent used. User queries will now fail.

DATABASE_SPACE_LOW

Database space low

Database space is &percent used. Soon user queries will fail.

DDL_TOO_LONG

DDL lock has been held for too long

The DDL lock has been held for too long. While it is held, all new DDL transactions will block.

DEVICE_DEACTIVATED

Device Deactivated

Deactivating device &device1

DM_READ_ERROR

Device Manager Read Error

Error reading &bytes bytes at offset &offset

HOST_FILE_ERROR

Error writing host files

&error

EXCESSIVE_CLOCK_SKEW

Excessive Clock Skew

Clock skew from nid &node_id to &node_id is &seconds seconds. Is NTP set up and working?

AUTO_RESIZE_FAILED

Failed to automatically resize devices

Not enough room to extend device: node &node_id vdev &number only has &number bytes free, maximum resize is &number

INACCESSIBLE_TABLES

Inaccessible Tables

The following is/are not fully accessible in this cluster: &table_name, &table_name...

INSUFFICIENT_REPROTECT_MEMORY

Insufficient memory for reprotection

Not enough memory to reprotect if another node is lost: &percent memory table usage (without softfailed nodes) is greater than max &percent

INSUFFICIENT_REPROTECT_NODES

Insufficient nodes for reprotection

Not enough nodes to reprotect if another node is lost

INSUFFICIENT_REPROTECT_SPACE

Insufficient space for reprotection

Not enough space to reprotect if another node is lost: &percent usage (without softfailed nodes) is greater than max &percent

LICENSE_INVALID

License is invalid

Invalid license installed

LICENSE_NEAR_EXPIRATION

License is nearing expiration

License will expire at: (&expiration)

LOST_QUORUM

Lost Quorum

Node &node_id lost quorum for group &group_id

MEMORY_TABLE_SPACE_CRITICAL

Memory table space critical

Memory table space is &percent used. User queries will fail, and soon system queries will fail.

MEMORY_TABLE_SPACE_EXHAUSTED

Memory table space exhausted

Memory table space is &percent used. User queries will now fail.

MEMORY_TABLE_SPACE_EXTREME

Memory table space extreme

Memory table space is &percent used. User queries will now fail.

MEMORY_TABLE_SPACE_LOW

Memory table space low

Memory table space is &percent used. Soon user queries may fail.

NEW_GROUP

New Group

Node &node_id has new group &group_id

ZONES_UNSPECIFIED

Node zone unspecified

Zones are configured for some, but not all nodes in this cluster. A zone must be specified for node &node_id

PARTIAL_WRITE_RECOVERED

Partial write recovered

A partial write was detected and recovered. Some space will be unusable unless the node is softfailed, reformatted, and re-added. No immediate action is necessary.

DBSTART_SPACE_PAUSE

Pausing dbstart due to space exhaustion

No space left for system transactions; not resulting continuation, awaiting cp command

PROTECTION_LOST

Protection Lost

Full protection lost for some data; queuing writes for down node; reprotection will begin in &seconds seconds if node has not recovered

PROTECTION_RESTORED

Protection Restored

Full protection restored for all data after &seconds seconds

SLAVE_RESTART

Slave Restart

Restarting mysqlslave &slave_name

SLAVE_STOP

Slave Stopped

Stopped mysqlslave &slave_name on non-transient error: &Error

USER

User Invoked From SQL

&SQL_error

Pre-configured alerts_parameters

These additional entries from the system.alerts_parameters table are pre-configured and shown here for information only.

Some of these parameters include "meta tags" to denote that metadata contents will be substituted in the alert content when that parameter is used. The meta tags are explained in the next section.

Parameter

Value

body_max_chars

50000

email_body

Severity: ${severity}
Date: ${date} ${tz}
Host: ${host}
Cluster: ${cluster_name}
Version: ${version}
OS Version: ${OS_version}
Message: ${message}

email_encoding

quoted-printable

email_subject

${alerts_name} [${severity}] ${summary}

smtp_sender

${alerts_name} CLX Log Alert

subject_max_chars

100

Metadata used in alerts_parameters

The alert parameters sometimes contain metadata that is identified by "meta tags". These meta tags cause real-time information to be substituted within a generated alert.

The following chart shows how each meta tag will be resolved whenever it is used.

Parameter

Value

{alerts_name}

Concatenation of deployment name and customer name.

{cluster_name}

Name for the deployment from the global cluster_name.

{customer_name}

Name of the customer as identified in the global customer_name.

{date}

The system's current_timestamp.

{group}

ID of the current deployment group.

{host}

Name of host sending the alert.

{message}

Text of the error message from system.alerts_messages.message

{OS_version}

Operating system version.

{severity}

Severity level of the alert as follows: | 0 - CRITICAL | 1 - ERROR | 2 - WARNING | 3 - INFO

{summary}

Short form of the error message from system.alerts_messages.summary

{tz}

System time zone from global variable system_time_zone.

{version}

Software version from global variable version.