Point-In-Time-Recovery

Point-in-time recovery (PITR) is a feature that allows you to restore a MariaDB instance to a specific point in time. For achieving this, it combines a full base backup and the binary logs that record all changes made to the database after the backup. This is something fully automated by operator, covering archival and restoration up to a specific time, ensuring business continuity and reduced RTO and RPO.

Supported MariaDB versions and topologies

The operator uses mariadb-binlogarrow-up-right to replay binary logs, in particular, it filters binlog events by passing a GTID to mariadb-binlog via the --start-positionarrow-up-right flag. This is only supported by MariaDB server 10.8 and later, so make sure you are using a compatible MariaDB version.

Regarding supported MariaB topologies, at the moment, binary log archiving and point-in-time recovery are only supported by the asynchronous replication topology, which already relies on the binary logs for replication. Galera and standalone topologies will be supported in upcoming releases.

Storage types

Full base backups and binary logs can be stored in the following object storage types:

For additional details on configuring storage, please refer to the storage types section in the physical backup documentation, same settings are applicable to the PointInTimeRecovery object.

Configuration

To be able to perform a point-in-time restoration, a physical backup should be configured as full base backup. For example, you can configure a nightly backup:

apiVersion: enterprise.mariadb.com/v1alpha1
kind: PhysicalBackup
metadata:
  name: physicalbackup-daily
spec:
  mariaDbRef:
    name: mariadb-repl
  schedule:
    cron: "0 0 * * *"
    suspend: false
    immediate: true
  compression: bzip2
  maxRetention: 720h 
  storage:
    s3:
      bucket: physicalbackups
      prefix: mariadb
      endpoint: minio.minio.svc.cluster.local:9000
      region: us-east-1
      accessKeyIdSecretKeyRef:
        name: minio
        key: access-key-id
      secretAccessKeySecretKeyRef:
        name: minio
        key: secret-access-key
      tls:
        enabled: true
        caSecretKeyRef:
          name: minio-ca
          key: ca.crt

Refer to the full base backup section for additional details on how to configure the full base backup.

Next step is configuring common aspects of both binary log archiving and point-in-time restoration by defining a PointInTimeRecovery object:

  • physicalBackupRef: It is a reference to the PhysicalBackup resource used as full base backup. See full base backup.

  • storage: Object storage configuration for binary logs. See storage types.

  • compression: Algorithm to be used for compressing binary logs. It is disabled by default. See compression.

  • archiveTimeout: Maximum duration for the binary log archival. If exceeded, agent will return an error and archival will be retried in the next archive cycle. Defaults to 1h.

  • archiveInterval: Interval at which the binary logs will be archived. Defaults to 10m. See archival for additional details.

  • maxParallel: Maximum number of workers that can be used for parallel binary log archival and restoration. Defaults to 1. See parallelization.

  • maxRetention: Maximum retention duration for binary logs. By default, binary logs are not automatically deleted. See retention policy.

  • strictMode: Controls the behavior when a point-in-time restoration cannot reach the exact target time. It is disabled by default. See strict mode.

With this configuration in place, you can enable binary log archival in a MariaDB instance by setting a reference to the PointInTimeRecovery object:

Once a full base backup has been completed and the binary logs have been archived, you can perform a point-in-time restoration. For example, you can create a new MariaDB instance with the following configuration:

Refer to the point-in-time restoration section for additional details.

Full base backup

To enable point-in-time recovery, a PhysicalBackup resource should be configured as full base backup. The backup should be a complete snapshot of the database at a specific point in time, and it will serve as the starting point for replaying the binary logs. Any of the supported backup strategies can be used as full base backup, as all of them provide a consistent snapshot of the database and a starting GTID position.

It is very important to note that a full physical backups should be completed before a point-in-time restoration can be performed. This is something that the operator accounts for when computing the last recoverable time.

To further expand the last recoverable time, it is recommended to take physical backups after the primary Pod has changed. This can be automated by setting schedule.onPrimaryChange, as documented in the physical backup docs:

Alternatively, you can schedule an on-demand physical backup or rely on the cron scheduling for doing so:

The backup taken in the new primary will establish a baseline for a new binlog timeline, which will be expanded when new binary logs are archived.

Archival

The mariadb-enterprise-operator sidecar agent will periodically check for new binary logs and archive them to the configured object storage. The archival process is controlled by the archiveInterval and archiveTimeout settings in the PointInTimeRecovery configuration, which determine how often the archival process runs and how long it can take before it is considered failed.

The archival process is performed on the primary Pod in the asynchronous replication topology, you may check the logs of the agent sidecar container, Kubernetes events and status of the MariaDB objects to monitor the current status of the archival process:

There are a couple of important considerations regarding binary log archival:

  • The archival process should start from a clean state, which means that the object storage should be empty at the time of the first archival.

  • It is not recommended to set archiveInterval to a very low value (< 1m), as it can lead to increased load on the database Pod and the storage system.

  • If the archival process fails (e.g., due to network issues or storage unavailability), it will be retried in the next archive cycle.

  • If binlog_expire_logs_secondsarrow-up-right server variable is configured, it should be set to a value higher than the archiveInterval to prevent automatic deletion of binary logs before they are archived.

  • Manually executing PURGE BINARY LOGSarrow-up-rightcommand on the database is not recommended, as it can lead to inconsistencies between the database and the archived binary logs.

  • Manually executing FLUSH BINARY LOGSarrow-up-right command on the database should be compatible with the archival process, it will force the active binary log to be closed and will be archived by the agent in the next archive cycle.

Binary log size

The server has a default max_binlog_sizearrow-up-right of 1GB, which means that a new binary log file will be created once the current one reaches that size. This is sensible default value for most cases, but it can be adjusted based on the data volume in order to enable a faster archival, and therefore a reduced RPO:

Environment
Recommended Size
Rationale

Low Traffic

128MB

Keeps file size minimal for slow-growing logs.

Standard

256MB

Balances rotation frequency with server overhead.

High Throughput

512MB - 1GB

Reduces the contention caused by frequent rotations in write-heavy environments.

The smaller the binlog file size, the more frequently the files will be rotated and archived, which can lead to increased load on the database Pod and the storage system. On the other hand, setting a very high binlog file size can lead to longer archival times and increased RPO.

Refer to the configuration documentation for instructions on how to set the max_binlog_size server variable in the MariaDB instance.

Compression

In order to reduce storage usage and save bandwidth during archival and restoration, the operator supports compressing the binary log files. Compression is enabled by setting the compression field in the PointInTimeRecovery configuration:

The supported compression algorithms are:

  • bzip2: Good compression ratio, but slower compression/decompression speed compared to gzip.

  • gzip: Good compression/decompression speed, but worse compression ratio compared to bzip2.

  • none: No compression.

Compression is disabled by default, and the are some important considerations before enabling it:

  • Compression is immutable, which means that once configured and binary logs have been archived with a specific algorithm, it cannot be changed. This also applies to restoration, the same compression algorithm should be configured as the one used for archival.

  • Although it saves storage space and bandwidth, the restoration process may take longer when compression is enabled, leading to an increased RTO. This can migrated by enabling parallelization.

Server-Side Encryption with Customer-Provided Keys (SSE-C) For S3

When using S3-compatible storage, you can enable server-side encryption using your own encryption key (SSE-C) by providing a reference to a Secret containing a 32-byte (256-bit) key encoded in base64:

circle-exclamation
circle-info

When replaying SSE-C encrypted binary logs via bootstrapFrom, the same key must be provided in the S3 configuration.

Parallelization

Several tasks during both archival an restoration process can take a significant amount of time, specially when managing large data volumes. These tasks include compressing and uploading binary logs during archival, and downloading and decompressing binary logs during restoration. This can lead to longer archival and restoration times, which can impact the RTO.

To mitigate this, the operator supports parallelization of these tasks by using multiple workers. The maximum number of workers can be configured via the maxParallel field in the PointInTimeRecovery configuration:

This will create up to 4 workers, each of them responsible for the operations related to a single binary log, which means that up to 4 binary logs can be processed in parallel. This can significantly reduce the archival and restoration times, specially when compression is enabled.

Parallelization is disabled by default (maxParallel: 1), and there are some important considerations to be taken into account when enabling it:

  • During archival, the workers will be spawn in the agent sidecar container, sharing storage with the primary database Pod. Using an elevated number of workers can exhaust IOPS and/or CPU resources of the primary Pod, which can impact the performance of the database.

  • During both archival and restoration, using an elevated number of workers can saturate the network bandwidth when pulling/pushing multiple binary logs in parallel, something that can degrade the performance of the database.

Retention policy

Binary logs can grow significantly in size, especially in write-heavy environments, which can lead to increased storage costs. To mitigate this, the operator supports automatic purging of binary logs based on a retention policy defined by the maxRetention field in the PointInTimeRecovery configuration:

The binary logs that exceed the defined retention will be automatically deleted from the object storage after each archival cycle.

By default, binary logs are never purged from object storage, and there are few considerations regarding configuring a retention policy:

  • The date of the last event in the binary logs is used to determine its age, and therefore whether it should be purged or not.

  • The maxRetention field should not be set to a value lower than the archiveInterval, as it can lead to situations where binary logs are purged before they can be archived.

Binlog inventory

The operator maintains an inventory of the archived binary logs in an index.yaml file located at the root of the configured object storage. This file contains a list of all the archived binary logs per each server, along with their GTIDs and other metadata utilized internally. Here is an example of the index.yaml file:

This file is used internally by the operator to keep track of the archived binary logs, and it is updated after each successful archival. It should not be modified manually, as it can lead to inconsistencies between the actual archived binary logs and the inventory.

When it comes to point-in-time restoration, this file serves as a source of truth to compute the binlog timeline and the last recoverable time.

Binlog timeline and last recoverable time

Taking into account the last completed physical backup GTID and the archived binlogs in the inventory, the operator computes a timeline of binary logs that can replayed and its corresponding last recoverable time. The last recoverable time is the latest timestamp that the MariaDB instance can be restored to. This information is crucial for understanding the RPO of the system and for making informed decisions during a recovery process.

You can easily check the last recoverable time by looking at the status of the PointInTimeRecovery object:

Then, you may provide exactly this timestamp, or an earlier one, as target recovery time when bootstrapping a new MariaDB instance, as described in the point-in-time restoration section.

Point-in-time restoration

In order to perform a point-in-time restoration, you can create a new MariaDB instance with a reference to the PointInTimeRecovery object in the bootstrapFrom field, along with the targetRecoveryTime field indicating the desired point-in-time to restore to.

For setting the targetRecoveryTime, it is recommended to check the last recoverable time first in the PointInTimeRecovery object:

  • pointInTimeRecoveryRef: Reference to the PointInTimeRecovery object that contains the configuration for the point-in-time recovery.

  • targetRecoveryTime: The desired point in time to restore to. It should be in RFC3339 format. If not provided, the current time will be used as target recovery time, which means restoring up to the last recoverable time.

  • restoreJob: Compute resources and metadata configuration for the restoration job. To reduce RTO, it is recommended to properly tune compute resources.

  • logLevel: Log level for the operator container, part of the restoration job.

The restoration process will match the closest physical backup before or at the targetRecoveryTime, and then it will replay the archived binary logs from the backup GTID position up until the targetRecoveryTime:

As you can see, the restoration process includes the following steps:

  1. Perform a rolling restore of the full base backup, one Pod at a time.

  2. Configure replication in the MariaDB instance.

  3. Get the base backup GTID, to be used as the starting point for replaying the binary logs.

  4. Schedule the point-in-time restoration job, which will:

    1. Build the binlog timeline based on the base backup GTID and the archived binary log inventory.

    2. Pull the binary logs in the timeline into a staging area.

    3. Replay the binary logs using mariadb-binlogarrow-up-right from the GTID position of the base backup up to the targetRecoveryTime.

After having completed the restoration process, the following status conditions will be available for you to inspect the restoration process:

Strict mode

The strict mode controls whether the target recovery time provided during the bootstrap process should be strictly met or not. This is configured via the strictMode field in the PointInTimeRecovery configuration, and it is disabled by default:

When strict mode is enabled (recommended), if the target recovery time cannot be met, the initialization process will return an error early, and the MariaDB instance will not be created. This can happen, for example, if the target recovery time is later than the last recoverable time. Let's assume strict mode is enabled and the last recoverable time is:

If we attempt to provision the following MariaDB instance:

The following errors will be returned, as the target recovery time 2026-02-28T20:10:42Z is later than the last recoverable time 2026-02-27T20:10:42Z:

When strict mode is disabled (default), and the target recovery time cannot be met, the MariaDB provisioning will proceed and the last recoverable time will be used. This would mean that, the MariaDB instance will be provisioned with a recovery time of 2026-02-27T20:10:42Z, which is the last recoverable time:

After setting strictMode=false, if we attempt to create the same MariaDB instance as before, it will be successfully provisioned, but with a recovery time of 2026-02-27T20:10:42Z will be used instead of the requested 2026-02-28T20:10:42Z.

It is important to note that the last recoverable time is stored in the status field of the PointInTimeRecovery object, therefore if this object is deleted and recreated, the last recoverable time metadata will be lost, and it will not be available until recomputed. When it comes to restore, this implies that the error will be returned later in the process, when computing the binary log timeline, but the strict mode behaviour still applies. This is the error returned for that scenario:

Staging storage

The operator uses a staging area to temporarily store the binary logs during the restoration process. By default, the staging area is an emptyDir volumearrow-up-right attached to the restoration job, which means that the binary logs are kept in the node storage where the job has been scheduled. This may not be suitable for large binary logs, as it can lead to exhausting the node's storage, resulting the restoration process to fail and potentially impacting other workloads running in the same node.

You are able to configure an alternative staging area using the stagingStorage field under the bootstrapFrom section in the MariaDB resource:

This will provision a PVC and attach it to the restoration job to be used as staging area.

Limitations

  • A PointInTimeRecovery object can only be referred by a single MariaDB object via the pointInTimeRecoveryRef field.

  • A combination object storage bucket + prefix can only be utilizied by a single MariaDB instance to archive binary logs.

Troubleshooting

The operator tracks the current archival status under the MariaDB status subresource. This status is updated after each archival cycle, and it contains metadata about the binary logs that have been archived, along with other useful information for troubleshooting:

Additionally, also under the status subresource, the operator sets status conditions whenever a specific state of the binlog archival or point-in-time restoration process is reached:

The operator also emits Kubernetes events during both archival and restoration process, to either report an outstanding event or error:

Common errors

Unable to start archival process

The following error will be returned if the archival process is configured pointing to a non-empty object storage, as the operator expects to start from a clean state:

To solve this, you can update the PointInTimeRecovery configuration pointing to another object storage bucket or prefix that is empty:

After updating the PointInTimeRecovery configuration, the error will be cleared in the next archival cycle, and a new archival operation will be attempted.

Alternatively, you can also consider deleting the existing binary logs and index.yaml inventory file, only after having double checked that they are not needed for recovery.

Target recovery time is after latest recoverable time

This error is returned in the MariaDB init process, when the targetRecoveryTime provided to bootstrap is later than the last recoverable time reported by the PointInTimeRecovery status.

For example, if you have configured the bootstrapFrom.targetRecoveryTime field with the value 2026-02-28T20:10:42Z, the following error will be returned:

There are two ways to solve this issue:

  • Update the targetRecoveryTime in the MariaDB resource to be earlier than or equal to the last recoverable time, which in this case is 2026-02-27T20:10:42Z.

  • Disable strictMode in the PointInTimeRecovery configuration, allowing to restore up until the latest recoverable time, in this case 2026-02-27T20:10:42Z.

Invalid binary log timeline: error getting binlog timeline between GTID and target time: timeline did not reach target time

This error is returned when computing the binary log timeline during the restoration process, and it means that the operator could not build a timeline that reaches the targetRecoveryTime provided in the bootstrapFrom field of the MariaDB resource.

For example, if you have the following binary log inventory:

And your targetRecoveryTime is 2026-02-28T20:10:42Z, the following error will be returned:

There are two ways to solve this issue:

  • Update the targetRecoveryTime in the MariaDB resource to be earlier than or equal to the last recoverable time, which in this case is 2026-02-27T16:04:15Z.

  • Disable strictMode in the PointInTimeRecovery configuration, allowing to restore up until the latest recoverable time, in this case 2026-02-27T16:04:15Z.

Last updated

Was this helpful?