MaxScale 23.08 LDI Filter

LDI Filter

The ldi (LOAD DATA INFILE) filter was introduced in MaxScale 23.08.0 and it extends the MariaDB LOAD DATA INFILE syntax to support loading data from any object storage that supports the S3 API. This includes cloud offerings like AWS S3 and Google Cloud Storage as well as locally run services like Minio.

If the filename starts with either S3:// or gs://, the path is interpreted as a S3 object file. The prefix is case-insensitive. For example, the following command would load the file my-data.csv from the bucket my-bucket into the table t1.

LOAD DATA INFILE 'S3://my-bucket/my-data.csv' INTO TABLE t1
    FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

How to Upload Data

Here is a minimal configuration for the filter that can be used to load data from AWS S3:

[LDI-Filter]
type=filter
module=ldi
host=s3.amazonaws.com
region=us-east-1

The first step is to move the file to be loaded into the same region that MaxScale and the MariaDB servers are in. One factor in the speed of the upload is the network latency and minimizing it by moving the source and the destination closer improves the data loading speed.

The next step is to connect to MaxScale and prepare the session for an upload by providing the service account access and secret keys.

SET @maxscale.ldi.s3_key='<my-access-key>', @maxscale.ldi.s3_secret='<my-secret-key>';

Once the credentials are configured, the data loading can be started:

LOAD DATA INFILE 'S3://my-bucket/my-data.csv' INTO TABLE t1;

Data Uploads with MariaDB Xpand

For MariaDB Xpand server, the upload is done using xpand_import. In this case, xpand_import must be installed locally on the MaxScale server and must be in the executable path of the maxscale user.

In addition, the @maxscale.ldi.import_user and @maxscale.ldi.import_password variables must be set to the username and password that are used to load data into the Xpand cluster.

SET @maxscale.ldi.import_user='<user>', @maxscale.ldi.import_password='<password>';

If a LOAD DATA LOCAL INFILE command is executed with an Xpand cluster, the data is redirected into xpand_import instead of directly to the Xpand nodes. This will speed up data imports into Xpand. For this mode, only the @maxscale.ldi.import_user and @maxscale.ldi.import_password variables must be set, the other S3 related variables are ignored.

If xpand_import is not installed locally, the LOAD DATA INFILE and LOAD DATA LOCAL INFILE commands will not use xpand_import and use the normal LOAD DATA LOCAL INFILE that's used with MariaDB behavior.

Common Problems With Data Loading

Missing Files

If you are using self-hosted object storage programs like Minio, a common problem is that they do not necessarily support the newer virtual-hosted-style requests that is used by AWS. This usually manifests as an error either about a missing file or a missing bucket.

If the host parameter is set to a hostname, it's assumed that the object storage supports the newer virtual-hosted-style requests. If this not the case, the filter must be configured with protocol_version=1.

Conversely, if the host parameter is set to a plain IP address, it is assumed that it does not support the newer virtual-hosted-style request. If the host does support it, the filter must be configured with protocol_version=2.

Configuration Parameters

key

  • Type: string
  • Mandatory: No
  • Dynamic: Yes

The S3 access key used to perform all requests to it.

This must be either configured in the MaxScale configuration file or set with SET @maxscale.ldi.s3_key='<key>' before starting the data load.

secret

  • Type: string
  • Mandatory: No
  • Dynamic: Yes

The S3 secret key used to perform all requests to it.

This must be either configured in the MaxScale configuration file or set with SET @maxscale.ldi.s3_secret='<secret>' before starting the data load.

region

  • Type: string
  • Mandatory: No
  • Dynamic: Yes
  • Default: us-east-1

The S3 region where the data is located.

The value can be overridden with SET @maxscale.ldi.s3_region='<region>' before starting the data load.

host

  • Type: string
  • Mandatory: No
  • Dynamic: Yes
  • Default: s3.amazonaws.com

The location of the S3 object storage. By default the original AWS S3 host is used. The corresponding value for Google Cloud Storage is storage.googleapis.com.

The value can be overridden with SET @maxscale.ldi.s3_host='<host>' before starting the data load.

port

  • Type: integer
  • Mandatory: No
  • Dynamic: Yes
  • Default: 0

The port on which the S3 object storage is listening. If unset or set to the value of 0, the default S3 port is used.

The value can be overridden with SET @maxscale.ldi.s3_port=<port> before starting the data load. Note that unlike the other values, the value for this variable must be an SQL integer and not an SQL string.

no_verify

  • Type: boolean
  • Mandatory: No
  • Dynamic: Yes
  • Default: false

If set to true, TLS certificate verification for the object storage is skipped.

use_http

  • Type: boolean
  • Mandatory: No
  • Dynamic: Yes
  • Default: false

If set to true, communication with the object storage is done unencrypted using HTTP instead of HTTPS.

protocol_version

  • Type: integer
  • Mandatory: No
  • Dynamic: Yes
  • Default: 0
  • Values: 0, 1, 2

Which protocol version to use. By default the protocol version is derived from the value of host but this automatic protocol version deduction will not always produce the correct result. For the legacy path-style requests used by older S3 storage buckets, the value must be set to 1. All new buckets use the protocol version 2.

For object storage programs like Minio, the value must be set to 1 as the bucket name cannot be resolved via the subdomain like it is done for object stores in the cloud.

import_user

  • Type: string
  • Mandatory: No
  • Dynamic: Yes

The Xpand user that will be used to import the data. This parameter must be defined if the data is being uploaded to an Xpand cluster.

The value can be overridden with SET @maxscale.ldi.import_user='<user>' before starting the data load.

import_password

  • Type: password
  • Mandatory: No
  • Dynamic: Yes

The password for the Xpand user that will be used to import the data. This parameter must be defined if the data is being uploaded to an Xpand cluster. The password can be encrypted with maxpasswd before use.

The value can be overridden with SET @maxscale.ldi.import_password='<password>' before starting the data load.

Comments

Comments loading...
Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.