Using the S3 Storage Engine - Source

Revision: 125994
User: Michael Widenius
Date: 2023-05-22 16:46

<<include slug="s3-status">>
<<toc>>

The [[s3-storage-engine|S3 storage engine]] is read only and allows one to archive MariaDB
tables in Amazon S3, or any third-party public or private cloud that
implements S3 API (of which there are many), but still have them
accessible for reading in MariaDB.

== Installing the Plugin
As of MariaDB 10.5.7, the S3 storage engine is currently [[release-criteria|gamma maturity]], so the following step can be omitted.

On earlier releases, when it was [[release-criteria|alpha maturity]], it will not load by default on a stable release of the server due to the default value of the [[server-system-variables/#plugin_maturity|plugin_maturity]] variable. Set to ##alpha## (or below) in your config file to permit installation of the plugin:

<<code>>
[mysqld]
plugin-maturity = alpha
<</code>>
and restart the server.

Now [[plugin-overview/#installing-a-plugin|install the plugin library]], for example:
<<sql>>
INSTALL SONAME 'ha_s3';
<</sql>>

If the library is not available, for example: 
<<code>>
INSTALL SONAME 'ha_s3';
ERROR 1126 (HY000): Can't open shared library '/var/lib/mysql/lib64/mysql/plugin/ha_s3.so' 
  (errno: 13, cannot open shared object file: No such file or directory)
<</code>>

you may need to install a separate package for the S3 storage engine, for example: 
<<sql>>
shell> yum install MariaDB-s3-engine
<</sql>>

== Moving Data to S3
To move data from an existing table to S3, one can run:
<<code>>
ALTER TABLE old_table ENGINE=S3 COMPRESSION_ALGORITHM=zlib
<</code>>

To get data back to a 'normal' table one can do:

<<code>>
ALTER TABLE s3_table ENGINE=INNODB
<</code>>

== New Options for [[alter-table|ALTER TABLE]]

* **##S3_BLOCK_SIZE## :** Set to 4M as default. This is the block size for all index and data pages stored in S3.
* **##COMPRESSION_ALGORITHM## :** Set to 'none' as default. Which compression algorithm to use for block stored in S3.  Options are: ##none## or ##zlib##.

[[alter-table|ALTER TABLE]] can be used on S3 tables as normal to add columns or change column definitions.

== mysqld Startup Options for S3

To be able to use S3 for storage one ***must*** define how to access S3 and where data are stored in S3:

* **[[s3-storage-engine-system-variables#s3_access_key|s3_access_key]]:** The AWS access key to access your data
* **[[s3-storage-engine-system-variables#s3_secret_key|s3_secret_key]]:** The AWS secret key to access your data
* **[[s3-storage-engine-system-variables#s3_bucket|s3_bucket]]: ** The AWS bucket where your data should be stored. All MariaDB table data is stored in this bucket.
* **[[s3-storage-engine-system-variables#s3_region|s3_region]]: ** The AWS region where your data should be stored.

If you are using an S3 service that is using HTTP to connect (like [[https://min.io/]]) you also need the set the following variables:
* **[[s3-storage-engine-system-variables#s3_port|s3_port]]:** Port number to connect to (0 means use default)
* **[[s3-storage-engine-system-variables#s3_use_http|s3_use_http]]:** If true, force use of HTTP protocol

If you are going to use a primary-replica setup, you should look at the following variables:
* **[[s3-storage-engine-system-variables#s3-replicate-alter-as-create-select|s3_replicate_alter_as_create_select]]: ** When converting an S3 table to local table, log all rows in binary log. Defaults to <<code>>TRUE<</code>>. This allows the replica to replicate <<code>>CREATE TABLE .. SELECT FROM s3_table<</code>> even it the replica doesn't have access to the original <<code>>s3_table<</code>>.
* **[[s3-storage-engine-system-variables#s3-slave-ignore-updates|s3_slave_ignore_updates]]: ** Should be set if primary and replica share the same S3 instance. This tells the replica that it can ignore any updates to the S3 tables as they are already applied on the primary. Defaults to <<code>>FALSE<</code>>.

The above defaults assume that the primary and replica don't share the same S3 instance.

Other, less critical options, are:
* **[[s3-storage-engine-system-variables#s3_host_name|s3_host_name]]:** Hostname for the S3 service. "s3.amazonaws.com", Amazon S3 service, by default.
* **[[s3-storage-engine-system-variables#s3_protocol_version|s3_protocol_version]]:**  Protocol used to communication with S3. One of "Auto", "Amazon" or "Original" where "Auto" is the default. If you get errors like "8 Access Denied" when you are connecting to another service provider, then try to change this option.  The reason for this variable is that Amazon has changed some parts of the S3 protocol since they originally introduced it but other service providers are still using the original protocol.
* **[[s3-storage-engine-system-variables#s3_block_size|s3_block_size]]:** Set to 4M as default.  This is the default block size for a table, if not specified in [[create-table|CREATE TABLE]].
* **[[s3-storage-engine-system-variables#s3_pagecache_buffer_size|s3_pagecache_buffer_size]]:** Default 128M. The size of the buffer used for data and index blocks for S3 tables. Increase this to get better index handling (for all reads and multiple writes) to as much as you can afford.

Last some options you probably don't have to ever touch:

* **[[s3-storage-engine-system-variables#s3_pagecache_age_threshold|s3_pagecache_age_threshold]] : ** Default 300: This characterizes the number of hits a hot block has to be untouched until it is considered aged enough to be downgraded to a warm block. This specifies the percentage ratio of that number of hits to the total number of blocks in the page cache.
* **[[s3-storage-engine-system-variables#s3_pagecache_division_limit|s3_pagecache_division_limit]]: ** Default 100. The minimum percentage of warm blocks in key cache.
* **[[s3-storage-engine-system-variables#s3_pagecache_file_hash_size|s3_pagecache_file_hash_size]]: ** Default 512. Number of hash buckets for open files.  If you have a lot of S3 files open you should increase this for faster flush of changes. A good value is probably 1/10 of number of possible open S3 files.
* **[[s3-storage-engine-system-variables/#s3_debug|s3_debug]]: ** Default 0. Generates a trace file from libmarias3 on stderr (mysqld.err) for debugging  the S3 protocol.

== Typical my.cnf Entry for connecting to Amazon S3 service

<<code>>
[mariadb]
s3=ON
s3-bucket=mariadb
s3-access-key=xxxx
s3-secret-key=xxx
s3-region=eu-north-1
s3-host-name=s3.amazonaws.com
# The following is useful if you want to use minio as a S3 server. (https://min.io/)
#s3-port=9000
#s3-use-http=ON

# Primary and replica share same S3 tables.
s3-slave-ignore-updates=1

[aria_s3_copy]
s3-bucket=mariadb
s3-access-key=xxxx
s3-secret-key=xxx
s3-region=eu-north-1
s3-host-name=s3.amazonaws.com
# The following is useful if you want to use minio as a S3 server. (https://min.io/)
#s3-port=9000
#s3-use-http=ON
<</code>>

== Typical my.cnf entry for connecting to a [[https://min.io|minio]] S3 server

<<code>>
[mariadb]
s3=ON
s3-host-name="127.0.0.1"
s3-bucket=storage-engine
s3-access-key=minio
s3-secret-key=minioadmin
s3-port=9000
s3-use-http=ON

[aria_s3_copy]
s3=ON
s3-host-name="127.0.0.1"
s3-bucket=storage-engine
s3-access-key=minio
s3-secret-key=minioadmin
s3-port=9000
s3-use-http=ON
<</code>>

== Typical Usage Case for S3 Tables

The typical use case would be that there exists tables that after some
time would become fairly inactive, but are still important so that they
can not be removed. In that case, an option is to move such a table
to an archiving service, which is accessible through an S3 API.

Notice that S3 means the Cloud Object Storage API defined by Amazon
AWS. Often the whole of Amazon’s Cloud Object Storage is referred to
as S3. In the context of the S3 archive storage engine, it refers to
the API itself that defines how to store objects in a cloud service,
being it Amazon’s or someone else’s. OpenStack for example provides an
S3 API for storing objects.

The main benefit of storing things in an S3 compatible storage is that the
cost of storage is much cheaper than many other alternatives. Many S3
implementations also provide reliable long-term storage.

== Operations Allowed on S3 Tables

* [[alter-table|ALTER TABLE]] S3 supports all types, keys and other options that are supported by the [[aria-storage-engine|Aria]] engine. One can also perform [[alter-table|ALTER TABLE]] on an S3 table to add or modify columns etc.
* [[drop-table|DROP TABLE]]
* [[select|SELECT]]  Any SELECT operations you can perform on a normal table should work with an S3 table.
* [[show-tables|SHOW TABLES]] will show all tables that exist in the current defined S3 location.
* S3 tables can be part of [[partitions-files|partitions]]. See Discovery below.

== Discovery

The S3 storage engine supports full [[table-discovery|MariaDB discovery]]. This means that if
you have the S3 storage engine enabled and properly configured, the
table stored in S3 will automatically be discovered when it's accessed with
[[show-tables|SHOW TABLES]], [[select|SELECT]] or any other operation that
tries to access it. In the case of SELECT, the .frm file from S3 will
be copied to the local storage to speed up future accesses.

When an S3 table is opened for the first time (it's not in the table cache)
and there is a local .frm file, the S3 engine will check if it's still
relevant, and if not, update or delete the .frm file.

This means that if the table definition changes on S3 and it's in the
local cache, one has to execute ##[[flush|FLUSH TABLES]]## to
get MariaDB to notice the change and update the .frm file.

If partitioning S3 tables are used, the partition definitions will also be stored on S3 storage and will be discovered by other servers.

Discovery of S3 tables is not done for tables in the [[the-mysql-database-tables|mysql databases]] to make mysqld boot faster and more securely.

== Replication

S3 works with [[replication-overview|replication]]. One can use replication in two different scenarios:
* The primary and replica share the same S3 storage. In this case the primary will make all changes to the S3 data and the replica will ignore any changes in the replication stream to S3 data . This scenario is achieved by setting [[s3-storage-engine-system-variables#s3-slave-ignore-updates|s3_slave_ignore_updates]] to 1.
* The primary and replica don't share the same S3 storage or the replica uses another storage engine for the S3 tables. This scenario is achieved by setting [[s3-storage-engine-system-variables#s3-slave-ignore-updates|s3_slave_ignore_updates]] to 0.

== aria_s3_copy

[[aria_s3_copy]] is an external tool that one can use to copy [[aria-storage-engine|Aria]] tables to and from S3.  Use ##aria_s3_copy --help## to get the options of how to use it.

== mariadb-dump

* [[mariadb-dump]] will by default ignore S3 tables. If ##mariadb-dump## is run with the ##--copy-s3-tables## option, the resulting file will contain a CREATE statement for a similar [[aria-storage-engine|Aria]] table, followed by the table data and ending with an ##ALTER TABLE xxx ENGINE=S3##.

== ANALYZE TABLE

As of MariaDB 10.5.14, [[analyze-table|ANALYZE TABLE]] is supported for S3 tables.
As the S3 tables are read-only, a normal ##ANALYZE TABLE## will not do anything. However
using ##ANALYZE TABLE table_name PERSISTENT FOR...## will now work.

== CHECK TABLE
As of MariaDB 10.5.14, [[check-table|CHECK TABLE]] will work.  As S3 tables are read only
it is very unlikely that they can become corrupted.   The only known way an S3 table could be corrupted if either the original table copied to S3 was corrupted or the process of copying the original table to S3 was somehow interrupted.

== Current Limitations
* [[mysql-test-run]] doesn't by default test the S3 engine as we can't embed AWS keys into mysql-test-run.
* Replicas should not access S3 tables while they are ALTERed! This is because there is no locking implemented to S3 between servers. However, after a table (either the original S3 table or the partitioned S3 table) is changed on the primary, the replica will notice this on the next access and update its local definition.

=== Limitations in [[alter-table|ALTER .. PARTITION]]

All [[alter-table|ALTER PARTITION]] operations are supported on S3 partitioning tables except:
* REBUILD PARTITION
* TRUNCATE PARTITION
* REORGANIZE PARTITION

== Performance Considerations

Depending on your connection speed to your S3 provider, there can be some notable slowdowns in some
operations.

=== Discovery

As S3 is supporting discovery (automatically making tables available that are in S3) this can cause some
small performance problems if the S3 engine is enabled. Partitioning S3 tables also support discovery.

* CREATE TABLE is a bit slower as the S3 engine has to check if the to-be-created table is already S3.
* Queries on information_schema tables are slower as S3 has to check if there is new tables in S3.
* DROP of non existing tables are slower as S3 has to check if the table is in S3.

There are no performance degradation's when accessing existing tables on the server. Accessing the S3
table the first time will copy the .frm file from S3 to the local disk, speeding up future accesses to the table.

=== Caching

* Accessing a table on S3 can take some time , especially if you are using big packets ([[s3-storage-engine-system-variables#s3_block_size|s3_block_size]]).  However the second access to the same data should be fast as it's then cached in the S3 page cache.

=== Things to Try to Increase Performance

If you have performance problems with the S3 engine, here are some things you can try:
* Decreasing [[s3-storage-engine-system-variables#s3_block_size|s3_block_size]]. This can be done both globally and per table.
* Use COMPRESSION_ALGORITHM=zlib when creating the table. This will decrease the amount of data transferred from S3 to the local cache.
* Increasing the size of the s3 page cache: [[s3-storage-engine-system-variables#s3_pagecache_buffer_size|s3_pagecache_buffer_size]]

Try also to execute the query twice to check if the problem is that the data was not properly cached.  When data is cached locally the performance should be excellent.

== Future Development Ideas

* Store aws keys and region in the mysql.servers table (as [[spider|Spider]] and [[federatedx|FederatedX]]). This will allow one to have different tables on different S3 servers.
* Store s3 bucket, access_key and secret key in a cache to better be able to better to reuse connections. This would save some memory and make some S3 accesses a bit faster as we could reuse old connections.

== Troubleshooting S3 on SELinux

If you get errors such as:
<<code>>
ERROR 3 (HY000): Got error from put_object(bubu/produkt/frm): 5 Couldn't connect to server
<</code>>
one reason could be that your system doesn't allow MariaDB to connect to ports other than 3306.
To procedure to enable other ports is the following:

Search for the ports allowed for MariaDB:

<<code>>
$ sudo semanage port -l | grep mysqd_port_t
mysqld_port_t                tcp   1186, 3306, 63132-63164
<</code>>
Say you want to allow MariaDB to connect to port 32768:

<<code>>
$ sudo semanage port -a -t mysqld_port_t -p tcp 32768
<</code>>

You can verify that the new port, 32768, is now allowed for MariaDB:

<<code>>
$ sudo semanage port -l | grep mysqd_port_t
mysqld_port_t                tcp   32768,1186, 3306, 63132-63164
<</code>>

== See Also
* [[s3-storage-engine-internals|S3 storage engine internals]]

Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.

Using the S3 Storage Engine - Source

Products

Services

Pricing

Resources

About Us

Download MariaDB