Cache
Cache
This filter was introduced in MariaDB MaxScale 2.1.
Overview
The cache filter is a simple cache that is capable of caching the result of SELECTs, so that subsequent identical SELECTs are served directly by MaxScale, without the queries being routed to any server.
Note that the cache is still experimental and that non-backward compatible changes may be made.
Note that installing the cache causes all statements to be parsed. The implication of that is that unless statements already need to be parsed, e.g. due to the presence of another filter or the chosen router, then adding the cache will not necessarily improve the performance, but may decrease it.
Limitations
All of these limitations may be addressed in forthcoming releases.
Invalidation
Currently there is no cache invalidation, apart from time-to-live.
Prepared Statements
Resultsets of prepared statements are not cached.
Transactions
The cache will be used and populated only if there is no on-going
transaction or if an on-going transaction is explicitly read-only (that is,
START TRANSACTION READ ONLY
).
Variables
The cache key is effectively the entire SELECT statement. However, the value of any variables used in the select is not considered. For instance, if a variable is used in the WHERE clause of the select, a subsequent identical select will return the wrong result, if the value of the variable has been changed in between.
MySQL [testdb]> create table tbl (a int, b int); MySQL [testdb]> insert into tbl values (1, 2), (3, 4); MySQL [testdb]> set @var=2; MySQL [testdb]> select a from tbl where b=@var; +------+ | a | +------+ | 1 | +------+ MySQL [testdb]> set @var=4; MySQL [testdb]> select a from tbl where b=@var; +------+ | a | +------+ | 1 | +------+
In the second case, the correct answer would have been 3
and not 1
.
Security
The cache is not aware of grants.
The implication is that unless the cache has been explicitly configured who the caching should apply to, the presence of the cache may provide a user with access to data he should not have access to.
Please read the section Security for more detailed information.
Configuration
The cache is simple to add to any existing service. However, some experimentation may be required in order to find the configuration settings that provide the maximum benefit.
[Cache] type=filter module=cache hard_ttl=30 soft_ttl=20 storage=... storage_options=... rules=... ... [Cached Routing Service] type=service ... filters=Cache
Each configured cache filter uses a storage of its own. That is, if there are two services, each configured with a specific cache filter, then, even if queries target the very same servers the cached data will not be shared.
Two services can use the same cache filter, but then either the services should use the very same servers or a completely different set of servers, where the used table names are different. Otherwise there can be unintended sharing.
Filter Parameters
The cache filter has one mandatory parameter - storage
- and a few
optional ones. Note that it is advisable to specify max_size
to prevent
the cache from using up all memory there is, in case there is very litte
overlap among the queries.
storage
The name of the module that provides the storage for the cache. That
module will be loaded and provided with the value of storage_options
as
argument. For instance:
storage=storage_inmemory
See Storage for what storage modules are available.
storage_options
A comma separated list of arguments to be provided to the storage module,
specified in storage
, when it is loaded. Note that the needed arguments
depend upon the specific module. For instance,
storage_options=storage_specific_option1=value1,storage_specific_option2=value2
hard_ttl
Hard time to live; the maximum amount of time - in seconds - the cached
result is used before it is discarded and the result is fetched from the
backend (and cached). See also soft_ttl
below.
hard_ttl=60
The default value is 0
, which means no limit.
soft_ttl
Soft time to live; the amount of time - in seconds - the cached result is
used before it is refreshed from the server. When soft_ttl
has passed, the
result will be refreshed when the first client requests the value.
However, as long as hard_ttl
has not passed, all other clients requesting
the same value will use the result from the cache while it is being fetched
from the backend. That is, as long as soft_ttl
but not hard_ttl
has passed,
even if several clients request the same value at the same time, there will be
just one request to the backend.
soft_ttl=60
The default value is 0
, which means no limit. If the value of soft_ttl
is
larger than hard_ttl
it will be adjusted down to the same value.
max_resultset_rows
Specifies the maximum number of rows a resultset can have in order to be stored in the cache. A resultset larger than this, will not be stored.
max_resultset_rows=1000
The default value is 0
, which means no limit.
max_resultset_size
Specifies the maximum size of a resultset, for it to be stored in the cache. A resultset larger than this, will not be stored. The size can be specified as described here.
max_resultset_size=128Ki
The default value is 0
, which means no limit.
Note that the value of max_resultset_size
should not be larger than the
value of max_size
.
max_count
The maximum number of items the cache may contain. If the limit has been reached and a new item should be stored, then an older item will be evicted.
Note that if cached_data
is thread_specific
then this limit will be
applied to each cache separately. That is, if a thread specific cache
is used, then the total number of cached items is #threads * the value
of max_count
.
max_count=1000
The default value is 0
, which means no limit.
max_size
The maximum size the cache may occupy. If the limit has been reached and a new item should be stored, then some older item(s) will be evicted to make space. The size can be specified as described here.
Note that if cached_data
is thread_specific
then this limit will be
applied to each cache separately. That is, if a thread specific cache
is used, then the total size is #threads * the value of max_size
.
max_size=100Mi
The default value is 0
, which means no limit.
rules
Specifies the path of the file where the caching rules are stored. A relative path is interpreted relative to the data directory of MariaDB MaxScale.
rules=/path/to/rules-file
cached_data
An enumeration option specifying how data is shared between threads. The allowed values are:
shared
: The cached data is shared between threads. On the one hand it implies that there will be synchronization between threads, on the other hand that all threads will use data fetched by any thread.thread_specific
: The cached data is specific to a thread. On the one hand it implies that no synchonization is needed between threads, on the other hand that the very same data may be fetched and stored multiple times.
cached_data=thread_specific
Default is shared
. See max_count
and max_size
what implication changing
this setting to thread_specific
has.
debug
An integer value, using which the level of debug logging made by the cache can be controlled. The value is actually a bitfield with different bits denoting different logging.
0
(0b00000
) No logging is made.1
(0b00001
) A matching rule is logged.2
(0b00010
) A non-matching rule is logged.4
(0b00100
) A decision to use data from the cache is logged.8
(0b01000
) A decision not to use data from the cache is logged.16
(0b10000
) Higher level decisions are logged.
Default is 0
. To log everything, give debug
a value of 31
.
debug=31
Rules
The caching rules are expressed as a JSON object.
There are two decisions to be made regarding the caching; in what circumstances should data be stored to the cache and in what circumstances should the data in the cache be used.
In the JSON object this is visible as follows:
{ store: [ ... ], use: [ ... ] }
The store
field specifies in what circumstances data should be stored to
the cache and the use
field specifies in what circumstances the data in
the cache should be used. In both cases, the value is a JSON array containg
objects.
When to Store
By default, if no rules file have been provided or if the store
field is
missing from the object, the results of all queries will be stored to the
cache, subject to max_resultset_rows
and max_resultset_size
cache filter
parameters.
By providing a store
field in the JSON object, the decision whether to
store the result of a particular query to the cache can be controlled in
a more detailed manner. The decision to cache the results of a query can
depend upon
- the database,
- the table,
- the column, or
- the query itself.
Each entry in the store
array is an object containing three fields,
{ "attribute": <string>, "op": <string> "value": <string> }
where,
* the attribute can be database
, table
, column
or query
,
* the op can be =
, !=
, like
or unlike
, and
* the value a string.
If op is =
or !=
then value is used as a string; if it is like
or unlike
, then value is interpreted as a pcre2 regular expression.
Note though that if attribute is database
, table
or column
, then
the string is interpreted as a name, where a dot .
denotes qualification
or scoping.
The objects in the store
array are processed in order. If the result
of a comparison is true, no further processing will be made and the
result of the query in question will be stored to the cache.
If the result of the comparison is false, then the next object is processed. The process continues until the array is exhausted. If there is no match, then the result of the query is not stored to the cache.
Note that as the query itself is used as the key, although the following queries
select * from db1.tbl
and
use db1; select * from tbl
target the same table and produce the same results, they will be cached separately. The same holds for queries like
select * from tbl where a = 2 and b = 3;
and
select * from tbl where b = 3 and a = 2;
as well. Although they conceptually are identical, there will be two cache entries.
Qualified Names
When using =
or !=
in the rule object in conjunction with database
,
table
and column
, the provided string is interpreted as a name, that is,
dot (.
) denotes qualification or scope.
In practice that means that if attribute is database
then value may
not contain a dot, if attribute is table
then value may contain one
dot, used for separating the database and table names respectively, and
if attribute is column
then value may contain one or two dots, used
for separating table and column names, or database, table and column names.
Note that if a qualified name is used as a value, then all parts of the
name must be available for a match. Currently Maria DB MaxScale may not
always be capable of deducing in what table a particular column is. If
that is the case, then a value like tbl.field
may not necessarily
be a match even if the field is field
and the table actually is tbl
.
Implication of the default database.
If the rules concerns the database
, then only if the statement refers
to no specific database, will the default database be considered.
Regexp Matching
The string used for matching the regular expression contains as much information as there is available. For instance, in a situation like
use somedb; select fld from tbl;
the string matched against the regular expression will be somedb.tbl.fld
.
Examples
Cache all queries targeting a particular database.
{ "store": [ { "attribute": "database", "op": "=", "value": "db1" } ] }
Cache all queries not targeting a particular table
{ "store": [ { "attribute": "table", "op": "!=", "value": "tbl1" } ] }
That will exclude queries targeting table tbl1 irrespective of which database it is in. To exclude a table in a particular database, specify the table name using a qualified name.
{ "store": [ { "attribute": "table", "op": "!=", "value": "db1.tbl1" } ] }
Cache all queries containing a WHERE clause
{ "store": [ { "attribute": "query", "op": "like", "value": ".*WHERE.*" } ] }
Note that that will actually cause all queries that contain WHERE anywhere, to be cached.
When to Use
By default, if no rules file have been provided or if the use
field is
missing from the object, all users may be returned data from the cache.
By providing a use
field in the JSON object, the decision whether to use
data from the cache can be controlled in a more detailed manner. The decision
to use data from the cache can depend upon
- the user.
Each entry in the use
array is an object containing three fields,
{ "attribute": <string>, "op": <string> "value": <string> }
where,
* the attribute can be user
,
* the op can be =
, !=
, like
or unlike
, and
* the value a string.
If op is =
or !=
then value is interpreted as a MariaDB account
string, that is, %
means indicates wildcard, but if op is like
or
unlike
it is simply assumed value is a pcre2 regular expression.
For instance, the following are equivalent:
{ "attribute": "user", "op": "=", "value": "'bob'@'%'" } { "attribute": "user", "op": "like", "value": "bob@.*" }
Note that if op is =
or !=
then the usual assumptions apply,
that is, a value of bob
is equivalent with 'bob'@'%'
. If like
or unlike is used, then no assumptions apply, but the string is
used verbatim as a regular expression.
The objects in the use
array are processed in order. If the result
of a comparison is true, no further processing will be made and the
data in the cache will be used, subject to the value of ttl
.
If the result of the comparison is false, then the next object is processed. The process continues until the array is exhausted. If there is no match, then data in the cache will not be used.
Note that use
is relevant only if the query is subject to caching,
that is, if all queries are cached or if a query matches a particular
rule in the store
array.
Examples
Use data from the cache for all users except admin
(actually 'admin'@'%'
),
regardless of what host the admin
user comes from.
{ "use": [ { "attribute": "user", "op": "!=", "value": "admin" } ] }
Security
As the cache is not aware of grants, unless the cache has been explicitly configured who the caching should apply to, the presence of the cache may provide a user with access to data he should not have access to.
Suppose there is a table access
that the user alice has access to,
but the user bob does not. If bob tries to access the table, he will
get an error as reply:
MySQL [testdb]> select * from access; ERROR 1142 (42000): SELECT command denied to user 'bob'@'localhost' for table 'access'
If we now setup caching for the table, using the simplest possible rules file, bob will get access to data from the table, provided he executes a select identical with one alice has executed.
For instance, suppose the rules look as follows:
{ "store": [ { "attribute": "table", "op": "=", "value": "access" } ] }
If alice now queries the table, she will get the result, which also will be cached:
MySQL [testdb]> select * from access; +------+------+ | a | b | +------+------+ | 47 | 11 | +------+------+
If bob now executes the very same query, and the result is still in the cache, it will be returned to him.
MySQL [testdb]> select current_user(); +----------------+ | current_user() | +----------------+ | bob@127.0.0.1 | +----------------+ 1 row in set (0.00 sec) MySQL [testdb]> select * from access; +------+------+ | a | b | +------+------+ | 47 | 11 | +------+------+
That can be prevented, by explicitly declaring in the rules that the caching should be applied to alice only.
{ "store": [ { "attribute": "table", "op": "=", "value": "access" } ], "use": [ { "attribute": "user", "op": "=", "value": "'alice'@'%'" } ] }
With these rules in place, bob is again denied access, since queries
targeting the table access
will in his case not be served from the cache.
Storage
storage_inmemory
This simple storage module uses the standard memory allocator for storing the cached data.
storage=storage_inmemory
storage_rocksdb
This storage module uses RocksDB database for storing the cached data. The directory where the RocksDB database will be created is by default created into the MaxScale cache directory, which usually is not on a RAM disk. For maximum performance, you may want to explicitly place the RocksDB database on a RAM disk.
storage=storage_rocksdb
Parameters
cache_directory
Specifies the directory under which the filter instance specific RocksDB databases will be placed. Note that at startup, each RocksDB database will be deleted and recreated. That is, cache content will not be retained across MaxScale restarts.
storage_options=cache_directory=/mnt/maxscale-cache
With the above setting a directory /mnt/macscale-cache/storage_rocksdb
will
created, under which the actual instance specific cache directories are created.
collect_statistics
Specifies whether RocksDB should collect statistics that later can be queried
using maxadmin
. It should be noted, though, that collecting RocksDB statistics
is not without a cost. From the RocksDB Documentation
The overhead of statistics is usually small but non-negligible. We usually observe an overhead of 5%-10%.
The value is a boolean and the default is false
.
storage_options=collect_statistics=true
Example
In the following we define a cache MyCache that uses the cache storage module
storage_inmemory
and whose soft ttl is 30
seconds and whose hard ttl is
45
seconds. The cached data is shared between all threads and the maximum size
of the cached data is 50
mebibytes. The rules for the cache are in the file
cache_rules.json
.
Configuration
[MyCache] type=filter module=cache storage=storage_inmemory soft_ttl=30 hard_ttl=45 cached_data=shared max_size=50Mi rules=cache_rules.json [MyService] type=service ... filters=MyCache
cache_rules.json
The rules specify that the data of the table sbtest
should be cached.
{ "store": [ { "attribute": "table", "op": "=", "value": "sbtest" } ] }