MySQL "Wishlist" Session from an online travel agency

Currently critical issues

DROP TABLE is slow

When dropping a partitioned table using the compressed InnoDB storage engine, we have experienced this to be very slow in some cases - sometimes the drop takes several hours, logjamming replication. This is currently a showstopped for rolling out a large MyISAM-to-InnoDB conversion that we consider to be important to be able to scale further.

Scaling replication

We experienced replication speed problems in the sql_slave thread this year, for the first time. This is apparently to our code being written to MyISAM originally, and now not using transactions to batch writes enough in the first place, plus added disk write latency from a filer as a target disk.

Monty stated that MariaDB is about to contain an interesting patch wrt to Group Commit which leverages the fact that all items in one Group Commit must be logically independent, hence could in theory be applied in parallel. The goal of the patch is to make this theory a practicality.

UTF-8 issues, MEMORY engine limitations

UTF-8 currently stalled because of MEMORY tables being fixed length in query plans with 'using temporary'. A single VARCHAR(255) CHARSET utf8 causes 765 bytes to be used per row if it ever goes 'using temporary'. -- Kris, Eric

Variable length memory tables existed with limits, need sponsorship. -- Monty

The "Adminlogging problem"

Many writes we have need to track the previous values so that they are being shipped to an auditing database elsewhere on a different machine. Many ways to solve that problem exist, most suck.

Using RBR binlog, parse that output: We are currently parsing "mysqlbinlog -vv" output. A binlog parsing library exposed to the usual scripting languages would make that much easier. Also, the binlog does not contain all data needed for an audit log (application level user information needs to be embedded somehow).

We can log the statement including all comments in the statement once, and have each row point back to the statement that caused this row to be changed. You'd have the change from the RBR record, and could get the higher level app user from a comment in the statement. - Monty

I like. Binlog is written anyway, so no additional overhead. -- Kris

Another solution would be triggers.

They suck. Executed synchronously, slow, hard to debug, prone to breakage. Also higher level app data missing. -- Kris

Specialized statements? "SELECT FROM DELETE/SELECT FROM UPDATE": Write statements returning a result set of old values. Actually INSERT FROM DELETE/INSERT FROM UPDATE would be needed to make it transactional. -- Yves, Ruud

Please spec this. -- Monty

MySQL 5.5. provides an Audit plugin API for this. Check it out, write UDP sender into eventlogging as such a plugin. It is easy, and does exactly what we want. Also, this is the least ugly way to solve this. -- Kris

More replication wishes

Replication synchronisation (Global Unique Id)

At the moment we build our replication hierarchies at most 2 levels deep, and use a host language wrapper around SHOW MASTER STATUS and SELECT MASTER_POS_WAIT() to create waypoints where needed. This does not work across more than 1 hop (2 levels), as master binlog positions do not translate past one slave.

A mechanism that allows to create such waypoints across any number of hops would be useful from a correctness perspective. I would also allow us more flexibility in how we build our replication hierarchies.

Large table upload causes replication lag

Some events are known independent events, and can cause large replication lag due to their sheer size. For example, sometimes we upload large new tables to a master, and the data download then delays normal processing in all slaves. It would be nice for a mechanism to exist to leverage additional connections to download this out-of-band.

Please note that such a mechanism should not change the security model of replication too much (direction of connect from slave to master, port usage, etc).

Inverse Sharding

We shard data. In some cases, for some users which do care less about speed and more about reporting, it would be convenient to have a way to unshard data. Is FederatedX a solution? Can be have replication so that a slave can be slave to multiple masters provided the data sets of the masters are nonoverlapping? -- Eric

Let's add names, DEFAULT being the optional default name. START SLAVE IO_THREAD DEFAULT, CHANGE DEFAULT MASTER TO ... and so on. -- Monty

Making slaves in a non-sucky way

Right now we mylvmbackup a slave, restore it to a new box, recover the InnoDB and connect it to the master. We'd like to have a command that can be issues to an empty database, will download the data from a master (or a co-slave), and then attach the new slave to the master (using the configuration provided on the command line or by the co-slave). At the SQL level. With progress report. -- Nicolai

And please have the data move at disk speed, no index creation, ship binary index data from the co-slave or master. This can be fast! -- Kris

Stable snapshot

"FLUSH TABLES WITH READ LOCK" can still break, if a checkpoint occurs under the lock.

Can we have FLUSH TABLES WITH READ LOCK AND NO CHECKPOINT and a FORCE CHECKPOINT for InnoDB, or have FLUSH TABLES force this checkpoint when applied to InnoDB? -- Kris, SimonM

Server restart required (Replication)

About every replication config change requires a server restart to be picked up. Please fix!

RBR conflict resultion SP

In RBR, when there is any kind of RBR problem, call a SP. The SP gets all required data made available somehow, can define a resolution policy. 4 Outcomes: Choose old, choose new, fail (error) and synthesize (merge) new row.

Compressed binlog

Write the binlog optionally in a compressed fashion at the event level.

Async binlog shipping

On binlog rotation, run a SP asynchronously. Note than execution of for example a shell command through a UDF must not lock anything! Provide sufficient environmental data to the SP (binlog names old and new, binlog position for old and new log, etc).

Use tables, not files

Get rid of master.info, relay.info. These should be InnoDB tables in mysql.*.

Less critical wishes

Async query API so that we can map/reduce at the client level

Provide an API in mysqlclient.so so that we can write application side map/reduce query-shards-in-parallel actions.

Make Indexes first level objects

An index should be able to exist as an object that is not tied to a physical table. That would allow for indices that are global to a partition, and for functional indexes or indices on virtual columns.

ALTER TABLE progress report

ALTER TABLE can take a long time. We know how to go into $datadir and find the secret files and observe them, but we'd rather see progress in SHOW PROCESSLIST.

done while you wait (*closes emacs*). -- Monty

Fast ALTER TABLE

Can we have proper fast ALTER TABLE? That would include online index drop/create, and versioned schema plus version numbers at the page level or the row level. Rows would be upgraded to the most recent schema version not at ALTER TABLE time, but whenever their row/page is being dirtied for whatever reason. -- Kris, Herald.

There is the facebook hack for online ALTER TABLE. -- SkySQL

Yes. We know. It is as much part of the problem space as it is part of the solution space. -- Kris, Herald

Nested transactions / In TXN flag

"I would like to have nested transactions. They would make some things I need to solve a lot easier (layering in code, lower layers with their own TXNs being called by upper layers with or without a TXN going on)." -- Yves

"I would like to see a viable definition of nested transactions. I would like to know what it means when an inner TXN does a COMMIT and then the outer TXN does a ROLLBACK. Are there any other ways to solve your layering problem?" -- Kris

Another group asked about solutions for the same layering problem, but in a different way: "We would like to have a flag or status variable that tells code if it currently is inside a transaction, and optionally, if that is an implied (autocommit = 0) or explicit (autocommit = 1 and BEGIN WORK was issued) transaction."

Compressed MyISAM/Aria

We are using MyISAM in some places for mostly append-only tables and very large sizes (20 TB and more target size). We are considering Aria for this, if Aria is comparable in space efficiency to MyISAM, and has nicer recovery after an unplanned shutdown.

MyISAM tables we are using are being compressed with myisampack/myisamchk when they are finished with being written to. This process is at the moment cumbersome, as it involves a FLUSH TABLES and external commands being run outside of the DB server.

Having an SQL command that online packs MyISAM/Aria tables would be useful, as this would handle open file handles and other stuff more nicely, and also can be controlled from the client using only SQL.

Having ARCHIVE-Engine like row-level compression would be even nicer, but will probably create all kind of problems when writes happen to already compressed rows.

For our current DWH needs just a COMPRESS TABLE/UNCOMPRESS TABLE command would be sufficient, and is probably a lot less invasive.

GET_LOCK() improvements

"At the moment GET_LOCK() in limited to one lock per connection. Acquiring a second lock while holding one releases the initial lock. This has been done in order to avoid deadlocking problems, we presume.

Having a deadlock detection system in place, and being able to pick up multiple locks at once or gradually would be useful." -- Yves

There are voices (mine, for one) which state that if you are using GET_LOCK instead of InnoDB row locks then you are doing it wrong. Using GET_LOCK more is a step into the wrong direction, technology-wise. -- Kris

Basic DWH improvements

"Hash Join would help in many places." -- Yves "Hash Joins would be useful." -- Debs

"Way ahead of you." -- Monty

"Also subselects..." -- Yves, Debs

"We hear you." -- Monty

"Some large join take ages to execute. Probably due to tmp tables to disk." -- Liz

Probably better of with Hash Joins? Show query examples, please. -- Monty

Better EXPLAIN, better stats

Better EXPLAIN, more statistics, display real I/O data for queries, list causes for WAIT.

PERFORMANCE_SCHEMA, MySQL 5.5 stuff plus addons. WL Item for better EXPLAIN exists. -- Monty

Please link this item here. -- Kris

"Better statistics for queries." "For example log query skeletons a la mysqldumpslow, but live from the parser." -- Monty

Yes, please. Remember we have tables with time-tailed names (_YYYYMMDD), variable length WHERE ... IN (...) lists, and variable length extended insert statements. -- Yves

Optimizer woes

"Unquoted numbers cause weird query plans." -- Ruud

Yes, they have to, thanks to numeric casting in SQL. This is not really solveable. -- Monty

In MongoDB (gasp!) it can be made such that a query that uses a full scan instead of an index is an error. MySQL also can log queries not using an index in the slow log, which is good. But, can we make such implied casts somehow cause an error optionally, that would make that SQL easier to find in Development. -- Kris

Using memcached as a backend

Using memcache as a backend. -- Liz

done. -- Monty

Moving ibd files physically

For individual table restore and for shipment of precalculated/preloaded data it would be useful to detach ibd files, ship them and reattach them elsewhere, in binary form, at disk speed, no index creation necessary.

Possible with InnoDB and XtraDB as the donor, xtrabackup as the program and a properly configured XtraDB as the receiver. -- Kris

InnoDB should be able to do this natively, via ALTER TABLE ... DETACH/ATTACH. -- Everybody else.

More data load issues

SELECT INTO OUTFILE LOCAL? We do have LOAD DATA INFILE LOCAL.

Yes, but as implemented it is a security risk, also messes with firewall config due to the way it connects. -- Kris

Does not. -- Monty

Ok, I will write this up in a better way. -- Kris

Also, please try mysqldumper. -- SkySQL

Server restart required (InnoDB)

About every second innodb config change requires a server restart to be picked up. Please fix!

INFORMATION_SCHEMA

Faster I_S. Consider a server with one million tables. Make I_S not suck on this.

GRANTS, ROLES, LDAP

We would be able to specify users as roles (e.g. developer, webuser, admin) and then assign a set of roles to users. One role is the default role picked up on login, the others can be picked up and dropped with a command.

Example (Kris): Define role developer, developer has access to ALL on schema x.*. Define role dba, dba has ALL on *.* with grant option. Define user debs with default role developer, additional role dba.

Debs logs in (username, password from LDAP, possibly also role info from LDAP (optional)). Debs has developer rights. Debs issues TAKE ROLE DBA or something. She now has DBA privileges. Debs issues DROP ROLE DBA. She now has Developer privileges.

A change of a role definition should be picked up when a role is picked up. For example role DBA changes privs. Debs has been logged in for 3 days as Developer. When she TAKEs ROLE DBA, the new DBA definition is picked up.

A change of role lists should be picked up at login.

The database should be able to work fine with one million users, one million roles. Do not keep all data in memory, load the required info on login into a cache (size changeable without restart). On login, load the requested data, if it is not in cache already.

Example: SET GLOBAL USERNAME_CACHE = 1000000; -- yes, we pay that memory overhead SET GLOBAL ROLE_CACHE = 1000000; -- here as well

As users log in, these caches fill up. If we SET GLOBAL ROLE_CACHE=1000, this works as a LRU for the 1000 most recently used...

Example (Simon):

Initial thoughts are:

CREATE ROLE ROLE_NAME                 -- makes the role exist;
GRANT .... ON db.table TO ROLE_NAME;  -- add grants to a "role", similar to grants currently given to a user.

GRANT ROLE ROLE_NAME1 TO 'USER'@'...' [IDENTIFIED BY [PASSWORD] 'xxxxxx'];  -- add role's grants to user
GRANT ROLE ROLE_NAME2 TO 'USER'@'...';  -- add role's grants to user

with corresponding revokes:

REVOKE ROLE ROLE_NAME1 FROM 'USER'@'..'; -- remove role's grants from a user
REVOKE ... ON db.table FROM ROLE_NAME1;  -- remove grants from a role

and probably something like

SHOW ROLES;            -- list all roles known
SHOW ROLE ROLE_NAME1;  -- show the "grants" for a role

SHOW GRANTS FOR 'user'@'location' -- should include the 'roles' the user has been granted.

Note: also would be nice if there were something like the equivalent of mk-show-grants which shows ALL grants for ALL users. We use this all the time as there's no SQL equivalent at the moment.

Views

Views are parsed for every access, never cached. -- Monty, Ruud

TABLESPACE

Proper tablespace management a la Oracle. -- SimonM

Why? Please state benefits, spec it. -- Monty

Simon replied:

My thoughts: We have database instances which have data ranging from: 20G to 2TB and some archive db servers containing up to 40TB of data target size.

Managing this with InnoDB as one large "blob" (one or more ibdata files) is not comfortable. innodb_filer_per_table splits this blob over multiple "different sized" files. Possibly more efficient is to have multiple "reasonably fixed" sized files [the tablespaces] (a smaller number than a per table count).

  • An instance contains one or more tablespaces.
  • A tablespace contains one or more datafiles (say 10G-50G depending on instance, DBA configurable)
  • It should be possible to increase the size of a datafile
  • It should be possible to assign a database or table to a specific tablespace
  • It should also be possible to move tables out of a datafile (prior to perhaps dropping the file).
  • Practicalities: make it possible to easily export/import a tablespace (with it's contents) from one instance to another.

It's true this does sound like quite a lot of extra work. Benefits are:

  • easier to move around the files between instances
  • space freed up by DELETE OR UPDATEs would be available to other tables (not the case with file_per_table unless you rebuild the table with ALTER TABLE)
  • locating specific tablespaces on a specific location means that you can tune for better performance: e.g. move different tablespaces onto different disks/filesystems/LUNs. (this is hard to do now cleanly)

So most of this is to help with management of the on-disk/file data operationally.

If NOT using XFS with O_DIRECT, there is an in-memory write lock on the inode, that is, a filesystem can have only one write at a time happening to a filesystem file. Tablespaces and datafiles are a way to generate concurrency here, if XFS+O_DIRECT is not an option.

Tablespaces, if done right, can provide LVM services to a storage engine in a way independent of the OS.

Tablespaces, if done right, can expose an API internally, that all storage engines can use instead of talking to the filesystem directly. That would unify the actual storage management of all storage engines.

Get rid of MyISAM in mysql.*

You should be able to use any general storage engine in mysql.*, e.g. InnoDB.

Get rid of log tables in mysql.*

The mysql.* schema is a config repository. It was a mistake to place potential log tables in here, e.g. the general log and slow query log tables.

Timezone issues

Currently, timezones are part of the session, not part of the column. Data types exist that take timezone data from the session into account (broken!), other data types never do that.

In general, this needs to be treated either the drizzle way (make all charsets utf8, make all time data utc), or the MySQL string way (store charset with each string, store TZ with each date or time, and provide blob-types as well as non-tz types for string resp. date/time types).

Essentially this is a request to create DATE_TZ, TIME_TZ, DATETIME_TZ and TIMESTAMP_TZ, relegate the timezone in a session to the same function as the charset in a session ('this is what the client uses'), and convert accordingly.

Microtime

Make these date and time zones actually store microsecond resolution. Thanks.

Dynamic ENUM

Create a ENUM-type like data type that does string compression (at a scale). For example, create a type LOOKUP(255) CHARSET utf8. We store browser user-agent strings here, you store bigint unsigned numbers internally, and create a lookup structure so that each occurrence of a string is physically stored only once. Convert from VARCHAR to LOOKUP using ALTER TABLE.

Multiple Buffer pool

Innodb multiple buffer pools, but size of each, and to which pool will a page go configured by the dba (instead of random/hash as in 5.5, at table level).

InnoDB temp tables

Discussed internally: is it not possible to use Innodb temporary tables or similar for this perhaps in an explicit tempdb (like Sybase)?

That is: tables are always temporary, cleaned up on server restart, writes to tables do not need to be (should not be) flushed.

That would make the MEMORY engine thing from above irrelevant.

Providing Input to Monty Program

Running an instance with profiling on

Monty requested that we set up a slave with full profiling enabled, and if possible even divert a minimum amount of production load to it. The profiling data will be useful if made accessible to his team, as it contains useful information from real world load for his teams.

Also, it will be useful if the box were to run oprofile on a regular basis (with symbols installed), as this provides additional useful information.

Other engines

We should try out other engines for fast inserts/DWH uses: Infobright, InfiniDB, PBXT, ...

Answer

Comments

Comments loading...
Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.