MySQL "Wishlist" Session from an online travel agency
Currently critical issues
DROP TABLE is slow
When dropping a partitioned table using the compressed InnoDB storage engine, we have experienced this to be very slow in some cases - sometimes the drop takes several hours, logjamming replication. This is currently a showstopped for rolling out a large MyISAM-to-InnoDB conversion that we consider to be important to be able to scale further.
We experienced replication speed problems in the sql_slave thread this year, for the first time. This is apparently to our code being written to MyISAM originally, and now not using transactions to batch writes enough in the first place, plus added disk write latency from a filer as a target disk.
Monty stated that MariaDB is about to contain an interesting patch wrt to Group Commit which leverages the fact that all items in one Group Commit must be logically independent, hence could in theory be applied in parallel. The goal of the patch is to make this theory a practicality.
UTF-8 issues, MEMORY engine limitations
UTF-8 currently stalled because of MEMORY tables being fixed length in query plans with 'using temporary'. A single
VARCHAR(255) CHARSET utf8 causes 765 bytes to be used per row if it ever goes 'using temporary'.
-- Kris, Eric
Variable length memory tables existed with limits, need sponsorship.
The "Adminlogging problem"
Many writes we have need to track the previous values so that they are being shipped to an auditing database elsewhere on a different machine. Many ways to solve that problem exist, most suck.
Using RBR binlog, parse that output: We are currently parsing "mysqlbinlog -vv" output. A binlog parsing library exposed to the usual scripting languages would make that much easier. Also, the binlog does not contain all data needed for an audit log (application level user information needs to be embedded somehow).
We can log the statement including all comments in the statement once, and have each row point back to the statement that caused this row to be changed. You'd have the change from the RBR record, and could get the higher level app user from a comment in the statement. - Monty
I like. Binlog is written anyway, so no additional overhead.
Another solution would be triggers.
They suck. Executed synchronously, slow, hard to debug, prone to breakage. Also higher level app data missing.
Specialized statements? "SELECT FROM DELETE/SELECT FROM UPDATE": Write statements returning a result set of old values. Actually INSERT FROM DELETE/INSERT FROM UPDATE would be needed to make it transactional.
-- Yves, Ruud
Please spec this.
MySQL 5.5. provides an Audit plugin API for this. Check it out, write UDP sender into eventlogging as such a plugin. It is easy, and does exactly what we want. Also, this is the least ugly way to solve this.
More replication wishes
Replication synchronisation (Global Unique Id)
At the moment we build our replication hierarchies at most 2 levels deep, and use a host language wrapper around SHOW MASTER STATUS and SELECT MASTER_POS_WAIT() to create waypoints where needed. This does not work across more than 1 hop (2 levels), as master binlog positions do not translate past one slave.
A mechanism that allows to create such waypoints across any number of hops would be useful from a correctness perspective. I would also allow us more flexibility in how we build our replication hierarchies.
Large table upload causes replication lag
Some events are known independent events, and can cause large replication lag due to their sheer size. For example, sometimes we upload large new tables to a master, and the data download then delays normal processing in all slaves. It would be nice for a mechanism to exist to leverage additional connections to download this out-of-band.
Please note that such a mechanism should not change the security model of replication too much (direction of connect from slave to master, port usage, etc).
We shard data. In some cases, for some users which do care less about speed and more about reporting, it would be convenient to have a way to unshard data. Is FederatedX a solution? Can be have replication so that a slave can be slave to multiple masters provided the data sets of the masters are nonoverlapping?
Let's add names, DEFAULT being the optional default name. START SLAVE IO_THREAD DEFAULT,
CHANGE DEFAULT MASTER TO ... and so on.
Making slaves in a non-sucky way
Right now we mylvmbackup a slave, restore it to a new box, recover the InnoDB and connect it to the master. We'd like to have a command that can be issues to an empty database, will download the data from a master (or a co-slave), and then attach the new slave to the master (using the configuration provided on the command line or by the co-slave). At the SQL level. With progress report.
And please have the data move at disk speed, no index creation, ship binary index data from the co-slave or master. This can be fast!
"FLUSH TABLES WITH READ LOCK" can still break, if a checkpoint occurs under the lock.
Can we have FLUSH TABLES WITH READ LOCK AND NO CHECKPOINT and a FORCE CHECKPOINT for InnoDB, or have FLUSH TABLES force this checkpoint when applied to InnoDB?
-- Kris, SimonM
Server restart required (Replication)
About every replication config change requires a server restart to be picked up. Please fix!
RBR conflict resultion SP
In RBR, when there is any kind of RBR problem, call a SP. The SP gets all required data made available somehow, can define a resolution policy. 4 Outcomes: Choose old, choose new, fail (error) and synthesize (merge) new row.
Write the binlog optionally in a compressed fashion at the event level.
Async binlog shipping
On binlog rotation, run a SP asynchronously. Note than execution of for example a shell command through a UDF must not lock anything! Provide sufficient environmental data to the SP (binlog names old and new, binlog position for old and new log, etc).
Use tables, not files
Get rid of master.info, relay.info. These should be InnoDB tables in mysql.*.
Less critical wishes
Async query API so that we can map/reduce at the client level
Provide an API in mysqlclient.so so that we can write application side map/reduce query-shards-in-parallel actions.
Make Indexes first level objects
An index should be able to exist as an object that is not tied to a physical table. That would allow for indices that are global to a partition, and for functional indexes or indices on virtual columns.
ALTER TABLE progress report
ALTER TABLE can take a long time. We know how to go into $datadir and find the secret files and observe them, but we'd rather see progress in SHOW PROCESSLIST.
done while you wait (*closes emacs*).
Fast ALTER TABLE
Can we have proper fast ALTER TABLE? That would include online index drop/create, and versioned schema plus version numbers at the page level or the row level. Rows would be upgraded to the most recent schema version not at ALTER TABLE time, but whenever their row/page is being dirtied for whatever reason.
-- Kris, Herald.
There is the facebook hack for online ALTER TABLE.
Yes. We know. It is as much part of the problem space as it is part of the solution space.
-- Kris, Herald
Nested transactions / In TXN flag
"I would like to have nested transactions. They would make some things I need to solve a lot easier (layering in code, lower layers with their own TXNs being called by upper layers with or without a TXN going on)."
"I would like to see a viable definition of nested transactions. I would like to know what it
means when an inner TXN does a COMMIT and then the outer TXN does a ROLLBACK. Are there
any other ways to solve your layering problem?"
Another group asked about solutions for the same layering problem, but in a different way: "We would like to have a flag or status variable that tells code if it currently is inside a transaction, and optionally, if that is an implied (autocommit = 0) or explicit (autocommit = 1 and BEGIN WORK was issued) transaction."
We are using MyISAM in some places for mostly append-only tables and very large sizes (20 TB and more target size). We are considering Aria for this, if Aria is comparable in space efficiency to MyISAM, and has nicer recovery after an unplanned shutdown.
MyISAM tables we are using are being compressed with myisampack/myisamchk when they are finished with being written to. This process is at the moment cumbersome, as it involves a FLUSH TABLES and external commands being run outside of the DB server.
Having an SQL command that online packs MyISAM/Aria tables would be useful, as this would handle open file handles and other stuff more nicely, and also can be controlled from the client using only SQL.
Having ARCHIVE-Engine like row-level compression would be even nicer, but will probably create all kind of problems when writes happen to already compressed rows.
For our current DWH needs just a COMPRESS TABLE/UNCOMPRESS TABLE command would be sufficient, and is probably a lot less invasive.
"At the moment GET_LOCK() in limited to one lock per connection. Acquiring a second lock while holding one releases the initial lock. This has been done in order to avoid deadlocking problems, we presume.
Having a deadlock detection system in place, and being able to pick up multiple locks at once or gradually would be useful."
There are voices (mine, for one) which state that if you are using GET_LOCK instead of InnoDB row locks then you
are doing it wrong. Using GET_LOCK more is a step into the wrong direction, technology-wise.
Basic DWH improvements
"Hash Join would help in many places."
"Hash Joins would be useful."
"Way ahead of you."
-- Yves, Debs
"We hear you."
"Some large join take ages to execute. Probably due to tmp tables to disk."
Probably better of with Hash Joins? Show query examples, please.
Better EXPLAIN, better stats
Better EXPLAIN, more statistics, display real I/O data for queries, list causes for WAIT.
PERFORMANCE_SCHEMA, MySQL 5.5 stuff plus addons.
WL Item for better EXPLAIN exists.
Please link this item here.
"Better statistics for queries." "For example log query skeletons a la mysqldumpslow, but live from the parser."
Yes, please. Remember we have tables with time-tailed names (_YYYYMMDD), variable length WHERE ... IN (...) lists, and variable length extended insert statements.
"Unquoted numbers cause weird query plans."
Yes, they have to, thanks to numeric casting in SQL. This is not really solveable.
In MongoDB (gasp!) it can be made such that a query that uses a full scan instead of an index is an error. MySQL
also can log queries not using an index in the slow log, which is good. But, can we make such implied casts
somehow cause an error optionally, that would make that SQL easier to find in Development.
Using memcached as a backend
Using memcache as a backend.
Moving ibd files physically
For individual table restore and for shipment of precalculated/preloaded data it would be useful to detach ibd files, ship them and reattach them elsewhere, in binary form, at disk speed, no index creation necessary.
Possible with InnoDB and XtraDB as the donor, xtrabackup as the program and a properly configured XtraDB as the receiver.
InnoDB should be able to do this natively, via ALTER TABLE ... DETACH/ATTACH.
-- Everybody else.
More data load issues
SELECT INTO OUTFILE LOCAL? We do have LOAD DATA INFILE LOCAL.
Yes, but as implemented it is a security risk, also messes with firewall config due to the way it connects.
Ok, I will write this up in a better way.
Also, please try mysqldumper.
Server restart required (InnoDB)
About every second innodb config change requires a server restart to be picked up. Please fix!
Faster I_S. Consider a server with one million tables. Make I_S not suck on this.
GRANTS, ROLES, LDAP
We would be able to specify users as roles (e.g. developer, webuser, admin) and then assign a set of roles to users. One role is the default role picked up on login, the others can be picked up and dropped with a command.
Example (Kris): Define role developer, developer has access to ALL on schema x.*. Define role dba, dba has ALL on *.* with grant option. Define user debs with default role developer, additional role dba.
Debs logs in (username, password from LDAP, possibly also role info from LDAP (optional)). Debs has developer rights. Debs issues TAKE ROLE DBA or something. She now has DBA privileges. Debs issues DROP ROLE DBA. She now has Developer privileges.
A change of a role definition should be picked up when a role is picked up. For example role DBA changes privs. Debs has been logged in for 3 days as Developer. When she TAKEs ROLE DBA, the new DBA definition is picked up.
A change of role lists should be picked up at login.
The database should be able to work fine with one million users, one million roles. Do not keep all data in memory, load the required info on login into a cache (size changeable without restart). On login, load the requested data, if it is not in cache already.
Example: SET GLOBAL USERNAME_CACHE = 1000000;
-- yes, we pay that memory overhead
SET GLOBAL ROLE_CACHE = 1000000;
-- here as well
As users log in, these caches fill up. If we SET GLOBAL ROLE_CACHE=1000, this works as a LRU for the 1000 most recently used...
Initial thoughts are:
CREATE ROLE ROLE_NAME -- makes the role exist; GRANT .... ON db.table TO ROLE_NAME; -- add grants to a "role", similar to grants currently given to a user. GRANT ROLE ROLE_NAME1 TO 'USER'@'...' [IDENTIFIED BY [PASSWORD] 'xxxxxx']; -- add role's grants to user GRANT ROLE ROLE_NAME2 TO 'USER'@'...'; -- add role's grants to user
with corresponding revokes:
REVOKE ROLE ROLE_NAME1 FROM 'USER'@'..'; -- remove role's grants from a user REVOKE ... ON db.table FROM ROLE_NAME1; -- remove grants from a role
and probably something like
SHOW ROLES; -- list all roles known SHOW ROLE ROLE_NAME1; -- show the "grants" for a role SHOW GRANTS FOR 'user'@'location' -- should include the 'roles' the user has been granted.
Note: also would be nice if there were something like the equivalent of mk-show-grants which shows ALL grants for ALL users. We use this all the time as there's no SQL equivalent at the moment.
Views are parsed for every access, never cached.
-- Monty, Ruud
Proper tablespace management a la Oracle.
Why? Please state benefits, spec it.
My thoughts: We have database instances which have data ranging from: 20G to 2TB and some archive db servers containing up to 40TB of data target size.
Managing this with InnoDB as one large "blob" (one or more ibdata files) is not comfortable. innodb_filer_per_table splits this blob over multiple "different sized" files. Possibly more efficient is to have multiple "reasonably fixed" sized files [the tablespaces] (a smaller number than a per table count).
- An instance contains one or more tablespaces.
- A tablespace contains one or more datafiles (say 10G-50G depending on instance, DBA configurable)
- It should be possible to increase the size of a datafile
- It should be possible to assign a database or table to a specific tablespace
- It should also be possible to move tables out of a datafile (prior to perhaps dropping the file).
- Practicalities: make it possible to easily export/import a tablespace (with it's contents) from one instance to another.
It's true this does sound like quite a lot of extra work. Benefits are:
- easier to move around the files between instances
- space freed up by DELETE OR UPDATEs would be available to other tables (not the case with file_per_table unless you rebuild the table with ALTER TABLE)
- locating specific tablespaces on a specific location means that you can tune for better performance: e.g. move different tablespaces onto different disks/filesystems/LUNs. (this is hard to do now cleanly)
So most of this is to help with management of the on-disk/file data operationally.
If NOT using XFS with O_DIRECT, there is an in-memory write lock on the inode, that is, a filesystem can have only one write at a time happening to a filesystem file. Tablespaces and datafiles are a way to generate concurrency here, if XFS+O_DIRECT is not an option.
Tablespaces, if done right, can provide LVM services to a storage engine in a way independent of the OS.
Tablespaces, if done right, can expose an API internally, that all storage engines can use instead of talking to the filesystem directly. That would unify the actual storage management of all storage engines.
Get rid of MyISAM in mysql.*
You should be able to use any general storage engine in mysql.*, e.g. InnoDB.
Get rid of log tables in mysql.*
The mysql.* schema is a config repository. It was a mistake to place potential log tables in here, e.g. the general log and slow query log tables.
Currently, timezones are part of the session, not part of the column. Data types exist that take timezone data from the session into account (broken!), other data types never do that.
In general, this needs to be treated either the drizzle way (make all charsets utf8, make all time data utc), or the MySQL string way (store charset with each string, store TZ with each date or time, and provide blob-types as well as non-tz types for string resp. date/time types).
Essentially this is a request to create DATE_TZ, TIME_TZ, DATETIME_TZ and TIMESTAMP_TZ, relegate the timezone in a session to the same function as the charset in a session ('this is what the client uses'), and convert accordingly.
Make these date and time zones actually store microsecond resolution. Thanks.
Create a ENUM-type like data type that does string compression (at a scale). For example, create a type LOOKUP(255) CHARSET utf8. We store browser user-agent strings here, you store bigint unsigned numbers internally, and create a lookup structure so that each occurrence of a string is physically stored only once. Convert from VARCHAR to LOOKUP using ALTER TABLE.
Multiple Buffer pool
Innodb multiple buffer pools, but size of each, and to which pool will a page go configured by the dba (instead of random/hash as in 5.5, at table level).
InnoDB temp tables
Discussed internally: is it not possible to use Innodb temporary tables or similar for this perhaps in an explicit tempdb (like Sybase)?
That is: tables are always temporary, cleaned up on server restart, writes to tables do not need to be (should not be) flushed.
That would make the MEMORY engine thing from above irrelevant.
Providing Input to Monty Program
Running an instance with profiling on
Monty requested that we set up a slave with full profiling enabled, and if possible even divert a minimum amount of production load to it. The profiling data will be useful if made accessible to his team, as it contains useful information from real world load for his teams.
Also, it will be useful if the box were to run oprofile on a regular basis (with symbols installed), as this provides additional useful information.
We should try out other engines for fast inserts/DWH uses: Infobright, InfiniDB, PBXT, ...