1 of 1

Data Sampling: Techniques for Efficiently Finding a Random Row

Fetching random rows from a table (beyond ORDER BY RAND())

The problem

One would like to do "SELECT ... ORDER BY RAND() LIMIT 10" to get 10 rows at random. But this is slow. The optimizer does

Fetch all the rows -- this is costly
Append to the rows
Sort the rows -- also costly
Pick the first 10.

All the algorithms given below are "fast", but most introduce flaws:

Bias -- some rows are more like to be fetched than others.
Repetitions -- If two random sets contain the same row, they are likely to contain other dups.
Sometimes failing to fetch the desired number of rows.

"Fast" means avoiding reading all the rows. There are many techniques that require a full table scan, or at least an index scan. They are not acceptable for this list. There is even a technique that averages half a scan; it is relegated to a footnote.

Metrics

Here's a way to measure performance without having a big table.

If some of the "Handler" numbers look like the number of rows in the table, then there was a table scan.

None of the queries presented here need a full table (or index) scan. Each has a time proportional to the number of rows returned.

Virtually all published algorithms involve a table scan. The previously published version of this blog had, embarrassingly, several algorithms that had table scans.

Sometimes the scan can be avoided via a subquery. For example, the first of these will do a table scan; the second will not.

Case: Consecutive AUTO_INCREMENT without gaps, 1 row returned

Requirement: id
Requirement: No gaps in id

(Of course, you might be able to simplify this. For example, min_id is likely to be 1. Or precalculate limits into @min and @max.)

Case: Consecutive AUTO_INCREMENT without gaps, 10 rows

Requirement: AUTO_INCREMENT id
Requirement: No gaps in id
Flaw: Sometimes delivers fewer than 10 rows

The FLOOR expression could lead to duplicates, hence the inflated inner LIMIT. There could (rarely) be so many duplicates that the inflated LIMIT leads to fewer than the desired 10 different rows. One approach to that Flaw is to rerun the query if it delivers too few rows.

A variant:

Again, ugly but fast, regardless of table size.

Case: AUTO_INCREMENT with gaps, 1 or more rows returned

Requirement: AUTO_INCREMENT, possibly with gaps due to DELETEs, etc
Flaw: Only semi-random (rows do not have an equal chance of being picked), but it does partially compensate for the gaps
Flaw: The first and last few rows of the table are less likely to be delivered.

This gets 50 "consecutive" ids (possibly with gaps), then delivers a random 10 of them.

Yes, it is complex, but yes, it is fast, regardless of the table size.

Case: Extra FLOAT column for randomizing

(Unfinished: need to check these.)

Assuming rnd is a FLOAT (or DOUBLE) populated with RAND() and INDEXed:

Requirement: extra, indexed, FLOAT column
Flaw: Fetches 10 adjacent rows (according to rnd), hence not good randomness
Flaw: Near 'end' of table, can't find 10 rows.

These two variants attempt to resolve the end-of-table flaw:

Case: UUID or MD5 column

Requirement: UUID/GUID/MD5/SHA1 column exists and is indexed.
Similar code/benefits/flaws to AUTO_INCREMENT with gaps.
Needs 7 random HEX digits:

can be used as a start for adapting a gapped AUTO_INCREMENT case. If the field is BINARY instead of hex, then

Data Sampling: Techniques for Efficiently Finding a Random Row

Fetching random rows from a table (beyond ORDER BY RAND())

The problem

One would like to do "SELECT ... ORDER BY RAND() LIMIT 10" to get 10 rows at random. But this is slow. The optimizer does

Fetch all the rows -- this is costly
Append to the rows
Sort the rows -- also costly
Pick the first 10.

All the algorithms given below are "fast", but most introduce flaws:

Bias -- some rows are more like to be fetched than others.
Repetitions -- If two random sets contain the same row, they are likely to contain other dups.
Sometimes failing to fetch the desired number of rows.

Metrics

Here's a way to measure performance without having a big table.

If some of the "Handler" numbers look like the number of rows in the table, then there was a table scan.

None of the queries presented here need a full table (or index) scan. Each has a time proportional to the number of rows returned.

Virtually all published algorithms involve a table scan. The previously published version of this blog had, embarrassingly, several algorithms that had table scans.

Sometimes the scan can be avoided via a subquery. For example, the first of these will do a table scan; the second will not.

Case: Consecutive AUTO_INCREMENT without gaps, 1 row returned

Requirement: id
Requirement: No gaps in id

(Of course, you might be able to simplify this. For example, min_id is likely to be 1. Or precalculate limits into @min and @max.)

Case: Consecutive AUTO_INCREMENT without gaps, 10 rows

Requirement: AUTO_INCREMENT id
Requirement: No gaps in id
Flaw: Sometimes delivers fewer than 10 rows

A variant:

Again, ugly but fast, regardless of table size.

Case: AUTO_INCREMENT with gaps, 1 or more rows returned

Requirement: AUTO_INCREMENT, possibly with gaps due to DELETEs, etc
Flaw: Only semi-random (rows do not have an equal chance of being picked), but it does partially compensate for the gaps
Flaw: The first and last few rows of the table are less likely to be delivered.

This gets 50 "consecutive" ids (possibly with gaps), then delivers a random 10 of them.

Yes, it is complex, but yes, it is fast, regardless of the table size.

Case: Extra FLOAT column for randomizing

(Unfinished: need to check these.)

Assuming rnd is a FLOAT (or DOUBLE) populated with RAND() and INDEXed:

Requirement: extra, indexed, FLOAT column
Flaw: Fetches 10 adjacent rows (according to rnd), hence not good randomness
Flaw: Near 'end' of table, can't find 10 rows.

These two variants attempt to resolve the end-of-table flaw:

Case: UUID or MD5 column

Requirement: UUID/GUID/MD5/SHA1 column exists and is indexed.
Similar code/benefits/flaws to AUTO_INCREMENT with gaps.
Needs 7 random HEX digits:

can be used as a start for adapting a gapped AUTO_INCREMENT case. If the field is BINARY instead of hex, then

Data Sampling: Techniques for Efficiently Finding a Random Row

Fetching random rows from a table (beyond ORDER BY RAND())

The problem

Metrics

Case: Consecutive AUTO_INCREMENT without gaps, 1 row returned

Case: Consecutive AUTO_INCREMENT without gaps, 10 rows

Case: AUTO_INCREMENT with gaps, 1 or more rows returned

Case: Extra FLOAT column for randomizing

Case: UUID or MD5 column

See also

Data Sampling: Techniques for Efficiently Finding a Random Row

Fetching random rows from a table (beyond ORDER BY RAND())

The problem

Metrics

Case: Consecutive AUTO_INCREMENT without gaps, 1 row returned

Case: Consecutive AUTO_INCREMENT without gaps, 10 rows

Case: AUTO_INCREMENT with gaps, 1 or more rows returned

Case: Extra FLOAT column for randomizing

Case: UUID or MD5 column

See also