MariaDB starting with 10.0.5

Starting with MariaDB 10.0.5, MariaDB switched to the PCRE regular expression library for enhanced regular expressions.

PCRE Regular expression enhancements

MariaDB 10.0.5 switched to the PCRE regular expression library, which significantly improved the power of the REGEXP/RLIKE operator.

The switch to PCRE added a number of features, including recursive patterns, named capture, look-ahead and look-behind assertions, non-capturing groups, non-greedy quantifiers, Unicode character properties, extended syntax for characters and character classes, multi-line matching, and many other.

Additionally, MariaDB 10.0.5 introduced three new functions that work with regular expressions: REGEXP_REPLACE(), REGEXP_INSTR() and REGEXP_SUBSTR().

Also, REGEXP/RLIKE, and the new functions, now work correctly with all multi-byte character sets supported by MariaDB, including East-Asian character sets (big5, gb2313, gbk, eucjp, eucjpms, cp932, ujis, euckr), and Unicode character sets (utf8, utf8mb4, ucs2, utf16, utf16le, utf32). In earlier versions of MariaDB (and all MySQL versions) REGEXP/RLIKE works correctly only with 8-bit character sets.

New regular expression functions

See the individual articles for more details and examples.

PCRE syntax

In most cases PCRE is backward compatible with the old POSIX 1003.2 compliant regexp library (see Regular Expressions Overview), so you won't need to change your applications that use SQL queries with the REGEXP/RLIKE predicate.

This section briefly describes the most important extended PCRE features. For more details please refer to the documentation on the PCRE site, or to the documentation which is bundled in the /pcre/doc/html/ directory of a MariaDB sources distribution. The pages pcresyntax.html and pcrepattern.html should be a good start. Regular-Expressions.Info is another good resource to learn about PCRE and regular expressions generally.

Special characters

PCRE supports the following escape sequences to match special characters:

SequenceDescription
\a0x07 (BEL)
\cx"control-x", where x is any ASCII character
\e0x1B (escape)
\f0x0C (form feed)
\n0x0A (newline)
\r0x0D (carriage return)
\t0x09 (TAB)
\dddcharacter with octal code ddd
\xhhcharacter with hex code hh
\x{hhh..}character with hex code hhh..

Note, the backslash characters (here, and in all examples in the sections below) must be escaped with another backslash, unless you're using the SQL_MODE NO_BACKSLASH_ESCAPES.

This example tests if a character has hex code 0x61:

SELECT 'a' RLIKE '\\x{61}';
-> 1

Character classes

PCRE supports the standard POSIX character classes such as alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit, with the following additional classes:

ClassDescription
asciiany ASCII character (0x00..0x7F)
wordany "word" character (a letter, a digit, or an underscore)

This example checks if the string consists of ASCII characters only:

SELECT 'abc' RLIKE '^[[:ascii:]]+$';
-> 1

Generic character types

Generic character types complement the POSIX character classes and serve to simplify writing patterns:

ClassDescription
\da decimal digit (same as [:digit:])
\Da character that is not a decimal digit
\ha horizontal white space character
\Ha character that is not a horizontal white space character
\Na character that is not a new line
\Ra newline sequence
\sa white space character
\Sa character that is not a white space character
\va vertical white space character
\Va character that is not a vertical white space character
\wa "word" character (same as [:word:])
\Wa "non-word" character

This example checks if the string consists of "word" characters only:

SELECT 'abc' RLIKE '^\\w+$';
-> 1

Unicode character properties

\p{xx} is a character with the xx property, and \P{xx} is a character without the xx property.

The property names represented by xx above are limited to the Unicode script names, the general category properties, and "Any", which matches any character (including newline). Those that are not part of an identified script are lumped together as "Common".

General category properties for \p and \P

PropertyDescription
COther
CcControl
CfFormat
CnUnassigned
CoPrivate use
CsSurrogate
LLetter
LlLower case letter
LmModifier letter
LoOther letter
LtTitle case letter
LuUpper case letter
L&Ll, Lu, or Lt
MMark
McSpacing mark
MeEnclosing mark
MnNon-spacing mark
NNumber
NdDecimal number
NlLetter number
NoOther number
PPunctuation
PcConnector punctuation
PdDash punctuation
PeClose punctuation
PfFinal punctuation
PiInitial punctuation
PoOther punctuation
PsOpen punctuation
SSymbol
ScCurrency symbol
SkModifier symbol
SmMathematical symbol
SoOther symbol
ZSeparator
ZlLine separator
ZpParagraph separator
ZsSpace separator

This example checks if the string consists only of characters with property N (number):

SELECT '1¼①' RLIKE '^\\p{N}+$';
-> 1

Special category properties for \p and \P

PropertyDescription
XanAlphanumeric: union of properties L and N
XpsPOSIX space: property Z or tab, NL, VT, FF, CR
XspPerl space: property Z or tab, NL, FF, CR
XucA character than can be represented by a Universal Character Name
XwdPerl word: property Xan or underscore

The property Xuc matches any character that can be represented by a Universal Character Name (in C++ and other programming languages). These include $, @, `, and all characters with Unicode code points greater than U+00A0, excluding the surrogates U+D800..U+DFFF.

Script names for \p and \P

Arabic, Armenian, Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam, Mandaic, Meetei_Mayek, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Sharada, Shavian, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, Yi.

This example checks if the string consists only of Greek characters:

SELECT 'ΣΦΩ' RLIKE '^\\p{Greek}+$';
-> 1

Extended Unicode grapheme sequence

The \X escape sequence matches a character sequence that makes an "extended grapheme cluster", i.e. a composite character that consists of multiple Unicode code points.

One of the examples of a composite character can be a letter followed by non-spacing accent marks. This example demonstrates that U+0045 LATIN CAPITAL LETTER E followed by U+0302 COMBINING CIRCUMFLEX ACCENT followed by U+0323 COMBINING DOT BELOW together form an extended grapheme cluster:

SELECT _ucs2 0x004503020323 RLIKE '^\\X$';
-> 1

See the PCRE documentation for the other types of extended grapheme clusters.

Simple assertions

An assertion specifies a certain condition that must match at a particular point, but without consuming characters from the subject string. In addition to the standard POSIX simple assertions ^ (that matches at the beginning of a line) and $ (that matches at the end of a line), PCRE supports a number of other assertions:

AssertionDescription
\bmatches at a word boundary
\Bmatches when not at a word boundary
\Amatches at the start of the subject
\Zmatches at the end of the subject, also matches before a newline at the end of the subject
\zmatches only at the end of the subject
\Gmatches at the first matching position in the subject

This example cuts a word that consists only of 3 characters from a string:

SELECT REGEXP_SUBSTR('---abcd---xyz---', '\\b\\w{3}\\b');
-> xyz

Notice that the two \b assertions checked the word boundaries but did not get into the matching pattern.

The \b assertions work well in the beginning and the end of the subject string:

SELECT REGEXP_SUBSTR('xyz', '\\b\\w{3}\\b');
-> xyz

By default, the ^ and $ assertions have the same meaning with \A, \Z, and \z. However, the meanings of ^ and $ can change in multiline mode (see below). By contrast, the meanings of \A, \Z, and \z are always the same; they are independent of the multiline mode.

Option setting

A number of options that control the default match behavior can be changed within the pattern by a sequence of option letters enclosed between (? and ).

OptionDescription
(?i)case insensitive match
(?m)multiline mode
(?s)dotall mode (dot matches newline characters)
(?x)extended (ignore white space)
(?U)ungreedy (lazy) match
(?J)allow named subpatterns with duplicate names
(?X)extra PCRE functionality (e.g. force error on unknown escaped character)
(?-...)unset option(s)

For example, (?im) sets case insensitive multiline matching.

A hyphen followed by the option letters unset the options. For example, (?-im) means case sensitive single line match.

A combined setting and unsetting is also possible, e.g. (?im-sx).

If an option is set outside of subpattern parentheses, the option applies to the remainder of the pattern that follows the option. If an option is set inside a subpattern, it applies to the part of this subpattern that follows the option.

In this example the pattern (?i)m((?-i)aria)db matches the words MariaDB, Mariadb, mariadb, but not MARIADB:

SELECT 'MariaDB' RLIKE '(?i)m((?-i)aria)db';
-> 1

SELECT 'mariaDB' RLIKE '(?i)m((?-i)aria)db';
-> 1

SELECT 'Mariadb' RLIKE '(?i)m((?-i)aria)db';
-> 1

SELECT 'MARIADB' RLIKE '(?i)m((?-i)aria)db';
-> 0

This example demonstrates that the (?x) option makes the regexp engine ignore all white spaces in the pattern (other than in a class).

SELECT 'ab' RLIKE '(?x)a b';
-> 1

Note, putting spaces into a pattern in combination with the (?x) option can be useful to split different logical parts of a complex pattern, to make it more readable.

Multiline matching

Multiline matching changes the meaning of ^ and $ from "the beginning of the subject string" and "the end of the subject string" to "the beginning of any line in the subject string" and "the end of any line in the subject string" respectively.

This example checks if the subject string contains two consequent lines that fully consist of digits:

SELECT 'abc\n123\n456\nxyz\n' RLIKE '(?m)^\\d+\\R\\d+$';
-> 1

Notice the (?m) option in the beginning of the pattern, which switches to the multiline matching mode.

Newline conventions

PCRE supports five line break conventions:

  • CR (\r) - a single carriage return character
  • LF (\n) - a single linefeed character
  • CRLF (\r\n) - a carriage return followed by a linefeed
  • any of the previous three
  • any Unicode newline sequence

By default, the newline convention is set to any Unicode newline sequence, which includes:

SequenceDescription
LF(U+000A, carriage return)
CR(U+000D, carriage return)
CRLF(a carriage return followed by a linefeed)
VT(U+000B, vertical tab)
FF(U+000C, form feed)
NEL(U+0085, next line)
LS(U+2028, line separator)
PS(U+2029, paragraph separator)

The newline convention can be set by starting a pattern with one of the following sequences:

SequenceDescription
(?CR)carriage return
(?LF)linefeed
(?CRLF)carriage return followed by linefeed
(?ANYCRLF)any of the previous three
(?ANY)all Unicode newline sequences

The newline conversion affects the ^ and $ assertions, the interpretation of the dot metacharacter, and the behavior of \N.

Note, the new line convention does not affect the meaning of \R.

This example demonstrates that the dot metacharacter matches \n, because it is not a newline sequence anymore:

SELECT 'a\nb' RLIKE '(*CR)a.b';
-> 1

Newline sequences

By default, the escape sequence \R matches any Unicode newline sequences.

The meaning of \R can be set by starting a pattern with one of the following sequences:

SequenceDescription
(?BSR_ANYCRLF)any of CR, LF or CRLF
(?BSR_UNICODE)any Unicode newline sequence

Comments

It's possible to include comments inside a pattern. Comments do not participate in the pattern matching. Comments start at the (?# sequence and continue up to the next closing parenthesis:

SELECT 'ab12' RLIKE 'ab(?#expect digits)12';
-> 1

Quoting

POSIX uses the backslash to remove a special meaning from a character. PCRE introduces a syntax to remove special meaning from a sequence of characters. The characters inside \Q ... \E are treated literally, without their special meaning.

This example checks if the string matches a dollar sign followed by a parenthesized name (a variable reference in some languages):

SELECT '$(abc)' RLIKE '^\\Q$(\\E\\w+\\Q)\\E$';
-> 1

Note that the leftmost dollar sign and the parentheses are used literally, while the rightmost dollar sign is still used to match the end of the string.

Resetting the match start

The escape sequence \K causes any previously matched characters to be excluded from the final matched sequence. For example, the pattern: (foo)\Kbar matches foobar, but reports that it has matched bar. This feature is similar to a look-behind assertion. However, in this case, the part of the subject before the real match does not have to be of fixed length:

SELECT REGEXP_SUBSTR('aaa123', '[a-z]*\\K[0-9]*');
-> 123

Non-capturing groups

The question mark and the colon after the opening parenthesis create a non-capturing group: (?:...).

This example removes an optional article from a word, for example for better sorting of the results.

SELECT REGEXP_REPLACE('The King','(?:the|an|a)[^a-z]([a-z]+)','\\1');
-> King

Note that the articles are listed inside the left parentheses using the alternation operator | but they do not produce a captured subpattern, so the word followed by the article is referenced by '
1'
in the third argument to the function. Using non-capturing groups can be useful to save numbers on the sup-patterns that won't be used in the third argument of REGEXP_REPLACE(), as well as for performance purposes.

Non-greedy quantifiers

By default, the repetition quantifiers ?, *, + and {n,m} are "greedy", that is, they try to match as much as possible. Adding a question mark after a repetition quantifier makes it "non-greedy", so the pattern matches the minimum number of times possible.

This example cuts C comments from a line:

SELECT REGEXP_REPLACE('/* Comment1 */ i+= 1; /* Comment2 */', '/[*].*?[*]/','');
->  i+= 1;

The pattern without the non-greedy flag to the quantifier /[*].*[*]/ would match the entire string between the leftmost /* and the rightmost */.

Atomic groups

A sequence inside (?>...) makes an atomic group. Backtracking inside an atomic group is prevented once it has matched; however, backtracking past to the previous items works normally.

Consider the pattern \d+foo applied to the subject string 123bar. Once the engine scans 123 and fails on the letter b, it would normally backtrack to 2 and try to match again, then fail and backtrack to 1 and try to match and fail again, and finally fail the entire pattern. In case of an atomic group (?>\d+)foo with the same subject string 123bar, the engine gives up immediately after the first failure to match foo. An atomic group with a quantifier can match all or nothing.

Atomic groups produce faster false results (i.e. in case when a long subject string does not match the pattern), because the regexp engine saves performance on backtracking. However, don't hurry to put everything into atomic groups. This example demonstrates the difference between atomic and non-atomic match:

SELECT 'abcc' RLIKE 'a(?>bc|b)c' AS atomic1;
-> 1

SELECT 'abc' RLIKE 'a(?>bc|b)c' AS atomic2;
-> 0
 
SELECT 'abcc' RLIKE 'a(bc|b)c' AS non_atomic1;
-> 1

SELECT 'abc' RLIKE 'a(bc|b)c' AS non_atomic2;
-> 1

The non-atomic pattern matches both abbc and abc, while the atomic pattern matches abbc only.

The atomic group (?>bc|b) in the above example can be "translated" as "if there is bc, then don't try to match as b". So b can match only if bc is not found.

Atomic groups are not capturing. To make an atomic group capturing, put it into parentheses:

SELECT REGEXP_REPLACE('abcc','a((?>bc|b))c','\\1');
-> bc

Possessive quantifiers

An atomic group which ends with a quantifier can be rewritten using a so called "possessive quantifier" syntax by putting an additional + sign following the quantifier.

The pattern (?>\d+)foo from the previous section's example can be rewritten as \d++foo.

Absolute and relative numeric backreferences

Backreferences match the same text as previously matched by a capturing group. Backreferences can be written using:

  • a backslash followed by a digit
  • the \g escape sequence followed by a positive or negative number
  • the \g escape sequence followed by a positive or negative number enclosed in braces

The following backreferences are identical and refer to the first capturing group:

  • \1
  • \g1
  • \g{1}

This example demonstrates a pattern that matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility":

SELECT 'sense and sensibility' RLIKE '(sens|respons)e and \\1ibility';
-> 1

This example removes doubled words that can unintentionally creep in when you edit a text in a text editor:

SELECT REGEXP_REPLACE('using using the the regexp regexp',
 '\\b(\\w+)\\s+\\1\\b','\\1');
-> using the regexp

Note that all double words were removed, in the beginning, in the middle and in the end of the subject string.

A negative number in a \g sequence means a relative reference. Relative references can be helpful in long patterns, and also in patterns that are created by joining fragments together that contain references within themselves. The sequence \g{-1} is a reference to the most recently started capturing subpattern before \g.

In this example \g{-1} is equivalent to \2:

SELECT 'abc123def123' RLIKE '(abc(123)def)\\g{-1}';     
-> 1

SELECT 'abc123def123' RLIKE '(abc(123)def)\\2';
-> 1

Named subpatterns and backreferences

Using numeric backreferences for capturing groups can be hard to track in a complicated regular expression. Also, the numbers can change if an expression is modified. To overcome these difficulties, PCRE supports named subpatterns.

A subpattern can be named in one of three ways: (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python. References to capturing subpatterns from other parts of the pattern, can be made by name as well as by number.

Backreferences to a named subpattern can be written using the .NET syntax \k{name}, the Perl syntax \k<name> or \k'name' or \g{name}, or the Python syntax (?P=name).

This example tests if the string is a correct HTML tag:

SELECT '<a href="../">Up</a>' RLIKE '<(?<tag>[a-z][a-z0-9]*)[^>]*>[^<]*</(?P=tag)>';
-> 1

Positive and negative look-ahead and look-behind assertions

Look-ahead and look-behind assertions serve to specify the context for the searched regular expression pattern. Note that the assertions only check the context, they do not capture anything themselves!

This example finds the letter which is not followed by another letter (negative look-ahead):

SELECT REGEXP_SUBSTR('ab1','[a-z](?![a-z])');
-> b

This example finds the letter which is followed by a digit (positive look-ahead):

SELECT REGEXP_SUBSTR('ab1','[a-z](?=[0-9])');
-> b

This example finds the letter which does not follow a digit character (negative look-behind):

SELECT REGEXP_SUBSTR('1ab','(?<![0-9])[a-z]');
-> b

This example finds the letter which follows another letter character (positive look-behind):

SELECT REGEXP_SUBSTR('1ab','(?<=[a-z])[a-z]');
-> b

Note that look-ahead and look-behind assertions can only be of fixed length; you cannot have repetition operators or alternations with different lengths:

SELECT 'aaa' RLIKE '(?<=(a|bc))a';
ERROR 1139 (42000): Got error 'lookbehind assertion is not fixed length at offset 10' from regexp

Subroutine reference and recursive patterns

PCRE supports a special syntax to recourse the entire pattern or its individual subpatterns:

SyntaxDescription
(?R)Recourse the entire pattern
(?n)call subpattern by absolute number
(?+n)call subpattern by relative number
(?-n)call subpattern by relative number
(?&name)call subpattern by name (Perl)
(?P>name)call subpattern by name (Python)
\g<name>call subpattern by name (Oniguruma)
\g'name'call subpattern by name (Oniguruma)
\g<n>call subpattern by absolute number (Oniguruma)
\g'n'call subpattern by absolute number (Oniguruma)
\g<+n>call subpattern by relative number
\g<-n>call subpattern by relative number
\g'+n'call subpattern by relative number
\g'-n'call subpattern by relative number

This example checks for a correct additive arithmetic expression consisting of numbers, unary plus and minus, binary plus and minus, and parentheses:

SELECT '1+2-3+(+(4-1)+(-2)+(+1))' RLIKE  '^(([+-]?(\\d+|[(](?1)[)]))(([+-](?1))*))$';
-> 1

The recursion is done using (?1) to call for the first parenthesized subpattern, which includes everything except the leading ^ and the trailing $.

The regular expression in the above example implements the following BNF grammar:

  1. <expression> ::= <term> [(<sign> <term>)...]
  2. <term> ::= [ <sign> ] <primary>
  3. <primary> ::= <number> | <left paren> <expression> <right paren>
  4. <sign> ::= <plus sign> | <minus sign>

Defining subpatterns for use by reference

Use the (?(DEFINE)...) syntax to define subpatterns that can be referenced from elsewhere.

This example defines a subpattern with the name letters that matches one or more letters, which is further reused two times:

SELECT 'abc123xyz' RLIKE '^(?(DEFINE)(?<letters>[a-z]+))(?&letters)[0-9]+(?&letters)$';
-> 1

The above example can also be rewritten to define the digit part as a subpattern as well:

SELECT 'abc123xyz' RLIKE
 '^(?(DEFINE)(?<letters>[a-z]+)(?<digits>[0-9]+))(?&letters)(?&digits)(?&letters)$';
-> 1

Conditional subpatterns

There are two forms of conditional subpatterns:

(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)

The yes-pattern is used if the condition is satisfied, otherwise the no-pattern (if any) is used.

Conditions with subpattern references

If a condition consists of a number, it makes a condition with a subpattern reference. Such a condition is true if a capturing subpattern corresponding to the number has previously matched.

This example finds an optionally parenthesized number in a string:

SELECT REGEXP_SUBSTR('a(123)b', '([(])?[0-9]+(?(1)[)])');
-> (123)

The ([(])? part makes a capturing subpattern that matches an optional opening parenthesis; the [0-9]+ part matches a number, and the (?(1)[)]) part matches a closing parenthesis, but only if the opening parenthesis has been previously found.

Other kinds of conditions

The other possible condition kinds are: recursion references and assertions. See the PCRE documentation for details.

Matching zero bytes (0x00)

PCRE correctly works with zero bytes in the subject strings:

SELECT 'a\0b' RLIKE '^a.b$';
-> 1

Zero bytes, however, are not supported literally in the pattern strings and should be escaped using the \xhh or \x{hh} syntax:

SELECT 'a\0b' RLIKE '^a\\x{00}b$';
-> 1

Other PCRE features

PCRE provides other extended features that were not covered in this document, such as duplicate subpattern numbers, backtracking control, breaking utf-8 sequences into individual bytes, setting the match limit, setting the recursion limit, optimization control, recursion conditions, assertion conditions and more types of extended grapheme clusters. Please refer to the PCRE documentation for details.

Enhanced regex was implemented as a GSoC 2013 project by Sudheera Palihakkara.

See also

Comments

Comments loading...