Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Learn about regular expression functions in MariaDB Server. This section details SQL functions for powerful pattern matching and manipulation of string data using regular expressions.
Get an overview of regex usage. This page introduces the pattern matching capabilities and common metacharacters used in MariaDB regular expressions.
Regular Expressions allow MariaDB to perform complex pattern matching on a string. In many cases, the simple pattern matching provided by LIKE is sufficient. LIKE performs two kinds of matches:
_ - the underscore, matching a single character
% - the percentage sign, matching any number of characters.
In other cases you may need more control over the returned matches, and will need to use regular expressions.
Regular expression matches are performed with the function. RLIKE is a synonym for REGEXP.
Comparisons are performed on the byte value, so characters that are treated as equivalent by a collation, but do not have the same byte-value, such as accented characters, could evaluate as unequal.
Without any special characters, a regular expression match is true if the characters match. The match is case-insensitive, except in the case of BINARY strings.
Note that the word being matched must match the whole pattern:
The first returns true because the pattern "Mari" exists in the expression "Maria". When the order is reversed, the result is false, as the pattern "Maria" does not exist in the expression "Mari"
A match can be performed against more than one word with the | character. For example:
The above examples introduce the syntax, but are not very useful on their own. It's the special characters that give regular expressions their power.
^ matches the beginning of a string (inside square brackets it can also mean NOT - see below):
$ matches the end of a string:
. matches any single character:
x* matches zero or more of a character x. In the examples below, it's the r character.
x+ matches one or more of a character x. In the examples below, it's the r character.
x? matches zero or one of a character x. In the examples below, it's the r character.
(xyz) - combine a sequence, for example (xyz)+ or (xyz)*
x{n} and x{m,n}
This notation is used to match many instances of the x. In the case of x{n} the match must be exactly that many times. In the case of x{m,n}, the match can occur from m to n times. For example, to match zero or one instance of the string ari (which is identical to (ari)?), the following can be used:
[xy] groups characters for matching purposes. For example, to match either the p or the r character:
The square brackets also permit a range match, for example, to match any character from a-z, [a-z] is used. Numeric ranges are also permitted.
The following does not match, as r falls outside of the range a-p.
^
The ^ character means does NOT match, for example:
The [ and ] characters on their own can be literally matched inside a [] block, without escaping, as long as they immediately match the opening bracket:
Incorrect order, so no match:
The - character can also be matched in the same way:
The :<: and :>: patterns match the beginning and the end of a word respectively. For example:
There are a number of shortcuts to match particular preset character classes. These are matched with the [:character_class:] pattern (inside a [] set). The following character classes exist:
For example:
Remember that matches are by default case-insensitive, unless a binary string is used, so the following example, specifically looking for an uppercase, counter-intuitively matches a lowercase character:
There are also number of shortcuts to match particular preset character names. These are matched with the [.character.] pattern (inside a [] set). The following character classes exist:
For example:
The true power of regular expressions is unleashed when the above is combined, to form more complex examples. Regular expression's reputation for complexity stems from the seeming complexity of multiple combined regular expressions, when in reality, it's simply a matter of understanding the characters and how they apply:
The first example fails to match, as while the Ma matches, either i or r only matches once before the ia characters at the end.
This example matches, as either i or r match exactly twice after the Ma, in this case one r and one i.
With the large number of special characters, care needs to be taken to properly escape characters. Two backslash characters, `` (one for the MariaDB parser, one for the regex library), are required to properly escape a character. For example:
To match the literal (Ma:
To match r+: The first two examples are incorrect, as they match r one or more times, not r+:
This page is licensed: CC BY-SA / Gnu FDL
lower
Lowercase alphabetic
Graphic or space characters
punct
Punctuation
space
Space, tab, newline, and carriage return
upper
Uppercase alphabetic
xdigit
Hexadecimal digit
006
BEL
007
alert
007
BS
010
backspace
'\b'
HT
011
tab
'\t'
LF
012
newline
'\n'
VT
013
vertical-tab
'\v'
FF
014
form-feed
'\f'
CR
015
carriage-return
'\r'
SO
016
SI
017
DLE
020
DC1
021
DC2
022
DC3
023
DC4
024
NAK
025
SYN
026
ETB
027
CAN
030
EM
031
SUB
032
ESC
033
IS4
034
FS
034
IS3
035
GS
035
IS2
036
RS
036
IS1
037
US
037
space
' '
exclamation-mark
'!'
quotation-mark
'"'
number-sign
'#'
dollar-sign
'$'
percent-sign
'%'
ampersand
'&'
apostrophe
'''
left-parenthesis
'('
right-parenthesis
')'
asterisk
'*'
plus-sign
'+'
comma
','
hyphen
'-'
hyphen-minus
'-'
period
'.'
full-stop
'.'
slash
'/'
solidus
'/'
zero
'0'
one
'1'
two
'2'
three
'3'
four
'4'
five
'5'
six
'6'
seven
'7'
eight
'8'
nine
'9'
colon
':'
semicolon
';'
less-than-sign
'<'
equals-sign
'='
greater-than-sign
'>'
question-mark
'?'
commercial-at
'@'
left-square-bracket
'['
backslash
''
reverse-solidus
''
right-square-bracket
']'
circumflex
'^'
circumflex-accent
'^'
underscore
'_'
low-line
'_'
grave-accent
'`'
left-brace
'{'
left-curly-bracket
'{'
vertical-line
'
right-brace
'}'
right-curly-bracket
'}'
tilde
''
DEL
177
alnum
Alphanumeric
alpha
Alphabetic
blank
Whitespace
cntrl
Control characters
digit
Digits
graph
Graphic characters
NUL
0
SOH
001
STX
002
ETX
003
EOT
004
ENQ
005
ACK
SELECT 'Maria' REGEXP 'Maria';
+------------------------+
| 'Maria' REGEXP 'Maria' |
+------------------------+
| 1 |
+------------------------+
SELECT 'Maria' REGEXP 'maria';
+------------------------+
| 'Maria' REGEXP 'maria' |
+------------------------+
| 1 |
+------------------------+
SELECT BINARY 'Maria' REGEXP 'maria';
+-------------------------------+
| BINARY 'Maria' REGEXP 'maria' |
+-------------------------------+
| 0 |
+-------------------------------+SELECT 'Maria' REGEXP 'Mari';
+-----------------------+
| 'Maria' REGEXP 'Mari' |
+-----------------------+
| 1 |
+-----------------------+
SELECT 'Mari' REGEXP 'Maria';
+-----------------------+
| 'Mari' REGEXP 'Maria' |
+-----------------------+
| 0 |
+-----------------------+SELECT 'Maria' REGEXP 'Monty|Maria';
+------------------------------+
| 'Maria' REGEXP 'Monty|Maria' |
+------------------------------+
| 1 |
+------------------------------+SELECT 'Maria' REGEXP '^Ma';
+----------------------+
| 'Maria' REGEXP '^Ma' |
+----------------------+
| 1 |
+----------------------+SELECT 'Maria' REGEXP 'ia$';
+----------------------+
| 'Maria' REGEXP 'ia$' |
+----------------------+
| 1 |
+----------------------+SELECT 'Maria' REGEXP 'Ma.ia';
+------------------------+
| 'Maria' REGEXP 'Ma.ia' |
+------------------------+
| 1 |
+------------------------+
SELECT 'Maria' REGEXP 'Ma..ia';
+-------------------------+
| 'Maria' REGEXP 'Ma..ia' |
+-------------------------+
| 0 |
+-------------------------+SELECT 'Maria' REGEXP 'Mar*ia';
+-------------------------+
| 'Maria' REGEXP 'Mar*ia' |
+-------------------------+
| 1 |
+-------------------------+
SELECT 'Maia' REGEXP 'Mar*ia';
+------------------------+
| 'Maia' REGEXP 'Mar*ia' |
+------------------------+
| 1 |
+------------------------+
SELECT 'Marrria' REGEXP 'Mar*ia';
+---------------------------+
| 'Marrria' REGEXP 'Mar*ia' |
+---------------------------+
| 1 |
+---------------------------+SELECT 'Maria' REGEXP 'Mar+ia';
+-------------------------+
| 'Maria' REGEXP 'Mar+ia' |
+-------------------------+
| 1 |
+-------------------------+
SELECT 'Maia' REGEXP 'Mar+ia';
+------------------------+
| 'Maia' REGEXP 'Mar+ia' |
+------------------------+
| 0 |
+------------------------+
SELECT 'Marrria' REGEXP 'Mar+ia';
+---------------------------+
| 'Marrria' REGEXP 'Mar+ia' |
+---------------------------+
| 1 |
+---------------------------+SELECT 'Maria' REGEXP 'Mar?ia';
+-------------------------+
| 'Maria' REGEXP 'Mar?ia' |
+-------------------------+
| 1 |
+-------------------------+
SELECT 'Maia' REGEXP 'Mar?ia';
+------------------------+
| 'Maia' REGEXP 'Mar?ia' |
+------------------------+
| 1 |
+------------------------+
SELECT 'Marrria' REGEXP 'Mar?ia';
+---------------------------+
| 'Marrria' REGEXP 'Mar?ia' |
+---------------------------+
| 0 |
+---------------------------+SELECT 'Maria' REGEXP '(ari)+';
+-------------------------+
| 'Maria' REGEXP '(ari)+' |
+-------------------------+
| 1 |
+-------------------------+SELECT 'Maria' REGEXP '(ari){0,1}';
+-----------------------------+
| 'Maria' REGEXP '(ari){0,1}' |
+-----------------------------+
| 1 |
+-----------------------------+SELECT 'Maria' REGEXP 'Ma[pr]ia';
+---------------------------+
| 'Maria' REGEXP 'Ma[pr]ia' |
+---------------------------+
| 1 |
+---------------------------+SELECT 'Maria' REGEXP 'Ma[a-z]ia';
+----------------------------+
| 'Maria' REGEXP 'Ma[a-z]ia' |
+----------------------------+
| 1 |
+----------------------------+SELECT 'Maria' REGEXP 'Ma[a-p]ia';
+----------------------------+
| 'Maria' REGEXP 'Ma[a-p]ia' |
+----------------------------+
| 0 |
+----------------------------+SELECT 'Maria' REGEXP 'Ma[^p]ia';
+---------------------------+
| 'Maria' REGEXP 'Ma[^p]ia' |
+---------------------------+
| 1 |
+---------------------------+
SELECT 'Maria' REGEXP 'Ma[^r]ia';
+---------------------------+
| 'Maria' REGEXP 'Ma[^r]ia' |
+---------------------------+
| 0 |
+---------------------------+SELECT '[Maria' REGEXP '[[]';
+-----------------------+
| '[Maria' REGEXP '[[]' |
+-----------------------+
| 1 |
+-----------------------+
SELECT '[Maria' REGEXP '[]]';
+-----------------------+
| '[Maria' REGEXP '[]]' |
+-----------------------+
| 0 |
+-----------------------+
SELECT ']Maria' REGEXP '[]]';
+-----------------------+
| ']Maria' REGEXP '[]]' |
+-----------------------+
| 1 |
+-----------------------+
SELECT ']Maria' REGEXP '[]a]';
+------------------------+
| ']Maria' REGEXP '[]a]' |
+------------------------+
| 1 |
+------------------------+SELECT ']Maria' REGEXP '[a]]';
+------------------------+
| ']Maria' REGEXP '[a]]' |
+------------------------+
| 0 |
+------------------------+SELECT '-Maria' REGEXP '[1-10]';
+--------------------------+
| '-Maria' REGEXP '[1-10]' |
+--------------------------+
| 0 |
+--------------------------+
SELECT '-Maria' REGEXP '[-1-10]';
+---------------------------+
| '-Maria' REGEXP '[-1-10]' |
+---------------------------+
| 1 |
+---------------------------+SELECT 'How do I upgrade MariaDB?' REGEXP '[[:<:]]MariaDB[[:>:]]';
+------------------------------------------------------------+
| 'How do I upgrade MariaDB?' REGEXP '[[:<:]]MariaDB[[:>:]]' |
+------------------------------------------------------------+
| 1 |
+------------------------------------------------------------+
SELECT 'How do I upgrade MariaDB?' REGEXP '[[:<:]]Maria[[:>:]]';
+----------------------------------------------------------+
| 'How do I upgrade MariaDB?' REGEXP '[[:<:]]Maria[[:>:]]' |
+----------------------------------------------------------+
| 0 |
+----------------------------------------------------------+SELECT 'Maria' REGEXP 'Mar[[:alnum:]]*';
+--------------------------------+
| 'Maria' REGEXP 'Mar[:alnum:]*' |
+--------------------------------+
| 1 |
+--------------------------------+SELECT 'Mari' REGEXP 'Mar[[:upper:]]+';
+---------------------------------+
| 'Mari' REGEXP 'Mar[[:upper:]]+' |
+---------------------------------+
| 1 |
+---------------------------------+
SELECT BINARY 'Mari' REGEXP 'Mar[[:upper:]]+';
+----------------------------------------+
| BINARY 'Mari' REGEXP 'Mar[[:upper:]]+' |
+----------------------------------------+
| 0 |
+----------------------------------------+SELECT '|' REGEXP '[[.vertical-line.]]';
+----------------------------------+
| '|' REGEXP '[[.vertical-line.]]' |
+----------------------------------+
| 1 |
+----------------------------------+SELECT 'Maria' REGEXP 'Ma[ir]{2}ia';
+------------------------------+
| 'Maria' REGEXP 'Ma[ir]{2}ia' |
+------------------------------+
| 0 |
+------------------------------+SELECT 'Maria' REGEXP 'Ma[ir]{2}';
+----------------------------+
| 'Maria' REGEXP 'Ma[ir]{2}' |
+----------------------------+
| 1 |
+----------------------------+SELECT '(Maria)' REGEXP '(Ma';
ERROR 1139 (42000): Got error 'parentheses not balanced' from regexp
SELECT '(Maria)' REGEXP '\(Ma';
ERROR 1139 (42000): Got error 'parentheses not balanced' from regexp
SELECT '(Maria)' REGEXP '\\(Ma';
+--------------------------+
| '(Maria)' REGEXP '\\(Ma' |
+--------------------------+
| 1 |
+--------------------------+SELECT 'Mar+ia' REGEXP 'r+';
+----------------------+
| 'Mar+ia' REGEXP 'r+' |
+----------------------+
| 1 |
+----------------------+
SELECT 'Maria' REGEXP 'r+';
+---------------------+
| 'Maria' REGEXP 'r+' |
+---------------------+
| 1 |
+---------------------+
SELECT 'Maria' REGEXP 'r\\+';
+-----------------------+
| 'Maria' REGEXP 'r\\+' |
+-----------------------+
| 0 |
+-----------------------+
SELECT 'Maria' REGEXP 'r+';
+---------------------+
| 'Maria' REGEXP 'r+' |
+---------------------+
| 1 |
+---------------------+This page is licensed: CC BY-SA / Gnu FDL
expr REGEXP pat, expr RLIKE patReturn the substring matching a regex. This function extracts the actual part of the string that matches the given pattern.
REGEXP_SUBSTR(subject,pattern)Returns the part of the string subject that matches the regular expression pattern, or an empty string if pattern was not found.
The function follows the case sensitivity rules of the effective . Matching is performed case insensitively for case insensitive collations, and case sensitively for case sensitive collations and for binary data.
The collation case sensitivity can be overwritten using the (?i) and (?-i) PCRE flags.
MariaDB uses the library for enhanced regular expression performance, and REGEXP_SUBSTR was introduced as part of this enhancement.
The variable addresses the remaining compatibilities between PCRE and the old regex library.
This page is licensed: CC BY-SA / Gnu FDL
subjectpatternreplacesubjectThe replace string can have backreferences to the subexpressions in the form \N, where N is a number from 1 to 9.
The function follows the case sensitivity rules of the effective collation. Matching is performed case insensitively for case insensitive collations, and case sensitively for case sensitive collations and for binary data.
The collation case sensitivity can be overwritten using the (?i) and (?-i) PCRE flags.
MariaDB uses the PCRE regular expression library for enhanced regular expression performance, and REGEXP_REPLACE was introduced as part of this enhancement.
The default_regex_flags variable addresses the remaining compatibilities between PCRE and the old regex library.
Backreferences to the subexpressions in the form , where N is a number from 1 to 9:
Case insensitive and case sensitive matches:
Overwriting the collation case sensitivity using the (?i) and (?-i) PCRE flags.
This page is licensed: CC BY-SA / Gnu FDL
REGEXP_REPLACE(subject, pattern, replace)SELECT REGEXP_SUBSTR('ab12cd','[0-9]+');
-> 12
SELECT REGEXP_SUBSTR(
'See https://mariadb.org/en/foundation/ for details',
'https?://[^/]*');
-> https://mariadb.orgTest if a string matches a regex. This operator returns 1 if the pattern is found in the string, and 0 otherwise.
Performs a pattern match of a string expression expr against a patternpat. The pattern can be an extended regular expression. See for details on the syntax for
regular expressions (see also ).
Returns 1 if expr matches pat or 0 if it doesn't match. If either expr or pat are NULL, the result is NULL.
The negative form also exists, as an alias for NOT (string REGEXP pattern). RLIKE and NOT RLIKE are synonyms for REGEXP and NOT REGEXP, originally provided for mSQL compatibility.
The pattern need not be a literal string. For example, it can be specified as a string expression or table column.
Note: Because MariaDB uses the C escape syntax in strings (for example, "\n" to represent the newline character), you must double any "" that you use in your REGEXP strings.
REGEXP is not case sensitive, except when used with binary strings.
The variable addresses the remaining compatibilities between PCRE and the old regex library.
MariaDB uses the variable to address the remaining compatibilities between PCRE and the old regex library.
The default behavior (multiline match is off)
Enabling the multiline option using the PCRE option syntax:
Enabling the multiline option using default_regex_flags
This page is licensed: GPLv2, originally from
SELECT REGEXP_REPLACE('ab12cd','[0-9]','') AS remove_digits;
-> abcd
SELECT REGEXP_REPLACE('<html><head><title>title</title><body>body</body></htm>', '<.+?>',' ')
AS strip_html;
-> title bodySELECT REGEXP_REPLACE('James Bond','^(.*) (.*)$','\\2, \\1') AS reorder_name;
-> Bond, JamesSELECT REGEXP_REPLACE('ABC','b','-') AS case_insensitive;
-> A-C
SELECT REGEXP_REPLACE('ABC' COLLATE utf8_bin,'b','-') AS case_sensitive;
-> ABC
SELECT REGEXP_REPLACE(BINARY 'ABC','b','-') AS binary_data;
-> ABCSELECT REGEXP_REPLACE('ABC','(?-i)b','-') AS force_case_sensitive;
-> ABC
SELECT REGEXP_REPLACE(BINARY 'ABC','(?i)b','-') AS force_case_insensitive;
-> A-Cexpr REGEXP pat, expr RLIKE patSELECT 'Monty!' REGEXP 'm%y%%';
+-------------------------+
| 'Monty!' REGEXP 'm%y%%' |
+-------------------------+
| 0 |
+-------------------------+
SELECT 'Monty!' REGEXP '.*';
+----------------------+
| 'Monty!' REGEXP '.*' |
+----------------------+
| 1 |
+----------------------+
SELECT 'new*\n*line' REGEXP 'new\\*.\\*line';
+---------------------------------------+
| 'new*\n*line' REGEXP 'new\\*.\\*line' |
+---------------------------------------+
| 1 |
+---------------------------------------+
SELECT 'a' REGEXP 'A', 'a' REGEXP BINARY 'A';
+----------------+-----------------------+
| 'a' REGEXP 'A' | 'a' REGEXP BINARY 'A' |
+----------------+-----------------------+
| 1 | 0 |
+----------------+-----------------------+
SELECT 'a' REGEXP '^[a-d]';
+---------------------+
| 'a' REGEXP '^[a-d]' |
+---------------------+
| 1 |
+---------------------+SELECT 'a\nb\nc' RLIKE '^b$';
+---------------------------+
| '(?m)a\nb\nc' RLIKE '^b$' |
+---------------------------+
| 0 |
+---------------------------+SELECT 'a\nb\nc' RLIKE '(?m)^b$';
+---------------------------+
| 'a\nb\nc' RLIKE '(?m)^b$' |
+---------------------------+
| 1 |
+---------------------------+SET default_regex_flags='MULTILINE';
SELECT 'a\nb\nc' RLIKE '^b$';
+-----------------------+
| 'a\nb\nc' RLIKE '^b$' |
+-----------------------+
| 1 |
+-----------------------+Understand MariaDB's regex support. This concept page explains the PCRE library integration, detailing supported syntax, character classes, and special characters.
10.34
MariaDB uses the PCRE library, which significantly improves the power of the operator.
The switch to PCRE added a number of features, including recursive patterns, named capture, look-ahead and look-behind assertions, non-capturing groups, non-greedy quantifiers, Unicode character properties, extended syntax for characters and character classes, multi-line matching, and many other.
These functions work with regular expressions: , , and .
Also, REGEXP/RLIKE, and the new functions, work correctly with all multi-byte supported by MariaDB, including East-Asian character sets (big5, gb2313, gbk, eucjp, eucjpms, cp932, ujis, euckr), and Unicode character sets (utf8, utf8mb4, ucs2, utf16, utf16le, utf32).
- Replaces all occurrences of a pattern.
- Position of the first appearance of a regex.
- Returns the matching part of a string.
See the individual articles for more details and examples.
In most cases PCRE is backward compatible with the old POSIX 1003.2 compliant regexp library (see ), so you won't need to change your applications that use SQL queries with the REGEXP/RLIKE predicate.
This section briefly describes the most important extended PCRE features. For more details, please refer to the documentation on the , or to the documentation which is bundled in the /pcre/doc/html/ directory of a MariaDB sources distribution. The pages pcresyntax.html and pcrepattern.html should be a good start. is another good resource to learn about PCRE and regular expressions generally.
PCRE supports the following escape sequences to match special characters:
Note, the backslash characters (here, and in all examples in the sections below) must be escaped with another backslash, unless you're using the NO_BACKSLASH_ESCAPES.
This example tests if a character has hex code 0x61:
PCRE supports the standard POSIX character classes such as alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit, with the following additional classes:
This example checks if the string consists of ASCII characters only:
Generic character types complement the POSIX character classes and serve to simplify writing patterns:
This example checks if the string consists of "word" characters only:
\p{xx} is a character with the xx property, and \P{xx} is a character without the xx property.
The property names represented by xx above are limited to the Unicode script names, the general category properties, and "Any", which matches any character (including newline). Those that are not part of an identified script are lumped together as "Common".
General Category Properties For \p and \P
This example checks if the string consists only of characters with property N (number):
Special Category Properties For \p and \P
The property Xuc matches any character that can be represented by a Universal Character Name (in C++ and other programming languages). These include $, @, ```, and all characters with Unicode code points greater than U+00A0, excluding the surrogates U+D800..U+DFFF.
Script Names For \p and \P
Arabic, Armenian, Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam, Mandaic, Meetei_Mayek, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Sharada, Shavian, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, Yi.
This example checks if the string consists only of Greek characters:
The \X escape sequence matches a character sequence that makes an "extended grapheme cluster", i.e. a composite character that consists of multiple Unicode code points.
One of the examples of a composite character can be a letter followed by non-spacing accent marks. This example demonstrates that U+0045 LATIN CAPITAL LETTER E followed by U+0302 COMBINING CIRCUMFLEX ACCENT followed by U+0323 COMBINING DOT BELOW together form an extended grapheme cluster:
See the for the other types of extended grapheme clusters.
An assertion specifies a certain condition that must match at a particular point, but without consuming characters from the subject string. In addition to the standard POSIX simple assertions ^ (that matches at the beginning of a line) and $ (that matches at the end of a line), PCRE supports a number of other assertions:
This example cuts a word that consists only of 3 characters from a string:
Notice that the two \b assertions checked the word boundaries but did not get into the matching pattern.
The \b assertions work well in the beginning and the end of the subject string:
By default, the ^ and $ assertions have the same meaning with \A, \Z, and \z. However, the meanings of ^ and $ can change in multiline mode (see below). By contrast, the meanings of \A, \Z, and \z are always the same; they are independent of the multiline mode.
A number of options that control the default match behavior can be changed within the pattern by a sequence of option letters enclosed between (? and ).
For example, (?im) sets case insensitive multiline matching.
A hyphen followed by the option letters unset the options. For example, (?-im) means case sensitive single line match.
A combined setting and unsetting is also possible, e.g. (?im-sx).
If an option is set outside of subpattern parentheses, the option applies to the remainder of the pattern that follows the option. If an option is set inside a subpattern, it applies to the part of this subpattern that follows the option.
In this example the pattern (?i)m((?-i)aria)db matches the words MariaDB, Mariadb, mariadb, but not MARIADB:
This example demonstrates that the (?x) option makes the regexp engine ignore all white spaces in the pattern (other than in a class).
Note, putting spaces into a pattern in combination with the (?x) option can be useful to split different logical parts of a complex pattern, to make it more readable.
Multiline matching changes the meaning of ^ and $ from "the beginning of the subject string" and "the end of the subject string" to "the beginning of any line in the subject string" and "the end of any line in the subject string" respectively.
This example checks if the subject string contains two consequent lines that fully consist of digits:
Notice the (?m) option in the beginning of the pattern, which switches to the multiline matching mode.
PCRE supports five line break conventions:
CR (\r) - a single carriage return character
LF (\n) - a single linefeed character
CRLF (\r\n) - a carriage return followed by a linefeed
By default, the newline convention is set to any Unicode newline sequence, which includes:
The newline convention can be set by starting a pattern with one of the following sequences:
The newline conversion affects the ^ and $ assertions, the interpretation of the dot metacharacter, and the behavior of .
Note, the new line convention does not affect the meaning of .
This example demonstrates that the dot metacharacter matches , because it is not a newline sequence anymore:
By default, the escape sequence matches any Unicode newline sequences.
The meaning of can be set by starting a pattern with one of the following sequences:
It's possible to include comments inside a pattern. Comments do not participate in the pattern matching. Comments start at the (?
POSIX uses the backslash to remove a special meaning from a character. PCRE introduces a syntax to remove special meaning from a sequence of characters. The characters inside \Q ... \E are treated literally, without their special meaning.
This example checks if the string matches a dollar sign followed by a parenthesized name (a variable reference in some languages):
Note that the leftmost dollar sign and the parentheses are used literally, while the rightmost dollar sign is still used to match the end of the string.
The escape sequence \K causes any previously matched characters to be excluded from the final matched sequence. For example, the pattern: (foo)\Kbar matches foobar, but reports that it has matched bar. This feature is similar to a look-behind assertion. However, in this case, the part of the subject before the real match does not have to be of fixed length:
The question mark and the colon after the opening parenthesis create a non-capturing group: (?:...).
This example removes an optional article from a word, for example for better sorting of the results.
Note that the articles are listed inside the left parentheses using the alternation operator | but they do not produce a captured subpattern, so the word followed by the article is referenced by '1' in the third argument to the function. Using non-capturing groups can be useful to save numbers on the sup-patterns that won't be used in the third argument of , as well as for performance purposes.
By default, the repetition quantifiers ?, *, + and {n,m} are "greedy", that is, they try to match as much as possible. Adding a question mark after a repetition quantifier makes it "non-greedy", so the pattern matches the minimum number of times possible.
This example cuts C comments from a line:
The pattern without the non-greedy flag to the quantifier /[*].*[*]/ would match the entire string between the leftmost /* and the rightmost */.
A sequence inside (?>...) makes an atomic group. Backtracking inside an atomic group is prevented once it has matched; however, backtracking past to the previous items works normally.
Consider the pattern \d+foo applied to the subject string 123bar. Once the engine scans 123 and fails on the letter b, it would normally backtrack to 2 and try to match again, then fail and backtrack to 1 and try to match and fail again, and finally fail the entire pattern. In case of an atomic group (?>\d+)foo with the same subject string 123bar, the engine gives up immediately after the first failure to match foo. An atomic group with a quantifier can match all or nothing.
Atomic groups produce faster false results (i.e. in case when a long subject string does not match the pattern), because the regexp engine saves performance on backtracking. However, don't hurry to put everything into atomic groups. This example demonstrates the difference between atomic and non-atomic match:
The non-atomic pattern matches both abbc and abc, while the atomic pattern matches abbc only.
The atomic group (?>bc|b) in the above example can be "translated" as "if there is bc, then don't try to match as b". So b can match only if bc is not found.
Atomic groups are not capturing. To make an atomic group capturing, put it into parentheses:
An atomic group which ends with a quantifier can be rewritten using a so called "possessive quantifier" syntax by putting an additional + sign following the quantifier.
The pattern (?>\d+)foo from the previous section's example can be rewritten as \d++foo.
Backreferences match the same text as previously matched by a capturing group. Backreferences can be written using:
a backslash followed by a digit
the \g escape sequence followed by a positive or negative number
the \g escape sequence followed by a positive or negative number enclosed in braces
The following backreferences are identical and refer to the first capturing group:
\1
\g1
\g{1}
This example demonstrates a pattern that matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility":
This example removes doubled words that can unintentionally creep in when you edit a text in a text editor:
Note that all double words were removed, in the beginning, in the middle and in the end of the subject string.
A negative number in a \g sequence means a relative reference. Relative references can be helpful in long patterns, and also in patterns that are created by joining fragments together that contain references within themselves. The sequence \g{-1} is a reference to the most recently started capturing subpattern before \g.
In this example \g{-1} is equivalent to \2:
Using numeric backreferences for capturing groups can be hard to track in a complicated regular expression. Also, the numbers can change if an expression is modified. To overcome these difficulties, PCRE supports named subpatterns.
A subpattern can be named in one of three ways: (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python. References to capturing subpatterns from other parts of the pattern, can be made by name as well as by number.
Backreferences to a named subpattern can be written using the .NET syntax \k{name}, the Perl syntax \k<name> or \k'name' or \g{name}, or the Python syntax (?P=name).
This example tests if the string is a correct HTML tag:
Look-ahead and look-behind assertions serve to specify the context for the searched regular expression pattern. Note that the assertions only check the context, they do not capture anything themselves!
This example finds the letter which is not followed by another letter (negative look-ahead):
This example finds the letter which is followed by a digit (positive look-ahead):
This example finds the letter which does not follow a digit character (negative look-behind):
This example finds the letter which follows another letter character (positive look-behind):
Note that look-behind assertions can only be of fixed length; you cannot have repetition operators or alternations with different lengths:
PCRE supports a special syntax to recourse the entire pattern or its individual subpatterns:
This example checks for a correct additive arithmetic expression consisting of numbers, unary plus and minus, binary plus and minus, and parentheses:
The recursion is done using (?1) to call for the first parenthesized subpattern, which includes everything except the leading ^ and the trailing $.
The regular expression in the above example implements the following BNF grammar:
<expression> ::= <term> [(<sign> <term>)...]
<term> ::= [ <sign> ] <primary>
<primary> ::= <number> | <left paren> <expression> <right paren>
Use the (?(DEFINE)...) syntax to define subpatterns that can be referenced from elsewhere.
This example defines a subpattern with the name letters that matches one or more letters, which is further reused two times:
The above example can also be rewritten to define the digit part as a subpattern as well:
There are two forms of conditional subpatterns:
The yes-pattern is used if the condition is satisfied, otherwise the no-pattern (if any) is used.
Conditions With Subpattern References
If a condition consists of a number, it makes a condition with a subpattern reference. Such a condition is true if a capturing subpattern corresponding to the number has previously matched.
This example finds an optionally parenthesized number in a string:
The ([(])? part makes a capturing subpattern that matches an optional opening parenthesis; the [0-9]+ part matches a number, and the (?(1)[)]) part matches a closing parenthesis, but only if the opening parenthesis has been previously found.
Other Kinds of Conditions
The other possible condition kinds are: recursion references and assertions. See the for details.
PCRE correctly works with zero bytes in the subject strings:
Zero bytes, however, are not supported literally in the pattern strings and should be escaped using the \xhh or \x{hh} syntax:
PCRE provides other extended features that were not covered in this document, such as duplicate subpattern numbers, backtracking control, breaking utf-8 sequences into individual bytes, setting the match limit, setting the recursion limit, optimization control, recursion conditions, assertion conditions and more types of extended grapheme clusters. Please refer to the for details.
Enhanced regex was implemented as a GSoC 2013 project by Sudheera Palihakkara.
The variable was introduced to address the remaining incompatibilities between PCRE and the old regex library. Here are some examples of its usage:
The default behaviour (multiline match is off)
Enabling the multiline option using the PCRE option syntax:
Enabling the miltiline option using default_regex_flags
This page is licensed: CC BY-SA / Gnu FDL
0x09 (TAB)
\ddd
character with octal code ddd
\xhh
character with hex code hh
\x{hhh..}
character with hex code hhh..
\s
a white space character
\S
a character that is not a white space character
\v
a vertical white space character
\V
a character that is not a vertical white space character
\w
a "word" character (same as [:word:])
\W
a "non-word" character
L
Letter
Ll
Lower case letter
Lm
Modifier letter
Lo
Other letter
Lt
Title case letter
Lu
Upper case letter
L&
Ll, Lu, or Lt
M
Mark
Mc
Spacing mark
Me
Enclosing mark
Mn
Non-spacing mark
N
Number
Nd
Decimal number
Nl
Letter number
No
Other number
P
Punctuation
Pc
Connector punctuation
Pd
Dash punctuation
Pe
Close punctuation
Pf
Final punctuation
Pi
Initial punctuation
Po
Other punctuation
Ps
Open punctuation
S
Symbol
Sc
Currency symbol
Sk
Modifier symbol
Sm
Mathematical symbol
So
Other symbol
Z
Separator
Zl
Line separator
Zp
Paragraph separator
Zs
Space separator
(?X)
extra PCRE functionality (e.g. force error on unknown escaped character)
(?-...)
unset option(s)
any Unicode newline sequence
LS
(U+2028, line separator)
PS
(U+2029, paragraph separator)
\g
call subpattern by name (Oniguruma)
\g'name'
call subpattern by name (Oniguruma)
\g
call subpattern by absolute number (Oniguruma)
\g'n'
call subpattern by absolute number (Oniguruma)
\g<+n>
call subpattern by relative number
\g<-n>
call subpattern by relative number
\g'+n'
call subpattern by relative number
\g'-n'
call subpattern by relative number
<sign> ::= <plus sign> | <minus sign>Stable
PCRE 8.43
Stable
PCRE 8.42
, ,
Stable
PCRE 8.41
, ,
Stable
PCRE 8.40
, ,
Stable
PCRE 8.39
,
Stable
PCRE 8.38
,
Stable
PCRE 8.37
,
Stable
PCRE 8.36
,
Stable
PCRE 8.35
,
Stable
PCRE 8.34
Stable
\a
0x07 (BEL)
\cx
"control-x", where x is any ASCII character
\e
0x1B (escape)
\f
0x0C (form feed)
0x0A (newline)
0x0D (carriage return)
ascii
any ASCII character (0x00..0x7F)
word
any "word" character (a letter, a digit, or an underscore)
\d
a decimal digit (same as [:digit:])
\D
a character that is not a decimal digit
\h
a horizontal white space character
\H
a character that is not a horizontal white space character
a character that is not a new line
a newline sequence
C
Other
Cc
Control
Cf
Format
Cn
Unassigned
Co
Private use
Cs
Surrogate
Xan
Alphanumeric: union of properties L and N
Xps
POSIX space: property Z or tab, NL, VT, FF, CR
Xsp
Perl space: property Z or tab, NL, FF, CR
Xuc
A character than can be represented by a Universal Character Name
Xwd
Perl word: property Xan or underscore
\b
matches at a word boundary
\B
matches when not at a word boundary
\A
matches at the start of the subject
\Z
matches at the end of the subject, also matches before a newline at the end of the subject
\z
matches only at the end of the subject
\G
matches at the first matching position in the subject
(?i)
case insensitive match
(?m)
multiline mode
(?s)
dotall mode (dot matches newline characters)
(?x)
extended (ignore white space)
(?U)
ungreedy (lazy) match
(?J)
allow named subpatterns with duplicate names
LF
(U+000A, carriage return)
CR
(U+000D, carriage return)
CRLF
(a carriage return followed by a linefeed)
VT
(U+000B, vertical tab)
FF
(U+000C, form feed)
NEL
(U+0085, next line)
(*CR)
carriage return
(*LF)
linefeed
(*CRLF)
carriage return followed by linefeed
(*ANYCRLF)
any of the previous three
(*ANY)
all Unicode newline sequences
(*BSR_ANYCRLF)
any of CR, LF or CRLF
(*BSR_UNICODE)
any Unicode newline sequence
(?R)
Recourse the entire pattern
(?n)
call subpattern by absolute number
(?+n)
call subpattern by relative number
(?-n)
call subpattern by relative number
(?&name)
call subpattern by name (Perl)
(?P>name)
call subpattern by name (Python)
SELECT 'a' RLIKE '\\x{61}';
-> 1SELECT 'abc' RLIKE '^[[:ascii:]]+$';
-> 1SELECT 'abc' RLIKE '^\\w+$';
-> 1SELECT '1¼①' RLIKE '^\\p{N}+$';
-> 1SELECT 'ΣΦΩ' RLIKE '^\\p{Greek}+$';
-> 1SELECT _ucs2 0x004503020323 RLIKE '^\\X$';
-> 1SELECT REGEXP_SUBSTR('---abcd---xyz---', '\\b\\w{3}\\b');
-> xyzSELECT REGEXP_SUBSTR('xyz', '\\b\\w{3}\\b');
-> xyzSELECT 'MariaDB' RLIKE '(?i)m((?-i)aria)db';
-> 1
SELECT 'mariaDB' RLIKE '(?i)m((?-i)aria)db';
-> 1
SELECT 'Mariadb' RLIKE '(?i)m((?-i)aria)db';
-> 1
SELECT 'MARIADB' RLIKE '(?i)m((?-i)aria)db';
-> 0SELECT 'ab' RLIKE '(?x)a b';
-> 1SELECT 'abc\n123\n456\nxyz\n' RLIKE '(?m)^\\d+\\R\\d+$';
-> 1SELECT 'a\nb' RLIKE '(*CR)a.b';
-> 1SELECT 'ab12' RLIKE 'ab(?#expect digits)12';
-> 1SELECT '$(abc)' RLIKE '^\\Q$(\\E\\w+\\Q)\\E$';
-> 1SELECT REGEXP_SUBSTR('aaa123', '[a-z]*\\K[0-9]*');
-> 123SELECT REGEXP_REPLACE('The King','(?:the|an|a)[^a-z]([a-z]+)','\\1');
-> KingSELECT REGEXP_REPLACE('/* Comment1 */ i+= 1; /* Comment2 */', '/[*].*?[*]/','');
-> i+= 1;SELECT 'abcc' RLIKE 'a(?>bc|b)c' AS atomic1;
-> 1
SELECT 'abc' RLIKE 'a(?>bc|b)c' AS atomic2;
-> 0
SELECT 'abcc' RLIKE 'a(bc|b)c' AS non_atomic1;
-> 1
SELECT 'abc' RLIKE 'a(bc|b)c' AS non_atomic2;
-> 1SELECT REGEXP_REPLACE('abcc','a((?>bc|b))c','\\1');
-> bcSELECT 'sense and sensibility' RLIKE '(sens|respons)e and \\1ibility';
-> 1SELECT REGEXP_REPLACE('using using the the regexp regexp',
'\\b(\\w+)\\s+\\1\\b','\\1');
-> USING the regexpSELECT 'abc123def123' RLIKE '(abc(123)def)\\g{-1}';
-> 1
SELECT 'abc123def123' RLIKE '(abc(123)def)\\2';
-> 1SELECT '<a href="../">Up</a>' RLIKE '<(?<tag>[a-z][a-z0-9]*)[^>]*>[^<]*</(?P=tag)>';
-> 1SELECT REGEXP_SUBSTR('ab1','[a-z](?![a-z])');
-> bSELECT REGEXP_SUBSTR('ab1','[a-z](?=[0-9])');
-> bSELECT REGEXP_SUBSTR('1ab','(?<![0-9])[a-z]');
-> bSELECT REGEXP_SUBSTR('1ab','(?<=[a-z])[a-z]');
-> bSELECT 'aaa' RLIKE '(?<=(a|bc))a';
ERROR 1139 (42000): Got error 'lookbehind assertion is not fixed length at offset 10' from regexpSELECT '1+2-3+(+(4-1)+(-2)+(+1))' RLIKE '^(([+-]?(\\d+|[(](?1)[)]))(([+-](?1))*))$';
-> 1SELECT 'abc123xyz' RLIKE '^(?(DEFINE)(?<letters>[a-z]+))(?&letters)[0-9]+(?&letters)$';
-> 1SELECT 'abc123xyz' RLIKE
'^(?(DEFINE)(?<letters>[a-z]+)(?<digits>[0-9]+))(?&letters)(?&digits)(?&letters)$';
-> 1(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)SELECT REGEXP_SUBSTR('a(123)b', '([(])?[0-9]+(?(1)[)])');
-> (123)SELECT 'a\0b' RLIKE '^a.b$';
-> 1SELECT 'a\0b' RLIKE '^a\\x{00}b$';
-> 1SELECT 'a\nb\nc' RLIKE '^b$';
+---------------------------+
| '(?m)a\nb\nc' RLIKE '^b$' |
+---------------------------+
| 0 |
+---------------------------+SELECT 'a\nb\nc' RLIKE '(?m)^b$';
+---------------------------+
| 'a\nb\nc' RLIKE '(?m)^b$' |
+---------------------------+
| 1 |
+---------------------------+SET default_regex_flags='MULTILINE';
SELECT 'a\nb\nc' RLIKE '^b$';
+-----------------------+
| 'a\nb\nc' RLIKE '^b$' |
+-----------------------+
| 1 |
+-----------------------+Return the index of a regex match. This function finds the starting position of the first substring that matches the given pattern.
REGEXP_INSTR(subject, pattern)Returns the position of the first occurrence of the regular expression pattern in the string subject, or 0 if pattern was not found.
The positions start with 1 and are measured in characters (i.e. not in bytes), which is important for multi-byte character sets. You can cast a multi-byte character set to BINARY to get offsets in bytes.
The function follows the case sensitivity rules of the effective collation. Matching is performed case insensitively for case insensitive collations, and case sensitively for case sensitive collations and for binary data.
The collation case sensitivity can be overwritten using the (?i) and (?-i) PCRE flags.
MariaDB uses the library for enhanced regular expression performance, and REGEXP_INSTR was introduced as part of this enhancement.
Casting a multi-byte character set as BINARY to get offsets in bytes:
Case sensitivity:
This page is licensed: CC BY-SA / Gnu FDL
SELECT REGEXP_INSTR('abc','b');
-> 2
SELECT REGEXP_INSTR('abc','x');
-> 0
SELECT REGEXP_INSTR('BJÖRN','N');
-> 5SELECT REGEXP_INSTR(BINARY 'BJÖRN','N') AS cast_utf8_to_binary;
-> 6SELECT REGEXP_INSTR('ABC','b');
-> 2
SELECT REGEXP_INSTR('ABC' COLLATE utf8_bin,'b');
-> 0
SELECT REGEXP_INSTR(BINARY'ABC','b');
-> 0
SELECT REGEXP_INSTR('ABC','(?-i)b');
-> 0
SELECT REGEXP_INSTR('ABC' COLLATE utf8_bin,'(?i)b');
-> 2