1 | [section Static Regexes] |
---|
2 | |
---|
3 | [h2 Overview] |
---|
4 | |
---|
5 | The feature that really sets xpressive apart from other C/C++ regular |
---|
6 | expression libraries is the ability to author a regular expression using C++ |
---|
7 | expressions. xpressive achieves this through operator overloading, using a |
---|
8 | technique called ['expression templates] to embed a mini-language dedicated |
---|
9 | to pattern matching within C++. These "static regexes" have many advantages |
---|
10 | over their string-based brethren. In particular, static regexes: |
---|
11 | |
---|
12 | * are syntax-checked at compile-time; they will never fail at run-time due to |
---|
13 | a syntax error. |
---|
14 | * can naturally refer to other C++ data and code, including other regexes, |
---|
15 | making it possible to build grammars out of regular expressions and bind |
---|
16 | user-defined actions that execute when parts of your regex match. |
---|
17 | * are statically bound for better inlining and optimization. Static regexes |
---|
18 | require no state tables, virtual functions, byte-code or calls through |
---|
19 | function pointers that cannot be resolved at compile time. |
---|
20 | * are not limited to searching for patterns in strings. You can declare a |
---|
21 | static regex that finds patterns in an array of integers, for instance. |
---|
22 | |
---|
23 | Since we compose static regexes using C++ expressions, we are constrained by |
---|
24 | the rules for legal C++ expressions. Unfortunately, that means that |
---|
25 | "classic" regular expression syntax cannot always be mapped cleanly into |
---|
26 | C++. Rather, we map the regex ['constructs], picking new syntax that is |
---|
27 | legal C++. |
---|
28 | |
---|
29 | [h2 Construction and Assignment] |
---|
30 | |
---|
31 | You create a static regex by assigning one to an object of type _basic_regex_. |
---|
32 | For instance, the following defines a regex that can be used to find patterns |
---|
33 | in objects of type `std::string`: |
---|
34 | |
---|
35 | sregex re = '$' >> +_d >> '.' >> _d >> _d; |
---|
36 | |
---|
37 | Assignment works similarly. |
---|
38 | |
---|
39 | [h2 Character and String Literals] |
---|
40 | |
---|
41 | In static regexes, character and string literals match themselves. For |
---|
42 | instance, in the regex above, `'$'` and `'.'` match the characters `'$'` and |
---|
43 | `'.'` respectively. Don't be confused by the fact that [^$] and [^.] are |
---|
44 | meta-characters in Perl. In xpressive, literals always represent themselves. |
---|
45 | |
---|
46 | When using literals in static regexes, you must take care that at least one |
---|
47 | operand is not a literal. For instance, the following are ['not] valid |
---|
48 | regexes: |
---|
49 | |
---|
50 | sregex re1 = 'a' >> 'b'; // ERROR! |
---|
51 | sregex re2 = +'a'; // ERROR! |
---|
52 | |
---|
53 | The two operands to the binary `>>` operator are both literals, and the |
---|
54 | operand of the unary `+` operator is also a literal, so these statements |
---|
55 | will call the native C++ binary right-shift and unary plus operators, |
---|
56 | respectively. That's not what we want. To get operator overloading to kick |
---|
57 | in, at least one operand must be a user-defined type. We can use xpressive's |
---|
58 | `as_xpr()` helper function to "taint" an expression with regex-ness, forcing |
---|
59 | operator overloading to find the correct operators. The two regexes above |
---|
60 | should be written as: |
---|
61 | |
---|
62 | sregex re1 = as_xpr('a') >> 'b'; // OK |
---|
63 | sregex re2 = +as_xpr('a'); // OK |
---|
64 | |
---|
65 | [h2 Sequencing and Alternation] |
---|
66 | |
---|
67 | As you've probably already noticed, sub-expressions in static regexes must |
---|
68 | be separated by the sequencing operator, `>>`. You can read this operator as |
---|
69 | "followed by". |
---|
70 | |
---|
71 | // Match an 'a' followed by a digit |
---|
72 | sregex re = 'a' >> _d; |
---|
73 | |
---|
74 | Alternation works just as it does in Perl with the `|` operator. You can |
---|
75 | read this operator as "or". For example: |
---|
76 | |
---|
77 | // match a digit character or a word character one or more times |
---|
78 | sregex re = +( _d | _w ); |
---|
79 | |
---|
80 | [h2 Grouping and Captures] |
---|
81 | |
---|
82 | In Perl, parentheses `()` have special meaning. They group, but as a |
---|
83 | side-effect they also create back\-references like [^$1] and [^$2]. In C++, |
---|
84 | parentheses only group \-\- there is no way to give them side\-effects. To |
---|
85 | get the same effect, we use the special `s1`, `s2`, etc. tokens. Assigning |
---|
86 | to one creates a back-reference. You can then use the back-reference later |
---|
87 | in your expression, like using [^\1] and [^\2] in Perl. For example, |
---|
88 | consider the following regex, which finds matching HTML tags: |
---|
89 | |
---|
90 | "<(\\w+)>.*?</\\1>" |
---|
91 | |
---|
92 | In static xpressive, this would be: |
---|
93 | |
---|
94 | '<' >> (s1= +_w) >> '>' >> -*_ >> "</" >> s1 >> '>' |
---|
95 | |
---|
96 | Notice how you capture a back-reference by assigning to `s1`, and then you |
---|
97 | use `s1` later in the pattern to find the matching end tag. |
---|
98 | |
---|
99 | [tip [*Grouping without capturing a back-reference] \n\n In |
---|
100 | xpressive, if you just want grouping without capturing a back-reference, you |
---|
101 | can just use `()` without `s1`. That is the equivalent of Perl's [^(?:)] |
---|
102 | non-capturing grouping construct.] |
---|
103 | |
---|
104 | [h2 Case-Insensitivity and Internationalization] |
---|
105 | |
---|
106 | Perl lets you make part of your regular expression case-insensitive by using |
---|
107 | the [^(?i:)] pattern modifier. xpressive also has a case-insensitivity |
---|
108 | pattern modifier, called `icase`. You can use it as follows: |
---|
109 | |
---|
110 | sregex re = "this" >> icase( "that" ); |
---|
111 | |
---|
112 | In this regular expression, `"this"` will be matched exactly, but `"that"` |
---|
113 | will be matched irrespective of case. |
---|
114 | |
---|
115 | Case-insensitive regular expressions raise the issue of |
---|
116 | internationalization: how should case-insensitive character comparisons be |
---|
117 | evaluated? Also, many character classes are locale-specific. Which |
---|
118 | characters are matched by `digit` and which are matched by `alpha`? The |
---|
119 | answer depends on the `std::locale` object the regular expression object is |
---|
120 | using. By default, all regular expression objects use the global locale. You |
---|
121 | can override the default by using the `imbue()` pattern modifier, as |
---|
122 | follows: |
---|
123 | |
---|
124 | std::locale my_locale = /* initialize a std::locale object */; |
---|
125 | sregex re = imbue( my_locale )( +alpha >> +digit ); |
---|
126 | |
---|
127 | This regular expression will evaluate `alpha` and `digit` according to |
---|
128 | `my_locale`. See the section on [link boost_xpressive.user_s_guide.localization_and_regex_traits |
---|
129 | Localization and Regex Traits] for more information about how to customize |
---|
130 | the behavior of your regexes. |
---|
131 | |
---|
132 | [h2 Static xpressive Syntax Cheat Sheet] |
---|
133 | |
---|
134 | The table below lists the familiar regex constructs and their equivalents in |
---|
135 | static xpressive. |
---|
136 | |
---|
137 | [table Perl syntax vs. Static xpressive syntax |
---|
138 | [[Perl] [Static xpressive] [Meaning]] |
---|
139 | [[[^.]] [`_`] [any character (assuming Perl's /s modifier).]] |
---|
140 | [[[^ab]] [`a >> b`] [sequencing of [^a] and [^b] sub-expressions.]] |
---|
141 | [[[^a|b]] [`a | b`] [alternation of [^a] and [^b] sub-expressions.]] |
---|
142 | [[[^(a)]] [`(s1= a)`] [group and capture a back-reference.]] |
---|
143 | [[[^(?:a)]] [`(a)`] [group and do not capture a back-reference.]] |
---|
144 | [[[^\1]] [`s1`] [a previously captured back-reference.]] |
---|
145 | [[[^a*]] [`*a`] [zero or more times, greedy.]] |
---|
146 | [[[^a+]] [`+a`] [one or more times, greedy.]] |
---|
147 | [[[^a?]] [`!a`] [zero or one time, greedy.]] |
---|
148 | [[[^a{n,m}]] [`repeat<n,m>(a)`] [between [^n] and [^m] times, greedy.]] |
---|
149 | [[[^a*?]] [`-*a`] [zero or more times, non-greedy.]] |
---|
150 | [[[^a+?]] [`-+a`] [one or more times, non-greedy.]] |
---|
151 | [[[^a??]] [`-!a`] [zero or one time, non-greedy.]] |
---|
152 | [[[^a{n,m}?]] [`-repeat<n,m>(a)`] [between [^n] and [^m] times, non-greedy.]] |
---|
153 | [[[^^]] [`bos`] [beginning of sequence assertion.]] |
---|
154 | [[[^$]] [`eos`] [end of sequence assertion.]] |
---|
155 | [[[^\b]] [`_b`] [word boundary assertion.]] |
---|
156 | [[[^\B]] [`~_b`] [not word boundary assertion.]] |
---|
157 | [[[^\\n]] [`_n`] [literal newline.]] |
---|
158 | [[[^.]] [`~_n`] [any character except a literal newline (without Perl's /s modifier).]] |
---|
159 | [[[^\\r?\\n|\\r]] [`_ln`] [logical newline.]] |
---|
160 | [[[^\[^\\r\\n\]]] [`~_ln`] [any single character not a logical newline.]] |
---|
161 | [[[^\w]] [`_w`] [a word character, equivalent to set\[alnum | '_'\].]] |
---|
162 | [[[^\W]] [`~_w`] [not a word character, equivalent to ~set\[alnum | '_'\].]] |
---|
163 | [[[^\d]] [`_d`] [a digit character.]] |
---|
164 | [[[^\D]] [`~_d`] [not a digit character.]] |
---|
165 | [[[^\s]] [`_s`] [a space character.]] |
---|
166 | [[[^\S]] [`~_s`] [not a space character.]] |
---|
167 | [[[^\[:alnum:\]]] [`alnum`] [an alpha-numeric character.]] |
---|
168 | [[[^\[:alpha:\]]] [`alpha`] [an alphabetic character.]] |
---|
169 | [[[^\[:blank:\]]] [`blank`] [a horizontal white-space character.]] |
---|
170 | [[[^\[:cntrl:\]]] [`cntrl`] [a control character.]] |
---|
171 | [[[^\[:digit:\]]] [`digit`] [a digit character.]] |
---|
172 | [[[^\[:graph:\]]] [`graph`] [a graphable character.]] |
---|
173 | [[[^\[:lower:\]]] [`lower`] [a lower-case character.]] |
---|
174 | [[[^\[:print:\]]] [`print`] [a printing character.]] |
---|
175 | [[[^\[:punct:\]]] [`punct`] [a punctuation character.]] |
---|
176 | [[[^\[:space:\]]] [`space`] [a white-space character.]] |
---|
177 | [[[^\[:upper:\]]] [`upper`] [an upper-case character.]] |
---|
178 | [[[^\[:xdigit:\]]] [`xdigit`] [a hexadecimal digit character.]] |
---|
179 | [[[^\[0-9\]]] [`range('0','9')`] [characters in range `'0'` through `'9'`.]] |
---|
180 | [[[^\[abc\]]] [`as_xpr('a') | 'b' |'c'`] [characters `'a'`, `'b'`, or `'c'`.]] |
---|
181 | [[[^\[abc\]]] [`(set= 'a','b','c')`] [['same as above]]] |
---|
182 | [[[^\[0-9abc\]]] [`set[ range('0','9') | 'a' | 'b' | 'c' ]`] [characters `'a'`, `'b'`, `'c'` or in range `'0'` through `'9'`.]] |
---|
183 | [[[^\[0-9abc\]]] [`set[ range('0','9') | (set= 'a','b','c') ]`] [['same as above]]] |
---|
184 | [[[^\[^abc\]]] [`~(set= 'a','b','c')`] [not characters `'a'`, `'b'`, or `'c'`.]] |
---|
185 | [[[^(?i:['stuff])]] [`icase(`[^['stuff]]`)`] [match ['stuff] disregarding case.]] |
---|
186 | [[[^(?>['stuff])]] [`keep(`[^['stuff]]`)`] [independent sub-expression, match ['stuff] and turn off backtracking.]] |
---|
187 | [[[^(?=['stuff])]] [`before(`[^['stuff]]`)`] [positive look-ahead assertion, match if before ['stuff] but don't include ['stuff] in the match.]] |
---|
188 | [[[^(?!['stuff])]] [`~before(`[^['stuff]]`)`] [negative look-ahead assertion, match if not before ['stuff].]] |
---|
189 | [[[^(?<=['stuff])]] [`after(`[^['stuff]]`)`] [positive look-behind assertion, match if after ['stuff] but don't include ['stuff] in the match. (['stuff] must be constant-width.)]] |
---|
190 | [[[^(?<!['stuff])]] [`~after(`[^['stuff]]`)`] [negative look-behind assertion, match if not after ['stuff]. (['stuff] must be constant-width.)]] |
---|
191 | ] |
---|
192 | \n |
---|
193 | |
---|
194 | [endsect] |
---|