Planet
navi homePPSaboutscreenshotsdownloaddevelopmentforum

source: downloads/boost_1_34_1/libs/xpressive/doc/static_regexes.qbk @ 69

Last change on this file since 69 was 29, checked in by landauf, 17 years ago

updated boost from 1_33_1 to 1_34_1

File size: 11.8 KB
Line 
1[section Static Regexes]
2
3[h2 Overview]
4
5The feature that really sets xpressive apart from other C/C++ regular
6expression libraries is the ability to author a regular expression using C++
7expressions. xpressive achieves this through operator overloading, using a
8technique called ['expression templates] to embed a mini-language dedicated
9to pattern matching within C++. These "static regexes" have many advantages
10over their string-based brethren. In particular, static regexes:
11
12* are syntax-checked at compile-time; they will never fail at run-time due to
13  a syntax error.
14* can naturally refer to other C++ data and code, including other regexes,
15  making it possible to build grammars out of regular expressions and bind
16  user-defined actions that execute when parts of your regex match.
17* are statically bound for better inlining and optimization. Static regexes
18  require no state tables, virtual functions, byte-code or calls through
19  function pointers that cannot be resolved at compile time.
20* are not limited to searching for patterns in strings. You can declare a
21  static regex that finds patterns in an array of integers, for instance.
22
23Since we compose static regexes using C++ expressions, we are constrained by
24the rules for legal C++ expressions. Unfortunately, that means that
25"classic" regular expression syntax cannot always be mapped cleanly into
26C++. Rather, we map the regex ['constructs], picking new syntax that is
27legal C++.
28
29[h2 Construction and Assignment]
30
31You create a static regex by assigning one to an object of type _basic_regex_.
32For instance, the following defines a regex that can be used to find patterns
33in objects of type `std::string`:
34
35    sregex re = '$' >> +_d >> '.' >> _d >> _d;
36
37Assignment works similarly.
38
39[h2 Character and String Literals]
40
41In static regexes, character and string literals match themselves. For
42instance, in the regex above, `'$'` and `'.'` match the characters `'$'` and
43`'.'` respectively. Don't be confused by the fact that [^$] and [^.] are
44meta-characters in Perl. In xpressive, literals always represent themselves.
45
46When using literals in static regexes, you must take care that at least one
47operand is not a literal. For instance, the following are ['not] valid
48regexes:
49
50    sregex re1 = 'a' >> 'b';         // ERROR!
51    sregex re2 = +'a';               // ERROR!
52
53The two operands to the binary `>>` operator are both literals, and the
54operand of the unary `+` operator is also a literal, so these statements
55will call the native C++ binary right-shift and unary plus operators,
56respectively. That's not what we want. To get operator overloading to kick
57in, at least one operand must be a user-defined type. We can use xpressive's
58`as_xpr()` helper function to "taint" an expression with regex-ness, forcing
59operator overloading to find the correct operators. The two regexes above
60should be written as:
61
62    sregex re1 = as_xpr('a') >> 'b'; // OK
63    sregex re2 = +as_xpr('a');       // OK
64
65[h2 Sequencing and Alternation]
66
67As you've probably already noticed, sub-expressions in static regexes must
68be separated by the sequencing operator, `>>`. You can read this operator as
69"followed by".
70
71    // Match an 'a' followed by a digit
72    sregex re = 'a' >> _d;
73
74Alternation works just as it does in Perl with the `|` operator. You can
75read this operator as "or". For example:
76
77    // match a digit character or a word character one or more times
78    sregex re = +( _d | _w );
79
80[h2 Grouping and Captures]
81
82In Perl, parentheses `()` have special meaning. They group, but as a
83side-effect they also create back\-references like [^$1] and [^$2]. In C++,
84parentheses only group \-\- there is no way to give them side\-effects. To
85get the same effect, we use the special `s1`, `s2`, etc. tokens. Assigning
86to one creates a back-reference. You can then use the back-reference later
87in your expression, like using [^\1] and [^\2] in Perl. For example,
88consider the following regex, which finds matching HTML tags:
89
90    "<(\\w+)>.*?</\\1>"
91
92In static xpressive, this would be:
93
94    '<' >> (s1= +_w) >> '>' >> -*_ >> "</" >> s1 >> '>'
95
96Notice how you capture a back-reference by assigning to `s1`, and then you
97use `s1` later in the pattern to find the matching end tag.
98
99[tip [*Grouping without capturing a back-reference] \n\n In
100xpressive, if you just want grouping without capturing a back-reference, you
101can just use `()` without `s1`. That is the equivalent of Perl's [^(?:)]
102non-capturing grouping construct.]
103
104[h2 Case-Insensitivity and Internationalization]
105
106Perl lets you make part of your regular expression case-insensitive by using
107the [^(?i:)] pattern modifier. xpressive also has a case-insensitivity
108pattern modifier, called `icase`. You can use it as follows:
109
110    sregex re = "this" >> icase( "that" );
111
112In this regular expression, `"this"` will be matched exactly, but `"that"`
113will be matched irrespective of case.
114
115Case-insensitive regular expressions raise the issue of
116internationalization: how should case-insensitive character comparisons be
117evaluated? Also, many character classes are locale-specific. Which
118characters are matched by `digit` and which are matched by `alpha`? The
119answer depends on the `std::locale` object the regular expression object is
120using. By default, all regular expression objects use the global locale. You
121can override the default by using the `imbue()` pattern modifier, as
122follows:
123
124    std::locale my_locale = /* initialize a std::locale object */;
125    sregex re = imbue( my_locale )( +alpha >> +digit );
126
127This regular expression will evaluate `alpha` and `digit` according to
128`my_locale`. See the section on [link boost_xpressive.user_s_guide.localization_and_regex_traits
129Localization and Regex Traits] for more information about how to customize
130the behavior of your regexes.
131
132[h2 Static xpressive Syntax Cheat Sheet]
133
134The table below lists the familiar regex constructs and their equivalents in
135static xpressive.
136
137[table Perl syntax vs. Static xpressive syntax
138    [[Perl]               [Static xpressive]                              [Meaning]]
139    [[[^.]]               [`_`]                                           [any character (assuming Perl's /s modifier).]]
140    [[[^ab]]              [`a >> b`]                                      [sequencing of [^a] and [^b] sub-expressions.]]
141    [[[^a|b]]             [`a | b`]                                       [alternation of [^a] and [^b] sub-expressions.]]
142    [[[^(a)]]             [`(s1= a)`]                                     [group and capture a back-reference.]]
143    [[[^(?:a)]]           [`(a)`]                                         [group and do not capture a back-reference.]]
144    [[[^\1]]              [`s1`]                                          [a previously captured back-reference.]]
145    [[[^a*]]              [`*a`]                                          [zero or more times, greedy.]]
146    [[[^a+]]              [`+a`]                                          [one or more times, greedy.]]
147    [[[^a?]]              [`!a`]                                          [zero or one time, greedy.]]
148    [[[^a{n,m}]]          [`repeat<n,m>(a)`]                              [between [^n] and [^m] times, greedy.]]
149    [[[^a*?]]             [`-*a`]                                         [zero or more times, non-greedy.]]
150    [[[^a+?]]             [`-+a`]                                         [one or more times, non-greedy.]]
151    [[[^a??]]             [`-!a`]                                         [zero or one time, non-greedy.]]
152    [[[^a{n,m}?]]         [`-repeat<n,m>(a)`]                             [between [^n] and [^m] times, non-greedy.]]
153    [[[^^]]               [`bos`]                                         [beginning of sequence assertion.]]
154    [[[^$]]               [`eos`]                                         [end of sequence assertion.]]
155    [[[^\b]]              [`_b`]                                          [word boundary assertion.]]
156    [[[^\B]]              [`~_b`]                                         [not word boundary assertion.]]
157    [[[^\\n]]             [`_n`]                                          [literal newline.]]
158    [[[^.]]               [`~_n`]                                         [any character except a literal newline (without Perl's /s modifier).]]
159    [[[^\\r?\\n|\\r]]     [`_ln`]                                         [logical newline.]]
160    [[[^\[^\\r\\n\]]]     [`~_ln`]                                        [any single character not a logical newline.]]
161    [[[^\w]]              [`_w`]                                          [a word character, equivalent to set\[alnum | '_'\].]]
162    [[[^\W]]              [`~_w`]                                         [not a word character, equivalent to ~set\[alnum | '_'\].]]
163    [[[^\d]]              [`_d`]                                          [a digit character.]]
164    [[[^\D]]              [`~_d`]                                         [not a digit character.]]
165    [[[^\s]]              [`_s`]                                          [a space character.]]
166    [[[^\S]]              [`~_s`]                                         [not a space character.]]
167    [[[^\[:alnum:\]]]     [`alnum`]                                       [an alpha-numeric character.]]
168    [[[^\[:alpha:\]]]     [`alpha`]                                       [an alphabetic character.]]
169    [[[^\[:blank:\]]]     [`blank`]                                       [a horizontal white-space character.]]
170    [[[^\[:cntrl:\]]]     [`cntrl`]                                       [a control character.]]
171    [[[^\[:digit:\]]]     [`digit`]                                       [a digit character.]]
172    [[[^\[:graph:\]]]     [`graph`]                                       [a graphable character.]]
173    [[[^\[:lower:\]]]     [`lower`]                                       [a lower-case character.]]
174    [[[^\[:print:\]]]     [`print`]                                       [a printing character.]]
175    [[[^\[:punct:\]]]     [`punct`]                                       [a punctuation character.]]
176    [[[^\[:space:\]]]     [`space`]                                       [a white-space character.]]
177    [[[^\[:upper:\]]]     [`upper`]                                       [an upper-case character.]]
178    [[[^\[:xdigit:\]]]    [`xdigit`]                                      [a hexadecimal digit character.]]
179    [[[^\[0-9\]]]         [`range('0','9')`]                              [characters in range `'0'` through `'9'`.]]
180    [[[^\[abc\]]]         [`as_xpr('a') | 'b' |'c'`]                      [characters `'a'`, `'b'`, or `'c'`.]]
181    [[[^\[abc\]]]         [`(set= 'a','b','c')`]                          [['same as above]]]
182    [[[^\[0-9abc\]]]      [`set[ range('0','9') | 'a' | 'b' | 'c' ]`]     [characters `'a'`, `'b'`, `'c'` or  in range `'0'` through `'9'`.]]
183    [[[^\[0-9abc\]]]      [`set[ range('0','9') | (set= 'a','b','c') ]`]  [['same as above]]]
184    [[[^\[^abc\]]]        [`~(set= 'a','b','c')`]                         [not characters `'a'`, `'b'`, or `'c'`.]]
185    [[[^(?i:['stuff])]]   [`icase(`[^['stuff]]`)`]                        [match ['stuff] disregarding case.]]
186    [[[^(?>['stuff])]]    [`keep(`[^['stuff]]`)`]                         [independent sub-expression, match ['stuff] and turn off backtracking.]]
187    [[[^(?=['stuff])]]    [`before(`[^['stuff]]`)`]                       [positive look-ahead assertion, match if before ['stuff] but don't include ['stuff] in the match.]]
188    [[[^(?!['stuff])]]    [`~before(`[^['stuff]]`)`]                      [negative look-ahead assertion, match if not before ['stuff].]]
189    [[[^(?<=['stuff])]]   [`after(`[^['stuff]]`)`]                        [positive look-behind assertion, match if after ['stuff] but don't include ['stuff] in the match. (['stuff] must be constant-width.)]]
190    [[[^(?<!['stuff])]]   [`~after(`[^['stuff]]`)`]                       [negative look-behind assertion, match if not after ['stuff]. (['stuff] must be constant-width.)]]
191]
192\n
193
194[endsect]
Note: See TracBrowser for help on using the repository browser.