Context Navigation

static_regexes.qbk @ 69

Last change on this file since 69 was 29, checked in by landauf, 17 years ago
updated boost from 1_33_1 to 1_34_1
File size: 11.8 KB

Line
1	[section Static Regexes]
2
3	[h2 Overview]
4
5	The feature that really sets xpressive apart from other C/C++ regular
6	expression libraries is the ability to author a regular expression using C++
7	expressions. xpressive achieves this through operator overloading, using a
8	technique called ['expression templates] to embed a mini-language dedicated
9	to pattern matching within C++. These "static regexes" have many advantages
10	over their string-based brethren. In particular, static regexes:
11
12	* are syntax-checked at compile-time; they will never fail at run-time due to
13	a syntax error.
14	* can naturally refer to other C++ data and code, including other regexes,
15	making it possible to build grammars out of regular expressions and bind
16	user-defined actions that execute when parts of your regex match.
17	* are statically bound for better inlining and optimization. Static regexes
18	require no state tables, virtual functions, byte-code or calls through
19	function pointers that cannot be resolved at compile time.
20	* are not limited to searching for patterns in strings. You can declare a
21	static regex that finds patterns in an array of integers, for instance.
22
23	Since we compose static regexes using C++ expressions, we are constrained by
24	the rules for legal C++ expressions. Unfortunately, that means that
25	"classic" regular expression syntax cannot always be mapped cleanly into
26	C++. Rather, we map the regex ['constructs], picking new syntax that is
27	legal C++.
28
29	[h2 Construction and Assignment]
30
31	You create a static regex by assigning one to an object of type _basic_regex_.
32	For instance, the following defines a regex that can be used to find patterns
33	in objects of type `std::string`:
34
35	sregex re = '$' >> +_d >> '.' >> _d >> _d;
36
37	Assignment works similarly.
38
39	[h2 Character and String Literals]
40
41	In static regexes, character and string literals match themselves. For
42	instance, in the regex above, `'$'` and `'.'` match the characters `'$'` and
43	`'.'` respectively. Don't be confused by the fact that [^$] and [^.] are
44	meta-characters in Perl. In xpressive, literals always represent themselves.
45
46	When using literals in static regexes, you must take care that at least one
47	operand is not a literal. For instance, the following are ['not] valid
48	regexes:
49
50	sregex re1 = 'a' >> 'b'; // ERROR!
51	sregex re2 = +'a'; // ERROR!
52
53	The two operands to the binary `>>` operator are both literals, and the
54	operand of the unary `+` operator is also a literal, so these statements
55	will call the native C++ binary right-shift and unary plus operators,
56	respectively. That's not what we want. To get operator overloading to kick
57	in, at least one operand must be a user-defined type. We can use xpressive's
58	`as_xpr()` helper function to "taint" an expression with regex-ness, forcing
59	operator overloading to find the correct operators. The two regexes above
60	should be written as:
61
62	sregex re1 = as_xpr('a') >> 'b'; // OK
63	sregex re2 = +as_xpr('a'); // OK
64
65	[h2 Sequencing and Alternation]
66
67	As you've probably already noticed, sub-expressions in static regexes must
68	be separated by the sequencing operator, `>>`. You can read this operator as
69	"followed by".
70
71	// Match an 'a' followed by a digit
72	sregex re = 'a' >> _d;
73
74	Alternation works just as it does in Perl with the `\|` operator. You can
75	read this operator as "or". For example:
76
77	// match a digit character or a word character one or more times
78	sregex re = +( _d \| _w );
79
80	[h2 Grouping and Captures]
81
82	In Perl, parentheses `()` have special meaning. They group, but as a
83	side-effect they also create back\-references like [^$1] and [^$2]. In C++,
84	parentheses only group \-\- there is no way to give them side\-effects. To
85	get the same effect, we use the special `s1`, `s2`, etc. tokens. Assigning
86	to one creates a back-reference. You can then use the back-reference later
87	in your expression, like using [^\1] and [^\2] in Perl. For example,
88	consider the following regex, which finds matching HTML tags:
89
90	"<(\\w+)>.*?</\\1>"
91
92	In static xpressive, this would be:
93
94	'<' >> (s1= +_w) >> '>' >> -*_ >> "</" >> s1 >> '>'
95
96	Notice how you capture a back-reference by assigning to `s1`, and then you
97	use `s1` later in the pattern to find the matching end tag.
98
99	[tip [*Grouping without capturing a back-reference] \n\n In
100	xpressive, if you just want grouping without capturing a back-reference, you
101	can just use `()` without `s1`. That is the equivalent of Perl's [^(?:)]
102	non-capturing grouping construct.]
103
104	[h2 Case-Insensitivity and Internationalization]
105
106	Perl lets you make part of your regular expression case-insensitive by using
107	the [^(?i:)] pattern modifier. xpressive also has a case-insensitivity
108	pattern modifier, called `icase`. You can use it as follows:
109
110	sregex re = "this" >> icase( "that" );
111
112	In this regular expression, `"this"` will be matched exactly, but `"that"`
113	will be matched irrespective of case.
114
115	Case-insensitive regular expressions raise the issue of
116	internationalization: how should case-insensitive character comparisons be
117	evaluated? Also, many character classes are locale-specific. Which
118	characters are matched by `digit` and which are matched by `alpha`? The
119	answer depends on the `std::locale` object the regular expression object is
120	using. By default, all regular expression objects use the global locale. You
121	can override the default by using the `imbue()` pattern modifier, as
122	follows:
123
124	std::locale my_locale = /* initialize a std::locale object */;
125	sregex re = imbue( my_locale )( +alpha >> +digit );
126
127	This regular expression will evaluate `alpha` and `digit` according to
128	`my_locale`. See the section on [link boost_xpressive.user_s_guide.localization_and_regex_traits
129	Localization and Regex Traits] for more information about how to customize
130	the behavior of your regexes.
131
132	[h2 Static xpressive Syntax Cheat Sheet]
133
134	The table below lists the familiar regex constructs and their equivalents in
135	static xpressive.
136
137	[table Perl syntax vs. Static xpressive syntax
138	[[Perl] [Static xpressive] [Meaning]]
139	[[[^.]] [`_`] [any character (assuming Perl's /s modifier).]]
140	[[[^ab]] [`a >> b`] [sequencing of [^a] and [^b] sub-expressions.]]
141	[[[^a\|b]] [`a \| b`] [alternation of [^a] and [^b] sub-expressions.]]
142	[[[^(a)]] [`(s1= a)`] [group and capture a back-reference.]]
143	[[[^(?:a)]] [`(a)`] [group and do not capture a back-reference.]]
144	[[[^\1]] [`s1`] [a previously captured back-reference.]]
145	[[[^a]] [`a`] [zero or more times, greedy.]]
146	[[[^a+]] [`+a`] [one or more times, greedy.]]
147	[[[^a?]] [`!a`] [zero or one time, greedy.]]
148	[[[^a{n,m}]] [`repeat<n,m>(a)`] [between [^n] and [^m] times, greedy.]]
149	[[[^a?]] [`-a`] [zero or more times, non-greedy.]]
150	[[[^a+?]] [`-+a`] [one or more times, non-greedy.]]
151	[[[^a??]] [`-!a`] [zero or one time, non-greedy.]]
152	[[[^a{n,m}?]] [`-repeat<n,m>(a)`] [between [^n] and [^m] times, non-greedy.]]
153	[[[^^]] [`bos`] [beginning of sequence assertion.]]
154	[[[^$]] [`eos`] [end of sequence assertion.]]
155	[[[^\b]] [`_b`] [word boundary assertion.]]
156	[[[^\B]] [`~_b`] [not word boundary assertion.]]
157	[[[^\\n]] [`_n`] [literal newline.]]
158	[[[^.]] [`~_n`] [any character except a literal newline (without Perl's /s modifier).]]
159	[[[^\\r?\\n\|\\r]] [`_ln`] [logical newline.]]
160	[[[^\[^\\r\\n\]]] [`~_ln`] [any single character not a logical newline.]]
161	[[[^\w]] [`_w`] [a word character, equivalent to set\[alnum \| '_'\].]]
162	[[[^\W]] [`~_w`] [not a word character, equivalent to ~set\[alnum \| '_'\].]]
163	[[[^\d]] [`_d`] [a digit character.]]
164	[[[^\D]] [`~_d`] [not a digit character.]]
165	[[[^\s]] [`_s`] [a space character.]]
166	[[[^\S]] [`~_s`] [not a space character.]]
167	[[[^\[:alnum:\]]] [`alnum`] [an alpha-numeric character.]]
168	[[[^\[:alpha:\]]] [`alpha`] [an alphabetic character.]]
169	[[[^\[:blank:\]]] [`blank`] [a horizontal white-space character.]]
170	[[[^\[:cntrl:\]]] [`cntrl`] [a control character.]]
171	[[[^\[:digit:\]]] [`digit`] [a digit character.]]
172	[[[^\[:graph:\]]] [`graph`] [a graphable character.]]
173	[[[^\[:lower:\]]] [`lower`] [a lower-case character.]]
174	[[[^\[:print:\]]] [`print`] [a printing character.]]
175	[[[^\[:punct:\]]] [`punct`] [a punctuation character.]]
176	[[[^\[:space:\]]] [`space`] [a white-space character.]]
177	[[[^\[:upper:\]]] [`upper`] [an upper-case character.]]
178	[[[^\[:xdigit:\]]] [`xdigit`] [a hexadecimal digit character.]]
179	[[[^\[0-9\]]] [`range('0','9')`] [characters in range `'0'` through `'9'`.]]
180	[[[^\[abc\]]] [`as_xpr('a') \| 'b' \|'c'`] [characters `'a'`, `'b'`, or `'c'`.]]
181	[[[^\[abc\]]] [`(set= 'a','b','c')`] [['same as above]]]
182	[[[^\[0-9abc\]]] [`set[ range('0','9') \| 'a' \| 'b' \| 'c' ]`] [characters `'a'`, `'b'`, `'c'` or in range `'0'` through `'9'`.]]
183	[[[^\[0-9abc\]]] [`set[ range('0','9') \| (set= 'a','b','c') ]`] [['same as above]]]
184	[[[^\[^abc\]]] [`~(set= 'a','b','c')`] [not characters `'a'`, `'b'`, or `'c'`.]]
185	[[[^(?i:['stuff])]] [`icase(`[^['stuff]]`)`] [match ['stuff] disregarding case.]]
186	[[[^(?>['stuff])]] [`keep(`[^['stuff]]`)`] [independent sub-expression, match ['stuff] and turn off backtracking.]]
187	[[[^(?=['stuff])]] [`before(`[^['stuff]]`)`] [positive look-ahead assertion, match if before ['stuff] but don't include ['stuff] in the match.]]
188	[[[^(?!['stuff])]] [`~before(`[^['stuff]]`)`] [negative look-ahead assertion, match if not before ['stuff].]]
189	[[[^(?<=['stuff])]] [`after(`[^['stuff]]`)`] [positive look-behind assertion, match if after ['stuff] but don't include ['stuff] in the match. (['stuff] must be constant-width.)]]
190	[[[^(?<!['stuff])]] [`~after(`[^['stuff]]`)`] [negative look-behind assertion, match if not after ['stuff]. (['stuff] must be constant-width.)]]
191	]
192	\n
193
194	[endsect]

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format