Context Navigation

grammars.qbk @ 33

Last change on this file since 33 was 29, checked in by landauf, 16 years ago
updated boost from 1_33_1 to 1_34_1
File size: 12.8 KB

Line
1	[section Grammars and Nested Matches]
2
3	[h2 Overview]
4
5	One of the key benefits of representing regexes as C++ expressions is the ability to easily refer to other C++
6	code and data from within the regex. This enables programming idioms that are not possible with other regular
7	expression libraries. Of particular note is the ability for one regex to refer to another regex, allowing you
8	to build grammars out of regular expressions. This section describes how to embed one regex in another by value
9	and by reference, how regex objects behave when they refer to other regexes, and how to access the tree of results
10	after a successful parse.
11
12	[h2 Embedding a Regex by Value]
13
14	The _basic_regex_ object has value semantics. When a regex object appears on the right-hand side in the definition
15	of another regex, it is as if the regex were embedded by value; that is, a copy of the nested regex is stored by
16	the enclosing regex. The inner regex is invoked by the outer regex during pattern matching. The inner regex
17	participates fully in the match, back-tracking as needed to make the match succeed.
18
19	Consider a text editor that has a regex-find feature with a whole-word option. You can implement this with
20	xpressive as follows:
21
22	find_dialog dlg;
23	if( dialog_ok == dlg.do_modal() )
24	{
25	std::string pattern = dlg.get_text(); // the pattern the user entered
26	bool whole_word = dlg.whole_word.is_checked(); // did the user select the whole-word option?
27
28	sregex re = sregex::compile( pattern ); // try to compile the pattern
29
30	if( whole_word )
31	{
32	// wrap the regex in begin-word / end-word assertions
33	re = bow >> re >> eow;
34	}
35
36	// ... use re ...
37	}
38
39	Look closely at this line:
40
41	// wrap the regex in begin-word / end-word assertions
42	re = bow >> re >> eow;
43
44	This line creates a new regex that embeds the old regex by value. Then, the new regex is assigned back to
45	the original regex. Since a copy of the old regex was made on the right-hand side, this works as you might
46	expect: the new regex has the behavior of the old regex wrapped in begin- and end-word assertions.
47
48	[note Note that `re = bow >> re >> eow` does ['not] define a recursive regular expression, since regex
49	objects embed by value by default. The next section shows how to define a recursive regular expression by
50	embedding a regex by reference.]
51
52	[h2 Embedding a Regex by Reference]
53
54	If you want to be able to build recursive regular expressions and context-free grammars, embedding a regex
55	by value is not enough. You need to be able to make your regular expressions self-referential. Most regular
56	expression engines don't give you that power, but xpressive does.
57
58	[tip The theoretical computer scientists out there will correctly point out that a self-referential
59	regular expression is not "regular", so in the strict sense, xpressive isn't really a ['regular] expression engine
60	at all. But as Larry Wall once said, "the term '''[regular expression]''' has grown with the capabilities of our
61	pattern matching engines, so I'm not going to try to fight linguistic necessity here."]
62
63	Consider the following code, which uses the `by_ref()` helper to define a recursive regular expression that
64	matches balanced, nested parentheses:
65
66	sregex parentheses;
67	parentheses // A balanced set of parentheses ...
68	= '(' // is an opening parenthesis ...
69	>> // followed by ...
70	*( // zero or more ...
71	keep( +~(set='(',')') ) // of a bunch of things that are not parentheses ...
72	\| // or ...
73	by_ref(parentheses) // a balanced set of parentheses
74	) // (ooh, recursion!) ...
75	>> // followed by ...
76	')' // a closing parenthesis
77	;
78
79	Matching balanced, nested tags is an important text processing task, and it is one that "classic" regular
80	expressions cannot do. The `by_ref()` helper makes it possible. It allows one regex object to be embedded
81	in another ['by reference]. Since the right-hand side holds `parentheses` by reference, assigning the right-hand
82	side back to `parentheses` creates a cycle, which will execute recursively.
83
84	[h2 Building a Grammar]
85
86	Once we allow self-reference in our regular expressions, the genie is out of the bottle and all manner of
87	fun things are possible. In particular, we can now build grammars out of regular expressions. Let's have
88	a look at the text-book grammar example: the humble calculator.
89
90	sregex group, factor, term, expression;
91
92	group = '(' >> by_ref(expression) >> ')';
93	factor = +_d \| group;
94	term = factor >> (('' >> factor) \| ('/' >> factor));
95	expression = term >> *(('+' >> term) \| ('-' >> term));
96
97	The regex `expression` defined above does something rather remarkable for a regular expression: it matches
98	mathematical expressions. For example, if the input string were `"foo 9*(10+3) bar"`, this pattern would
99	match `"9*(10+3)"`. It only matches well-formed mathematical expressions, where the parentheses are
100	balanced and the infix operators have two arguments each. Don't try this with just any regular expression
101	engine!
102
103	[note There is no way for a dynamic regex to refer to other regexes, so they can only
104	be used as terminals in a grammar. Use static regexes for non-terminal grammar rules.]
105
106	Let's take a closer look at this regular expression grammar. Notice that it is cyclic: `expression` is
107	implemented in terms of `term`, which is implemented in terms of `factor`, which is implemented in terms
108	of `group`, which is implemented in terms of `expression`, closing the loop. In general, the way to define
109	a cyclic grammar is to forward-declare the regex objects and embed by reference those regular expressions
110	that have not yet been initialized. In the above grammar, there is only one place where we need to reference
111	a regex object that has not yet been initialized: the definition of `group`. In that place, we use
112	`by_ref()` to embed `expression` by reference. In all other places, it is sufficient to embed the other
113	regex objects by value, since they have already been initialized and their values will not change.
114
115	[tip [*Embed by value if possible]
116	\n\n
117	In general, prefer embedding regular expressions by value rather than by reference. It
118	involves one less indirection, making your patterns match a little faster. Besides, value semantics
119	are simpler and will make your grammars easier to reason about. Don't worry about the expense of "copying"
120	a regex. Each regex object shares its implementation with all of its copies.]
121
122	[h2 Cyclic Patterns, Copying and Memory Management, Oh My!]
123
124	The calculator example above raises a number of very complicated memory-management issues. Each of the
125	four regex objects refer to each other, some directly and some indirectly, some by value and some by
126	reference. What if we were to return one of them from a function and let the others go out of scope?
127	What becomes of the references? The answer is that the regex objects are internally reference counted,
128	such that they keep their referenced regex objects alive as long as they need them. So passing a regex
129	object by value is never a problem, even if it refers to other regex objects that have gone out of scope.
130
131	Those of you who have dealt with reference counting are probably familiar with its Achilles Heel: cyclic
132	references. If regex objects are reference counted, what happens to cycles like the one created in the
133	calculator example? Are they leaked? The answer is no, they are not leaked. The _basic_regex_ object has some tricky
134	reference tracking code that ensures that even cyclic regex grammars are cleaned up when the last external
135	reference goes away. So don't worry about it. Create cyclic grammars, pass your regex objects around and
136	copy them all you want. It is fast and efficient and guaranteed not to leak or result in dangling references.
137
138	[h2 Nested Regexes and Sub-Match Scoping]
139
140	Nested regular expressions raise the issue of sub-match scoping. If both the inner and outer regex write
141	to and read from the same sub-match vector, chaos would ensue. The inner regex would stomp on the
142	sub-matches written by the outer regex. For example, what does this do?
143
144	sregex inner = sregex::compile( "(.)\\1" );
145	sregex outer = (s1= _) >> inner >> s1;
146
147	The author probably didn't intend for the inner regex to overwrite the sub-match written by the outer
148	regex. The problem is particularly acute when the inner regex is accepted from the user as input. The
149	author has no way of knowing whether the inner regex will stomp the sub-match vector or not. This is
150	clearly not acceptable.
151
152	Instead, what actually happens is that each invocation of a nested regex gets its own scope. Sub-matches
153	belong to that scope. That is, each nested regex invocation gets its own copy of the sub-match vector to
154	play with, so there is no way for an inner regex to stomp on the sub-matches of an outer regex. So, for
155	example, the regex `outer` defined above would match `"ABBA"`, as it should.
156
157	[h2 Nested Results]
158
159	If nested regexes have their own sub-matches, there should be a way to access them after a successful
160	match. In fact, there is. After a _regex_match_ or _regex_search_, the _match_results_ struct behaves
161	like the head of a tree of nested results. The _match_results_ class provides a `nested_results()`
162	member function that returns an ordered sequence of _match_results_ structures, representing the
163	results of the nested regexes. The order of the nested results is the same as the order in which
164	the nested regex objects matched.
165
166	Take as an example the regex for balanced, nested parentheses we saw earlier:
167
168	sregex parentheses;
169	parentheses = '(' >> *( keep( +~(set='(',')') ) \| by_ref(parentheses) ) >> ')';
170
171	smatch what;
172	std::string str( "blah blah( a(b)c (c(e)f (g)h )i (j)6 )blah" );
173
174	if( regex_search( str, what, parentheses ) )
175	{
176	// display the whole match
177	std::cout << what[0] << '\n';
178
179	// display the nested results
180	std::for_each(
181	what.nested_results().begin(),
182	what.nested_results().end(),
183	output_nested_results() );
184	}
185
186	This program displays the following:
187
188	[pre
189	( a(b)c (c(e)f (g)h )i (j)6 )
190	(b)
191	(c(e)f (g)h )
192	(e)
193	(g)
194	(j)
195	]
196
197	Here you can see how the results are nested and that they are stored in the order in which they
198	are found.
199
200	[tip See the definition of [link boost_xpressive.user_s_guide.examples.display_a_tree_of_nested_results output_nested_results] in the
201	[link boost_xpressive.user_s_guide.examples Examples] section.]
202
203	[h2 Filtering Nested Results]
204
205	Sometimes a regex will have several nested regex objects, and you want to know which result corresponds
206	to which regex object. That's where `basic_regex<>::regex_id()` and `match_results<>::regex_id()`
207	come in handy. When iterating over the nested results, you can compare the regex id from the results to
208	the id of the regex object you're interested in.
209
210	To make this a bit easier, xpressive provides a predicate to make it simple to iterate over just the
211	results that correspond to a certain nested regex. It is called `regex_id_filter_predicate`, and it is
212	intended to be used with _iterator_. You can use it as follows:
213
214	sregex name = +alpha;
215	sregex integer = +_d;
216	sregex re = ( _s >> ( name \| integer ) );
217
218	smatch what;
219	std::string str( "marsha 123 jan 456 cindy 789" );
220
221	if( regex_match( str, what, re ) )
222	{
223	smatch::nested_results_type::const_iterator begin = what.nested_results().begin();
224	smatch::nested_results_type::const_iterator end = what.nested_results().end();
225
226	// declare filter predicates to select just the names or the integers
227	sregex_id_filter_predicate name_id( name.regex_id() );
228	sregex_id_filter_predicate integer_id( integer.regex_id() );
229
230	// iterate over only the results from the name regex
231	std::for_each(
232	boost::make_filter_iterator( name_id, begin, end ),
233	boost::make_filter_iterator( name_id, end, end ),
234	output_result
235	);
236
237	std::cout << '\n';
238
239	// iterate over only the results from the integer regex
240	std::for_each(
241	boost::make_filter_iterator( integer_id, begin, end ),
242	boost::make_filter_iterator( integer_id, end, end ),
243	output_result
244	);
245	}
246
247	where `output_results` is a simple function that takes a `smatch` and displays the full match.
248	Notice how we use the `regex_id_filter_predicate` together with `basic_regex<>::regex_id()` and
249	`boost::make_filter_iterator()` from the _iterator_ to select only those results
250	corresponding to a particular nested regex. This program displays the following:
251
252	[pre
253	marsha
254	jan
255	cindy
256	123
257	456
258	789
259	]
260
261	[endsect]

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format