1 | [section Grammars and Nested Matches] |
---|
2 | |
---|
3 | [h2 Overview] |
---|
4 | |
---|
5 | One of the key benefits of representing regexes as C++ expressions is the ability to easily refer to other C++ |
---|
6 | code and data from within the regex. This enables programming idioms that are not possible with other regular |
---|
7 | expression libraries. Of particular note is the ability for one regex to refer to another regex, allowing you |
---|
8 | to build grammars out of regular expressions. This section describes how to embed one regex in another by value |
---|
9 | and by reference, how regex objects behave when they refer to other regexes, and how to access the tree of results |
---|
10 | after a successful parse. |
---|
11 | |
---|
12 | [h2 Embedding a Regex by Value] |
---|
13 | |
---|
14 | The _basic_regex_ object has value semantics. When a regex object appears on the right-hand side in the definition |
---|
15 | of another regex, it is as if the regex were embedded by value; that is, a copy of the nested regex is stored by |
---|
16 | the enclosing regex. The inner regex is invoked by the outer regex during pattern matching. The inner regex |
---|
17 | participates fully in the match, back-tracking as needed to make the match succeed. |
---|
18 | |
---|
19 | Consider a text editor that has a regex-find feature with a whole-word option. You can implement this with |
---|
20 | xpressive as follows: |
---|
21 | |
---|
22 | find_dialog dlg; |
---|
23 | if( dialog_ok == dlg.do_modal() ) |
---|
24 | { |
---|
25 | std::string pattern = dlg.get_text(); // the pattern the user entered |
---|
26 | bool whole_word = dlg.whole_word.is_checked(); // did the user select the whole-word option? |
---|
27 | |
---|
28 | sregex re = sregex::compile( pattern ); // try to compile the pattern |
---|
29 | |
---|
30 | if( whole_word ) |
---|
31 | { |
---|
32 | // wrap the regex in begin-word / end-word assertions |
---|
33 | re = bow >> re >> eow; |
---|
34 | } |
---|
35 | |
---|
36 | // ... use re ... |
---|
37 | } |
---|
38 | |
---|
39 | Look closely at this line: |
---|
40 | |
---|
41 | // wrap the regex in begin-word / end-word assertions |
---|
42 | re = bow >> re >> eow; |
---|
43 | |
---|
44 | This line creates a new regex that embeds the old regex by value. Then, the new regex is assigned back to |
---|
45 | the original regex. Since a copy of the old regex was made on the right-hand side, this works as you might |
---|
46 | expect: the new regex has the behavior of the old regex wrapped in begin- and end-word assertions. |
---|
47 | |
---|
48 | [note Note that `re = bow >> re >> eow` does ['not] define a recursive regular expression, since regex |
---|
49 | objects embed by value by default. The next section shows how to define a recursive regular expression by |
---|
50 | embedding a regex by reference.] |
---|
51 | |
---|
52 | [h2 Embedding a Regex by Reference] |
---|
53 | |
---|
54 | If you want to be able to build recursive regular expressions and context-free grammars, embedding a regex |
---|
55 | by value is not enough. You need to be able to make your regular expressions self-referential. Most regular |
---|
56 | expression engines don't give you that power, but xpressive does. |
---|
57 | |
---|
58 | [tip The theoretical computer scientists out there will correctly point out that a self-referential |
---|
59 | regular expression is not "regular", so in the strict sense, xpressive isn't really a ['regular] expression engine |
---|
60 | at all. But as Larry Wall once said, "the term '''[regular expression]''' has grown with the capabilities of our |
---|
61 | pattern matching engines, so I'm not going to try to fight linguistic necessity here."] |
---|
62 | |
---|
63 | Consider the following code, which uses the `by_ref()` helper to define a recursive regular expression that |
---|
64 | matches balanced, nested parentheses: |
---|
65 | |
---|
66 | sregex parentheses; |
---|
67 | parentheses // A balanced set of parentheses ... |
---|
68 | = '(' // is an opening parenthesis ... |
---|
69 | >> // followed by ... |
---|
70 | *( // zero or more ... |
---|
71 | keep( +~(set='(',')') ) // of a bunch of things that are not parentheses ... |
---|
72 | | // or ... |
---|
73 | by_ref(parentheses) // a balanced set of parentheses |
---|
74 | ) // (ooh, recursion!) ... |
---|
75 | >> // followed by ... |
---|
76 | ')' // a closing parenthesis |
---|
77 | ; |
---|
78 | |
---|
79 | Matching balanced, nested tags is an important text processing task, and it is one that "classic" regular |
---|
80 | expressions cannot do. The `by_ref()` helper makes it possible. It allows one regex object to be embedded |
---|
81 | in another ['by reference]. Since the right-hand side holds `parentheses` by reference, assigning the right-hand |
---|
82 | side back to `parentheses` creates a cycle, which will execute recursively. |
---|
83 | |
---|
84 | [h2 Building a Grammar] |
---|
85 | |
---|
86 | Once we allow self-reference in our regular expressions, the genie is out of the bottle and all manner of |
---|
87 | fun things are possible. In particular, we can now build grammars out of regular expressions. Let's have |
---|
88 | a look at the text-book grammar example: the humble calculator. |
---|
89 | |
---|
90 | sregex group, factor, term, expression; |
---|
91 | |
---|
92 | group = '(' >> by_ref(expression) >> ')'; |
---|
93 | factor = +_d | group; |
---|
94 | term = factor >> *(('*' >> factor) | ('/' >> factor)); |
---|
95 | expression = term >> *(('+' >> term) | ('-' >> term)); |
---|
96 | |
---|
97 | The regex `expression` defined above does something rather remarkable for a regular expression: it matches |
---|
98 | mathematical expressions. For example, if the input string were `"foo 9*(10+3) bar"`, this pattern would |
---|
99 | match `"9*(10+3)"`. It only matches well-formed mathematical expressions, where the parentheses are |
---|
100 | balanced and the infix operators have two arguments each. Don't try this with just any regular expression |
---|
101 | engine! |
---|
102 | |
---|
103 | [note There is no way for a dynamic regex to refer to other regexes, so they can only |
---|
104 | be used as terminals in a grammar. Use static regexes for non-terminal grammar rules.] |
---|
105 | |
---|
106 | Let's take a closer look at this regular expression grammar. Notice that it is cyclic: `expression` is |
---|
107 | implemented in terms of `term`, which is implemented in terms of `factor`, which is implemented in terms |
---|
108 | of `group`, which is implemented in terms of `expression`, closing the loop. In general, the way to define |
---|
109 | a cyclic grammar is to forward-declare the regex objects and embed by reference those regular expressions |
---|
110 | that have not yet been initialized. In the above grammar, there is only one place where we need to reference |
---|
111 | a regex object that has not yet been initialized: the definition of `group`. In that place, we use |
---|
112 | `by_ref()` to embed `expression` by reference. In all other places, it is sufficient to embed the other |
---|
113 | regex objects by value, since they have already been initialized and their values will not change. |
---|
114 | |
---|
115 | [tip [*Embed by value if possible] |
---|
116 | \n\n |
---|
117 | In general, prefer embedding regular expressions by value rather than by reference. It |
---|
118 | involves one less indirection, making your patterns match a little faster. Besides, value semantics |
---|
119 | are simpler and will make your grammars easier to reason about. Don't worry about the expense of "copying" |
---|
120 | a regex. Each regex object shares its implementation with all of its copies.] |
---|
121 | |
---|
122 | [h2 Cyclic Patterns, Copying and Memory Management, Oh My!] |
---|
123 | |
---|
124 | The calculator example above raises a number of very complicated memory-management issues. Each of the |
---|
125 | four regex objects refer to each other, some directly and some indirectly, some by value and some by |
---|
126 | reference. What if we were to return one of them from a function and let the others go out of scope? |
---|
127 | What becomes of the references? The answer is that the regex objects are internally reference counted, |
---|
128 | such that they keep their referenced regex objects alive as long as they need them. So passing a regex |
---|
129 | object by value is never a problem, even if it refers to other regex objects that have gone out of scope. |
---|
130 | |
---|
131 | Those of you who have dealt with reference counting are probably familiar with its Achilles Heel: cyclic |
---|
132 | references. If regex objects are reference counted, what happens to cycles like the one created in the |
---|
133 | calculator example? Are they leaked? The answer is no, they are not leaked. The _basic_regex_ object has some tricky |
---|
134 | reference tracking code that ensures that even cyclic regex grammars are cleaned up when the last external |
---|
135 | reference goes away. So don't worry about it. Create cyclic grammars, pass your regex objects around and |
---|
136 | copy them all you want. It is fast and efficient and guaranteed not to leak or result in dangling references. |
---|
137 | |
---|
138 | [h2 Nested Regexes and Sub-Match Scoping] |
---|
139 | |
---|
140 | Nested regular expressions raise the issue of sub-match scoping. If both the inner and outer regex write |
---|
141 | to and read from the same sub-match vector, chaos would ensue. The inner regex would stomp on the |
---|
142 | sub-matches written by the outer regex. For example, what does this do? |
---|
143 | |
---|
144 | sregex inner = sregex::compile( "(.)\\1" ); |
---|
145 | sregex outer = (s1= _) >> inner >> s1; |
---|
146 | |
---|
147 | The author probably didn't intend for the inner regex to overwrite the sub-match written by the outer |
---|
148 | regex. The problem is particularly acute when the inner regex is accepted from the user as input. The |
---|
149 | author has no way of knowing whether the inner regex will stomp the sub-match vector or not. This is |
---|
150 | clearly not acceptable. |
---|
151 | |
---|
152 | Instead, what actually happens is that each invocation of a nested regex gets its own scope. Sub-matches |
---|
153 | belong to that scope. That is, each nested regex invocation gets its own copy of the sub-match vector to |
---|
154 | play with, so there is no way for an inner regex to stomp on the sub-matches of an outer regex. So, for |
---|
155 | example, the regex `outer` defined above would match `"ABBA"`, as it should. |
---|
156 | |
---|
157 | [h2 Nested Results] |
---|
158 | |
---|
159 | If nested regexes have their own sub-matches, there should be a way to access them after a successful |
---|
160 | match. In fact, there is. After a _regex_match_ or _regex_search_, the _match_results_ struct behaves |
---|
161 | like the head of a tree of nested results. The _match_results_ class provides a `nested_results()` |
---|
162 | member function that returns an ordered sequence of _match_results_ structures, representing the |
---|
163 | results of the nested regexes. The order of the nested results is the same as the order in which |
---|
164 | the nested regex objects matched. |
---|
165 | |
---|
166 | Take as an example the regex for balanced, nested parentheses we saw earlier: |
---|
167 | |
---|
168 | sregex parentheses; |
---|
169 | parentheses = '(' >> *( keep( +~(set='(',')') ) | by_ref(parentheses) ) >> ')'; |
---|
170 | |
---|
171 | smatch what; |
---|
172 | std::string str( "blah blah( a(b)c (c(e)f (g)h )i (j)6 )blah" ); |
---|
173 | |
---|
174 | if( regex_search( str, what, parentheses ) ) |
---|
175 | { |
---|
176 | // display the whole match |
---|
177 | std::cout << what[0] << '\n'; |
---|
178 | |
---|
179 | // display the nested results |
---|
180 | std::for_each( |
---|
181 | what.nested_results().begin(), |
---|
182 | what.nested_results().end(), |
---|
183 | output_nested_results() ); |
---|
184 | } |
---|
185 | |
---|
186 | This program displays the following: |
---|
187 | |
---|
188 | [pre |
---|
189 | ( a(b)c (c(e)f (g)h )i (j)6 ) |
---|
190 | (b) |
---|
191 | (c(e)f (g)h ) |
---|
192 | (e) |
---|
193 | (g) |
---|
194 | (j) |
---|
195 | ] |
---|
196 | |
---|
197 | Here you can see how the results are nested and that they are stored in the order in which they |
---|
198 | are found. |
---|
199 | |
---|
200 | [tip See the definition of [link boost_xpressive.user_s_guide.examples.display_a_tree_of_nested_results output_nested_results] in the |
---|
201 | [link boost_xpressive.user_s_guide.examples Examples] section.] |
---|
202 | |
---|
203 | [h2 Filtering Nested Results] |
---|
204 | |
---|
205 | Sometimes a regex will have several nested regex objects, and you want to know which result corresponds |
---|
206 | to which regex object. That's where `basic_regex<>::regex_id()` and `match_results<>::regex_id()` |
---|
207 | come in handy. When iterating over the nested results, you can compare the regex id from the results to |
---|
208 | the id of the regex object you're interested in. |
---|
209 | |
---|
210 | To make this a bit easier, xpressive provides a predicate to make it simple to iterate over just the |
---|
211 | results that correspond to a certain nested regex. It is called `regex_id_filter_predicate`, and it is |
---|
212 | intended to be used with _iterator_. You can use it as follows: |
---|
213 | |
---|
214 | sregex name = +alpha; |
---|
215 | sregex integer = +_d; |
---|
216 | sregex re = *( *_s >> ( name | integer ) ); |
---|
217 | |
---|
218 | smatch what; |
---|
219 | std::string str( "marsha 123 jan 456 cindy 789" ); |
---|
220 | |
---|
221 | if( regex_match( str, what, re ) ) |
---|
222 | { |
---|
223 | smatch::nested_results_type::const_iterator begin = what.nested_results().begin(); |
---|
224 | smatch::nested_results_type::const_iterator end = what.nested_results().end(); |
---|
225 | |
---|
226 | // declare filter predicates to select just the names or the integers |
---|
227 | sregex_id_filter_predicate name_id( name.regex_id() ); |
---|
228 | sregex_id_filter_predicate integer_id( integer.regex_id() ); |
---|
229 | |
---|
230 | // iterate over only the results from the name regex |
---|
231 | std::for_each( |
---|
232 | boost::make_filter_iterator( name_id, begin, end ), |
---|
233 | boost::make_filter_iterator( name_id, end, end ), |
---|
234 | output_result |
---|
235 | ); |
---|
236 | |
---|
237 | std::cout << '\n'; |
---|
238 | |
---|
239 | // iterate over only the results from the integer regex |
---|
240 | std::for_each( |
---|
241 | boost::make_filter_iterator( integer_id, begin, end ), |
---|
242 | boost::make_filter_iterator( integer_id, end, end ), |
---|
243 | output_result |
---|
244 | ); |
---|
245 | } |
---|
246 | |
---|
247 | where `output_results` is a simple function that takes a `smatch` and displays the full match. |
---|
248 | Notice how we use the `regex_id_filter_predicate` together with `basic_regex<>::regex_id()` and |
---|
249 | `boost::make_filter_iterator()` from the _iterator_ to select only those results |
---|
250 | corresponding to a particular nested regex. This program displays the following: |
---|
251 | |
---|
252 | [pre |
---|
253 | marsha |
---|
254 | jan |
---|
255 | cindy |
---|
256 | 123 |
---|
257 | 456 |
---|
258 | 789 |
---|
259 | ] |
---|
260 | |
---|
261 | [endsect] |
---|