1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
---|
2 | <html> |
---|
3 | <head> |
---|
4 | <title>Boost.Regex: Introduction</title> |
---|
5 | <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> |
---|
6 | <link rel="stylesheet" type="text/css" href="../../../boost.css"> |
---|
7 | </head> |
---|
8 | <body> |
---|
9 | <P> |
---|
10 | <TABLE id="Table1" cellSpacing="1" cellPadding="1" width="100%" border="0"> |
---|
11 | <TR> |
---|
12 | <td valign="top" width="300"> |
---|
13 | <h3><a href="../../../index.htm"><img height="86" width="277" alt="C++ Boost" src="../../../boost.png" border="0"></a></h3> |
---|
14 | </td> |
---|
15 | <TD width="353"> |
---|
16 | <H1 align="center">Boost.Regex</H1> |
---|
17 | <H2 align="center">Introduction</H2> |
---|
18 | </TD> |
---|
19 | <td width="50"> |
---|
20 | <h3><a href="index.html"><img height="45" width="43" alt="Boost.Regex Index" src="uarrow.gif" border="0"></a></h3> |
---|
21 | </td> |
---|
22 | </TR> |
---|
23 | </TABLE> |
---|
24 | </P> |
---|
25 | <HR> |
---|
26 | <p></p> |
---|
27 | <P>Regular expressions are a form of pattern-matching that are often used in text |
---|
28 | processing; many users will be familiar with the Unix utilities <I>grep</I>, <I>sed</I> |
---|
29 | and <I>awk</I>, and the programming language <I>Perl</I>, each of which make |
---|
30 | extensive use of regular expressions. Traditionally C++ users have been limited |
---|
31 | to the POSIX C API's for manipulating regular expressions, and while regex++ |
---|
32 | does provide these API's, they do not represent the best way to use the |
---|
33 | library. For example regex++ can cope with wide character strings, or search |
---|
34 | and replace operations (in a manner analogous to either sed or Perl), something |
---|
35 | that traditional C libraries can not do.</P> |
---|
36 | <P>The class <A href="basic_regex.html">boost::basic_regex</A> is the key class in |
---|
37 | this library; it represents a "machine readable" regular expression, and is |
---|
38 | very closely modeled on std::basic_string, think of it as a string plus the |
---|
39 | actual state-machine required by the regular expression algorithms. Like |
---|
40 | std::basic_string there are two typedefs that are almost always the means by |
---|
41 | which this class is referenced:</P> |
---|
42 | <pre><B>namespace </B>boost{ |
---|
43 | |
---|
44 | <B>template</B> <<B>class</B> charT, |
---|
45 | <B> class</B> traits = regex_traits<charT> > |
---|
46 | <B>class</B> basic_regex; |
---|
47 | |
---|
48 | <B>typedef</B> basic_regex<<B>char</B>> regex; |
---|
49 | <B>typedef</B> basic_regex<<B>wchar_t></B> wregex; |
---|
50 | |
---|
51 | }</pre> |
---|
52 | <P>To see how this library can be used, imagine that we are writing a credit card |
---|
53 | processing application. Credit card numbers generally come as a string of |
---|
54 | 16-digits, separated into groups of 4-digits, and separated by either a space |
---|
55 | or a hyphen. Before storing a credit card number in a database (not necessarily |
---|
56 | something your customers will appreciate!), we may want to verify that the |
---|
57 | number is in the correct format. To match any digit we could use the regular |
---|
58 | expression [0-9], however ranges of characters like this are actually locale |
---|
59 | dependent. Instead we should use the POSIX standard form [[:digit:]], or the |
---|
60 | regex++ and Perl shorthand for this \d (note that many older libraries tended |
---|
61 | to be hard-coded to the C-locale, consequently this was not an issue for them). |
---|
62 | That leaves us with the following regular expression to validate credit card |
---|
63 | number formats:</P> |
---|
64 | <PRE>(\d{4}[- ]){3}\d{4}</PRE> |
---|
65 | <P>Here the parenthesis act to group (and mark for future reference) |
---|
66 | sub-expressions, and the {4} means "repeat exactly 4 times". This is an example |
---|
67 | of the extended regular expression syntax used by Perl, awk and egrep. Regex++ |
---|
68 | also supports the older "basic" syntax used by sed and grep, but this is |
---|
69 | generally less useful, unless you already have some basic regular expressions |
---|
70 | that you need to reuse.</P> |
---|
71 | <P>Now let's take that expression and place it in some C++ code to validate the |
---|
72 | format of a credit card number:</P> |
---|
73 | <PRE><B>bool</B> validate_card_format(<B>const</B> std::string& s) |
---|
74 | { |
---|
75 | <B>static</B> <B>const</B> <A href="basic_regex.html">boost::regex</A> e("(\\d{4}[- ]){3}\\d{4}"); |
---|
76 | <B>return</B> <A href="regex_match.html">regex_match</A>(s, e); |
---|
77 | }</PRE> |
---|
78 | <P>Note how we had to add some extra escapes to the expression: remember that the |
---|
79 | escape is seen once by the C++ compiler, before it gets to be seen by the |
---|
80 | regular expression engine, consequently escapes in regular expressions have to |
---|
81 | be doubled up when embedding them in C/C++ code. Also note that all the |
---|
82 | examples assume that your compiler supports Koenig lookup, if yours doesn't |
---|
83 | (for example VC6), then you will have to add some boost:: prefixes to some of |
---|
84 | the function calls in the examples.</P> |
---|
85 | <P>Those of you who are familiar with credit card processing, will have realized |
---|
86 | that while the format used above is suitable for human readable card numbers, |
---|
87 | it does not represent the format required by online credit card systems; these |
---|
88 | require the number as a string of 16 (or possibly 15) digits, without any |
---|
89 | intervening spaces. What we need is a means to convert easily between the two |
---|
90 | formats, and this is where search and replace comes in. Those who are familiar |
---|
91 | with the utilities <I>sed</I> and <I>Perl</I> will already be ahead here; we |
---|
92 | need two strings - one a regular expression - the other a "<A href="format_syntax.html">format |
---|
93 | string</A>" that provides a description of the text to replace the match |
---|
94 | with. In regex++ this search and replace operation is performed with the |
---|
95 | algorithm<A href="regex_replace.html"> regex_replace</A>, for our credit card |
---|
96 | example we can write two algorithms like this to provide the format |
---|
97 | conversions:</P> |
---|
98 | <PRE><I>// match any format with the regular expression: |
---|
99 | </I><B>const</B> boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z"); |
---|
100 | <B>const</B> std::string machine_format("\\1\\2\\3\\4"); |
---|
101 | <B>const</B> std::string human_format("\\1-\\2-\\3-\\4"); |
---|
102 | |
---|
103 | std::string machine_readable_card_number(<B>const</B> std::string s) |
---|
104 | { |
---|
105 | <B>return</B> <A href="regex_replace.html">regex_replace</A>(s, e, machine_format, boost::match_default | boost::format_sed); |
---|
106 | } |
---|
107 | |
---|
108 | std::string human_readable_card_number(<B>const</B> std::string s) |
---|
109 | { |
---|
110 | <B>return</B> <A href="regex_replace.html">regex_replace</A>(s, e, human_format, boost::match_default | boost::format_sed); |
---|
111 | }</PRE> |
---|
112 | <P>Here we've used marked sub-expressions in the regular expression to split out |
---|
113 | the four parts of the card number as separate fields, the format string then |
---|
114 | uses the sed-like syntax to replace the matched text with the reformatted |
---|
115 | version.</P> |
---|
116 | <P>In the examples above, we haven't directly manipulated the results of a regular |
---|
117 | expression match, however in general the result of a match contains a number of |
---|
118 | sub-expression matches in addition to the overall match. When the library needs |
---|
119 | to report a regular expression match it does so using an instance of the class <A href="match_results.html"> |
---|
120 | match_results</A>, as before there are typedefs of this class for the most |
---|
121 | common cases: |
---|
122 | </P> |
---|
123 | <PRE><B>namespace </B>boost{ |
---|
124 | <B>typedef</B> match_results<<B>const</B> <B>char</B>*> cmatch; |
---|
125 | <B>typedef</B> match_results<<B>const</B> <B>wchar_t</B>*> wcmatch; |
---|
126 | <STRONG>typedef</STRONG> match_results<std::string::const_iterator> smatch; |
---|
127 | <STRONG>typedef</STRONG> match_results<std::wstring::const_iterator> wsmatch; |
---|
128 | }</PRE> |
---|
129 | <P>The algorithms <A href="regex_search.html">regex_search</A> and <A href="regex_match.html">regex_match</A> |
---|
130 | make use of match_results to report what matched; the difference between these |
---|
131 | algorithms is that <A href="regex_match.html">regex_match</A> will only find |
---|
132 | matches that consume <EM>all</EM> of the input text, where as <A href="regex_search.html"> |
---|
133 | regex_search</A> will <EM>search</EM> for a match anywhere within the text |
---|
134 | being matched.</P> |
---|
135 | <P>Note that these algorithms are not restricted to searching regular C-strings, |
---|
136 | any bidirectional iterator type can be searched, allowing for the possibility |
---|
137 | of seamlessly searching almost any kind of data. |
---|
138 | </P> |
---|
139 | <P>For search and replace operations, in addition to the algorithm <A href="regex_replace.html"> |
---|
140 | regex_replace</A> that we have already seen, the <A href="match_results.html">match_results</A> |
---|
141 | class has a format member that takes the result of a match and a format string, |
---|
142 | and produces a new string by merging the two.</P> |
---|
143 | <P>For iterating through all occurences of an expression within a text, there are |
---|
144 | two iterator types: <A href="regex_iterator.html">regex_iterator</A> will |
---|
145 | enumerate over the <A href="match_results.html">match_results</A> objects |
---|
146 | found, while <A href="regex_token_iterator.html">regex_token_iterator</A> will |
---|
147 | enumerate a series of strings (similar to perl style split operations).</P> |
---|
148 | <P>For those that dislike templates, there is a high level wrapper class RegEx |
---|
149 | that is an encapsulation of the lower level template code - it provides a |
---|
150 | simplified interface for those that don't need the full power of the library, |
---|
151 | and supports only narrow characters, and the "extended" regular expression |
---|
152 | syntax. This class is now deprecated as it does not form part of the regular |
---|
153 | expressions C++ standard library proposal. |
---|
154 | </P> |
---|
155 | <P>The <A href="posix_api.html">POSIX API</A> functions: regcomp, regexec, regfree |
---|
156 | and regerror, are available in both narrow character and Unicode versions, and |
---|
157 | are provided for those who need compatibility with these API's. |
---|
158 | </P> |
---|
159 | <P>Finally, note that the library now has run-time <A href="localisation.html">localization</A> |
---|
160 | support, and recognizes the full POSIX regular expression syntax - including |
---|
161 | advanced features like multi-character collating elements and equivalence |
---|
162 | classes - as well as providing compatibility with other regular expression |
---|
163 | libraries including GNU and BSD4 regex packages, and to a more limited extent |
---|
164 | Perl 5. |
---|
165 | </P> |
---|
166 | <P> |
---|
167 | <HR> |
---|
168 | <P></P> |
---|
169 | <p>Revised |
---|
170 | <!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%d %B, %Y" startspan --> |
---|
171 | 24 Oct 2003 |
---|
172 | <!--webbot bot="Timestamp" endspan i-checksum="39359" --></p> |
---|
173 | <p><i>© Copyright John Maddock 1998- |
---|
174 | <!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%Y" startspan --> |
---|
175 | 2003<!--webbot bot="Timestamp" endspan i-checksum="39359" --></i></p> |
---|
176 | <P><I>Use, modification and distribution are subject to the Boost Software License, |
---|
177 | Version 1.0. (See accompanying file <A href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</A> |
---|
178 | or copy at <A href="http://www.boost.org/LICENSE_1_0.txt">http://www.boost.org/LICENSE_1_0.txt</A>)</I></P> |
---|
179 | </body> |
---|
180 | </html> |
---|
181 | |
---|