C++ Boost

Boost.Regex

Introduction

Boost.Regex Index


Regular expressions are a form of pattern-matching that are often used in text processing; many users will be familiar with the Unix utilities grep, sed and awk, and the programming language Perl, each of which make extensive use of regular expressions. Traditionally C++ users have been limited to the POSIX C API's for manipulating regular expressions, and while regex++ does provide these API's, they do not represent the best way to use the library. For example regex++ can cope with wide character strings, or search and replace operations (in a manner analogous to either sed or Perl), something that traditional C libraries can not do.

The class boost::basic_regex is the key class in this library; it represents a "machine readable" regular expression, and is very closely modeled on std::basic_string, think of it as a string plus the actual state-machine required by the regular expression algorithms. Like std::basic_string there are two typedefs that are almost always the means by which this class is referenced:

namespace boost{

template <class charT, 
          class traits = regex_traits<charT> >
class basic_regex;

typedef basic_regex<char> regex;
typedef basic_regex<wchar_t> wregex;

}

To see how this library can be used, imagine that we are writing a credit card processing application. Credit card numbers generally come as a string of 16-digits, separated into groups of 4-digits, and separated by either a space or a hyphen. Before storing a credit card number in a database (not necessarily something your customers will appreciate!), we may want to verify that the number is in the correct format. To match any digit we could use the regular expression [0-9], however ranges of characters like this are actually locale dependent. Instead we should use the POSIX standard form [[:digit:]], or the regex++ and Perl shorthand for this \d (note that many older libraries tended to be hard-coded to the C-locale, consequently this was not an issue for them). That leaves us with the following regular expression to validate credit card number formats:

(\d{4}[- ]){3}\d{4}

Here the parenthesis act to group (and mark for future reference) sub-expressions, and the {4} means "repeat exactly 4 times". This is an example of the extended regular expression syntax used by Perl, awk and egrep. Regex++ also supports the older "basic" syntax used by sed and grep, but this is generally less useful, unless you already have some basic regular expressions that you need to reuse.

Now let's take that expression and place it in some C++ code to validate the format of a credit card number:

bool validate_card_format(const std::string& s)
{
   static const boost::regex e("(\\d{4}[- ]){3}\\d{4}");
   return regex_match(s, e);
}

Note how we had to add some extra escapes to the expression: remember that the escape is seen once by the C++ compiler, before it gets to be seen by the regular expression engine, consequently escapes in regular expressions have to be doubled up when embedding them in C/C++ code. Also note that all the examples assume that your compiler supports Koenig lookup, if yours doesn't (for example VC6), then you will have to add some boost:: prefixes to some of the function calls in the examples.

Those of you who are familiar with credit card processing, will have realized that while the format used above is suitable for human readable card numbers, it does not represent the format required by online credit card systems; these require the number as a string of 16 (or possibly 15) digits, without any intervening spaces. What we need is a means to convert easily between the two formats, and this is where search and replace comes in. Those who are familiar with the utilities sed and Perl will already be ahead here; we need two strings - one a regular expression - the other a "format string" that provides a description of the text to replace the match with. In regex++ this search and replace operation is performed with the algorithm regex_replace, for our credit card example we can write two algorithms like this to provide the format conversions:

// match any format with the regular expression:
const boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
const std::string machine_format("\\1\\2\\3\\4");
const std::string human_format("\\1-\\2-\\3-\\4");

std::string machine_readable_card_number(const std::string s)
{
   return regex_replace(s, e, machine_format, boost::match_default | boost::format_sed);
}

std::string human_readable_card_number(const std::string s)
{
   return regex_replace(s, e, human_format, boost::match_default | boost::format_sed);
}

Here we've used marked sub-expressions in the regular expression to split out the four parts of the card number as separate fields, the format string then uses the sed-like syntax to replace the matched text with the reformatted version.

In the examples above, we haven't directly manipulated the results of a regular expression match, however in general the result of a match contains a number of sub-expression matches in addition to the overall match. When the library needs to report a regular expression match it does so using an instance of the class match_results, as before there are typedefs of this class for the most common cases:

namespace boost{
typedef match_results<const char*> cmatch;
typedef match_results<const wchar_t*> wcmatch;
typedef match_results<std::string::const_iterator> smatch;
typedef match_results<std::wstring::const_iterator> wsmatch; 
}

The algorithms regex_search and regex_match make use of match_results to report what matched; the difference between these algorithms is that regex_match will only find matches that consume all of the input text, where as regex_search will search for a match anywhere within the text being matched.

Note that these algorithms are not restricted to searching regular C-strings, any bidirectional iterator type can be searched, allowing for the possibility of seamlessly searching almost any kind of data.

For search and replace operations, in addition to the algorithm regex_replace that we have already seen, the match_results class has a format member that takes the result of a match and a format string, and produces a new string by merging the two.

For iterating through all occurences of an expression within a text, there are two iterator types: regex_iterator will enumerate over the match_results objects found, while regex_token_iterator will enumerate a series of strings (similar to perl style split operations).

For those that dislike templates, there is a high level wrapper class RegEx that is an encapsulation of the lower level template code - it provides a simplified interface for those that don't need the full power of the library, and supports only narrow characters, and the "extended" regular expression syntax. This class is now deprecated as it does not form part of the regular expressions C++ standard library proposal.

The POSIX API functions: regcomp, regexec, regfree and regerror, are available in both narrow character and Unicode versions, and are provided for those who need compatibility with these API's.

Finally, note that the library now has run-time localization support, and recognizes the full POSIX regular expression syntax - including advanced features like multi-character collating elements and equivalence classes - as well as providing compatibility with other regular expression libraries including GNU and BSD4 regex packages, and to a more limited extent Perl 5.


Revised 24 Oct 2003

© Copyright John Maddock 1998- 2003

Use, modification and distribution are subject to the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)