Planet
navi homePPSaboutscreenshotsdownloaddevelopmentforum

source: downloads/boost_1_34_1/libs/xpressive/doc/tokenization.qbk @ 33

Last change on this file since 33 was 29, checked in by landauf, 16 years ago

updated boost from 1_33_1 to 1_34_1

File size: 5.0 KB
Line 
1[section String Splitting and Tokenization]
2
3_regex_token_iterator_ is the Ginsu knife of the text manipulation world. It slices! It dices! This section describes
4how to use the highly-configurable _regex_token_iterator_ to chop up input sequences.
5
6[h2 Overview]
7
8You initialize a _regex_token_iterator_ with an input sequence, a regex, and some optional configuration parameters.
9The _regex_token_iterator_ will use _regex_search_ to find the first place in the sequence that the regex matches. When
10dereferenced, the _regex_token_iterator_ returns a ['token] in the form of a `std::basic_string<>`. Which string it returns
11depends on the configuration parameters. By default it returns a string corresponding to the full match, but it could also
12return a string corresponding to a particular marked sub-expression, or even the part of the sequence that ['didn't] match.
13When you increment the _regex_token_iterator_, it will move to the next token. Which token is next depends on the configuration
14parameters. It could simply be a different marked sub-expression in the current match, or it could be part or all of the
15next match. Or it could be the part that ['didn't] match.
16
17As you can see, _regex_token_iterator_ can do a lot. That makes it hard to describe, but some examples should make it clear.
18
19[h2 Example 1: Simple Tokenization]
20
21This example uses _regex_token_iterator_ to chop a sequence into a series of tokens consisting of words.
22
23    std::string input("This is his face");
24    sregex re = +_w;                      // find a word
25
26    // iterate over all the words in the input
27    sregex_token_iterator begin( input.begin(), input.end(), re ), end;
28
29    // write all the words to std::cout
30    std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
31    std::copy( begin, end, out_iter );
32
33This program displays the following:
34
35[pre
36This
37is
38his
39face
40]
41
42[h2 Example 2: Simple Tokenization, Reloaded]
43
44This example also uses _regex_token_iterator_ to chop a sequence into a series of tokens consisting of words,
45but it uses the regex as a delimiter. When we pass a `-1` as the last parameter to the _regex_token_iterator_
46constructor, it instructs the token iterator to consider as tokens those parts of the input that ['didn't]
47match the regex.
48
49    std::string input("This is his face");
50    sregex re = +_s;                      // find white space
51
52    // iterate over all non-white space in the input. Note the -1 below:
53    sregex_token_iterator begin( input.begin(), input.end(), re, -1 ), end;
54
55    // write all the words to std::cout
56    std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
57    std::copy( begin, end, out_iter );
58
59This program displays the following:
60
61[pre
62This
63is
64his
65face
66]
67
68[h2 Example 3: Simple Tokenization, Revolutions]
69
70This example also uses _regex_token_iterator_ to chop a sequence containing a bunch of dates into a series of
71tokens consisting of just the years. When we pass a positive integer [^['N]] as the last parameter to the
72_regex_token_iterator_ constructor, it instructs the token iterator to consider as tokens only the [^['N]]-th
73marked sub-expression of each match.
74
75    std::string input("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981");
76    sregex re = sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})"); // find a date
77
78    // iterate over all the years in the input. Note the 3 below, corresponding to the 3rd sub-expression:
79    sregex_token_iterator begin( input.begin(), input.end(), re, 3 ), end;
80
81    // write all the words to std::cout
82    std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
83    std::copy( begin, end, out_iter );
84
85This program displays the following:
86
87[pre
882003
891999
901981
91]
92
93[h2 Example 4: Not-So-Simple Tokenization]
94
95This example is like the previous one, except that instead of tokenizing just the years, this program
96turns the days, months and years into tokens. When we pass an array of integers [^['{I,J,...}]] as the last
97parameter to the _regex_token_iterator_ constructor, it instructs the token iterator to consider as tokens the
98[^['I]]-th, [^['J]]-th, etc. marked sub-expression of each match.
99
100    std::string input("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981");
101    sregex re = sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})"); // find a date
102
103    // iterate over the days, months and years in the input
104    int const sub_matches[] = { 2, 1, 3 }; // day, month, year
105    sregex_token_iterator begin( input.begin(), input.end(), re, sub_matches ), end;
106
107    // write all the words to std::cout
108    std::ostream_iterator< std::string > out_iter( std::cout, "\n" );
109    std::copy( begin, end, out_iter );
110
111This program displays the following:
112
113[pre
11402
11501
1162003
11723
11804
1191999
12013
12111
1221981
123]
124
125The `sub_matches` array instructs the _regex_token_iterator_ to first take the value of the 2nd sub-match, then
126the 1st sub-match, and finally the 3rd. Incrementing the iterator again instructs it to use _regex_search_ again
127to find the next match. At that point, the process repeats -- the token iterator takes the value of the 2nd
128sub-match, then the 1st, et cetera.
129
130[endsect]
Note: See TracBrowser for help on using the repository browser.