1 | <html> |
---|
2 | <head> |
---|
3 | <title>Character Sets</title> |
---|
4 | <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> |
---|
5 | <link rel="stylesheet" href="theme/style.css" type="text/css"> |
---|
6 | </head> |
---|
7 | |
---|
8 | <body> |
---|
9 | <table width="100%" border="0" background="theme/bkd2.gif" cellspacing="2"> |
---|
10 | <tr> |
---|
11 | <td width="10"> |
---|
12 | </td> |
---|
13 | <td width="85%"> |
---|
14 | <font size="6" face="Verdana, Arial, Helvetica, sans-serif"><b>Character Sets</b></font> |
---|
15 | </td> |
---|
16 | <td width="112"><a href="http://spirit.sf.net"><img src="theme/spirit.gif" width="112" height="48" align="right" border="0"></a></td> |
---|
17 | </tr> |
---|
18 | </table> |
---|
19 | <br> |
---|
20 | <table border="0"> |
---|
21 | <tr> |
---|
22 | <td width="10"></td> |
---|
23 | <td width="30"><a href="../index.html"><img src="theme/u_arr.gif" border="0"></a></td> |
---|
24 | <td width="30"><a href="loops.html"><img src="theme/l_arr.gif" border="0"></a></td> |
---|
25 | <td width="30"><a href="confix.html"><img src="theme/r_arr.gif" border="0"></a></td> |
---|
26 | </tr> |
---|
27 | </table> |
---|
28 | <p>The character set <tt>chset</tt> matches a set of characters over a finite |
---|
29 | range bounded by the limits of its template parameter <tt>CharT</tt>. This class |
---|
30 | is an optimization of a parser that acts on a set of single characters. The |
---|
31 | template class is parameterized by the character type <tt>CharT</tt> and can |
---|
32 | work efficiently with 8, 16 and 32 and even 64 bit characters.</p> |
---|
33 | <pre><span class=identifier> </span><span class=keyword>template </span><span class=special><</span><span class=keyword>typename </span><span class=identifier>CharT </span><span class=special>= </span><span class=keyword>char</span><span class=special>> |
---|
34 | </span><span class=keyword>class </span><span class=identifier>chset</span><span class=special>;</span></pre> |
---|
35 | <p>The <tt>chset</tt> is constructed from literals (e.g. <tt>'x'</tt>), <tt>ch_p</tt> |
---|
36 | or <tt>chlit<></tt>, <tt>range_p</tt> or <tt>range<></tt>, <tt>anychar_p</tt> |
---|
37 | and <tt>nothing_p</tt> (see <a href="primitives.html">primitives</a>) or copy-constructed |
---|
38 | from another <tt>chset</tt>. The <tt>chset</tt> class uses a copy-on-write scheme |
---|
39 | that enables instances to be passed along easily by value.</p> |
---|
40 | <table width="80%" border="0" align="center"> |
---|
41 | <tr> |
---|
42 | <td class="note_box"><img src="theme/lens.gif" width="15" height="16"> <b>Sparse |
---|
43 | bit vectors</b><br> |
---|
44 | <br> |
---|
45 | To accomodate 16/32 and 64 bit characters, the <tt>chset</tt> class |
---|
46 | statically switches from a <tt>std::bitset</tt> implementation when the |
---|
47 | character type is not greater than 8 bits, to a sparse bit/boolean set which |
---|
48 | uses a sorted vector of disjoint ranges (<tt>range_run</tt>). The set is |
---|
49 | constructed from ranges such that adjacent or overlapping ranges are coalesced.<br> |
---|
50 | <br> |
---|
51 | range_runs are very space-economical in situations where there are lots |
---|
52 | of ranges and a few individual disjoint values. Searching is O(log n) where |
---|
53 | n is the number of ranges.</td> |
---|
54 | </tr> |
---|
55 | </table> |
---|
56 | <p> Examples:<br> |
---|
57 | </p> |
---|
58 | <pre><span class=identifier> </span><span class=identifier>chset</span><span class=special><> </span><span class=identifier>s1</span><span class=special>(</span><span class=literal>'x'</span><span class=special>); |
---|
59 | </span><span class=identifier>chset</span><span class=special><> </span><span class=identifier>s2</span><span class=special>(</span><span class=identifier>anychar_p </span><span class=special>- </span><span class=identifier>s1</span><span class=special>);</span></pre> |
---|
60 | <p>Optionally, character sets may also be constructed using a definition string |
---|
61 | following a syntax that resembles posix style regular expression character sets, |
---|
62 | except that double quotes delimit the set elements instead of square brackets |
---|
63 | and there is no special negation <tt>^</tt> character.</p> |
---|
64 | <pre> <span class=identifier>range </span><span class=special>= </span><span class=identifier>anychar_p </span><span class=special>>> </span><span class=literal>'-' </span><span class=special>>> </span><span class=identifier>anychar_p</span><span class=special>; |
---|
65 | </span><span class=identifier>set </span><span class=special>= *(</span><span class=identifier>range_p </span><span class=special>| </span><span class=identifier>anychar_p</span><span class=special>);</span></pre> |
---|
66 | <p>Since we are defining the set using a C string, the usual C/C++ literal string |
---|
67 | syntax rules apply. Examples:<br> |
---|
68 | </p> |
---|
69 | <pre> <span class=identifier>chset</span><span class=special><> </span><span class=identifier>s1</span><span class=special>(</span><span class=string>"a-zA-Z"</span><span class=special>); </span><span class=comment>// alphabetic characters |
---|
70 | </span><span class=identifier>chset</span><span class=special><> </span><span class=identifier>s2</span><span class=special>(</span><span class=string>"0-9a-fA-F"</span><span class=special>); </span><span class=comment>// hexadecimal characters |
---|
71 | </span><span class=identifier>chset</span><span class=special><> </span><span class=identifier>s3</span><span class=special>(</span><span class=string>"actgACTG"</span><span class=special>); </span><span class=comment>// DNA identifiers |
---|
72 | </span><span class=identifier>chset</span><span class=special><> </span><span class=identifier>s4</span><span class=special>(</span><span class=string>"\x7f\x7e"</span><span class=special>); </span><span class=comment>// Hexadecimal 0x7F and 0x7E</span></pre> |
---|
73 | <p>The standard Spirit set operators apply (see <a href="operators.html">operators</a>) |
---|
74 | plus an additional character-set-specific inverse (negation <tt>~</tt>) operator:<span class=comment></span></p> |
---|
75 | |
---|
76 | <table width="90%" border="0" align="center"> |
---|
77 | <tr> |
---|
78 | <td class="table_title" colspan="2">Character set operators</td> |
---|
79 | </tr> |
---|
80 | <tr> |
---|
81 | <td class="table_cells" width="28%"><b>~a</b></td> |
---|
82 | <td class="table_cells" width="72%">Set inverse</td> |
---|
83 | </tr> |
---|
84 | <tr> |
---|
85 | <td class="table_cells" width="28%"><b>a | b</b></td> |
---|
86 | <td class="table_cells" width="72%">Set union</td> |
---|
87 | </tr> |
---|
88 | <tr> |
---|
89 | <td class="table_cells" width="28%"><b>a & </b></td> |
---|
90 | <td class="table_cells" width="72%">Set intersection</td> |
---|
91 | </tr> |
---|
92 | <tr> |
---|
93 | <td class="table_cells" width="28%"><b>a - b</b></td> |
---|
94 | <td class="table_cells" width="72%">Set difference</td> |
---|
95 | </tr> |
---|
96 | <tr> |
---|
97 | <td class="table_cells" width="28%"><b>a ^ b</b></td> |
---|
98 | <td class="table_cells" width="72%">Set xor</td> |
---|
99 | </tr> |
---|
100 | </table> |
---|
101 | <p></p> |
---|
102 | <p></p> |
---|
103 | <p></p> |
---|
104 | <p></p> |
---|
105 | <p></p> |
---|
106 | <p></p> |
---|
107 | <p></p> |
---|
108 | <p></p> |
---|
109 | <p>where operands a and b are both <tt>chsets</tt> or one of the operand is either |
---|
110 | a literal character, <tt>ch_p</tt> or <tt>chlit</tt>, <tt>range_p</tt> or <tt>range</tt>, |
---|
111 | <tt>anychar_p</tt> or <tt>nothing_p</tt>. Special optimized overloads are provided |
---|
112 | for <tt>anychar_p</tt> and <tt>nothing_p</tt> operands. A <tt>nothing_p</tt> |
---|
113 | operand is converted to an empty set, while an <tt>anychar_p</tt> operand is |
---|
114 | converted to a set having elements of the full range of the character type used |
---|
115 | (e.g. 0-255 for unsigned 8 bit chars).</p> |
---|
116 | <p>A special case is <tt>~anychar_p</tt> which yields <tt>nothing_p</tt>, but |
---|
117 | <tt>~nothing_p</tt> is illegal. Inversion of <tt>anychar_p</tt> is asymmetrical, |
---|
118 | a one-way trip comparable to converting <tt>T*</tt> to a <tt>void*.</tt></p> |
---|
119 | <table width="90%" border="0" align="center"> |
---|
120 | <tr> |
---|
121 | <td class="table_title" colspan="2">Special conversions</td> |
---|
122 | </tr> |
---|
123 | <tr> |
---|
124 | <td class="table_cells" width="28%"><b>chset<CharT>(nothing_p)</b></td> |
---|
125 | <td class="table_cells" width="72%">empty set</td> |
---|
126 | </tr> |
---|
127 | <tr> |
---|
128 | <td class="table_cells" width="28%"><b>chset<CharT>(anychar_p)</b></td> |
---|
129 | <td class="table_cells" width="72%">full range of CharT (e.g. 0-255 for unsigned |
---|
130 | 8 bit chars)</td> |
---|
131 | </tr> |
---|
132 | <tr> |
---|
133 | <td class="table_cells" width="28%"><b>~anychar_p</b></td> |
---|
134 | <td class="table_cells" width="72%">nothing_p</td> |
---|
135 | </tr> |
---|
136 | <tr> |
---|
137 | <td class="table_cells" width="28%"><b>~nothing_p</b></td> |
---|
138 | <td class="table_cells" width="72%">illegal</td> |
---|
139 | </tr> |
---|
140 | </table> |
---|
141 | |
---|
142 | <p></p><table border="0"> |
---|
143 | <tr> |
---|
144 | <td width="10"></td> |
---|
145 | <td width="30"><a href="../index.html"><img src="theme/u_arr.gif" border="0"></a></td> |
---|
146 | <td width="30"><a href="loops.html"><img src="theme/l_arr.gif" border="0"></a></td> |
---|
147 | <td width="30"><a href="confix.html"><img src="theme/r_arr.gif" border="0"></a></td> |
---|
148 | </tr> |
---|
149 | </table> |
---|
150 | <br> |
---|
151 | <hr size="1"> |
---|
152 | <p class="copyright">Copyright © 1998-2003 Joel de Guzman<br> |
---|
153 | <br> |
---|
154 | <font size="2">Use, modification and distribution is subject to the Boost Software |
---|
155 | License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at |
---|
156 | http://www.boost.org/LICENSE_1_0.txt) </font> </p> |
---|
157 | </body> |
---|
158 | </html> |
---|