1 | <?xml version="1.0" standalone="yes"?> |
---|
2 | <!DOCTYPE library PUBLIC "-//Boost//DTD BoostBook XML V1.0//EN" |
---|
3 | "http://www.boost.org/tools/boostbook/dtd/boostbook.dtd" |
---|
4 | [ |
---|
5 | <!ENTITY % entities SYSTEM "program_options.ent" > |
---|
6 | %entities; |
---|
7 | ]> |
---|
8 | <section id="program_options.design"> |
---|
9 | <title>Design Discussion</title> |
---|
10 | |
---|
11 | <para>This section focuses on some of the design questions. |
---|
12 | </para> |
---|
13 | |
---|
14 | <section id="program_options.design.unicode"> |
---|
15 | |
---|
16 | <title>Unicode Support</title> |
---|
17 | |
---|
18 | <para>Unicode support was one of the features specifically requested |
---|
19 | during the formal review. Throughout this document "Unicode support" is |
---|
20 | a synonym for "wchar_t" support, assuming that "wchar_t" always uses |
---|
21 | Unicode encoding. Also, when talking about "ascii" (in lowercase) we'll |
---|
22 | not mean strict 7-bit ASCII encoding, but rather "char" strings in local |
---|
23 | 8-bit encoding. |
---|
24 | </para> |
---|
25 | |
---|
26 | <para> |
---|
27 | Generally, "Unicode support" can mean |
---|
28 | many things, but for the program_options library it means that: |
---|
29 | |
---|
30 | <itemizedlist> |
---|
31 | <listitem> |
---|
32 | <para>Each parser should accept either <code>char*</code> |
---|
33 | or <code>wchar_t*</code>, correctly split the input into option |
---|
34 | names and option values and return the data. |
---|
35 | </para> |
---|
36 | </listitem> |
---|
37 | <listitem> |
---|
38 | <para>For each option, it should be possible to specify whether the conversion |
---|
39 | from string to value uses ascii or Unicode. |
---|
40 | </para> |
---|
41 | </listitem> |
---|
42 | <listitem> |
---|
43 | <para>The library guarantees that: |
---|
44 | <itemizedlist> |
---|
45 | <listitem> |
---|
46 | <para>ascii input is passed to an ascii value without change |
---|
47 | </para> |
---|
48 | </listitem> |
---|
49 | <listitem> |
---|
50 | <para>Unicode input is passed to a Unicode value without change</para> |
---|
51 | </listitem> |
---|
52 | <listitem> |
---|
53 | <para>ascii input passed to a Unicode value, and Unicode input |
---|
54 | passed to an ascii value will be converted using a codecvt |
---|
55 | facet (which may be specified by the user(which can be |
---|
56 | specified by the user) |
---|
57 | </para> |
---|
58 | </listitem> |
---|
59 | </itemizedlist> |
---|
60 | </para> |
---|
61 | </listitem> |
---|
62 | </itemizedlist> |
---|
63 | </para> |
---|
64 | |
---|
65 | <para>The important point is that it's possible to have some "ascii |
---|
66 | options" together with "Unicode options". There are two reasons for |
---|
67 | this. First, for a given type you might not have the code to extract the |
---|
68 | value from Unicode string and it's not good to require that such code be written. |
---|
69 | Second, imagine a reusable library which has some options and exposes |
---|
70 | options description in its interface. If <emphasis>all</emphasis> |
---|
71 | options are either ascii or Unicode, and the library does not use any |
---|
72 | Unicode strings, then the author will likely to use ascii options, which |
---|
73 | would make the library unusable inside Unicode |
---|
74 | applications. Essentially, it would be necessary to provide two versions |
---|
75 | of the library -- ascii and Unicode. |
---|
76 | </para> |
---|
77 | |
---|
78 | <para>Another important point is that ascii strings are passed though |
---|
79 | without modification. In other words, it's not possible to just convert |
---|
80 | ascii to Unicode and process the Unicode further. The problem is that the |
---|
81 | default conversion mechanism -- the <code>codecvt</code> facet -- might |
---|
82 | not work with 8-bit input without additional setup. |
---|
83 | </para> |
---|
84 | |
---|
85 | <para>The Unicode support outlined above is not complete. For example, we |
---|
86 | don't plan allow Unicode in option names. Unicode support is hard and |
---|
87 | requires a Boost-wide solution. Even comparing two arbitrary Unicode |
---|
88 | strings is non-trivial. Finally, using Unicode in option names is |
---|
89 | related to internationalization, which has it's own |
---|
90 | complexities. E.g. if option names depend on current locale, then all |
---|
91 | program parts and other parts which use the name must be |
---|
92 | internationalized too. |
---|
93 | </para> |
---|
94 | |
---|
95 | <para>The primary question in implementing the Unicode support is whether |
---|
96 | to use templates and <code>std::basic_string</code> or to use some |
---|
97 | internal encoding and convert between internal and external encodings on |
---|
98 | the interface boundaries. |
---|
99 | </para> |
---|
100 | |
---|
101 | <para>The choice, mostly, is between code size and execution |
---|
102 | speed. A templated solution would either link library code into every |
---|
103 | application that uses the library (thereby making shared library |
---|
104 | impossible), or provide explicit instantiations in the shared library |
---|
105 | (increasing its size). The solution based on internal encoding would |
---|
106 | necessarily make conversions in a number of places and will be somewhat slower. |
---|
107 | Since speed is generally not an issue for this library, the second |
---|
108 | solution looks more attractive, but we'll take a closer look at |
---|
109 | individual components. |
---|
110 | </para> |
---|
111 | |
---|
112 | <para>For the parsers component, we have three choices: |
---|
113 | <itemizedlist> |
---|
114 | <listitem> |
---|
115 | <para>Use a fully templated implementation: given a string of a |
---|
116 | certain type, a parser will return a &parsed_options; instance |
---|
117 | with strings of the same type (i.e. the &parsed_options; class |
---|
118 | will be templated).</para> |
---|
119 | </listitem> |
---|
120 | <listitem> |
---|
121 | <para>Use internal encoding: same as above, but strings will be converted to and |
---|
122 | from the internal encoding.</para> |
---|
123 | </listitem> |
---|
124 | <listitem> |
---|
125 | <para>Use and partly expose the internal encoding: same as above, |
---|
126 | but the strings in the &parsed_options; instance will be in the |
---|
127 | internal encoding. This might avoid a conversion if |
---|
128 | &parsed_options; instance is passed directly to other components, |
---|
129 | but can be also dangerous or confusing for a user. |
---|
130 | </para> |
---|
131 | </listitem> |
---|
132 | </itemizedlist> |
---|
133 | </para> |
---|
134 | |
---|
135 | <para>The second solution appears to be the best -- it does not increase |
---|
136 | the code size much and is cleaner than the third. To avoid extra |
---|
137 | conversions, the Unicode version of &parsed_options; can also store |
---|
138 | strings in internal encoding. |
---|
139 | </para> |
---|
140 | |
---|
141 | <para>For the options descriptions component, we don't have much |
---|
142 | choice. Since it's not desirable to have either all options use ascii or all |
---|
143 | of them use Unicode, but rather have some ascii and some Unicode options, the |
---|
144 | interface of the &value_semantic; must work with both. The only way is |
---|
145 | to pass an additional flag telling if strings use ascii or internal encoding. |
---|
146 | The instance of &value_semantic; can then convert into some |
---|
147 | other encoding if needed. |
---|
148 | </para> |
---|
149 | |
---|
150 | <para>For the storage component, the only affected function is &store;. |
---|
151 | For Unicode input, the &store; function should convert the value to the |
---|
152 | internal encoding. It should also inform the &value_semantic; class |
---|
153 | about the used encoding. |
---|
154 | </para> |
---|
155 | |
---|
156 | <para>Finally, what internal encoding should we use? The |
---|
157 | alternatives are: |
---|
158 | <code>std::wstring</code> (using UCS-4 encoding) and |
---|
159 | <code>std::string</code> (using UTF-8 encoding). The difference between |
---|
160 | alternatives is: |
---|
161 | <itemizedlist> |
---|
162 | <listitem> |
---|
163 | <para>Speed: UTF-8 is a bit slower</para> |
---|
164 | </listitem> |
---|
165 | <listitem> |
---|
166 | <para>Space: UTF-8 takes less space when input is ascii</para> |
---|
167 | </listitem> |
---|
168 | <listitem> |
---|
169 | <para>Code size: UTF-8 requires additional conversion code. However, |
---|
170 | it allows one to use existing parsers without converting them to |
---|
171 | <code>std::wstring</code> and such conversion is likely to create a |
---|
172 | number of new instantiations. |
---|
173 | </para> |
---|
174 | </listitem> |
---|
175 | |
---|
176 | </itemizedlist> |
---|
177 | There's no clear leader, but the last point seems important, so UTF-8 |
---|
178 | will be used. |
---|
179 | </para> |
---|
180 | |
---|
181 | <para>Choosing the UTF-8 encoding allows the use of existing parsers, |
---|
182 | because 7-bit ascii characters retain their values in UTF-8, |
---|
183 | so searching for 7-bit strings is simple. However, there are |
---|
184 | two subtle issues: |
---|
185 | <itemizedlist> |
---|
186 | <listitem> |
---|
187 | <para>We need to assume the character literals use ascii encoding |
---|
188 | and that inputs use Unicode encoding.</para> |
---|
189 | </listitem> |
---|
190 | <listitem> |
---|
191 | <para>A Unicode character (say '=') can be followed by 'composing |
---|
192 | character' and the combination is not the same as just '=', so a |
---|
193 | simple search for '=' might find the wrong character. |
---|
194 | </para> |
---|
195 | </listitem> |
---|
196 | </itemizedlist> |
---|
197 | Neither of these issues appear to be critical in practice, since ascii is |
---|
198 | almost universal encoding and since composing characters following '=' (and |
---|
199 | other characters with special meaning to the library) are not likely to appear. |
---|
200 | </para> |
---|
201 | |
---|
202 | </section> |
---|
203 | |
---|
204 | |
---|
205 | </section> |
---|
206 | |
---|
207 | <!-- |
---|
208 | Local Variables: |
---|
209 | mode: xml |
---|
210 | sgml-indent-data: t |
---|
211 | sgml-parent-document: ("program_options.xml" "section") |
---|
212 | sgml-set-face: t |
---|
213 | End: |
---|
214 | --> |
---|