Planet

navi

home

PPS

about

screenshots

download

development

forum

Context Navigation

source: downloads/boost_1_34_1/libs/statechart/doc/performance.html @ 33

Last change on this file since 33 was 29, checked in by landauf, 16 years ago
updated boost from 1_33_1 to 1_34_1
File size: 20.7 KB

Line
1	<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2
3	<html>
4	<head>
5	<meta http-equiv="Content-Language" content="en-us">
6	<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
7	<meta name="GENERATOR" content="Microsoft FrontPage 6.0">
8	<meta name="ProgId" content="FrontPage.Editor.Document">
9	<link rel="stylesheet" type="text/css" href="../../../boost.css">
10
11	<title>The Boost Statechart Library - Performance</title>
12	</head>
13
14	<body link="#0000FF" vlink="#800080">
15	<table border="0" cellpadding="7" cellspacing="0" width="100%" summary=
16	"header">
17	<tr>
18	<td valign="top" width="300">
19	<h3><a href="../../../index.htm"><img alt="C++ Boost" src=
20	"../../../boost.png" border="0" width="277" height="86"></a></h3>
21	</td>
22
23	<td valign="top">
24	<h1 align="center">The Boost Statechart Library</h1>
25
26	<h2 align="center">Performance</h2>
27	</td>
28	</tr>
29	</table>
30	<hr>
31
32	<dl class="index">
33	<dt><a href="#SpeedVersusScalabilityTradeoffs">Speed versus scalability
34	tradeoffs</a></dt>
35
36	<dt><a href="#MemoryManagementCustomization">Memory management
37	customization</a></dt>
38
39	<dt><a href="#RttiCustomization">RTTI customization</a></dt>
40
41	<dt><a href="#ResourceUsage">Resource usage</a></dt>
42	</dl>
43
44	<h2><a name="SpeedVersusScalabilityTradeoffs" id=
45	"SpeedVersusScalabilityTradeoffs">Speed versus scalability
46	tradeoffs</a></h2>
47
48	<p>Quite a bit of effort has gone into making the library fast for small
49	simple machines <b>and</b> scaleable at the same time (this applies only to
50	<code>state_machine<></code>, there still is some room for optimizing
51	<code>fifo_scheduler<></code>, especially for multi-threaded builds).
52	While I believe it should perform reasonably in most applications, the
53	scalability does not come for free. Small, carefully handcrafted state
54	machines will thus easily outperform equivalent Boost.Statechart machines.
55	To get a picture of how big the gap is, I implemented a simple benchmark in
56	the BitMachine example. The Handcrafted example is a handcrafted variant of
57	the 1-bit-BitMachine implementing the same benchmark.</p>
58
59	<p>I tried to create a fair but somewhat unrealistic <b>worst-case</b>
60	scenario:</p>
61
62	<ul>
63	<li>For both machines exactly one object of the only event is allocated
64	before starting the test. This same object is then sent to the machines
65	over and over</li>
66
67	<li>The Handcrafted machine employs GOF-visitor double dispatch. The
68	states are preallocated so that event dispatch & transition amounts
69	to nothing more than two virtual calls and one pointer assignment</li>
70	</ul>
71
72	<p>The Benchmarks - running on a 3.2GHz Intel Pentium 4 - produced the
73	following dispatch and transition times per event:</p>
74
75	<ul>
76	<li>Handcrafted:
77
78	<ul>
79	<li>MSVC7.1: 10ns</li>
80
81	<li>GCC3.4.2: 20ns</li>
82
83	<li>Intel9.0: 20ns</li>
84	</ul>
85	</li>
86
87	<li>1-bit-BitMachine with customized memory management:
88
89	<ul>
90	<li>MSVC7.1: 150ns</li>
91
92	<li>GCC3.4.2: 320ns</li>
93
94	<li>Intel9.0: 170ns</li>
95	</ul>
96	</li>
97	</ul>
98
99	<p>Although this is a big difference I still think it will not be
100	noticeable in most real-world applications. No matter whether an
101	application uses handcrafted or Boost.Statechart machines it will...</p>
102
103	<ul>
104	<li>almost never run into a situation where a state machine is swamped
105	with as many events as in the benchmarks. A state machine will almost
106	always spend a good deal of time waiting for events (which typically come
107	from a human operator, from machinery or from electronic devices over
108	often comparatively slow I/O channels). Parsers are just about the only
109	application of FSMs where this is not the case. However, parser FSMs are
110	usually not directly specified on the state machine level but on a higher
111	one that is better suited for the task. Examples of such higher levels
112	are: Boost.Regex, Boost.Spirit, XML Schemas, etc. Moreover, the nature of
113	parsers allows for a number of optimizations that are not possible in a
114	general-purpose FSM framework.<br>
115	Bottom line: While it is possible to implement a parser with this
116	library, it is almost never advisable to do so because other approaches
117	lead to better performing and more expressive code</li>
118
119	<li>often run state machines in their own threads. This adds considerable
120	locking and thread-switching overhead. Performance tests with the
121	PingPong example, where two asynchronous state machines exchange events,
122	gave the following times to process one event and perform the resulting
123	in-state reaction (using the library with
124	<code>boost::fast_pool_allocator<></code>):
125
126	<ul>
127	<li>Single-threaded (no locking and waiting): 840ns / 840ns</li>
128
129	<li>Multi-threaded with one thread (the scheduler uses mutex locking
130	but never has to wait for events): 6500ns / 4800ns</li>
131
132	<li>Multi-threaded with two threads (both schedulers use mutex
133	locking and exactly one always waits for an event): 14000ns /
134	7000ns</li>
135	</ul>
136
137	<p>As mentioned above, there definitely is some room to improve the
138	timings for the asynchronous machines. Moreover, these are very crude
139	benchmarks, designed to show the overhead of locking and thread context
140	switching. The overhead in a real-world application will typically be
141	smaller and other operating systems can certainly do better in this
142	area. However, I strongly believe that on most platforms the threading
143	overhead is usually larger than the time that Boost.Statechart spends
144	for event dispatch and transition. Handcrafted machines will inevitably
145	have the same overhead, making raw single-threaded dispatch and
146	transition speed much less important</p>
147	</li>
148
149	<li>almost always allocate events with <code>new</code> and destroy them
150	after consumption. This will add a few cycles, even if event memory
151	management is customized</li>
152
153	<li>often use state machines that employ orthogonal states and other
154	advanced features. This forces the handcrafted machines to use a more
155	adequate and more time-consuming book-keeping</li>
156	</ul>
157
158	<p>Therefore, in real-world applications event dispatch and transition not
159	normally constitutes a bottleneck and the relative gap between handcrafted
160	and Boost.Statechart machines also becomes much smaller than in the
161	worst-case scenario.</p>
162
163	<h3>Detailed performance data</h3>
164
165	<p>In an effort to identify the main performance bottleneck, the example
166	program "Performance" has been written. It measures the time that is spent
167	to process one event in different BitMachine variants. In contrast to the
168	BitMachine example, which uses only transitions, Performance uses a varying
169	number of in-state reactions together with transitions. The only difference
170	between in-state-reactions and transitions is that the former neither enter
171	nor exit any states. Apart from that, the same amount of code needs to be
172	run to dispatch an event and execute the resulting action.</p>
173
174	<p>The following diagrams show the average time the library spends to
175	process one event, depending on the percentage of in-state reactions
176	employed. 0% means that all triggered reactions are transitions. 100% means
177	that all triggered reactions are in-state reactions. I draw the following
178	conclusions from these measurements:</p>
179
180	<ul>
181	<li>The fairly linear course of the curves suggests that the measurements
182	with a 100% in-state reaction ratio are accurate and not merely a product
183	of optimizations in the compiler. Such optimizations might have been
184	possible due to the fact that in the 100% case it is known at
185	compile-time that the current state will never change</li>
186
187	<li>The data points with 100% in-state reaction ratio and speed optimized
188	RTTI show that modern compilers seem to inline the complex-looking
189	dispatch code so aggressively that dispatch is reduced to little more
190	than it actually is, one virtual function call followed by a linear
191	search for a suitable reaction. For instance, in the case of the 1-bit
192	Bitmachine, Intel9.0 seems to produce dispatch code that is equally
193	efficient like the two virtual function calls in the Handcrafted
194	machine</li>
195
196	<li>On all compilers and in all variants the time spent in event dispatch
197	is dwarfed by the time spent to exit the current state and enter the
198	target state. It is worth noting that BitMachine is a flat and
199	non-orthogonal state machine, representing a close-to-worst case.
200	Real-world machines will often exit and enter multiple states during a
201	transition, what further dwarfs pure dispatch time. This makes the
202	implementation of constant-time dispatch (as requested by a few people
203	during formal review) an undertaking with little merit. Instead, the
204	future optimization effort will concentrate on state-entry and
205	state-exit</li>
206
207	<li>Intel9.0 seems to have problems to optimize/inline code as soon as
208	the amount of code grows over a certain threshold. Unlike with the other
209	two compilers, I needed to compile the tests for the 1, 2, 3 and 4-bit
210	BitMachine into separate executables to get good performance. Even then
211	was the performance overly bad for the 4-bit BitMachine. It was much
212	worse when I compiled all 4 tests into the same executable. This surely
213	looks like a bug in the compiler</li>
214	</ul>
215
216	<h4>Out of the box</h4>
217
218	<p>The library is used as is, without any optimizations/modifications.</p>
219
220	<p><img alt="PerformanceNormal1" src="PerformanceNormal1.gif" border="0"
221	width="371" height="284"><img alt="PerformanceNormal2" src=
222	"PerformanceNormal2.gif" border="0" width="371" height="284"><img alt=
223	"PerformanceNormal3" src="PerformanceNormal3.gif" border="0" width="371"
224	height="284"><img alt="PerformanceNormal4" src="PerformanceNormal4.gif"
225	border="0" width="371" height="284"></p>
226
227	<h4>Native RTTI</h4>
228
229	<p>The library is used with <code><a href=
230	"configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI</a></code>
231	defined.</p>
232
233	<p><img alt="PerformanceNative1" src="PerformanceNative1.gif" border="0"
234	width="371" height="284"><img alt="PerformanceNative2" src=
235	"PerformanceNative2.gif" border="0" width="371" height="284"><img alt=
236	"PerformanceNative3" src="PerformanceNative3.gif" border="0" width="371"
237	height="284"><img alt="PerformanceNative4" src="PerformanceNative4.gif"
238	border="0" width="371" height="284"></p>
239
240	<h4>Customized memory-management</h4>
241
242	<p>The library is used with customized memory management
243	(<code>boost::fast_pool_allocator</code>).</p>
244
245	<p><img alt="PerformanceCustom1" src="PerformanceCustom1.gif" border="0"
246	width="371" height="284"><img alt="PerformanceCustom2" src=
247	"PerformanceCustom2.gif" border="0" width="371" height="284"><img alt=
248	"PerformanceCustom3" src="PerformanceCustom3.gif" border="0" width="371"
249	height="284"><img alt="PerformanceCustom4" src="PerformanceCustom4.gif"
250	border="0" width="371" height="284"></p>
251
252	<h3><a name="DoubleDispatch" id="DoubleDispatch">Double dispatch</a></h3>
253
254	<p>At the heart of every state machine lies an implementation of double
255	dispatch. This is due to the fact that the incoming event <b>and</b> the
256	active state define exactly which <a href=
257	"definitions.html#Reaction">reaction</a> the state machine will produce.
258	For each event dispatch, one virtual call is followed by a linear search
259	for the appropriate reaction, using one RTTI comparison per reaction. The
260	following alternatives were considered but rejected:</p>
261
262	<ul>
263	<li><a href=
264	"http://www.objectmentor.com/resources/articles/acv.pdf">Acyclic
265	visitor</a>: This double-dispatch variant satisfies all scalability
266	requirements but performs badly due to costly inheritance tree
267	cross-casts. Moreover, a state must store one v-pointer for <b>each</b>
268	reaction what slows down construction and makes memory management
269	customization inefficient. In addition, C++ RTTI must inevitably be
270	turned on, with negative effects on executable size. Boost.Statechart
271	originally employed acyclic visitor and was about 4 times slower than it
272	is now (MSVC7.1 on Intel Pentium M). The dispatch speed might be better
273	on other platforms but the other negative effects will remain</li>
274
275	<li><a href="http://en.wikipedia.org/wiki/Visitor_pattern">GOF
276	Visitor</a>: The GOF Visitor pattern inevitably makes the whole machine
277	depend upon all events. That is, whenever a new event is added there is
278	no way around recompiling the whole state machine. This is contrary to
279	the scalability requirements</li>
280
281	<li>Single two-dimensional array of function pointers: To satisfy
282	requirement 6, it should be possible to spread a single state machine
283	over several translation units. This however means that the dispatch
284	table must be filled at runtime and the different translation units must
285	somehow make themselves "known", so that their part of the state machine
286	can be added to the table. There simply is no way to do this
287	automatically <b>and</b> portably. The only portable way that a state
288	machine distributed over several translation units could employ
289	table-based double dispatch relies on the user. The programmer(s) would
290	somehow have to <b>manually</b> tie together the various pieces of the
291	state machine. Not only does this scale badly but is also quite
292	error-prone</li>
293	</ul>
294
295	<h2><a name="MemoryManagementCustomization" id=
296	"MemoryManagementCustomization">Memory management customization</a></h2>
297
298	<p>Out of the box, everything (event objects, state objects, internal data,
299	etc.) is allocated through <code>std::allocator< void ></code> (the
300	default for the Allocator template parameter). This should be satisfactory
301	for applications meeting the following prerequisites:</p>
302
303	<ul>
304	<li>There are no deterministic reaction time (hard real-time)
305	requirements</li>
306
307	<li>The application will never run long enough for heap fragmentation to
308	become a problem. This is of course an issue for all long running
309	programs not only the ones employing this library. However, it should be
310	noted that fragmentation problems could show up earlier than with
311	traditional FSM frameworks</li>
312	</ul>
313
314	<p>Should an application not meet these prerequisites, Boost.Statechart's
315	memory management can be customized as follows:</p>
316
317	<ul>
318	<li>By passing a model of the standard Allocator concept to the class
319	templates that support a corresponding parameter
320	(<code>event<></code>, <code>state_machine<></code>,
321	<code>asynchronous_state_machine<></code>,
322	<code>fifo_scheduler<></code>, <code>fifo_worker<></code>).
323	This redirects all allocations to the passed custom allocator and should
324	satisfy the needs of just about any project</li>
325
326	<li>Additionally, it is possible to <b>separately</b> customize
327	<b>state</b> memory management by overloading <code>operator new()</code>
328	and <code>operator delete()</code> for all state classes but this is
329	probably only useful under rare circumstances</li>
330	</ul>
331
332	<h2><a name="RttiCustomization" id="RttiCustomization">RTTI
333	customization</a></h2>
334
335	<p>RTTI is used for event dispatch and
336	<code>state_downcast<>()</code>. Currently, there are exactly two
337	options:</p>
338
339	<ol>
340	<li>By default, a speed-optimized internal implementation is
341	employed</li>
342
343	<li>The library can be instructed to use native C++ RTTI instead by
344	defining <code><a href=
345	"configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI</a></code></li>
346	</ol>
347
348	<p>There are 2 reasons to favor 2:</p>
349
350	<ul>
351	<li>When a state machine (or parts of it) are compiled into a DLL,
352	problems could arise from the use of the internal RTTI mechanism (see the
353	FAQ item "<a href="faq.html#Dll">How can I compile a state machine into a
354	dynamic link library (DLL)?</a>"). Using option 2 is one way to work
355	around such problems (on some platforms, it seems to be the only
356	way)</li>
357
358	<li>State and event objects need to store one pointer less, meaning that
359	in the best case the memory footprint of a state machine object could
360	shrink by 15% (an empty event is typically 30% smaller, what can be an
361	advantage when there are bursts of events rather than a steady flow).
362	However, on most platforms executable size grows when C++ RTTI is turned
363	on. So, given the small per machine object savings, this only makes sense
364	in applications where both of the following conditions hold:
365
366	<ul>
367	<li>Event dispatch will never become a bottleneck</li>
368
369	<li>There is a need to reduce the memory allocated at runtime (at the
370	cost of a larger executable)</li>
371	</ul>
372
373	<p>Obvious candidates are embedded systems where the executable resides
374	in ROM. Other candidates are applications running a large number of
375	identical state machines where this measure could even reduce the
376	<b>overall</b> memory footprint</p>
377	</li>
378	</ul>
379
380	<h2><a name="ResourceUsage" id="ResourceUsage">Resource usage</a></h2>
381
382	<h3>Memory</h3>
383
384	<p>On a 32-bit box, one empty active state typically needs less than 50
385	bytes of memory. Even <b>very</b> complex machines will usually have less
386	than 20 simultaneously active states so just about every machine should run
387	with less than one kilobyte of memory (not counting event queues).
388	Obviously, the per-machine memory footprint is offset by whatever
389	state-local members the user adds.</p>
390
391	<h3>Processor cycles</h3>
392
393	<p>The following ranking should give a rough picture of what feature will
394	consume how many cycles:</p>
395
396	<ol>
397	<li><code>state_cast<>()</code>: By far the most cycle-consuming
398	feature. Searches linearly for a suitable state, using one
399	<code>dynamic_cast</code> per visited state</li>
400
401	<li>State entry and exit: Profiling of the fully optimized
402	1-bit-BitMachine suggested that roughly 3 quarters of the total event
403	processing time is spent destructing the exited state and constructing
404	the entered state. Obviously, transitions where the <a href=
405	"definitions.html#InnermostCommonContext">innermost common context</a> is
406	"far" from the leaf states and/or with lots of orthogonal states can
407	easily cause the destruction and construction of quite a few states
408	leading to significant amounts of time spent for a transition</li>
409
410	<li><code>state_downcast<>()</code>: Searches linearly for the
411	requested state, using one virtual call and one RTTI comparison per
412	visited state</li>
413
414	<li>Deep history: For all innermost states inside a state passing either
415	<code>has_deep_history</code> or <code>has_full_history</code> to its
416	state base class, a binary search through the (usually small) history map
417	must be performed on each exit. History slot allocation is performed
418	exactly once, at first exit</li>
419
420	<li>Shallow history: For all direct inner states of a state passing
421	either <code>has_shallow_history</code> or <code>has_full_history</code>
422	to its state base class, a binary search through the (usually small)
423	history map must be performed on each exit. History slot allocation is
424	performed exactly once, at first exit</li>
425
426	<li>Event dispatch: One virtual call followed by a linear search for a
427	suitable <a href="definitions.html#Reaction">reaction</a>, using one RTTI
428	comparison per visited reaction</li>
429
430	<li>Orthogonal states: One additional virtual call for each exited state
431	<b>if</b> there is more than one active leaf state before a transition.
432	It should also be noted that the worst-case event dispatch time is
433	multiplied in the presence of orthogonal states. For example, if two
434	orthogonal leaf states are added to a given state configuration, the
435	worst-case time is tripled</li>
436	</ol>
437	<hr>
438
439	<p><a href="http://validator.w3.org/check?uri=referer"><img border="0" src=
440	"http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01 Transitional"
441	height="31" width="88"></a></p>
442
443	<p>Revised
444	<!--webbot bot="Timestamp" s-type="EDITED" s-format="%d %B, %Y" startspan -->03 December, 2006<!--webbot bot="Timestamp" endspan i-checksum="38512" --></p>
445
446	<p><i>Copyright © 2003-<!--webbot bot="Timestamp" s-type="EDITED" s-format="%Y" startspan -->2006<!--webbot bot="Timestamp" endspan i-checksum="770" -->
447	<a href="contact.html">Andreas Huber Dönni</a></i></p>
448
449	<p><i>Distributed under the Boost Software License, Version 1.0. (See
450	accompanying file <a href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</a> or
451	copy at <a href=
452	"http://www.boost.org/LICENSE_1_0.txt">http://www.boost.org/LICENSE_1_0.txt</a>)</i></p>
453	</body>
454	</html>

Note: See TracBrowser for help on using the repository browser.

Download in other formats: