[16] | 1 | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> |
---|
| 2 | <html> |
---|
| 3 | <head> |
---|
| 4 | |
---|
| 5 | <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15"/> |
---|
| 6 | <title>Ogg Vorbis Documentation</title> |
---|
| 7 | |
---|
| 8 | <style type="text/css"> |
---|
| 9 | body { |
---|
| 10 | margin: 0 18px 0 18px; |
---|
| 11 | padding-bottom: 30px; |
---|
| 12 | font-family: Verdana, Arial, Helvetica, sans-serif; |
---|
| 13 | color: #333333; |
---|
| 14 | font-size: .8em; |
---|
| 15 | } |
---|
| 16 | |
---|
| 17 | a { |
---|
| 18 | color: #3366cc; |
---|
| 19 | } |
---|
| 20 | |
---|
| 21 | img { |
---|
| 22 | border: 0; |
---|
| 23 | } |
---|
| 24 | |
---|
| 25 | #xiphlogo { |
---|
| 26 | margin: 30px 0 16px 0; |
---|
| 27 | } |
---|
| 28 | |
---|
| 29 | #content p { |
---|
| 30 | line-height: 1.4; |
---|
| 31 | } |
---|
| 32 | |
---|
| 33 | h1, h1 a, h2, h2 a, h3, h3 a, h4, h4 a { |
---|
| 34 | font-weight: bold; |
---|
| 35 | color: #ff9900; |
---|
| 36 | margin: 1.3em 0 8px 0; |
---|
| 37 | } |
---|
| 38 | |
---|
| 39 | h1 { |
---|
| 40 | font-size: 1.3em; |
---|
| 41 | } |
---|
| 42 | |
---|
| 43 | h2 { |
---|
| 44 | font-size: 1.2em; |
---|
| 45 | } |
---|
| 46 | |
---|
| 47 | h3 { |
---|
| 48 | font-size: 1.1em; |
---|
| 49 | } |
---|
| 50 | |
---|
| 51 | li { |
---|
| 52 | line-height: 1.4; |
---|
| 53 | } |
---|
| 54 | |
---|
| 55 | #copyright { |
---|
| 56 | margin-top: 30px; |
---|
| 57 | line-height: 1.5em; |
---|
| 58 | text-align: center; |
---|
| 59 | font-size: .8em; |
---|
| 60 | color: #888888; |
---|
| 61 | clear: both; |
---|
| 62 | } |
---|
| 63 | </style> |
---|
| 64 | |
---|
| 65 | </head> |
---|
| 66 | |
---|
| 67 | <body> |
---|
| 68 | |
---|
| 69 | <div id="xiphlogo"> |
---|
| 70 | <a href="http://www.xiph.org/"><img src="fish_xiph_org.png" alt="Fish Logo and Xiph.org"/></a> |
---|
| 71 | </div> |
---|
| 72 | |
---|
| 73 | <h1>Ogg Vorbis stereo-specific channel coupling discussion</h1> |
---|
| 74 | |
---|
| 75 | <h2>Abstract</h2> |
---|
| 76 | |
---|
| 77 | <p>The Vorbis audio CODEC provides a channel coupling |
---|
| 78 | mechanisms designed to reduce effective bitrate by both eliminating |
---|
| 79 | interchannel redundancy and eliminating stereo image information |
---|
| 80 | labeled inaudible or undesirable according to spatial psychoacoustic |
---|
| 81 | models. This document describes both the mechanical coupling |
---|
| 82 | mechanisms available within the Vorbis specification, as well as the |
---|
| 83 | specific stereo coupling models used by the reference |
---|
| 84 | <tt>libvorbis</tt> codec provided by xiph.org.</p> |
---|
| 85 | |
---|
| 86 | <h2>Mechanisms</h2> |
---|
| 87 | |
---|
| 88 | <p>In encoder release beta 4 and earlier, Vorbis supported multiple |
---|
| 89 | channel encoding, but the channels were encoded entirely separately |
---|
| 90 | with no cross-analysis or redundancy elimination between channels. |
---|
| 91 | This multichannel strategy is very similar to the mp3's <em>dual |
---|
| 92 | stereo</em> mode and Vorbis uses the same name for its analogous |
---|
| 93 | uncoupled multichannel modes.</p> |
---|
| 94 | |
---|
| 95 | <p>However, the Vorbis spec provides for, and Vorbis release 1.0 rc1 and |
---|
| 96 | later implement a coupled channel strategy. Vorbis has two specific |
---|
| 97 | mechanisms that may be used alone or in conjunction to implement |
---|
| 98 | channel coupling. The first is <em>channel interleaving</em> via |
---|
| 99 | residue backend type 2, and the second is <em>square polar |
---|
| 100 | mapping</em>. These two general mechanisms are particularly well |
---|
| 101 | suited to coupling due to the structure of Vorbis encoding, as we'll |
---|
| 102 | explore below, and using both we can implement both totally |
---|
| 103 | <em>lossless stereo image coupling</em> [bit-for-bit decode-identical |
---|
| 104 | to uncoupled modes], as well as various lossy models that seek to |
---|
| 105 | eliminate inaudible or unimportant aspects of the stereo image in |
---|
| 106 | order to enhance bitrate. The exact coupling implementation is |
---|
| 107 | generalized to allow the encoder a great deal of flexibility in |
---|
| 108 | implementation of a stereo or surround model without requiring any |
---|
| 109 | significant complexity increase over the combinatorially simpler |
---|
| 110 | mid/side joint stereo of mp3 and other current audio codecs.</p> |
---|
| 111 | |
---|
| 112 | <p>A particular Vorbis bitstream may apply channel coupling directly to |
---|
| 113 | more than a pair of channels; polar mapping is hierarchical such that |
---|
| 114 | polar coupling may be extrapolated to an arbitrary number of channels |
---|
| 115 | and is not restricted to only stereo, quadraphonics, ambisonics or 5.1 |
---|
| 116 | surround. However, the scope of this document restricts itself to the |
---|
| 117 | stereo coupling case.</p> |
---|
| 118 | |
---|
| 119 | <h3>Square Polar Mapping</h3> |
---|
| 120 | |
---|
| 121 | <h4>maximal correlation</h4> |
---|
| 122 | |
---|
| 123 | <p>Recall that the basic structure of a a Vorbis I stream first generates |
---|
| 124 | from input audio a spectral 'floor' function that serves as an |
---|
| 125 | MDCT-domain whitening filter. This floor is meant to represent the |
---|
| 126 | rough envelope of the frequency spectrum, using whatever metric the |
---|
| 127 | encoder cares to define. This floor is subtracted from the log |
---|
| 128 | frequency spectrum, effectively normalizing the spectrum by frequency. |
---|
| 129 | Each input channel is associated with a unique floor function.</p> |
---|
| 130 | |
---|
| 131 | <p>The basic idea behind any stereo coupling is that the left and right |
---|
| 132 | channels usually correlate. This correlation is even stronger if one |
---|
| 133 | first accounts for energy differences in any given frequency band |
---|
| 134 | across left and right; think for example of individual instruments |
---|
| 135 | mixed into different portions of the stereo image, or a stereo |
---|
| 136 | recording with a dominant feature not perfectly in the center. The |
---|
| 137 | floor functions, each specific to a channel, provide the perfect means |
---|
| 138 | of normalizing left and right energies across the spectrum to maximize |
---|
| 139 | correlation before coupling. This feature of the Vorbis format is not |
---|
| 140 | a convenient accident.</p> |
---|
| 141 | |
---|
| 142 | <p>Because we strive to maximally correlate the left and right channels |
---|
| 143 | and generally succeed in doing so, left and right residue is typically |
---|
| 144 | nearly identical. We could use channel interleaving (discussed below) |
---|
| 145 | alone to efficiently remove the redundancy between the left and right |
---|
| 146 | channels as a side effect of entropy encoding, but a polar |
---|
| 147 | representation gives benefits when left/right correlation is |
---|
| 148 | strong.</p> |
---|
| 149 | |
---|
| 150 | <h4>point and diffuse imaging</h4> |
---|
| 151 | |
---|
| 152 | <p>The first advantage of a polar representation is that it effectively |
---|
| 153 | separates the spatial audio information into a 'point image' |
---|
| 154 | (magnitude) at a given frequency and located somewhere in the sound |
---|
| 155 | field, and a 'diffuse image' (angle) that fills a large amount of |
---|
| 156 | space simultaneously. Even if we preserve only the magnitude (point) |
---|
| 157 | data, a detailed and carefully chosen floor function in each channel |
---|
| 158 | provides us with a free, fine-grained, frequency relative intensity |
---|
| 159 | stereo*. Angle information represents diffuse sound fields, such as |
---|
| 160 | reverberation that fills the entire space simultaneously.</p> |
---|
| 161 | |
---|
| 162 | <p>*<em>Because the Vorbis model supports a number of different possible |
---|
| 163 | stereo models and these models may be mixed, we do not use the term |
---|
| 164 | 'intensity stereo' talking about Vorbis; instead we use the terms |
---|
| 165 | 'point stereo', 'phase stereo' and subcategories of each.</em></p> |
---|
| 166 | |
---|
| 167 | <p>The majority of a stereo image is representable by polar magnitude |
---|
| 168 | alone, as strong sounds tend to be produced at near-point sources; |
---|
| 169 | even non-diffuse, fast, sharp echoes track very accurately using |
---|
| 170 | magnitude representation almost alone (for those experimenting with |
---|
| 171 | Vorbis tuning, this strategy works much better with the precise, |
---|
| 172 | piecewise control of floor 1; the continuous approximation of floor 0 |
---|
| 173 | results in unstable imaging). Reverberation and diffuse sounds tend |
---|
| 174 | to contain less energy and be psychoacoustically dominated by the |
---|
| 175 | point sources embedded in them. Thus, we again tend to concentrate |
---|
| 176 | more represented energy into a predictably smaller number of numbers. |
---|
| 177 | Separating representation of point and diffuse imaging also allows us |
---|
| 178 | to model and manipulate point and diffuse qualities separately.</p> |
---|
| 179 | |
---|
| 180 | <h4>controlling bit leakage and symbol crosstalk</h4> |
---|
| 181 | |
---|
| 182 | <p>Because polar |
---|
| 183 | representation concentrates represented energy into fewer large |
---|
| 184 | values, we reduce bit 'leakage' during cascading (multistage VQ |
---|
| 185 | encoding) as a secondary benefit. A single large, monolithic VQ |
---|
| 186 | codebook is more efficient than a cascaded book due to entropy |
---|
| 187 | 'crosstalk' among symbols between different stages of a multistage cascade. |
---|
| 188 | Polar representation is a way of further concentrating entropy into |
---|
| 189 | predictable locations so that codebook design can take steps to |
---|
| 190 | improve multistage codebook efficiency. It also allows us to cascade |
---|
| 191 | various elements of the stereo image independently.</p> |
---|
| 192 | |
---|
| 193 | <h4>eliminating trigonometry and rounding</h4> |
---|
| 194 | |
---|
| 195 | <p>Rounding and computational complexity are potential problems with a |
---|
| 196 | polar representation. As our encoding process involves quantization, |
---|
| 197 | mixing a polar representation and quantization makes it potentially |
---|
| 198 | impossible, depending on implementation, to construct a coupled stereo |
---|
| 199 | mechanism that results in bit-identical decompressed output compared |
---|
| 200 | to an uncoupled encoding should the encoder desire it.</p> |
---|
| 201 | |
---|
| 202 | <p>Vorbis uses a mapping that preserves the most useful qualities of |
---|
| 203 | polar representation, relies only on addition/subtraction (during |
---|
| 204 | decode; high quality encoding still requires some trig), and makes it |
---|
| 205 | trivial before or after quantization to represent an angle/magnitude |
---|
| 206 | through a one-to-one mapping from possible left/right value |
---|
| 207 | permutations. We do this by basing our polar representation on the |
---|
| 208 | unit square rather than the unit-circle.</p> |
---|
| 209 | |
---|
| 210 | <p>Given a magnitude and angle, we recover left and right using the |
---|
| 211 | following function (note that A/B may be left/right or right/left |
---|
| 212 | depending on the coupling definition used by the encoder):</p> |
---|
| 213 | |
---|
| 214 | <pre> |
---|
| 215 | if(magnitude>0) |
---|
| 216 | if(angle>0){ |
---|
| 217 | A=magnitude; |
---|
| 218 | B=magnitude-angle; |
---|
| 219 | }else{ |
---|
| 220 | B=magnitude; |
---|
| 221 | A=magnitude+angle; |
---|
| 222 | } |
---|
| 223 | else |
---|
| 224 | if(angle>0){ |
---|
| 225 | A=magnitude; |
---|
| 226 | B=magnitude+angle; |
---|
| 227 | }else{ |
---|
| 228 | B=magnitude; |
---|
| 229 | A=magnitude-angle; |
---|
| 230 | } |
---|
| 231 | } |
---|
| 232 | </pre> |
---|
| 233 | |
---|
| 234 | <p>The function is antisymmetric for positive and negative magnitudes in |
---|
| 235 | order to eliminate a redundant value when quantizing. For example, if |
---|
| 236 | we're quantizing to integer values, we can visualize a magnitude of 5 |
---|
| 237 | and an angle of -2 as follows:</p> |
---|
| 238 | |
---|
| 239 | <p><img src="squarepolar.png" alt="square polar"/></p> |
---|
| 240 | |
---|
| 241 | <p>This representation loses or replicates no values; if the range of A |
---|
| 242 | and B are integral -5 through 5, the number of possible Cartesian |
---|
| 243 | permutations is 121. Represented in square polar notation, the |
---|
| 244 | possible values are:</p> |
---|
| 245 | |
---|
| 246 | <pre> |
---|
| 247 | 0, 0 |
---|
| 248 | |
---|
| 249 | -1,-2 -1,-1 -1, 0 -1, 1 |
---|
| 250 | |
---|
| 251 | 1,-2 1,-1 1, 0 1, 1 |
---|
| 252 | |
---|
| 253 | -2,-4 -2,-3 -2,-2 -2,-1 -2, 0 -2, 1 -2, 2 -2, 3 |
---|
| 254 | |
---|
| 255 | 2,-4 2,-3 ... following the pattern ... |
---|
| 256 | |
---|
| 257 | ... 5, 1 5, 2 5, 3 5, 4 5, 5 5, 6 5, 7 5, 8 5, 9 |
---|
| 258 | |
---|
| 259 | </pre> |
---|
| 260 | |
---|
| 261 | <p>...for a grand total of 121 possible values, the same number as in |
---|
| 262 | Cartesian representation (note that, for example, <tt>5,-10</tt> is |
---|
| 263 | the same as <tt>-5,10</tt>, so there's no reason to represent |
---|
| 264 | both. 2,10 cannot happen, and there's no reason to account for it.) |
---|
| 265 | It's also obvious that this mapping is exactly reversible.</p> |
---|
| 266 | |
---|
| 267 | <h3>Channel interleaving</h3> |
---|
| 268 | |
---|
| 269 | <p>We can remap and A/B vector using polar mapping into a magnitude/angle |
---|
| 270 | vector, and it's clear that, in general, this concentrates energy in |
---|
| 271 | the magnitude vector and reduces the amount of information to encode |
---|
| 272 | in the angle vector. Encoding these vectors independently with |
---|
| 273 | residue backend #0 or residue backend #1 will result in bitrate |
---|
| 274 | savings. However, there are still implicit correlations between the |
---|
| 275 | magnitude and angle vectors. The most obvious is that the amplitude |
---|
| 276 | of the angle is bounded by its corresponding magnitude value.</p> |
---|
| 277 | |
---|
| 278 | <p>Entropy coding the results, then, further benefits from the entropy |
---|
| 279 | model being able to compress magnitude and angle simultaneously. For |
---|
| 280 | this reason, Vorbis implements residue backend #2 which pre-interleaves |
---|
| 281 | a number of input vectors (in the stereo case, two, A and B) into a |
---|
| 282 | single output vector (with the elements in the order of |
---|
| 283 | A_0, B_0, A_1, B_1, A_2 ... A_n-1, B_n-1) before entropy encoding. Thus |
---|
| 284 | each vector to be coded by the vector quantization backend consists of |
---|
| 285 | matching magnitude and angle values.</p> |
---|
| 286 | |
---|
| 287 | <p>The astute reader, at this point, will notice that in the theoretical |
---|
| 288 | case in which we can use monolithic codebooks of arbitrarily large |
---|
| 289 | size, we can directly interleave and encode left and right without |
---|
| 290 | polar mapping; in fact, the polar mapping does not appear to lend any |
---|
| 291 | benefit whatsoever to the efficiency of the entropy coding. In fact, |
---|
| 292 | it is perfectly possible and reasonable to build a Vorbis encoder that |
---|
| 293 | dispenses with polar mapping entirely and merely interleaves the |
---|
| 294 | channel. Libvorbis based encoders may configure such an encoding and |
---|
| 295 | it will work as intended.</p> |
---|
| 296 | |
---|
| 297 | <p>However, when we leave the ideal/theoretical domain, we notice that |
---|
| 298 | polar mapping does give additional practical benefits, as discussed in |
---|
| 299 | the above section on polar mapping and summarized again here:</p> |
---|
| 300 | |
---|
| 301 | <ul> |
---|
| 302 | <li>Polar mapping aids in controlling entropy 'leakage' between stages |
---|
| 303 | of a cascaded codebook.</li> |
---|
| 304 | <li>Polar mapping separates the stereo image |
---|
| 305 | into point and diffuse components which may be analyzed and handled |
---|
| 306 | differently.</li> |
---|
| 307 | </ul> |
---|
| 308 | |
---|
| 309 | <h2>Stereo Models</h2> |
---|
| 310 | |
---|
| 311 | <h3>Dual Stereo</h3> |
---|
| 312 | |
---|
| 313 | <p>Dual stereo refers to stereo encoding where the channels are entirely |
---|
| 314 | separate; they are analyzed and encoded as entirely distinct entities. |
---|
| 315 | This terminology is familiar from mp3.</p> |
---|
| 316 | |
---|
| 317 | <h3>Lossless Stereo</h3> |
---|
| 318 | |
---|
| 319 | <p>Using polar mapping and/or channel interleaving, it's possible to |
---|
| 320 | couple Vorbis channels losslessly, that is, construct a stereo |
---|
| 321 | coupling encoding that both saves space but also decodes |
---|
| 322 | bit-identically to dual stereo. OggEnc 1.0 and later uses this |
---|
| 323 | mode in all high-bitrate encoding.</p> |
---|
| 324 | |
---|
| 325 | <p>Overall, this stereo mode is overkill; however, it offers a safe |
---|
| 326 | alternative to users concerned about the slightest possible |
---|
| 327 | degradation to the stereo image or archival quality audio.</p> |
---|
| 328 | |
---|
| 329 | <h3>Phase Stereo</h3> |
---|
| 330 | |
---|
| 331 | <p>Phase stereo is the least aggressive means of gracefully dropping |
---|
| 332 | resolution from the stereo image; it affects only diffuse imaging.</p> |
---|
| 333 | |
---|
| 334 | <p>It's often quoted that the human ear is deaf to signal phase above |
---|
| 335 | about 4kHz; this is nearly true and a passable rule of thumb, but it |
---|
| 336 | can be demonstrated that even an average user can tell the difference |
---|
| 337 | between high frequency in-phase and out-of-phase noise. Obviously |
---|
| 338 | then, the statement is not entirely true. However, it's also the case |
---|
| 339 | that one must resort to nearly such an extreme demonstration before |
---|
| 340 | finding the counterexample.</p> |
---|
| 341 | |
---|
| 342 | <p>'Phase stereo' is simply a more aggressive quantization of the polar |
---|
| 343 | angle vector; above 4kHz it's generally quite safe to quantize noise |
---|
| 344 | and noisy elements to only a handful of allowed phases, or to thin the |
---|
| 345 | phase with respect to the magnitude. The phases of high amplitude |
---|
| 346 | pure tones may or may not be preserved more carefully (they are |
---|
| 347 | relatively rare and L/R tend to be in phase, so there is generally |
---|
| 348 | little reason not to spend a few more bits on them)</p> |
---|
| 349 | |
---|
| 350 | <h4>example: eight phase stereo</h4> |
---|
| 351 | |
---|
| 352 | <p>Vorbis may implement phase stereo coupling by preserving the entirety |
---|
| 353 | of the magnitude vector (essential to fine amplitude and energy |
---|
| 354 | resolution overall) and quantizing the angle vector to one of only |
---|
| 355 | four possible values. Given that the magnitude vector may be positive |
---|
| 356 | or negative, this results in left and right phase having eight |
---|
| 357 | possible permutation, thus 'eight phase stereo':</p> |
---|
| 358 | |
---|
| 359 | <p><img src="eightphase.png" alt="eight phase"/></p> |
---|
| 360 | |
---|
| 361 | <p>Left and right may be in phase (positive or negative), the most common |
---|
| 362 | case by far, or out of phase by 90 or 180 degrees.</p> |
---|
| 363 | |
---|
| 364 | <h4>example: four phase stereo</h4> |
---|
| 365 | |
---|
| 366 | <p>Similarly, four phase stereo takes the quantization one step further; |
---|
| 367 | it allows only in-phase and 180 degree out-out-phase signals:</p> |
---|
| 368 | |
---|
| 369 | <p><img src="fourphase.png" alt="four phase"/></p> |
---|
| 370 | |
---|
| 371 | <h3>example: point stereo</h3> |
---|
| 372 | |
---|
| 373 | <p>Point stereo eliminates the possibility of out-of-phase signal |
---|
| 374 | entirely. Any diffuse quality to a sound source tends to collapse |
---|
| 375 | inward to a point somewhere within the stereo image. A practical |
---|
| 376 | example would be balanced reverberations within a large, live space; |
---|
| 377 | normally the sound is diffuse and soft, giving a sonic impression of |
---|
| 378 | volume. In point-stereo, the reverberations would still exist, but |
---|
| 379 | sound fairly firmly centered within the image (assuming the |
---|
| 380 | reverberation was centered overall; if the reverberation is stronger |
---|
| 381 | to the left, then the point of localization in point stereo would be |
---|
| 382 | to the left). This effect is most noticeable at low and mid |
---|
| 383 | frequencies and using headphones (which grant perfect stereo |
---|
| 384 | separation). Point stereo is is a graceful but generally easy to |
---|
| 385 | detect degradation to the sound quality and is thus used in frequency |
---|
| 386 | ranges where it is least noticeable.</p> |
---|
| 387 | |
---|
| 388 | <h3>Mixed Stereo</h3> |
---|
| 389 | |
---|
| 390 | <p>Mixed stereo is the simultaneous use of more than one of the above |
---|
| 391 | stereo encoding models, generally using more aggressive modes in |
---|
| 392 | higher frequencies, lower amplitudes or 'nearly' in-phase sound.</p> |
---|
| 393 | |
---|
| 394 | <p>It is also the case that near-DC frequencies should be encoded using |
---|
| 395 | lossless coupling to avoid frame blocking artifacts.</p> |
---|
| 396 | |
---|
| 397 | <h3>Vorbis Stereo Modes</h3> |
---|
| 398 | |
---|
| 399 | <p>Vorbis, as of 1.0, uses lossless stereo and a number of mixed modes |
---|
| 400 | constructed out of lossless and point stereo. Phase stereo was used |
---|
| 401 | in the rc2 encoder, but is not currently used for simplicity's sake. It |
---|
| 402 | will likely be re-added to the stereo model in the future.</p> |
---|
| 403 | |
---|
| 404 | <div id="copyright"> |
---|
| 405 | The Xiph Fish Logo is a |
---|
| 406 | trademark (™) of Xiph.Org.<br/> |
---|
| 407 | |
---|
| 408 | These pages © 1994 - 2005 Xiph.Org. All rights reserved. |
---|
| 409 | </div> |
---|
| 410 | |
---|
| 411 | </body> |
---|
| 412 | </html> |
---|
| 413 | |
---|
| 414 | |
---|
| 415 | |
---|
| 416 | |
---|
| 417 | |
---|
| 418 | |
---|