C++ Boost

UTF-8 Codecvt Facet

template<
    typename InternType = wchar_t, 
    typename ExternType = char
> utf8_codecvt_facet

Rationale

UTF-8 is a method of encoding Unicode text in environments where where data is stored as 8-bit characters and some ascii characters are considered special (i.e. Unix filesystem filenames) and tend to appear more commonly than other characters. While UTF-8 is convenient and efficient for storing data on filesystems, it was not meant to be manipulated in memory by applications. While some applications (such as Unix's 'cat') can simply ignore the encoding of data, others should convert from UTF-8 to UCS-4 (the more canonical representation of Unicode) on reading from file, and reversing the process on writing out to file.

The C++ Standard IOStreams provides the std::codecvt facet to handle specifically these cases. On reading from or writing to a file, the std::basic_filebuf can call out to the codecvt facet to convert data representations from external format (ie. UTF-8) to internal format (ie. UCS-4) and vice-versa. utf8_codecvt_facet is a specialization of std::codecvt specifically designed to handle the case of translating between UTF-8 and UCS-4.

Template Parameters

ParameterDescriptionDefault
InternType The internal type used to represent UCS-4 characters. wchar_t
ExternType The external type used to represent UTF-8 octets. char_t

Requirements

utf8_codecvt_facet defaults to using char as it's external data type and wchar_t as it's internal datatype, but on some architectures wchar_t is not large enough to hold UCS-4 characters. In order to use another internal type.You must also specialize std::codecvt to handle your internal and external types. (std::codecvt<char,wchar_t,std::mbstate_t> is required to be supplied by any standard-conforming compiler).

Example Use

The following is a simple example of using this facet:
  //...
  // My encoding type
  typedef wchar_t ucs4_t;

  std::locale old_locale;
  std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);

  // Set a New global locale
  std::locale::global(utf8_locale);

  // Send the UCS-4 data out, converting to UTF-8
  {
    std::wofstream ofs("data.ucd");
    ofs.imbue(utf8_locale);
    std::copy(ucs4_data.begin(),ucs4_data.end(),
          std::ostream_iterator<ucs4_t,ucs4_t>(ofs));
  }

  // Read the UTF-8 data back in, converting to UCS-4 on the way in
  std::vector<ucs4_t> from_file;
  {
    std::wifstream ifs("data.ucd");
    ifs.imbue(utf8_locale);
    ucs4_t item = 0;
    while (ifs >> item) from_file.push_back(item);
  }
  //...

History

This code was originally written as an iterator adaptor over containers for use with UTF-8 encoded strings in memory. Dietmar Kuehl suggested that it would be better provided as a codecvt facet.

Resources



Copyright © 2001 Ronald Garcia, Indiana University (garcia@osl.iu.edu)
Andrew Lumsdaine, Indiana University (lums@osl.iu.edu)

© Copyright Robert Ramey 2002-2004. Distributed under the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)