Opened 14 years ago
Last modified 14 years ago
#379 new defect
International characters in paths
Reported by: | rgrieder | Owned by: | nobody |
---|---|---|---|
Priority: | minor | Milestone: | Version 0.1 Codename: Arcturus |
Component: | GeneralFramework | Version: | 0.0.4 |
Keywords: | unicode utf western 1252 codepage cegui | Cc: | |
Referenced By: | References: |
Description (last modified by rgrieder)
When starting Orxonox in a directory like 'ásdf' on Windows 7, the CEGUI logger will not accept the logging file, leading to an exception.
We need to investigate whether this is a just a communication problem between Orxonox and CEGUI or whether we have serious issues with international characters in paths.
EDIT
It turns out that it was mostly a Problem in the CEGUI::DefaultLogger. However that's not all. So I have to make a little detour (for Windows only!):
On Windows, characters are encoded using the Microsoft codepage currently in use, which could be any codepage on different systems. Codepages are simply 8 bit ASCII characters extended by another 128 characters to support whatever is needed. On systems in the US and Western Europe, codepage 1252 is the standard.
CEGUI on the other hand uses UTF-32 (4 bytes) for their strings and converts them to UTF-8 when calling c_str(). That is of course different from the 1252 Western codepage used by Windows, so whatever we get from CEGUI might not be useful directly for the Windows API.
That's why for all the Windows API functions related to strings, there is a second function with a 'W' suffix (or prefix, don't remember) that accepts wchar_t. However, the usual standard is 4 bytes for that type (UNIX), but Microsoft decided to go for 2 bytes and UTF-16 encoding.
That's exactly where the bug occurred: CEGUI converted to UTF-8 and fed that to ofstream::open, which in turn was interpreted as a codepage 1252 character sequence.
There is one more subtle detail left: How does CEGUI::String convert from 1252 to UTF-32 when assigning our std::string to it? Simple: according to the documentation, the characters are interpreted as unencoded 8-bit values. So a simple cast from 8 bit to 32 bit values is done.
And how on earth could that ever be correct (it actually was)? It turns out that 1252 is mostly identical to UTF-32 for the first 256 characters.
TODO
Not every user will have the 1252 codepage and therefore a lot of things can go wrong. We somehow have to deal with this.
On the other hand, the CEGUI problem, that this ticket was issued for, is just a bug and not a general behaviour. CEGUI 0.6.2 might still have the issues though. But since that only concerns Windows where we use CEGUI 0.7.5, we're safe.
The other TODO is making a correct conversion from UTF-8 (standard Linux encoding if I'm not wrong) to CEGUI::String because that's just a simple cast and not a decoding.
Change History (2)
comment:1 Changed 14 years ago by youngk
comment:2 Changed 14 years ago by rgrieder
- Description modified (diff)
- Keywords utf western 1252 codepage cegui added
- Priority changed from critical to minor
If it turns out that we actually have serious problems when handling international characters in Orxonox, one might also think about whitespace handling in paths. Just a thought.