Re: Conversion from UTF32 to UTF8 for review

From:

"Daniel T." <daniel_t@earthlink.net>

Newsgroups:

comp.lang.c++,microsoft.public.vc.mfc

Date:

Mon, 31 May 2010 12:35:30 -0400

Message-ID:

<daniel_t-B347BC.12353031052010@70-3-168-216.pools.spcsdns.net>

Peter Olcott <NoSpam@OCR4Screen.com> wrote:

I used the two tables from this link as the basis for my design:
http://en.wikipedia.org/wiki/UTF-8

I suggest you use http://unicode.org/ for your source. Why use a
secondary source when the primary source is easily available?

I would like this reviewed for algorithm correctness:

Surely your tests have already shown whether the algorithm is correct.

void UnicodeEncodingConversion::
toUTF8(std::vector<uint32_t>& UTF32, std::vector<uint8_t>& UTF8) {
uint8_t Byte;
uint32_t CodePoint;
   UTF8.reserve(UTF32.size() * 4); // worst case
   for (uint32_t N = 0; N < UTF32.size(); N++) {
     CodePoint = UTF32[N];

I suggest you use an iterator instead of an integer for the loop. That
way you wont need the extraneous variable.

     if (CodePoint <= 0x7F) {
       Byte = CodePoint;
     UTF8.push_back(Byte);
     }
     else if (CodePoint <= 0x7FF) {
       Byte = 0xC0 | (CodePoint >> 6);
       UTF8.push_back(Byte);
       Byte = 0x80 | (CodePoint & 0x3F);
       UTF8.push_back(Byte);
     }
     else if (CodePoint <= 0xFFFF) {
       Byte = 0xE0 | (CodePoint >> 12);
       UTF8.push_back(Byte);
       Byte = 0x80 | ((CodePoint >> 6) & 0x3F);
       UTF8.push_back(Byte);
       Byte = 0x80 | (CodePoint & 0x3F);
       UTF8.push_back(Byte);
     }
     else if (CodePoint <= 0x10FFFF) {

The codes 10FFFE and 10FFFF are guaranteed not to be unicode
characters...

       Byte = 0xF0 | (CodePoint >> 18);
       UTF8.push_back(Byte);
       Byte = 0x80 | ((CodePoint >> 12) & 0x3F);
       UTF8.push_back(Byte);
       Byte = 0x80 | ((CodePoint >> 6) & 0x3F);
       UTF8.push_back(Byte);
       Byte = 0x80 | (CodePoint & 0x3F);
       UTF8.push_back(Byte);
     }
     else
       printf("%d is outside of the Unicode range!\n", CodePoint);

Throw is more appropriate here.

}
}