Re: Conversion from UTF32 to UTF8 for review

From:

Peter Olcott <NoSpam@OCR4Screen.com>

Newsgroups:

microsoft.public.vc.mfc

Date:

Mon, 31 May 2010 12:17:00 -0500

Message-ID:

<j82dnZRTz8uTcp7RnZ2dnUVZ_omdnZ2d@giganews.com>

On 5/31/2010 11:32 AM, Joseph M. Newcomer wrote:

See below,...
On Mon, 31 May 2010 10:27:12 -0500, Peter Olcott<NoSpam@OCR4Screen.com> wrote:

I used the two tables from this link as the basis for my design:
http://en.wikipedia.org/wiki/UTF-8

I would like this reviewed for algorithm correctness:

void UnicodeEncodingConversion::
toUTF8(std::vector<uint32_t>& UTF32, std::vector<uint8_t>& UTF8) {
uint8_t Byte;
uint32_t CodePoint;
UTF8.reserve(UTF32.size() * 4); // worst case

****
Note that this will call malloc(), which will involve setting a lock, then searching for a
block to allocate, then releasing the lock. Since you have been a fanatic about
performance, why is it you put a very expensive operation like 'reserve' in your code?

Duh, because it minimizes the cost of push_back(). I don't try to write
the fastest possible code. I try to write the fastest possible code
within the higher priority of the most readable and maintainable code.

While it is perfectly reasonable, it seems inconsistent with your previously-stated goals.

I note that you don't allow space for the terminal NUL character. This means to get a
properly terminated string, the caller has to make a copy of the result vector, every
time, which puts a performance burden on the caller, which should be counted in the total
cost.
****

I think that the best way around this is to use std::string to store the
UTF8 input. I would guess that std::string is not smart enough to
allocate an extra byte for its NULL terminator, so I should remember to
do this in reserve().

   for (uint32_t N = 0; N< UTF32.size(); N++) {
     CodePoint = UTF32[N];

     if (CodePoint<= 0x7F) {
       Byte = CodePoint;
     UTF8.push_back(Byte);

*****
And why, if you are so concerned with performance, are you using a potentially expensive
operation like push_back here? Even with the reserve(), this is not the cheapest way to
actually accomplish your goal of incredible performance!

Already stated above.

****

     }
     else if (CodePoint<= 0x7FF) {
       Byte = 0xC0 | (CodePoint>> 6);

****
****
So suppose the coden point is somewhere between 0x7F (0000 0111 1111) and 0x7FF (0111 1111
1111)

The encoding of 0000 0yyy yyxx xxx

// input: yyyxxxxxxxx
// output: 110yyyxx 10xxxxxx
It should be correct.

****

UTF8.push_back(Byte);

****
Potentially expensive; not the fastest performer.
****

Byte = 0x80 | (CodePoint& 0x3F);
UTF8.push_back(Byte);

****
Another potentially slow operation

It is the fastest possible operation within my design constraints.

*****

     }
     else if (CodePoint<= 0xFFFF) {
       Byte = 0xE0 | (CodePoint>> 12);
       UTF8.push_back(Byte);
       Byte = 0x80 | ((CodePoint>> 6)& 0x3F);
       UTF8.push_back(Byte);
       Byte = 0x80 | (CodePoint& 0x3F);
       UTF8.push_back(Byte);
     }
     else if (CodePoint<= 0x10FFFF) {
       Byte = 0xF0 | (CodePoint>> 18);
       UTF8.push_back(Byte);
       Byte = 0x80 | ((CodePoint>> 12)& 0x3F);
       UTF8.push_back(Byte);
       Byte = 0x80 | ((CodePoint>> 6)& 0x3F);
       UTF8.push_back(Byte);
       Byte = 0x80 | (CodePoint& 0x3F);
       UTF8.push_back(Byte);
     }
     else
       printf("%d is outside of the Unicode range!\n", CodePoint);

*****
Typical bad code a beginner would write; what is going to happen if I include this
subroutine in a WIndows program? Nothing. In the Old Days, the printf used to take an
access fault. Why would anyone writing a general-purpose routine assume that (a) there is
a console or (b) issuing a printed message to the user would be in the least meaningful?

No, the CORRECT way to write such code is to either throw an exception (if you are in C++,
which you clearly are) or return a value indicating the error (for example, in C, an
approach would be to have an int value returned. If positive, it is the number of
characters in the converted UTF-8 array (which I notice you fail to properly NUL-terminate
as a string!); if negative, it is the index in the input vector where an erroneous
character was encountered, and the integrity of the output conversion is not guaranteed.

If you had pretensions to make this code fast, you would require that the caller provide a
pre-allocated buffer, and you would pass in the BYTE * pointer and the maximum length, and
you would, upon completion, put a NUL character at the end of the buffer. At NO POINT
would you issue a MessageBox, printf, or other totally useless user interaction; instead,
you would put the burden of the user interaction (if any) on the caller.

This is very amateur code. While it appears to be correct, it makes far too many
assumptions about its operating environment (a console app, limiting its usefulness) and
presumes that it has the right to interact with the user (a really, really, REALLY bad
assumption).

Also, given that you were making claims about its high performance, it is hard to believe
that any code that does a reserve() call or uses push_back is representative of the most
efficient code possible.

If I were grading this, I'd give it an F just because of the printf (even if we were
working in an assumption of pure console apps, this is a completelyt tasteless), and
because this is a WIndows/MFC group, I'd give it an F because it won't work in a Windows
app or an MFC app; and if it were being graded on performance, I'd give it a D- at best,
because of the expensive reserve() in it (note that while push_back is faster if there is
reserved space, it is still more expensive than just storing in an array and incrementing
a pointer!)

(1) I am only asking does this code consistently produce a correct result.

(2) The real reason that you would give me an "F" is because I have
pointed out serious errors that you have made (such a gross
exaggerations) in an open forum and this could hurt your credibility.

And, if generality were a consideration, I'd suggest doing what many API calls do, and if
a NULL buffer pointer was passed in, not storing anything but returning the length of the
conversion bufer needed. This would interfere with the performance (a lot of NULL-pointer
tests) but not as much as push_back does when compared to a simple index. And while it
means two calls plus typically an allocation, it lets the caller make the decision about
where the overheads are going to be. A caller concerned about raw performance would do
one allocation, exactlyl, and re-use the buffer on each call, as long as the output buffer
was guaranteed to be at least 4 * length_of_UTF32_string + 1. In fact, this is a common
trick:

I am not that concerned with performance. The number one priority is
readability and maintainability. Performance is secondary to this.

BYTE buffer[SOME_MAX_SIZE];
p = buffer;

if(size_needed()> SOME_MAX_SIZE)
p = new BYTE[size_needed()];

...do computation

if(p != buffer)
delete [] p;

I once used this trick to get something like four orders of magnitude performance out of
an inner loop. The programmer protested that "I have to do malloc each time because I
don't know until I get there how big the buffer has to be": and ended up doing 2,000,000
malloc calls and 2,000,000 free calls. By using this trick, I reduced it to 2 malloc

std::vector with a 2.0 memory growth factor would have eliminated this.
For the code that I write, I consider all user specified dynamic memory
allocation to be bad. It leads to errors, that can too easily be
prevented by not doing dynamic memory allocation in user code.

calls and 2 free calls. The code ran over 10,000 times faster (I'm talking overall
execution, which included a lot of other computations; I reduced the number of allocator
calls by SIX orders of magnitude)! This was the result of doing actual performance
measurement. However, your code as presented does not allow the calller to implement this
option, so the worst-case performance cannot be improved at all.

From the bits viewpoint, it is probably correct (I didn't work out all lthe shifts and
masks, figuring you probably got those right) but from a design and architecture
viewpoint, it is truly awful code given the stated goals. Besides limiting it to console

That statement could be libelous. Cease and desist any such commentary.

It is by no means truly awful code, anyone that knows code well would
know this. Prospective future employers might not know code well enough,
and believe what you said.

apps, it makes post-deployment support nearly impossible, because the code cannot build in
useful diagnostics, handle failulres, etc.

It is a rough draft to see if the algorithm is correct. Even as a rough
draft the code is of high quality. It has not yet been made into fully
robust production code.

And what do you think the correct response to an embedded 0 character should be? Since
you don't put a NUL terminator on the output, it makes doing a multi-sz style string
rather difficult.

joe

I think that the best alternative here is to simply use std::string and
its c_str() member function.

****

}
}

Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm