Re: New utf8string design may make UTF-8 the superior encoding

From:

Peter Olcott <NoSpam@OCR4Screen.com>

Newsgroups:

comp.lang.c++,microsoft.public.vc.mfc

Date:

Mon, 17 May 2010 22:35:22 -0500

Message-ID:

<hcednaq6ks5ml2_WnZ2dnUVZ_tqdnZ2d@giganews.com>

On 5/17/2010 4:58 PM, Joshua Maurice wrote:

On May 17, 12:25 pm, I V<ivle...@gmail.com> wrote:

On Mon, 17 May 2010 08:08:22 -0500, Peter Olcott wrote:

Do you know of any faster way to validate and divide a UTF-8 sequence
into its constituent code point parts than a regular expression
implemented as a finite state machine? (please don't cite a software
package, I am only interested in the underlying methodology).

A finite state machine sounds like a good plan, but I'd be a bit
surprised if a regular expression was faster than a state machine
specifically written to parse UTF-8. Aside from the unnecessary
generality of regular expressions (I don't really know if that would
actually make them slower in this case), I would guess a regular
expression engine wouldn't take advantage of the way that UTF-8 encodes
the meaning of each byte (single-byte codepoint, first byte of multi-byte
code-point, or continuation of a multi-byte codepoint) in the most-
significant two bits of the byte.

This sounds a little overkill to me, all of this talk of regular
expressions, finite state machines, etc.

Can't you just do something like the following? I understand that it
is a finite state machine in fact, but it uses no frameworks, no
regular expressions, etc. I'd expect that this is pretty good in terms
of speed and readability. It would be quite simple to add some code
using bit operations to convert from the utf8 array to Unicode code
points.

The finite state machine's detailed design is now completed. Its state
transition matrix only takes 2048 bytes. It will be faster than any
other possible method.

The finite state machine is semantically identical to the regular
expression. I am somewhat enamored with DFA recognizers. I love them. I
will post the source code when it is completed.

A few years ago I beat both Microsoft and Borland's std::string by as
much as more than 100-fold. I posted the code this this group. It was
called FastString if anyone cares to look this up.

//COMPLETELY UNTESTED
bool validate_utf8(unsigned char * utf8str_start, unsigned char *
utf8str_end)
{
   for (unsigned char * x = utf8str_start; x != utf8str_end; )
   {
     if ((*x& 0x80) == 0)
     {
       ++x;
     }
     else if ((*x& (0x80 + 0x40 + 0x20)) == (0x80 + 0x40))
     {
       if (++x == utf8str_end || (*x& (0x80 + 0x40)) != (0x80))
         return false;
       ++x;
     }
     else if ((*x& (0x80 + 0x40 + 0x20 + 0x10)) == (0x80 + 0x40 +
0x20))
     {
       if (++x == utf8str_end || (*x& (0x80 + 0x40)) != (0x80))
         return false;
       if (++x == utf8str_end || (*x& (0x80 + 0x40)) != (0x80))
         return false;
       ++x;
     }
     else if ((*x& (0x80 + 0x40 + 0x20 + 0x10 + 0x08)) == (0x80 + 0x40
+ 0x20 + 0x10))
     {
       if (++x == utf8str_end || (*x& (0x80 + 0x40)) != (0x80))
         return false;
       if (++x == utf8str_end || (*x& (0x80 + 0x40)) != (0x80))
         return false;
       if (++x == utf8str_end || (*x& (0x80 + 0x40)) != (0x80))
         return false;
       ++x;
     } else
       return false;
   }
   return true;
}