Re: New utf8string design may make UTF-8 the superior encoding

From:

Joshua Maurice <joshuamaurice@gmail.com>

Newsgroups:

comp.lang.c++,microsoft.public.vc.mfc

Date:

Mon, 17 May 2010 14:58:27 -0700 (PDT)

Message-ID:

<0c8e10e8-88c6-4454-8b6d-b11746bc9162@34g2000prs.googlegroups.com>

On May 17, 12:25 pm, I V <ivle...@gmail.com> wrote:

On Mon, 17 May 2010 08:08:22 -0500, Peter Olcott wrote:

Do you know of any faster way to validate and divide a UTF-8 sequence
into its constituent code point parts than a regular expression
implemented as a finite state machine? (please don't cite a software
package, I am only interested in the underlying methodology).

A finite state machine sounds like a good plan, but I'd be a bit
surprised if a regular expression was faster than a state machine
specifically written to parse UTF-8. Aside from the unnecessary
generality of regular expressions (I don't really know if that would
actually make them slower in this case), I would guess a regular
expression engine wouldn't take advantage of the way that UTF-8 encodes
the meaning of each byte (single-byte codepoint, first byte of multi-byte
code-point, or continuation of a multi-byte codepoint) in the most-
significant two bits of the byte.

This sounds a little overkill to me, all of this talk of regular
expressions, finite state machines, etc.

Can't you just do something like the following? I understand that it
is a finite state machine in fact, but it uses no frameworks, no
regular expressions, etc. I'd expect that this is pretty good in terms
of speed and readability. It would be quite simple to add some code
using bit operations to convert from the utf8 array to Unicode code
points.

//COMPLETELY UNTESTED
bool validate_utf8(unsigned char * utf8str_start, unsigned char *
utf8str_end)
{
  for (unsigned char * x = utf8str_start; x != utf8str_end; )
  {
    if ((*x & 0x80) == 0)
    {
      ++x;
    }
    else if ((*x & (0x80 + 0x40 + 0x20)) == (0x80 + 0x40))
    {
      if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
        return false;
      ++x;
    }
    else if ((*x & (0x80 + 0x40 + 0x20 + 0x10)) == (0x80 + 0x40 +
0x20))
    {
      if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
        return false;
      if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
        return false;
      ++x;
    }
    else if ((*x & (0x80 + 0x40 + 0x20 + 0x10 + 0x08)) == (0x80 + 0x40
+ 0x20 + 0x10))
    {
      if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
        return false;
      if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
        return false;
      if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
        return false;
      ++x;
    } else
      return false;
  }
  return true;
}