Re: New utf8string design may make UTF-8 the superior encoding

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++,microsoft.public.vc.mfc

Date:

Wed, 19 May 2010 10:23:01 -0700 (PDT)

Message-ID:

<0312b6a1-1cb7-45fe-8d7c-c1aa76d13298@v37g2000vbv.googlegroups.com>

On May 19, 4:14 pm, Peter Olcott <NoS...@OCR4Screen.com> wrote:

On 5/19/2010 5:39 AM, James Kanze wrote:

On May 18, 8:17 pm, Peter Olcott<NoS...@OCR4Screen.com> wrote:

On 5/18/2010 9:34 AM, James Kanze wrote:

On 17 May, 14:08, Peter Olcott<NoS...@OCR4Screen.com> wrote:

On 5/17/2010 1:35 AM, Mihai N. wrote:

a regular expression implemented as a finite state
machine is the fastest and simplest possible way of every
way that can possibly exist to validate a UTF-8 sequence
and divide it into its constituent parts.

It all depends on the formal specification; one of the
characteristics of UTF-8 is that you don't have to look at
every character to find the length of a sequence. And
a regular expression generally will have to look at every
character.

Validation and translation to UTF-32 concurrently can not be
done faster than a DFA recognizer, in fact it must always be
slower.

UTF-8 was designed intentionally in a way that it doesn't
require a complete DFA to handle, but can be handled faster.
Complete DFA's are usually slower than caluculations on modern
processors, since they require memory accesses, and memory is
often the limiting factor.

In fact, there is no "must always be slower". There are too
many variables involved to be able to make such statements.

This is the essence of my optimal design, try and show one
that is faster for matching 50 keywords.

I have to see yours, first. (I posted mine. I won't claim that
nothing is faster, but it does get the job done pretty quickly.)

Looking up an ActionCode switch statement value based on
a state_transition_matrix that is indexed by current_state and
current_input_byte.

unsigned char States[8][256];

The above is the state transition matrix for validating and
translating UTF-8 to UTF-32.

Plus the action table. And executing the action. And... Hand
waving is always easier (and faster) than real code.

And I've implemented DFA's before. I've actually written code
which generates a DFA from a regular expression. You don't have
to explain the obvious to me.

The above design completely proves my point to everyone with
sufficent knowledge of UTF-8 and DFA state transition
matrices.

Design never proves anything with regards to speed, at least as
long as the big-O is the same.

I will also add that I only have twelve ActionCodes including
InvalidByteError and the OutOfdata sentinel.

And how do you act on an ActionCode? Switch statements and
indirect jumps are very, very slow on some machines (but not on
others).

--
James Kanze