On May 17, 12:25 pm, I V<ivle...@gmail.com> wrote:
On Mon, 17 May 2010 08:08:22 -0500, Peter Olcott wrote:
Do you know of any faster way to validate and divide a UTF-8 sequence
into its constituent code point parts than a regular expression
implemented as a finite state machine? (please don't cite a software
package, I am only interested in the underlying methodology).
A finite state machine sounds like a good plan, but I'd be a bit
surprised if a regular expression was faster than a state machine
specifically written to parse UTF-8. Aside from the unnecessary
generality of regular expressions (I don't really know if that would
actually make them slower in this case), I would guess a regular
expression engine wouldn't take advantage of the way that UTF-8 encodes
the meaning of each byte (single-byte codepoint, first byte of multi-byte
code-point, or continuation of a multi-byte codepoint) in the most-
significant two bits of the byte.
This sounds a little overkill to me, all of this talk of regular
expressions, finite state machines, etc.
Can't you just do something like the following? I understand that it
is a finite state machine in fact, but it uses no frameworks, no
regular expressions, etc. I'd expect that this is pretty good in terms
of speed and readability. It would be quite simple to add some code
using bit operations to convert from the utf8 array to Unicode code
points.
The finite state machine's detailed design is now completed. Its state
transition matrix only takes 2048 bytes. It will be faster than any
other possible method.
expression. I am somewhat enamored with DFA recognizers. I love them. I
will post the source code when it is completed.
much as more than 100-fold. I posted the code this this group. It was