Re: Designing a Finite State Machine DFA Recognizer for UTF-8

From:

Peter Olcott <NoSpam@OCR4Screen.com>

Newsgroups:

microsoft.public.vc.mfc

Date:

Sun, 23 May 2010 22:33:04 -0500

Message-ID:

<C-idnWXj16LsbmTWnZ2dnUVZ_hGdnZ2d@giganews.com>

I don't want to take the time to bother to reply to all of this point by
point. You did not seem to be rude so I will at least reply.

The link that you provided superficially seemed to be very high quality
work. I would guess that this guy's code may beat my code by a little
bit. I would guess that this little bit would be something like 50% faster.

On 5/23/2010 7:24 PM, Hector Santos wrote:

Peter Olcott wrote:

****
(4) The case-pair decoder, with the current state and character pair

That is (1) above.

****
No, it is not, it is much flatter, and easier to write. I'm sorry you can't tell the
difference, which is actually pretty significant! There is NO "case statement within case
statement" in the model I showed you, only a very flat, single switch statement. Isn't it
obvious by inspection?
joe
****

Although I don't remember it at all, I can imagine it from the details
that you just provided. For small DFAs it may be faster than the method
that I propose. For larger ones it may be slower. I see no way around it
not taking more memory that the one that I propose.

****

I think you two are mixing apples and oranges.

And peter, just because you are limited in understanding prior art,
does not mean that when you get an AH-HA, that it becomes "My IDEA",
"My Proposal," etc.

What you speak of is very common in computer methods, just like others
have provided to you to illustrate the ideas to you or you read it
about it in some web site, like here, switch shows a very elegant and
sweet state machine DFA UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa

and this work include some empirical speed break downs. There are no
switch statements. And if you research the earlier work he did, you
can see the evolution he took to reach this point.

Please not there is no std::vector<> object involved. Pure
primitives.

Not actually. The total amount of space of the method that you proposed would be much larger because it would either have to repeat the code in my twelve ActionCodes very many times or take the overhead of function calls, which would negate the speed improvement. One can know in advance very good estimates of actual relative performance.

But the point is that its just a matter of CODE, not really the data.
What was important here is the MAPPING, from the MAPPING, the
function generator written can be done a many ways and LANGUAGE is a
big factor here. Again the idea of higher vs lower level functional
prototyping.

Example, Languages, such as the APL language is specifically designed
for vector/matrix transformation operations. The CRAY offers a the
API that is specifically design to be optimized for the machine
vectorization I/O.

I have found that bot to be the case. Nothing else may be 100% completely definitive but these performance estimates are accurate enough to matter very much.

But what are you really measuring? The technology or the method to
code it (implementation)?

Pointing to ANY two instances of code and saying "this instance

has better performance than that instance" is a meaningless
activity. MEASURING the two instances and saying "According to
the statistical performance data, code A has an average time
of KA units of time with a standard deviation of SDA units, and
code B has an average time of KB units of time with a standard
deviation of SDB units" can you say anything meaningful about

the execution of the code. Anyting else is a complete waste

of time.

I understand it and reject it as false.

But you can only say that because on your limited understanding to
implement a concept. I can understand that one take a basic approach
to eliminate factors that produce the minimum work requirements, but
the way you are doing it DOES NOT SHOW you have done this.

For example, you INSIST on using std::vector as if its the MOST COMMON
approach, therefore it has concept you can factor out of the
measurement since the performance OVERHEAD is a constant for all
implementations.

The point is that what you think is BETTER, may not be better in the
mind of others. It this idea that you have most trouble with because
you have an ego that presumes that is higher than the rest, thus
believe you have an intelligence level unmatched.

Yes all that stuff can potentially throw off the estimates. I don't think that it is all that difficult to see which sequences of instructions can not be correctly executed in parallel.

There you go again. If you are going to ignore all issues, and
believe what you produce is the common construct for all, then that
would be one thing, but you have not done that.

Stop making your silly claims and produce some REAL data we can look at!

I will produce working code very soon. For data to look at I need a large file of UTF-8 Chinese. The source code will measure and report its own performance. It will take a UTF-8 file as command line input and produce a UTF-32 file as output. It will time the conversion duration using clock().

The 27th? or do you need an extension? You know normally, when a
programmer gives me an estimate, I multiply it by 3. So I have you
three weeks.

Also, you should use ICU (iconv.exe tool) to be among your testing and
comparison test, including the API available in Windows 2008/VISTA SDK
for Unicode and UTF-XX conversion, as well as other common API
libraries used because at the END of the DAY, and comparison will be
measured again that people use today in their environment.

Unless there is any real significant gains, I means outstandingly
significant, the worth of using any apparent improvement done by you
that wasn't already done will be ignored.

Most experts cannot look at the generated machine code and /a priori/ predict behavior,
but you seem to claim you can do this just by looking a C/C++ source! Get a grip on life,
man, and realize you haven't a CLUE how long it takes a loop to execute on a REAL machine,
simply by looking at the C/C++ source (or even at the generated machine code).

No I don't bother to do this. It is most often the case where one loop has triple the assembly language instructions (accomplishing the same functional end result) that it would take longer than the one with fewer. Because of all the things you cited I would tend to agree that estimating how much longer that it takes would be fruitless, only measurement can tell this.

Again look at the link above, and also see (find) his original version
and the evolution. I believe he made a good point, that there was no
performance gains other than its looks more elegant. I'll show it
here.

In the first version he had two vectors (tables), but he merged it and
the 2nd version decode() function had:

void decode(uint8_t* s, uint32_t* unip) {
   uint8_t byte, stat, data;
   uint32_t unic;

   stat = 9;
   unic = 0;

   while ((byte = *s++)) {
     data = d_combined[ byte ];
     stat = d_combined[ 256 + (stat<< 4) + (data>> 4) ];
     byte ^= (data<< 4);
     unic = (unic<< 6) | byte;

     if (!stat) {
       *unip = unic;
       unic = 0;
     }
   }
}

which he reduced to:

uint32_t inline
decode(uint32_t* state, uint32_t* codep, uint32_t byte) {
   uint32_t type = utf8d[byte];

   *codep = (*state != UTF8_ACCEPT) ?
     (byte& 0x3fu) | (*codep<< 6) :
     (0xff>> type)& (byte);

   *state = utf8d[256 + *state*16 + type];
   return *state;
}

I would bet the 90% of the time in the case of compiler generated code (not hand tweaked assembly language) if one loop that was coded to produce that same functional result as another and has triple the assembly language instructions that it will more time than the one with fewer instructions.

DUH!

But that is a general idea and has nothing to do with what you are
doing. Again, this is more about WHAT YOU, PETER OLCOTT, are capable
of, not what was already possible and already done.

Again, you want to use std::vector, others will not use it or more
likely expose it for usage but will be a wrapper of the primitive, in
other words, you will have overloads.

I am not talking code that was specifically designed to form a counter example to this claim. I am talking about typical code that is written.

Exactly! But the typical code in your own mind in how do XYZ idea is
very limited. Thats one of the major beefs with your outrageous
statements. Its this way or NO way.

You couldn't predict that the OS would not swap out the pages of my data when I could and did make this accurate prediction. I knew that the OS had far more RAM than it needed, and thus had no good reason to swap out this data.

Pedro, that is really an inaccurate rationalization developed over the
last 3-4 months when the facts were you lacked understanding of the
virtualization memory concept, and it still seems you are still in
denial.

Lets try it this way, its like this in simple terms:

    YOU ARE FREE UNTIL THE OS SEES YOU NO LONGER HAVE ROOM
    TO BE FREE.

You were told that you have to use SPECIAL API functions to provide
the OS the "hint" that you wish certain memory blocks to be non-
pagable, and for special cases, the API applies when you have special
multi-core hardware, i.e. NUMA.

But in the "typical code" sense as you speak of, using standard memory
allocation (whether you do it explicitly or not such as with a class,
does not matter), will not provide you this FREEDOM and that is what
you were MIXING up. You wanted freedom and believe that if you didn't
pressure the system than there was NO reason to lose this freedom.

So if you are testing and have not put any load (pressure) into your
system, the oddsS are very good you will see an illusion of this
unrestricted non-paging freedom.

But like most systems of the world, once the "chaos" begins, you will
tend to lose the freedom you thought to be absolute - its not.

So when you talk about "typical coding" well, using the special API is
not typical coding. Most people will allow the OS memory management
to deal with application code and data needs.

If you want to make claims that performance about a particular
implementation you have in mind, you need to specific what the
boundary conditions and presumptions are for the process environment.

One example, you have to stipulate:

    These test results are based on the idea that the data
    tables are non-pagable memory blocks and that data
    locality overhead differentials are not a factor or is
    it common and practically negligible. In other words,
    memory I/O is the same.

    These test results are based on the compilation using
    XYZ compiler version #.# with the following optimization
    switches.

    These test results are based that no USER LEVEL memory
    allocation is done and data arrays are implicitly
    management by the data array object managements. In
    this case, std::vector is used for data arrays.

And so one.

The beef I have is this is so trivial ideas. What you are trying to
prove? Is your claim or proposal basically?

      "A method using std::vector for a UTF-8 to UTF-32 encoder
       which offer performances equal or better as the most
       acceptable methods today."

Whatever that XYZ methods are, one would venture it will be the method
the person needing this stuff is using. Be it some 3rd party library,
the platform OS API if any, or his own. He might be interested in
your method IFF he feels the method he is using is currently slow, but
acceptable until something betters it.

So he might try your method which means switching his data to
std::vector.

But the purest will look at that and say:

     "If its based on std::vector, then I can easily convert it
      using C based array primitives."

The odds are very good that will happen if they don't want to use the
STL library. And he might do a quick test and compare to see what
"secret" (and there isn't any) that using std::vector provides in
speed but now lost in a more primitive functional prototype version.
Even if he sees (and he won't unless he coded it wrong) there is a
slight performance gain, he might ignore that gain in order to get the
"productivity" gain using a interface layer better suited for him.

You need to understand that MANY practical engineers think in this
was. If they are going to use a new technology, it better be worth
the pains of rewriting code in order to fit it in.

--
HLS