According to Peter Duniho <NpOeStPeAdM@NnOwSlPiAnMk.com>:
My understanding is that you can tell from a single byte in UTF-8
whether it's the end of a character or not. But to identify the
beginning of a character, you need to look for the end of the
_previous_ character.
No, that's not how it works in UTF-8:
-- code points which encode into a single byte yield byte values between
0 and 127 (inclusive);
-- other code points become a sequence of bytes:
** first byte has value between 192 and 247 (inclusive)
** subsequent bytes (one to three extra bytes) have value between
128 and 191 (inclusive)
The first byte of a multi-byte sequence also encodes how many extra
bytes are to be found afterwards. With Unicode as currently defined, no
code point requires more than four bytes: valid code points are in the
0..1114111 range, while allocated code points use about 10% of that
range (so there is still quite some room). The UTF-8 encoding is good up
to 2097152. If a future Unicode version extends the range, UTF-8
encoding can be extended to up to 6-byte encodings, and the first byte
may then assume values 192 to 253. It is a feature of UTF-8 that byte
values 254 and 255 never appear anywhere (it is used for BOM handling,
so that UTF-8 and UTF-16 can be telled appart unambiguously).
Anyway, the ending byte of the UTF-8 encoding of a code point is not
specially marked; but _starting_ bytes are easy to detect. Hence it is
easy to know whether you are at the start of a code point, or should go
back for at least one byte.
Exactly.
A character starts with a byte which does not start with 10. Those are
pretty easy to spot.
A problem well stated is a problem half solved. -- Charles F. Kettering