Re: number of bytes for each (uni)code point while using utf-8 as encoding ...

From:

Lew <lewbloch@gmail.com>

Newsgroups:

comp.lang.java.programmer

Date:

Tue, 10 Jul 2012 14:17:59 -0700 (PDT)

Message-ID:

<d18b8ea9-1ec7-4098-9b77-eff3500bc14f@googlegroups.com>

Daniele Futtorovic wrote:

lbrt chx _ gemale allegedly wrote:
lbrt chx _ gemale allegedly wrote:
>
>>> How can you get the number of bytes you "get()"?
>
>> Well, UTF-8 always encodes the same char to the same (number of)=

bytes,

>> doesn't it?
> ~
> What about files, which (author's) claim to be UTF-8 encoded bu=

t they aren't, and/or get somehow corrupted in transit? There are quite=
a bit of "monkeys" (us) messing with the metadata headers of htm=
l pages

> ~
> Sometimes you must double check every file you keep in a text bank/=

corpus, because, through associations, one mistake may propagate and create=
other kinds of problems

> ~
>> So you could just build a map char -> size /a priori/.
> ~
> ...
> ~
>> But really, what's the use? ...
> ~
> to you there is none but I am trying pinpoint the closest I possibl=

y can:

> ~
> .onMalformedInput(CodingErrorAction.REPORT);
> .onUnmappableCharacter(CodingErrorAction.REPORT);
> ~
> errors
> ~
> There should be a way to get sizes as you get UTF-8 encoded sequenc=

es from a file. Also I how found that quite a few files get corrupted while=
in transmission and sometimes I wonder how safe that naive mapping you men=
tion is, since those file formatting don't have any kind of built-in er=
ror correction measures

And what's that knowledge about the mapping size going to tell you?

Assume the file is corrupted. Then you can't know the original charac=

ter

(since it's corrupted). Hence even if you know to how many bytes each
character maps, you can't tell whether the size you're seeing is =

wrong

or right.

At least that's how it seems to me.

Even the malformedness is no reliable indicator. Your data might get
corrupted and the outcome be well-formed, as far as the character
encoding is concerned.

I have to agree with Lew. Only the transmission layer can reliably
tackle this problem. Just pass a checksum and be done with it.

Even the file being corrupt has no bearing on the correctness of the Java=

code. The file itself may actually be corrupt and the Java code yet
working perfectly.

--
Lew