Re: number of bytes for each (uni)code point while using utf-8 as
encoding ...
Daniele Futtorovic wrote:
lbrt chx _ gemale allegedly wrote:
lbrt chx _ gemale allegedly wrote:
>
>>> How can you get the number of bytes you "get()"?
>
>> Well, UTF-8 always encodes the same char to the same (number of)=
bytes,
>> doesn't it?
> ~
> What about files, which (author's) claim to be UTF-8 encoded bu=
t they aren't, and/or get somehow corrupted in transit? There are quite=
a bit of "monkeys" (us) messing with the metadata headers of htm=
l pages
> ~
> Sometimes you must double check every file you keep in a text bank/=
corpus, because, through associations, one mistake may propagate and create=
other kinds of problems
> ~
>> So you could just build a map char -> size /a priori/.
> ~
> ...
> ~
>> But really, what's the use? ...
> ~
> to you there is none but I am trying pinpoint the closest I possibl=
y can:
> ~
> .onMalformedInput(CodingErrorAction.REPORT);
> .onUnmappableCharacter(CodingErrorAction.REPORT);
> ~
> errors
> ~
> There should be a way to get sizes as you get UTF-8 encoded sequenc=
es from a file. Also I how found that quite a few files get corrupted while=
in transmission and sometimes I wonder how safe that naive mapping you men=
tion is, since those file formatting don't have any kind of built-in er=
ror correction measures
And what's that knowledge about the mapping size going to tell you?
Assume the file is corrupted. Then you can't know the original charac=
ter
(since it's corrupted). Hence even if you know to how many bytes each
character maps, you can't tell whether the size you're seeing is =
wrong
or right.
At least that's how it seems to me.
Even the malformedness is no reliable indicator. Your data might get
corrupted and the outcome be well-formed, as far as the character
encoding is concerned.
I have to agree with Lew. Only the transmission layer can reliably
tackle this problem. Just pass a checksum and be done with it.
Even the file being corrupt has no bearing on the correctness of the Java=
code. The file itself may actually be corrupt and the Java code yet
working perfectly.
--
Lew