Re: number of bytes for each (uni)code point while using utf-8 as
 encoding ...
 
On Tuesday, July 10, 2012 12:45:07 PM UTC-7, (unknown) wrote:
> On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote:
 
> >  How can you get the number of bytes you "get()"?
 
> Well, UTF-8 always encodes the same char to the same (number of) byt=
es,
> doesn't it?
~ 
 What about files, which (author's) claim to be UTF-8 encoded but the=
y aren't, and/or get somehow corrupted in transit? There are quite a bi=
t of "monkeys" (us) messing with the metadata headers of html pag=
es
~ 
 Sometimes you must double check every file you keep in a text bank/corpu=
s, because, through associations, one mistake may propagate and create othe=
r kinds of problems
~ 
> So you could just build a map char -> size /a priori/.
~ 
 ...
~ 
> But really, what's the use? ...
~ 
 to you there is none but I am trying pinpoint the closest I possibly can=
:
~ 
  .onMalformedInput(CodingErrorAction.REPORT);
  .onUnmappableCharacter(CodingErrorAction.REPORT);
~ 
 errors
~ 
 There should be a way to get sizes as you get UTF-8 encoded sequences fr=
om a file. Also I how found that quite a few files get corrupted while in t=
ransmission and sometimes I wonder how safe that naive mapping you mention =
is, since those file formatting don't have any kind of built-in error c=
orrection measures
It isn't the job of the file format to correct errors but of the transmissi=
on protocol.
Are you saying "quite a few files get corrupted" when reading directly from=
 disk 
or over some other wire protocol? If it's from disk, I'd blame the disk dri=
ve not 
Java.
You aren't going to fix a bad disk with good programming.
-- 
Lew