Re: Binary or text file

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

12 May 2007 02:18:04 -0700

Message-ID:

<1178961484.711558.105040@p77g2000hsh.googlegroups.com>

On May 12, 1:32 am, Gianni Mariani <gi3nos...@mariani.ws> wrote:

On May 12, 6:14 am, James Kanze <james.ka...@gmail.com> wrote:
...

In practice, today, ASCII is pretty much inexistant; most text
is in some other encoding.

Really ? Most text files I see don't have any characters beyond the
ASCII set which would make them ASCII.

Really. You must live a very parochial life. I find accented
characters pretty regularly in my files (including in C++ source
files). And ASCII doesn't have any accented characters.

You're reading this thread; there are non-ASCII characters in
the messages in it. (Check out my signature, for example.)
Practically, if you're connected to the network, you can forget
about ASCII; you have to be able to handle a large number of
different character encodings.

.... A file in UTF-32LE, for example,
with English text, will have close to 3/4 of the bytes 0. You
can still try some heuristics: if you have a file with 1 byte
non-0, then three 0's, and that pattern repeats, with few
exceptions, there's a very good chance that it is UTF-32LE. But
it's more complicated (and globally, less reliable) that back in
the days when everything was ASCII.

I have yet to see a UTF-32LE file in the wild.

I haven't either, but I know that they exist. I've also created
a few for test purposes.

Even the UTF-16 files I've seen are far and few between.

Curious. From what I understand, UTF-16 is the standard
encoding under Windows. And machines running Windows aren't
exactly "few and far between".

I'd like to believe that utf-8
will become the default text format

I would too, but given the passive that has to be taken into
account, I don't realistically expect it to happen any time
soon.

and there are a few tests to
determine the likliness of a file being utf-8 (and no, it's probably
not a BOM at the beginning of the file).

Actually, UTF-8 isn't that difficult. If the first 500 some
bytes don't contain an illegal UTF-8 sequence, there's only a
very small probability that the file isn't UTF-8.

--
James Kanze (Gabi Software) email: james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34