Re: Binary or text file
On May 12, 1:32 am, Gianni Mariani <gi3nos...@mariani.ws> wrote:
On May 12, 6:14 am, James Kanze <james.ka...@gmail.com> wrote:
...
In practice, today, ASCII is pretty much inexistant; most text
is in some other encoding.
Really ? Most text files I see don't have any characters beyond the
ASCII set which would make them ASCII.
Really. You must live a very parochial life. I find accented
characters pretty regularly in my files (including in C++ source
files). And ASCII doesn't have any accented characters.
You're reading this thread; there are non-ASCII characters in
the messages in it. (Check out my signature, for example.)
Practically, if you're connected to the network, you can forget
about ASCII; you have to be able to handle a large number of
different character encodings.
.... A file in UTF-32LE, for example,
with English text, will have close to 3/4 of the bytes 0. You
can still try some heuristics: if you have a file with 1 byte
non-0, then three 0's, and that pattern repeats, with few
exceptions, there's a very good chance that it is UTF-32LE. But
it's more complicated (and globally, less reliable) that back in
the days when everything was ASCII.
I have yet to see a UTF-32LE file in the wild.
I haven't either, but I know that they exist. I've also created
a few for test purposes.
Even the UTF-16 files I've seen are far and few between.
Curious. From what I understand, UTF-16 is the standard
encoding under Windows. And machines running Windows aren't
exactly "few and far between".
I'd like to believe that utf-8
will become the default text format
I would too, but given the passive that has to be taken into
account, I don't realistically expect it to happen any time
soon.
and there are a few tests to
determine the likliness of a file being utf-8 (and no, it's probably
not a BOM at the beginning of the file).
Actually, UTF-8 isn't that difficult. If the first 500 some
bytes don't contain an illegal UTF-8 sequence, there's only a
very small probability that the file isn't UTF-8.
--
James Kanze (Gabi Software) email: james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34