Re: Reading an array from file?

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Fri, 7 Aug 2009 01:32:13 -0700 (PDT)

Message-ID:

<3ca0c757-cb5a-46ae-ab91-9e4aa27d18f1@q14g2000vbi.googlegroups.com>

On Aug 6, 8:27 pm, Jerry Coffin <jerryvcof...@yahoo.com> wrote:

In article <693fed3c-761e-4429-b6b0-9a6f77a52748
@c14g2000yqm.googlegroups.com>, james.ka...@gmail.com says...

[ ... ]

Well, there's a certain level where everything is just
bytes. But I was under the impression that Windows used
UTF-16 for text at the system level, and that files could
(and text files generally did) contain UTF-16---i.e. 16 bit
entities. (And under Windows on a PC, a byte is 8 bits.)

They can, but they far more often contain something like ISO
8859.

In the end, the OS is mostly agnostic about the content of
text files. As you'd expect, it includes some utilities that
know how to work with text files, and most of those can work
with files containing either 8-bit or 16-bit entities, and
even guess which a particular file contains (though the guess
isn't always right).

On the other hand, now that you mention it... When I ported
some of my file handling classes to Windows, filenames for
CreateFile were always LPCTSTR, whatever that is (but a
narrow character string literal converts implicitly to it,
as does the results of std::string.c_str()), which makes me
wonder why people argue that std::fstream must have a form
which takes a wchar_t string as filename argument.

Just FWIW, LPCTSTR is something like long pointer to const
text string (where 'text' means char's or wchar_t's depending
on whether _UNICODE was defined or not when compiling).

In other words, you don't know what you're getting. That sounds
like the worst of both worlds.

If you don't have _UNICODE defined, CreateFile will accept a
char *. If you do define it, CreateFile accepts a wchar_t *.

In reality, most of the functions in Windows that take strings
come in two flavors: an 'A' version and a 'W' version, so the
headers look something like this:

HANDLE CreateFileW(wchar_t const *, /* ... */);
HANDLE CreateFileA(char const *, /* ... */);

#ifdef _UNICODE
#define CreateFile CreateFileW
#else
#define CreateFile CreateFileA
#endif

Hopefully, they do use an inline function in the #ifdef, and not
a macro.

The 'A' version, however, is a small stub that converts the
string from the current code page to UTF-16, and then (in
essence) feeds that result to the 'W' version. That can lead
to a problem if you use the 'A' version -- if your current
code page doesn't contain a character corresponding to a
character in the file name, you may not be able to create that
file name with the 'A' version at all.

Hopefully, they have a code page for UTF-8.

And what happens with the name when it is actually passed to the
file system? Most file systems I have mounted won't support
UTF-16 in filenames---the file system will read it as a NTMB
string, and stop at the first byte with 0. (Also, the file
servers are often big endian, and not little endian.) I'm
pretty sure that NFS doesn't support UTF-16 in the protocol, and
I don't think SMB does either.

The 'W' version lets you specify UTF-16 characters directly,
so it can specify any file name that can exist -- but
fstream::fstream and fstream::open usually act as wrappers for
the 'A' version.

Of course, you _could_ work around this without changing the
fstream interface -- for example, you could write it to expect
a UTF-8 string, convert it to UTF-16, and then pass the result
to CreateFileW -- but I don't know of anybody who does so. As
I recall, there are also some characters that can't be encoded
as UTF-8, so even that wouldn't be a perfect solution, though
it would usually be adequate.

UTF-8 can encode anything in Unicode. And more; basically, in
it's most abstract form, it's just a means of encoding 32 bit
values as sequences of 8 bit bytes, and can handle an 32 bit
value. (The Unicode definition of UTF-8 does introduce some
restrictions---I don't think encodings of surrogates are
allowed, for example, and codes Unicode forbids, like 0xFFFF,
certainly aren't. But in the basic original UTF-8, there's no
problem with those either.)

According to the documentation, WriteFile and ReadFile take
what I assume to be a void* (LPCVOID or LPVOID), which
doesn't say much one way or the other, but the length
argument is specified as "number of bytes".

Right -- the OS just passes this data through transparently.
Fundamentally it's about like write() on Unix -- it just deals
with a stream of bytes; any other structure is entirely up to
you and what you choose to write and how you choose to
interpret data you read.

In other words, there is no transfer of 16 bit entities. It's
up to the writer to write it as bytes, and the reader to read it
as bytes, and the two to agree how to do so. (In practice, of
course, if the two are both on the same machine, this won't be a
problem. But in practice, in the places I've worked, most of
the files on the PC's have been remote mounted on a Sparc, which
is big-endian.)

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34