Re: Reading an array from file?

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Sat, 8 Aug 2009 03:12:58 -0700 (PDT)

Message-ID:

<bdbbf8aa-5da2-443a-bd20-98969e1b7633@v2g2000vbb.googlegroups.com>

On Aug 7, 4:00 pm, Jerry Coffin <jerryvcof...@yahoo.com> wrote:

In article <3ca0c757-cb5a-46ae-ab91-9e4aa27d18f1
@q14g2000vbi.googlegroups.com>, james.ka...@gmail.com says...

On Aug 6, 8:27 pm, Jerry Coffin <jerryvcof...@yahoo.com> wrote:

[ ... ]

Just FWIW, LPCTSTR is something like long pointer to const
text string (where 'text' means char's or wchar_t's
depending on whether _UNICODE was defined or not when
compiling).

In other words, you don't know what you're getting. That
sounds like the worst of both worlds.

I can't say I've ever run into a situation where I didn't get
what I wanted or didn't know what I was going to get.

A library with inline functions or template code?

More generally, how do you ensure that all components of an
application are compiled with the same value for _UNICODE?

At the same time, for _most_ new development, I'd ignore all
that and use the "W" versions of functions directly. Those are
really its native functions, and they're always a bit faster,
require less storage, and have at least the same capabilities
as the "A" versions of the same (and sometimes more).

That sounds reasonable.

[ ... ]

The 'A' version, however, is a small stub that converts
the string from the current code page to UTF-16, and then
(in essence) feeds that result to the 'W' version. That
can lead to a problem if you use the 'A' version -- if
your current code page doesn't contain a character
corresponding to a character in the file name, you may not
be able to create that file name with the 'A' version at
all.

Hopefully, they have a code page for UTF-8.

Yes, thankfully, they do.

What about Alf's claim that it doesn't really work?

More generally, if you're going to do this sort of thing, you
need to offer a bit more flexibility. Filenames can come from
many different sources, and depending on the origine, the
encoding may not be the same.

[ ... ]

And what happens with the name when it is actually passed to
the file system? Most file systems I have mounted won't
support UTF-16 in filenames---the file system will read it
as a NTMB string, and stop at the first byte with 0. (Also,
the file servers are often big endian, and not little
endian.) I'm pretty sure that NFS doesn't support UTF-16 in
the protocol, and I don't think SMB does either.

This is one of the places that I think the GUI way of doing
things is helpful -- you're normally giving the user a list
of files from the server, and then passing the server back a
name picked from the list.

Not when you're creating new files. And most of my programs
don't run under a GUI; they're servers, which run 24 hours a
day. Of course, they don't run under Windows either, so the
question is moot:-). But the question remains---picking up the
name from a GUI is fine for interactive programs, but a lot of
programs aren't interactive.

[ ... ]

UTF-8 can encode anything in Unicode. And more; basically,
in it's most abstract form, it's just a means of encoding 32
bit values as sequences of 8 bit bytes, and can handle an 32
bit value. (The Unicode definition of UTF-8 does introduce
some restrictions---I don't think encodings of surrogates
are allowed, for example, and codes Unicode forbids, like
0xFFFF, certainly aren't. But in the basic original UTF-8,
there's no problem with those either.)

I think we're mostly dealing with a difference in how
terminology is being used,

UTF-8 really does have two commonly accepted meanings. The
original UTF-8 was just a means of formatting 16, and later 31
bit entities as bytes, and could handle any value that could be
represented in 31 bits. The Unicode definition clearly
restricts it somewhat, but their site is down right now, so I
can't see exactly how. If nothing else, they only allow values
in the range 0-0x10FFFF (which means that the longest sequence
is only 4 bytes, rather than 6), but I'm sure that there are
other restrictions as well.

but I also think it's more or less irrelevant -- as long as
you use UTF-8, you'll almost certainly be able to represent
any file name there is.

Yes.

[ ... ]

In other words, there is no transfer of 16 bit entities.
It's up to the writer to write it as bytes, and the reader
to read it as bytes, and the two to agree how to do so. (In
practice, of course, if the two are both on the same
machine, this won't be a problem. But in practice, in the
places I've worked, most of the files on the PC's have been
remote mounted on a Sparc, which is big-endian.)

As long as the files are only being used on PCs, and stored on
SPARCs, that shouldn't matter. Just to act as a file server,
all it has to do is ensure that the stream of bytes that was
sent to it matches the stream of bytes it plays back.

And that the file name matches, somehow. But typically, this
isn't the case---I regularly share files between systems, and
this seems to be the case for everyone where I work.

We're on fairly familiar ground here though -- Windows being
involved doesn't really change anything. If you're writing a
file of Unicode text, putting a BOM at the beginning should be
enough to let anything that "knows" Unicode read it. If the
file needs to contain anything much more complex, you probably
want to use some standardized encoding format like ASN.1 or
XDR. Choosing between those is usually pretty easy as well:
you use XDR when you can, and ASN.1 if you have to (e.g. to
exchange data with something that only understands ASN.1, or
if you really need the data to be self-describing).

I agree that standard (and simple) solutions exist. Putting a
BOM at the start of a text file allows immediate identification
of the encoding format. But how many editors that you know
actually do this? (For non-text files, of course, you have to
define a format, and the defined formats do tend to work
everywhere. Although there's still the question of what to do
if you have a filename embedded in an otherwise non-text
format.)

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34