Re: Is this String class properly implemented?

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Fri, 8 May 2009 06:48:35 -0700 (PDT)

Message-ID:

<b519f5c5-1f8a-439e-859e-37a8045691a5@b1g2000vbc.googlegroups.com>

On May 8, 3:02 am, "Tony" <t...@my.net> wrote:

James Kanze wrote:

On May 2, 12:10 pm, "Tony" <t...@my.net> wrote:

James Kanze wrote:

On Apr 29, 9:18 am, "Tony" <t...@my.net> wrote:

James Kanze wrote:

7-bit ASCII is your friend. OK, not *your* friend maybe,
but mine for sure!

7-bit ASCII is dead, as far as I can tell. Certainly none
of the machines I use use it.

It's an application-specific thing, not a machine-specific
thing.

That's true to a point---an application can even use EBCDIC,
internally, on any of these machines. In practice, however,
anything that leaves the program (files, printer output,
screen output) will be interpreted by other programs, and an
application will only be usable if it conforms to what these
programs expect.

But there is a huge volume of programs that can and do use
just ASCII text.

There is a huge volume of programs that can and do use no text.
However, I don't know of any program today that uses text in
ASCII; text is used to communicate with human beings, and ASCII
isn't sufficient for that.

I gave the example of development tools: parsers, etc.

Except that the examples are false. C/C++/Java and Ada require
Unicode. Practically everything on the network is UTF-8.
Basically, except for some historical tools, ASCII is dead.

Sure, the web isn't just ASCII, but that is just an
application domain. If that is the target, then I'll use
UnicodeString instead of ASCIIString. I certainly don't want
all the overhead and complexity of Unicode in ASCIIString
though. It has too many valid uses to have to be bothered with
a mountain of unnecessary stuff if being subsumed into the
"one size fits all" monstrosity.

As long as you're the only person using your code, you can do
what you want.

Which isn't necessarily a trivial requirement.

On that we agree 100%! That's the rationale for keeping
ASCIIString unaberrated.

I understand the rationale.

When I spoke of the encodings used on my machines, I was
refering very precisely to those machines, when I'm logged
into them, with the environment I set up. Neither pure
ASCII nor EBCDIC are options, but there are a lot of other
possibilities. Screen output depends on the font being used
(which as far as I know can't be determined directly by a
command line program running in an xterm), printer output
depends on what is installed and configured on the printer
(or in some cases, the spooling system), and file output
depends on the program which later reads the file---which
may differ depending on the program, and what they do with
the data. (A lot of programs in the Unix world will use
$LC_CTYPE to determine the encoding---which means that if
you and I read the same file, using the same program, we may
end up with different results.)

I don't get what you mean: an ASCII text file is still an
ASCII text file no matter what font the user chooses in
Notepad, e.g.

First, there is no such thing as an ASCII text file. For that
matter, under Unix, there is no such thing as a text file. A
file is a sequence of bytes. How those bytes are interpreted
depends on the application. Most Unix tools expect text, in an
encoding which depends on the environment ($LC_CTYPE, etc.).
Most Unix tools delegate display to X, passing the bytes on to
the window manager "as is". And all Unix tools delegate to the
spooling system or the printer for printing, again, passing the
bytes on "as is" (more or less---the spooling system often has
some code translation in it). None of these take into
consideration what you meant when you wrote the file.

Internally, the program is still working with ASCII strings,
assuming English is the language (PURE English that recognizes
only 26 letters, that is).

Pure English has accented characters in some words (at least
according to Merriam Webster, for American English). Pure
English distiguishes between open and closing quotes, both
single and double. Real English distinguishes between a hyphen,
an en dash and an em dash.

But that's all irrelevant, because in the end, you're writing
bytes, and you have to establish some sort of agreement between
what you mean by them, and what the programs reading the data
mean. (*If* we could get by with only the characters in
traditional ASCII, it would be nice, because for historical
reasons, most of the other encodings encountered encode those
characters identically. Realistically, however, any program
dealing with text has to support more, or nobody will use it.)

Nor does it matter that the platform is Wintel where "behind
the scenes" the OS is all UTF-16.

My (very ancient)

(Aside Trivia: The "failure" of Sun has been attributed in
part to the unwillingness to move to x86 while "the industry"
went there. Very ancient indeed!).

Where did you get that bullshit? Sun does sell x86 processors
(using the AMD chip). And IBM and HP are quite successful with
there lines of non x86 processors. (IMHO, where Sun went wrong
was in abandoning its traditional hardware market, and moving
into software adventures like Java.)

Sparcs use ISO 8859-1, my Linux boxes UTF-8, and Windows
UTF-16LE.

The reason is simple, of course: 7-bit ASCII (nor ISO
8859-1, for that matter) doesn't suffice for any known
language.

The application domain you reference is: Operating System.
Quite different from CSV text file parser.

I'm not referencing any application domain in particular.
Practically all of the Unix applications I know take the
encoding from the environment; those that don't use UTF-8 (the
more recent ones, anyway). All of the Windows applications I
know use UTF-16LE.

Do you think anyone would use MS Office or Open Office if they
only supported ASCII?

Your statement could be misleading even if you didn't intend
it to be. The "any known language.. blah, blah", is a
generalization that fits the real world,

Yes. That's where I live and work. In the real world. I
produce programs that other people use. (In practice, my
programs don't usually deal with text, except maybe to pass it
through, so I'm not confronted with the problem that often. But
often enough to be aware of it.)

but software programs eventually are just "zeros and ones".

Not really. Programs assign semantics to those ones and zeros.
Even at the hardware level---a float and an int may contain the
same number of bits, but the code uses different instructions
with them. Programs interpret the data.

Which brings us back to my point above---you don't generally
control how other programs are going to interpret the data you
write.

The above from you is an odd perspective noting that in
another thread you were trying to shoehorn something with,
logically, magnitude and direction into a signed integral
type.

Sorry, I don't know what you're talking about.

Um, how about the C++ programming language!

C++ accepts ISO/IEC 10646 in comments, string and character
literals, and symbol names.

That's a good expansion point. Let's look the constituents...

Comments and Symbols: If you want to program in French or
7-bit kanji (The Matrix?), have at it.

I've already had to deal with C with the symbols in Kanji. That
would have been toward the end of the 1980s. And I haven't seen
a program in the last ten years which didn't use symbols and
have comments in either French or German.

I guarantee you that I'll never ever use/need 10646 comments
or symbols.

Fine. If you write a compiler, and you're the only person to
use it, you can do whatever you want. But there's no sense in
talking about it here, since it has no relevance in the real
world.

I'll be nice and call it a simplifying assumption but it's
really a "no brainer".

Literals: Not a problem for me, and can be worked around for
others (put in file or something: make it data because that's
what it is. Programming in French is hard).

No it's not. (Actually, the most difficult language to program
in is English, because so many useful words are reserved as key
words. When I moved to C++, from C, I got hit several times in
the code written in English, by things like variables named
class. Never had that problem the French classe, nor the German
Klasse.)

Major advantage for me in programming: English is my primary
language!

It's one of my primarly languages as well. Not the only one,
obviously, but one of them.

(Curb all the jokes please! ;P). Trying to extend programming
(as I know it) to other languages is not my goal. It may be
someone else's proverbial "noble" goal.

[snip... must one indicate snips?]

Of course, I'm talking here about real programs, designed to
be used in production environments. If your goal is just a
Sudoku solver, then 7-bit ASCII is fine.

Of course compilers and other software development tools
are just toys. The English alphabet has 26 characters. No
more, no less.

C, C++, Java and Ada all accept the Unicode character set,
in one form or another.

There's that operating system example again that doesn't apply
to hardly all application development.

That has nothing to do with the operating system. Read the
language standards.

(Ada, and maybe Java, limit it to the first BMP.) I would
think that this is pretty much the case for any modern
programming language.

You are interfusing programming languages with the data that
they manipulate.

No. Do you know any of the languages in question? All of them
clearly require support for at least the first BMP of Unicode in
the compiler. You may not use that possibility---a lot of
people don't---but it's a fundamental part of the language.
(FWIW: I think that C++ was the first to do so.)

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34