Re: The D Programming Language

From:

"James Kanze" <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++.moderated

Date:

15 Dec 2006 08:23:33 -0500

Message-ID:

<1166175368.870279.87960@t46g2000cwa.googlegroups.com>

Al wrote:

James Kanze wrote:
<snip>

It's nice to know that string literals aren't constants. (Sort
of reminds me of Fortran IV, where constants passed to a
function could be modified by the function, so a different
constant would be passed the next time. If you look at Niklas'
code, you'll also see how you can get things like:
String s = "Hello, World!" ;
s.lastIndexOf( 'H' )
throwing an ArrayIndexOutOfBoundsException.

Of course, this was also the case in the original C. Maybe Java
got its ideas about how a string literal should behave from
there. Thank goodness we've made some progress in this respect
in C++ (and in C90---even the C standards committee thought that
modifying constants was taking empowerment of the programmer a
bit too far).

Well, there are two issues, which are distinct:

A) (String) Literals being unique (single instance).
B) (String) Literals being constant (immutable).

Formally, yes. Practically, strings are values, so identity
isn't important, which means that if the strings are constant,
whether identical strings are a single instance or not is
irrelevant. (There are exceptions to this, of course. When
optimizing, it is sometimes useful to require a single instance
for all identical strings, in order to just compare pointers,
rather than comparing all of the characters.)

If I understand correctly, A is done to minimize redundant memory
consumption.

Not only. Depending on how and where it is done, it can be used
to reduce total memory consumation, reduce dynamic allocation
(which can be expensive in terms of run-time) or to simplify
comparisons---if you know that two strings with the same value
must be at the same address, you can just compare pointers.

I agree that /if/ A is true (in any given language), then B
/should/ be true.

Per definition, B should be true. A literal is a compile time
constant. The only exceptions I'm aware of were early versions
of Fortran and C---and now Java. Both Fortran and C corrected
this defect very early in their existance. Java seems to have
added it; it wasn't present in the earliest implementations
(which didn't have reflection).

However, if A is false, then B is not necessary.

I disagree. If I see a numeric constant 42 in the source code,
I should be able to count on its value being 42. And if I see a
string literal "abc", I should be able to count on its value
being "abc". Constants should not be variables, and vice versa.

In my opinion, A is
Premature Optimization? that puts unfortunate constraints on the
language.

It has nothing to do with optimization. It's a question of
readability. How would you like it if the expression "i += 1"
added 2 to i? And how is that any different from the expression
`System.println( "Hello" )' printing "Good bye"?

How many identical string literals does a program have, on
average? I would say very few, if the code is well-written. If
the program is dynamically localizable (as is often the case),
probably /none/.

I don't know. "WHERE" tends to occur a lot in SQL requests
(with what precedes and follows variable). And I would strongly
recommend NOT replacing "WHERE" with "O?" or "WO", just because
you are in a French or German locale. An HTML client will
doubtlessly want to use "GET" (but that use is more likely to be
localized in one place in the program). And the logging macros
are full of __FILE__, which expands to the same string literal
throughout the file.

Not that that's relevant to anything. (Except maybe the
expansion of __FILE__, which could increase the size of the
executable noticeably if the identical instances aren't merged.)

Furthermore, if I understand correctly:

In C++, A is true* and B is true**.

* Or at least, probably, since the compiler will likely optimize it.
** Except char pointers decay to non-const.

A is unspecified. B is formally true, in that any attempt to
modify a string literal is undefined behavior. Because early C
guaranteed that string literals could be modified, and that each
instance was a separate object, many C++ compilers still support
this (often only with certain compiler options).

Note that the fact that the pointer can be implicitly converted
to non-const, at least in some very frequent cases, does not
authorize modification. It's an intentional hack to support
previously existing practice.

In Java, A is true*** and B true****.
*** At least those created at compile-time.
**** Except that reflection can be used to bypass it.

If it isn't created at compile-time, it isn't a string literal,
either in Java or C++. And if there's anything in the language
which allows you to modify a literal, that's a serious defect.

In the case of Java, the problem concerning literals may be the
most shocking, externally, but the fact that you can modify a
String after having passed it to another subsystem is far more
serious, since it undermines many of Java's security measures.

So I would conclude that ideally, a modern language should make string
literals:

A) Per-instance (or CoW).
B) Mutable.

A literal should never be mutable. Modifying a literal is on
the same level as other self-modifying code.

If this is not possible, then at least:

A) Unique.
B) Const.

The worst possible case is:

A) Unique.
B) Mutable.

Depending on how you interpret the caveats, I would argue that
both Java /and/ C++ are in the third category, which is not
good.

The modification of literals is a fun exercise, to demonstrate
the problem. (G++ puts string literals in write protected
memory, so they can't be modified. Period. Sun CC will do so
to, with the right options.) But it's only one aspect of the
problem; the real problem is modifying something that the author
of the code thinks cannot be modified. In C++, this is most
often a result of unintentional aliasing---just because you have
a std::string const& doesn't mean that the string value will not
change. In C++, however, this is so frequently a problem that
it is pretty well understood; most C++ programmers know that if
you need to be sure that something doesn't change, you make a
deep copy of it---you use pass by value. Java has similar
problems, in that you don't always know when objects are shared,
and when they aren't. This is normally only a problem with
objects which have value semantics---if identity is relevant to
the object's semantics, then obviously, you know which objects
are shared, and which aren't, by design. The normal solution to
this is to make value objects immutable. (For a good example of
what happens when you don't, consider the return value of
javax.swing.getPreferredSize(), which returns a mutable value
object. What happens if you modify it? Depending on the code
you've previously executed, and the layout manager installed,
you may or may not modify the preferred size of the component;
it's anybody's guess.) And of course, the problem here is that
we have a means of modifying an object which has been carefully
designed to be immutable, and which must be immutable, for
security reasons. In practice, you can probably force
uniqueness by something like:

    StringBuffer tmp( " " ) ;
    tmp.append( s ) ;
    s = tmp.substring( 1 ) ;

but 1) I don't think it's formally guaranteed, and 2) I've never
seen the necessity of this sort of hack documented.

And I repeat, the possibility of modifying a string *after*
having passed it to a library function is a serious security
hole. I'm very surprised that Java let's this one through.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient?e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated. First time posters: Do this! ]