Re: char and strict aliasing

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Fri, 18 Jul 2008 05:22:31 -0700 (PDT)

Message-ID:

<81890b92-08c8-4628-bafc-9d0088172b04@q24g2000prf.googlegroups.com>

On Jul 18, 8:12 am, Paul Brettschneider <paul.brettschnei...@yahoo.fr>
wrote:

Hello Alexandre and James, thanks for your reply!
James Kanze wrote:

On Jul 17, 10:10 pm, Paul Brettschneider

[...]

Apparently the pointer data is reloaded after every store. I
guess this is due to the aliasing rules for char types: for
some strange reason data might point to itself and to be
correct it has to be reloaded after every store.

Indeed replacing the char for an int gives the same code for f
and f2. IMO this is a bad language decision: It's highly
inconsistent.

It's a pragmatic compromise. Low level software (think of the
implementation of memcpy or a garbage collector) must be able to
access the raw memory underlying the objects; at this level, the
compiler really should consider all pointers as possible aliases
to anything.

I understand that. But I would expect programmers of low level
code like garbage collectors to understand aliasing and be
able to explicitly tell the compiler when aliasing is
possible.

The language standard doesn't provide any real means of telling
the compiler anything, outside of the language. And it has
pretty much been a principle of the language not to provide such
means.

Of course some old weird code might break. OTOH C++ breaks old
C code anyway...

:-).

The real problem in C++ is C compatibility. This is one of the
most fundamental parts of the basic object model, shared with C.
Modern C keeps the rule to avoid breaking older C, and C++ keeps
it to avoid breaking C compatibility. (Historically, C didn't
have such a rule. But the compilers at the time didn't do
enough alias checking to make it worthwhile. When C was being
standardized, in the late 1980's, it was becoming an issue, with
different compilers taking different positions. When I said
"pragmatic compromise", I really meant it: the C committee did
not want to innovate, introduce new keywords, etc., and worked
out a solution which guaranteed that most of the low level code
still worked, and that most of the optimizations---the people
most concerned with optimization are usually using float and
double---also worked, without making any fundamental additions
to the language.)

Optimization needs require aliasing to be
restricted as much as possible, and in application code, of
course, there should pratically never be any such aliasing. The
C++ solution (inherited from C) is to allow char* and unsigned
char* (in C, only unsigned char*, I think) to alias anything,
since that covers most of the low level needs, and to restrict
the aliasing for other types. In practice, even this turned out
to be insufficient for optimization purposes, and C99 introduced
restrict.

Normally, I would expect a compiler to offer options to control
this: one to request it to ignore the types in possible aliasing
analysis (because there is code around which counts on e.g.
looking at a double through an unsigned short*), and another to
state that even char* won't alias another type (which is
non-conform, but if you don't need the feature).

Exactly.

The best solution I've seen here is Modula-3, which had "safe"
modules (the default), and "unsafe" modules (explicitly declared
as such). In a safe module, the only pointers which could exist
were to dynamic objects, and a pointer to T could only point to
a T, or to something derived from a T, and all pointers were
garbage collected. In an unsafe module, practically anything
was allowed. Given the C++ object model, you'd probably have to
loosen the restrictions in "safe" modules somewhat, but I see
nothing wrong with saying that the compiler can assume no
cross-type aliasing except in unsafe modules. (If we ever get
modules, maybe we could arrange for three levels: "safe", with
guaranteed garbage collection, and pointers only allowed to
dynamically allocated objects, the default level, which could
correspond to the current situation, but possible with no
support for cross-type aliasing, even when char* in involved,
and no reinterpret_cast, and "unsafe", where anything goes, and
the compiler must assume you've used every dirty trick
imaginable.)

[...]

Also the restrict keyword didn't help: g++ doesn't like it.

In C++ code. It works fine in C code, at least if you specify
-std=c99. Which is correct: it is part of C99, but not C90, nor
C++98 or C++03. And I've not heard that it will be adopted into
the next C++ standard; when all is said and done, it's really
just an additional source of undefined behavior.

Long term, of course, it won't be necessary. Compilers are
getting better and better at inter-module optimization, and
there are already compilers (maybe only experimental) which can
detect the lack of aliasing across compilation unit boundaries,
and do this optimization, dependent on whether there actually is
aliasing or not. But for most users, that's probably "very long
term", rather than just "long term".

It's not legal C++. I would expect most C++ compilers to
support it, however, but only as an extension. So you'd
loose it if you turn extensions off (-std=c++98 or -ansi
with g++). I thought that this was the case with g++, but
I've never had the occasion to verify it.

My editor recognises it as reserved word, but g++ doesn't like
it - at least not without some command line argument.

It's a keyword in C99. If I were writing a compiler, I'd at
least warn if you used it otherwise (e.g. as the name of a
variable). Whether C++ adopts it officially or not, I imagine
that most C++ compilers will eventually support it as an
extension.

As a last measure I tried a wrapper class:

typedef class my_char {
        char data;
public:
        my_char() { }
        my_char(char c) { data = c; }
        char operator=(char c) { return data = c; }
        char operator=(my_char c) { return data = c.data; }
        operator char() { return data; }
} T;

Amazingly, this produces byte by byte the same code as using a
simple char. g++ cannot be right about this one: Does "class
{ char x; }" really have the same aliasing rules as "char"?

You'd have to show us the actual code you used. my_char*
cannot be used to access a pointer, so it should work.

Exactly the same code as above, but with the other typedef:

typedef class my_char {
        char data;
public:
        my_char() { }
        my_char(char c) { data = c; }
        char operator=(char c) { return data = c; }
        char operator=(my_char c) { return data = c.data; }
        operator char() { return data; }
} T;

class test {
        T *data;
public:
        void f(T, T, T);
        void f2(T, T, T);
};

void test::f(T a, T b, T c)
{
        data[3] = a;
        data[4] = b;
        data[5] = c;
}

void test::f2(T a, T b, T c)
{
        T *d = data;
        d[3] = a;
        d[4] = b;
        d[5] = c;
}

Gives byte by byte the same code as with "typedef char T;". Of
course I'm not sure that you can call this a bug since after
all the code is correct, it's just not as efficient as it
could be. Using stronger aliasing rules you're always on the
safe side. Still it makes me wonder where the aliasing rules
are implemented in g++? You can even change the wrapper class
to (note the negations):

Formally, the optimization is legal here. Practically, g++
probably determines that it is dealing with char's (optimizing
out the wrapper) before it applies the aliasing analysis. As
you say, it could be better, since this causes a possible
optimization to be missed, but it is certainly legal. Or
possibly it simply "pessimizes" aliasing analysis anytime it
sees a char in the expression. (This is probably the simplest
way of handling the C++ requirements.)

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34