Re: Efficient use of C++ Strings: Request for Comments

From:

Dizzy <dizzy@roedu.net>

Newsgroups:

comp.lang.c++.moderated

Date:

Wed, 31 Jan 2007 08:12:36 CST

Message-ID:

<45c073f9$0$49202$14726298@news.sunsite.dk>

Scott McKellar wrote:

Having put up some web pages about how to use C++ strings
efficiently, I hereby invite my betters to rip them to shreds:

http://home.swbell.net/mck9/effstr/

Ok, my comments about your "rules":
1. Allocate strings statically, not on the stack

You can't be serious about this, one needs strings all over the place and
you can't possibly allocate a static version for all needed strings in the
lifetime of the program (and managing all these strings to use is crazy)
also consider what happens with the destruction order of statics, if one
will need a (static) string from another static object destructor things
can go very bad if that static string was destructed already.

Plus if allocating static or not is a speedup only depends on the actual
implementation. Not all implementations allocate any puny std::string to
the heap, I can easily imagine a std::string implementation that embeds a C
array to store strings up to some predefined length (usually small strings
but which may happen many times in a program). You say one cannot base code
on implementation but I say it depends on what you want to achieve. If your
objective is to optimize some code clearly that optimization it's PER
implementation (you will profile each specific executable/implementation
and conclude problems on those) as it will behave completely different on
one implementation than another. And this tip is one such thing that will
depend.

In conclusion I wouldn't advise people to use static strings instead of
stack ones but instead if their profiler points out a CPU problem because
the string implementation doesn't make it fast enough for small strings I
would just use another string implementation done by me or by someone else
(possibly based on the available std::string relaying some of the
operations to it).

But your tip makes sense if we talk about character string literals, in
order to avoid having unnecessary character pointer interfaces one should
declare static std::string constant objects instead of using string
literals (see also tip 3).

2. Don't pass strings by value

I generally agree with this (not only for strings but any other non built-in
object should be passed by reference to const where possible).

3. Provide overloaded functions that accept character pointers instead of
strings

I guess this may help if calling functions happens often with character
pointers instead of std::string. However, a program that uses std::string a
lot (and I can't think of many reasons why most C++ programs shouldn't) may
not have any need for internal character pointer interfaces as it can just
pass arround references to std::string.

The character pointer interfaces have the also drawback that you loose some
of the metadata stored in std::string (like it's size) and as such if in
that character pointer function (directly or indirectly) you need the size
of the string you will run a O(n) operation (ie strlen) on the character
string instead of calling the cached std::string.size() method.

So usually I tend to avoid having character pointers and just receive
references to const std::string.

4. Don't return strings by value

While maybe for a very specific testcase where returning that string on a
specific std::string implementation is the CPU killer you might be right, I
really believe this tip shouldn't be applied in general (only in particular
on such a case). Why ? Because CPU bottleneck of returning a string by
value:
- it's usually eliminated by RVO in my programs as I benchmarked it (make a
test function returning by value some object of yours where you have a copy
constructor printing out a message, you will see that compilers optimize
away any copy, at least g++ 4.1.x did so on my testings)
- depends on the implementation (a reference count implementation as gcc's
libstdc++ just increments a counter)
- will surely dissapear in C++0x standard library because then with the
rvalue reference such temporaries returned won't be copied unnecessary but
just some pointers will be copied

Because avoiding returning by value tends to lead to worse code than
returning by value (design speaking) I wouldn't avoid it especially since
in the future any CPU overhead will be eliminated on all implementations
with the rvalue reference semantics.

Example of messy code (IMO of course):

// avoid returning by value
void buildRoot(std::string& str);

MyClass::MyClass()
:m_root(), m_memb2(arg1, arg2), m_memb3(arg3, arg4)
{
         buildRoot(m_root);
}

Compared with

// return by value
std::string buildRoot();

MyClass::MyClass()
:m_root(buildRoot()), m_memb2(arg1, arg2), m_memb3(arg3, arg4)
{}

I consider the second version much better, especially considering things
such as exceptions where buildRoot() my throw instead of returning for
error cases and in those cases it's pointless to construct m_memb2, m_memb3
because their construction might be costly. Not to mention that m_memb2
might need m_root as it's constructor argument and then what do you do to
solve this ? You would add a default constructor to m_memb2 to delay it's
initialization ? (a technique that it's messy too not to mention you will
be modifying the design of such class when trying to do some strange
optimization that it's unrelated to the design of m_memb2)

5. Don't use string::operator+()

I guess this comes from one of the optimization techniques, to replace code
such as:
std::string str3(str1 + "text1");
with code:
std::string str3(str1);
str3 += "text1";

Because this way it will avoid possible temporary creation overhead
especially when you got more than one "+" in the expression. This however
again will generally be a no issue with rvalue references in the future so
don't stress too much optimizing this if your profiling isn't clearly
showing this as a killer. But because I don't have a design issue with the
optimized version vs the non-optimized code (as I have with point 4 example
code) I guess it's ok to have this general tip.

6. Don't use string::substr()

This probably results from point 4 but as I consider point 4 invalid I don't
have a problem using substr(). However there is a difference between using
substr() and "abusing" it.

7. Preallocate space for large strings.

Not very sure what you mean with this. You probably have some exact sample
code in mind if you care to show it.

8. Consider using C-style character arrays.

I would actually propose the exact reverse, consider only using std::string
(as my recommendations per tip 1 and 3). What's wrong with passing a
reference to const everywhere you need a string ?

9. Prefer initialization to assignment.

Completely agree (in general not just std::string). Which goes to the
general advise that only declare (local) variables where you can initilize
them.

10. Use string::empty() to test for an empty string.

Although I couldn't find the time complexity specification for
std::string::size() (and std::string::empty()) I would expect both to be
constant time (not the same thing one can say about std::list of course for
obvious reasons). In general I too think people should use empty for
checking if it's empty, at least makes it better when later the person
would use std::list for some reason and I think the code is more explicit
which is a good thing(tm) :)

------

In your article you say most of the problems either show up from
initialization of small strings or from unnecessary copy operations. I
think both can be eliminated with a string implementation that would
perform internally some optimizations for these cases. As such I would more
advise people to use another better fit for their needs string
implementation (ie use well a good implementation for your needs) than to
advise them how to wrongly use an wrong (for their needs) implementation.

--
Dizzy
http://dizzy.roedu.net

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated. First time posters: Do this! ]