Sat, 29 Aug 2009 23:08:54 +0200
VC9 SP1 contains optimizations to avoid incrementing and decrementing
shared_ptr's refcount (e.g. when reallocating a vector<shared_ptr<T>>).
VC10's move semantics extend this even further. Also, VC10's
make_shared<T>() eliminates the overhead of having two dynamic memory
allocations; it's able to put the object and its reference count control
block in the same chunk of memory, which gives you efficiency very close
to that of intrusive reference counting.

I use mainly VC8 SP1 (VS2005) right now. This is a choice (too much
stability problems with VC9, even with SP1, and no changes really useful for
me ; also VC8 is more commonly shared with other members of my projects).

Of course, shared_ptr uses interlocked operations for threadsafe
refcounting. Naive intrusive refcounts that use ordinary increments and
decrements will obviously be faster, but restricts ownership to a single

This was not really the question that I asked and my code was only a quick
first run for testing this kind of feature, but anyway, thank you for
pointing this out : I'm aware of this race condition problem, but I don't
know it very well.

You seem to know a lot about this subject, maybe you will be able to help me
on this matter also ?
I modified my code for using Interlocked functions (see at bottom).
But I'm still not sure that it is very multithread safe : another thread can
probably change the reference count between InterlockedCompareExchange and
InterlockedDecrement in my Release function. How can I be sure to avoid this
problem ?

About performance questions : I looked at the assembly code, and I verified
that the actual assembly code for these Interlocked functions is only about
5 or 6 machine instructions, like this:

7C809806 8B 4C 24 04 mov ecx,dword ptr [esp+4]
7C80980A B8 01 00 00 00 mov eax,1
7C80980F F0 0F C1 01 lock xadd dword ptr [ecx],eax
7C809813 40 inc eax
7C809814 C2 04 00 ret 4

Even considering that there can be (probably quite rarely) some wait times
because of locked states from other threads, do you really think that I can
have significant overhead problems with that ?



/** Minimal \p IUnknown implementation for being able to only take advantage
of its
 * reference count feature and to allow building smart pointers using \p
class CUnknownStub : public IUnknown
    volatile ULONG m_cRef; //!< Reference counter
    /// Minimal interface query
    virtual STDMETHODIMP QueryInterface(REFIID riid, void **ppv);

    /// Minimal reference increment implementation
        return InterlockedIncrement(reinterpret_cast<volatile LONG
*>(&m_cRef)); }

    /// Minimal reference decrement implementation (with object destruction
when reaching zero)
    virtual ULONG STDMETHODCALLTYPE Release();

    /// Constructor
    CUnknownStub() :
        m_cRef(0) // initialize reference count

    /// Destructor (protected because it must never be externally called)
    virtual ~CUnknownStub() {
        /* empty */ }

/// Minimal interface query
STDMETHODIMP CUnknownStub::QueryInterface(REFIID riid, void **ppv)
    if (!ppv)
        return E_POINTER;
    if ((riid == IID_IUnknown)) {
        *ppv = (LPVOID) this;
        AddRef(); // AddRef
        return S_OK;
    *ppv = NULL;
    return E_NOINTERFACE;


/// Minimal reference decrement implementation (with object destruction when
reaching zero)
    if (
            reinterpret_cast<volatile LONG *>(&m_cRef), DESTRUCTOR_REFCOUNT,
1) == 1
    ) {
        delete this;
        return 0;
    return InterlockedDecrement(reinterpret_cast<volatile LONG *>(&m_cRef));

