Re: C++ programming challenge

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Mon, 15 Jun 2009 02:59:24 -0700 (PDT)

Message-ID:

<c4d9db20-fc26-4ed3-ba81-7c9309c4e205@h28g2000yqd.googlegroups.com>

On Jun 14, 4:30 am, Jerry Coffin <jcof...@taeus.com> wrote:

In article <12ba30e3-b04b-4c32-9625-
7bd26cb8c...@g20g2000vba.googlegroups.com>, webmas...@jansson.net
says...

I have created a small programming challenge for those of
you who are interested in challenging your Standard C++
programming skills. The challenge is about counting
character frequency in large texts, perhaps useful for spam
filtering or classical crypto analysis. You can read more
about it here:

Yet another bit of code:

#include <iostream>
#include <fstream>
#include <vector>
#include <numeric>
#include <iomanip>

class counter {
    std::vector<double> counts;
    long total;
public:
    static const int num = 'z'-'a'+1;

    counter() : counts(num), total(0) {}

    counter operator+(char c) {
        char ch = tolower(c);
        if (isalpha(ch)) {
            ++total;
            ++counts[ch-'a'];
        }
        return *this;
    }

    friend std::ostream &operator<<(std::ostream &os, counter const &c)
{
        for (int i=0; i<num; ++i)
            if (isalpha('a'+i))
// The single most complex part of the program is getting the
// output formatted as your web site shows it!
                os << char('a'+ i) << " "
                   << std::fixed
                   << std::setw(6)
                   << std::setprecision(3)
                   << c.counts[i]/c.total*100.0 << "%\n";
        return os;
    }
};

int main(int argc, char **argv) {
    std::ifstream input(argv[1]);

    counter &counts = std::accumulate(
        std::istream_iterator<char>(input),
        std::istream_iterator<char>(),
        counter());

    std::cout << counts;
    return 0;
}

I suppose it's open to argument whether I might be abusing
std::accumulate and/or operator overloading. I (obviously)
don't really think so, but I'll admit that especially in the
latter case there's room for question.

I think it's a good example of the sort of thing accumulate
should be usable for. Note, however, that accumulate will
normally copy your counter twice for each element, which isn't
going to help performance much. In the past, I've used a
somewhat hacky solution to avoid this:

    class Counter
    {
        // as above, with in addition...
        class Dummy {} ;
        class AddOp
        {
            Dummy operator+( Counter& dest, char ch ) const
            {
                dest += ch ;
                return Dummy() ;
            }
        } ;
        Counter& operator=( Dummy const& )
        {
            return *this ;
        }
    } ;

Then invoke accumulate with Counter::AddOp() as the fourth
argument. It's horrible, but when with my SHA-1 accumulator, it
made more than an order of magnitude of difference in the
performance---and my SHA-1 accumulator was probably a lot
cheaper to copy than your Counter class.

Also, I'd avoid floating point for the accumulation---long long
seems preferable (but since we have no idea what the largest
file size is that needs to be handled, who knows).

I don't have the right sort of machine handy to test it with,
but I believe this should work with something like EBCDIC.
Making it work correctly with something like almost any
Unicode encoding would take substantial modifications.

It will work for EBCDIC, but it won't work for ISO 8859-1 (which
is probably the most widespread single byte encoding).

Which may or may not be a problem. We still don't know what the
program we're supposed to write is supposed to do.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34