Re: StringBuilder Difficulties

From:

blmblm@myrealbox.com <blmblm.myrealbox@gmail.com>

Newsgroups:

comp.lang.java.programmer

Date:

6 Jul 2011 16:59:04 GMT

Message-ID:

<97jiioFr6pU2@mid.individual.net>

In article <rma717948c79cs9nnupdqjrhuu965bag76@4ax.com>,
Gene Wirchenko <genew@ocis.net> wrote:

On 5 Jul 2011 21:16:44 GMT, blmblm@myrealbox.com
<blmblm.myrealbox@gmail.com> wrote:

In article <jsu6171vmasu6jfvsbroa052n8u7ukarr3@4ax.com>,
Gene Wirchenko <genew@ocis.net> wrote:

On 5 Jul 2011 19:12:39 GMT, blmblm@myrealbox.com
<blmblm.myrealbox@gmail.com> wrote:

[snip]

What I found is that HashSet was noticeably faster on all the
systems where I ran the benchmarks. Unless you need for the set
to be sorted (and it's not apparent from your code that you do),
why not .... ? (I'm curious too about why you chose TreeSet in
the first place. ? )

1) I can output the set in order without having to do anything else.
My real program has a lot of debugging info dumping. (Read as "checks
that I have not done something wrong".)

Ah. Well, yes, then you probably do need a SortedSet, though
considering that you initially build the set from a string that's in
order, maybe you could use that (the string) instead. Just sayin',
maybe.

I set the string value that way so that the binary search would
work.

Ah, okay.

2) When I read "hash", I think "collision", and I get nervous.
Nothing I read reassured me that that could not happen.

Why would that make you nervous? If you're worried about
correctness, um, as far as I know hash tables are supposed to
deal with collisions in some way that preserves the overall "map"
semantics. Performance may suffer if there are a lot of collisions,
but -- benchmark on the system(s) of interest and check?

I would need to know what the behaviour is supposed to be.

Maybe there's something I'm not understanding, but as far as I
know hash tables (including the Java HashMap class that, according
to the API, is used by HashSet) are required to do something that
allows storing multiple distinct keys that have the same hash code
("collisions"). The implementation that comes to mind is one that
has some sort of list for each possible hashcode value. Obviously(?)
performance suffers if very many of these lists have more than one
or a few elements, but correctness (in the sense of considering
keys to be equal based only on what equals() returns and not on
what hashCode() returns) is preserved.

I don't think I'm explaining this very well. Maybe the following
short test program will do better -- it's a totally contrived example
of using a HashSet to store elements that are distinct but have the
same hash code:

import java.util.*;

public class HashSetTest {

    public static class Foo {
        public final int n;

        public Foo(int n) {
            this.n = n;
        }

        @Override
        public int hashCode() {
            return 0;
        }

        @Override
        public boolean equals(Object obj) {
            if (obj instanceof Foo) {
                return n == ((Foo) obj).n;
            }
            else {
                return false;
            }
        }
    }

    public static void main(String[] args) {
        Set<Foo> s = new HashSet<Foo>();
        // add what should be ten distinct elements
        for (int i = 0; i < 10; ++i) {
            s.add(new Foo(i));
        }
        // add what should be duplicate elements
        for (int i = 0; i < 10; ++i) {
            s.add(new Foo(i));
        }
        // iterate over all elements of s -- this "foreach" syntax
        // is possible AFAIK with all classes that implement Collection
        for (Foo foo : s) {
            // note use of C-like printf
            System.out.printf("foo with value %d, hashcode %d\n",
                    foo.n, foo.hashCode());
        }
    }
}

When I run this I get 10 lines of output, indicating to me that
whether two elements are considered duplicates depends on equality
as determined by the equals() method rather than on hashcode.

Or maybe I'm once again totally misunderstanding you, and your actual
concern is something else ....

I
have something that works for now. Optimisation comes after getting
it working.

Well, yeah, sure, but this rather seems at odds with your having
done benchmarking of various ways of looking up a character in a
set of characters.

Anyway if at some point it seems that performance of your lookup
isn't good enough, it might be worthwhile to explore HashSet, if
you can think of some way to get the debugging information you need
that doesn't depend on being able to retrieve elements from the set
in sorted order. (E.g., maybe you could do a one-time operation
that retrieves all the elements, sorts them, and saves the result.)

Or, as supercali* suggests in another post, you could use a
LinkedHashSet. Replace HashSet with LinkedHashSet in the above test
program to see the difference. This would require you to build your
set from elements in the right order, but assuming building the set
is a one-time thing, it should be easy enough to sort the elements
before adding them, or so I would think.

3) I had to pick something. If it works, I can change it later. If
it does not, I have not solved my problem yet. The former is safer.

Sure. Then again, if you were only concerned about getting something
that works, why try various alternatives .... <shrug>

1) Just in case there was a BIG difference.

In my experiments HashSet was about twice as fast as TreeSet. Of
course if you needed the sortedness, HashSet is off the table, but
LinkedHashSet (which I did not know about!) seems like it would do
what you need and is about as fast as HashSet on the systems where
I tried it.

The code using java.util.regex.* wasn't as fast on the old system
where I did my initial experiments, but on a newer system it seems
to give performance comparable to (Linked)HashSet.

Just sayin'.

2) To learn more
about Java. At some point though, I have to do something useful with
it.

Sure. I'm dimly aware that I'm probably flogging a horse that has long
since shuffled off this mortal coil.

My preprocessor is shaping up nicely. I have to write the symbol
define command code, handle the include command, and handle output to
a file. That is about it. Which means that there are probably only
about three other things that I will come up.

And then you can go back to coding in a language you like better? :-)

--
B. L. Massingill
ObDisclaimer: I don't speak for my employers; they return the favor.