Re: Collection interfaces (Was: Creating a byte[] of long size)

From:

ClassCastException <zjkg3d9gj56@gmail.invalid>

Newsgroups:

comp.lang.java.programmer

Date:

Sat, 10 Jul 2010 05:32:00 +0000 (UTC)

Message-ID:

<i190kf$3fp$2@news.eternal-september.org>

On Fri, 09 Jul 2010 16:54:53 -0400, Eric Sosman wrote:

On 7/9/2010 4:06 PM, Daniel Pitts wrote:

interface Hasher<T> {
long hash(T value);
}

A 64-bit hashCode() would be of little use until you got to
more than 2^32 hash buckets. Just saying.

Gets us back to the original topic. :-)

interface Equivalence<T> {
boolean equal(T left, T right);
}

I don't get it: Why not just use equals()? I guess a class
could choose not to implement Equivalence at all (and thus make itself
unusable in whatever framework relies on Equivalence), but is that an
advantage? Also, you could get a compile-time error instead of a
run-time `false' for trying to call equal() on references of dissimilar
classes; again, where's the benefit?

Then, all the appropriate Collection code could use those interfaces.
There should also be the obvious default implementations.

It might be helpful to give some examples of the "appropriate"
uses, and of the "obvious" defaults. For example, how does a HashMap
make use of a key that implements Hasher? Does it reflect on each key
its given and make a run-time choice between using hash() and
hashCode()? I don't get it ...

Note that those interfaces specify methods with an "extra" parameter
each. They're like Comparator versus compareTo/Comparable.

The purpose is clear: so a HashMap could be given, optionally, a
Hasher<K> to use in place of the keys' own hashCode methods and an
Equivalence<K> to use in place of the keys' own equals methods.

One obvious benefit is that you get rid of IdentityHashMap by folding
that functionality into plain HashMap. Instead of a separate class, you'd
get an identity hash map with

new HashMap<K,V>(new Hasher<K>() {
        public long hash (K x) {
            return System.identityHashCode(x);
        }
    }, new Equivalence<K>() {
        public boolean equal (K x, K y) {
            return x == y;
        }
    };

or with canned instances of IdentityHasher and IdentityEquivalence
provided by the library.

With this, you would also be able to get identity WeakHashMaps and the
like; by separating the "how strong is the reference" aspect into one
class and the "how is identity decided" aspect into another, you avoid a
combinatorial explosion and possible lacunae of capability (right now we
have no WeakIdentityHashMap, in particular).

You'd also be able to reduce some of the clumsier uses of HashMap to
HashSet. Picture a

class Record {
    public final int id;
    public final String name;
    public final String address;
}

with the obvious equality semantics (all fields equal) and constructor
added.

Now throw in an Equivalence and a Hasher that use only the record's id
field.

So maybe you keep a change log for an individual person as a
List<Record>, chronological:

id 0001
name Jane Herman
address 1600 Pennsylvania Avenue

id 0001
name Jane Herman
address 18 Wisteria Lane

id 0001
name Jane Thurston
address 18 Wisteria Lane

OK, so she got voted out of office, then got married, or something like
that.

Of course you might want to throw a jumble of records in a Set and have
different ones of the above count as different.

But you might also want a record of the current state of affairs. Given a
HashSet implementation that can use a supplied Hasher and Equivalence the
way TreeSet can use an externally-supplied Comparator, and that also has
the semantics that adding an element that equals an already-present
element replaces that element with the new one, you can update the 0001
record simply by putting a more recent one into this set -- if it already
has a 0001 record, the provided Hasher and Equivalence will lead to the
new one replacing that one.

So in some contexts you can treat records identically only if they're
actually identical; in others if they have the same id; all without
monkeying with an explicit id-to-record HashMap or suchlike.

Another way to achieve this last, though, is to have a KeyExtractor<T>
interface that you implement in this case to return the id field of a
Record and a HashSet implementation that uses the object itself as the
key in its internal HashMap if no KeyExtractor is specified during
construction, and uses the supplied KeyExtractor otherwise. This is
actually closer to the conceptual truth of what you're doing in a case
like this: keying on the id field in a particular HashSet. The
implementation would be something like

public class HashSet<T> {
    private HashMap<Object,T> data = new HashMap<Object,T>();
    private KeyExtractor<T> ke = new KeyExtractor<T>() {
        public Object getKey (T val) {
            return val;
        }
    }

    ...

    public T put (T newElem) {
        Object key = ke.getKey(newElem);
        T oldElem = data.get(key);
        data.put(key, newElem);
        return oldElem;
    }
}

whereas the Hasher/Equivalence version would just pass the Hasher and
Equivalence to the HashMap constructor when initializing Data and not
have the key local in put, just newElem.

The really interesting thing is that we don't really need to wait for any
hypothetical future Sun (Oracle?) update to do some of this; KeyExtractor
and the above variation of HashSet can be implemented now, perhaps
calling the latter RecordMap instead as it acts as a map from key fields
of records of some sort to whole records, in the typical case, and in
fact you probably do also want to do lookups of whole records by just the
keys. And you might sometimes want to hold the records via weak or soft
references, e.g. to make it a cache. In that case you want to allow
specifying two more things, a type of reference to use (enum
ReferenceType {STRONG; SOFT; WEAK;} with default STRONG) and an optional
ExpensiveGetter that defaults to return null but can be replaced with one
whose expensiveGet() does something like, say, retrieve disk records.
Then get() calls expensiveGet() on not-found and if expensiveGet()
doesn't throw or return null, does a put() before returning the result.
You can throw in another type parameter, too:

public class RecordMap <K,V,E> {
    private ExpensiveGetter<K,V,E> eg = new ExpensiveGetter<K,V,E>() {
        public V expensiveGet (K key) throws E {
            return null;
        }
    }

    private HashMap<K,Object> data = new HashMap<K,Object>();

    public enum ReferenceType {
        STRONG {
            public Object wrap (Object o) {
                return o;
            }
        };
        SOFT {
            public Object wrap (Object o) {
                return new SoftReference(o);
            }
        };
        WEAK {
            public Object wrap (Object o) {
                return new WeakReference(o);
            }
        };
        public abstract Object wrap (Object o);
    }

    private ReferenceType referenceType = ReferenceType.STRONG;

    ...

    @SuppressWarnings("unchecked")
    public V get (K key) throws E {
        Object result = data.get(key);
        if (result instanceof Reference) result = result.get();
        if (result != null) return (V)result;
        result = eg.expensiveGet(key);
        if (result == null) return null;
        put(key, result);
        return (V)result;
    }

    public void put (K key, V val) {
        data.put(key, referenceType.wrap(val);
    }
}

This is a bit messy but it's just a quick draft. It doesn't actually
implement Map because it doesn't quite fit the Map contract in a few
places (and making it do so would be difficult, particularly since get
seems to have to be able to throw exceptions). You might want to change
ExpensiveGet to a more general BackingSource that provides both get and
put methods; puts write through to the real backing store whenever
performed as well as writing to the RecordMap in memory, making a
RecordMap with a non-default BackingSource a cache backed by something in
a two-way fashion.

I may be a bit rusty on the syntax of giving enum constants behavior,
too. Clearly in this case that's the right thing to do, from an OO
perspective, rather than having a switch clause in the put method that
could get out of synch if someone decided to add PHANTOM to the thing for
whatever reason or a future JDK added more Reference types that
influenced GC policy in as-yet-unforeseen ways.