Re: Efficient Integer Comparison Algorithm

From:

Robert Klemme <shortcutter@googlemail.com>

Newsgroups:

comp.lang.java.programmer

Date:

Tue, 04 Mar 2008 19:37:18 +0100

Message-ID:

<635j6vF24hecrU1@mid.individual.net>

On 04.03.2008 18:27, JThangiah wrote:

I need some help to find an efficient range or pair comparison
algorithm for integers.
My requirement is to link records in two different databases with
different set of records. Both have a pair of values called co-
ordinates which are similar and can be used to match the records.

In other words: you consider records equivalent that have the exact same
values for both coordinates. What are the properties of your coords?
Are there positive and / or negative values? How many are there?

My Solution:
I am going to pull a set of records from both databases and sort them
based on these co-ordinates. I will put them in two lists so that the
indexes for the records to be compared in both lists are the same.

How can you make sure that indexes are identical? If you just pull an
ordered result from the database you cannot expect that (4,-5) is at the
same position in both results. Or is there another precondition you did
not mention so far?

I will then do a comparison of both lists to find the exact match.
If I find the match, I need to record the matched records elsewhere to
create an XML file with those results.

You do not mention database products that you are going to use. If at
all possible I would use a database link (Oracle) or a similar feature
of another product and let the DB do the work, by joining the local and
the remote table. It's at least worth a try.

Database extensions for spatial data are another set of features that
might help here.

Constraints:
Both the databases are so large that I have to find the most efficient
algorithm and datastructure to do the job.

How large is "large"? If we're talking about a few million records then
that can be easily handled inside the JVM - just pull all the data into
an appropriate data structure (see below) and go from there.

The execution will be done in a batch job.
The records pulled from both databases to form the lists are unique
and have no duplicates but each have a separate unique identifier.
There may be excess data that cannot be compared in either of the
databases.

Questions:
1. Is list the best data-structure for this since speed is my main
concern. Or will the autoboxing of integers cause an overhead?

If you use a List you need to have entries ordered, otherwise you risk
O(n*n) which is unnecessary.

2. Are there any good range-comparison or pair-comparison algorithms
for integers that I can use?

ad 1 and 2: why not create a pair type that carries two ints plus
additional information and can be stuffed into a HashSet?

Alternative: query both databases ordered by both coordinates and
compare records individually. You can then decide which ResultSet you
need to step forward based on the current comparison.

3. Can I make use of concurrency algorithms for this?

Yes, certainly. You could, for example, partition your data set into
smaller chunks, say x=0...999, 1000..1999, 2000..2999 etc. and have a
number of threads which each processes one chunk. You only need to make
sure that you only have one thread that will generate the output XML.

I have the feeling that at the moment the problem is underspecified.
Missing bits I can see so far:

1. more about the nature of the data
2. database products involved
3. volume of data

Kind regards

robert