Re: advice on loading and searching large map in memory

From:

Lew <noone@lewscanon.com>

Newsgroups:

comp.lang.java.programmer

Date:

Sun, 20 Feb 2011 12:06:06 -0500

Message-ID:

<ijrhlm$dci$1@news.albasani.net>

On 02/20/2011 11:34 AM, Arved Sandstrom wrote:

On 11-02-19 09:43 PM, eunever32@yahoo.co.uk wrote:

Hi

We have a requirement to query across two disparate systems. Both
systems are read-only so no need for updates and once loaded and no
need to check for updates. I would plan to reload the data afresh each
day. Records on both systems map one-one and each has 7million
records.

The first system is legacy and I am reluctant to redevelop (C code).
The second is standard Java/tomcat/SQL

The non-relational query can return up to 1000 records.

This could therefore result in 1000 queries to the relational system
(just one table) before returning to the user.

To avoid 1000 relational queries I was planning to "cache" the entire
relational table in memory. I was planning to have a web service which
would load the entire relational table into memory. The web service,
running in a separate tomcat could then be queried 1000 times or maybe
get a single request with 1000 values and return all results in one
go. Having a separate tomcat process would help to isolate any memory
issues eg JVM heap size.

[ SNIP ]

Just to be clear on this, there are N records in each system, one legacy
(non-relational) and one a RDBMS, and N ~ 7 million. There is also some kind
of key for a 1:1 mapping between records in the two systems, but the data in
each record is different. Correct?

Let me ask a few questions:

1. You've obviously got known criteria for each legacy query (that as you say
might return up to 1000 records, let's call this number P). I take it that's
it been asked, and ruled out, that there are any criteria at this point that
would also identify the correct P matching records in the RDBMS table. So you
need to get the "keys" first. Correct?

2. Have you considered an IN operator in a WHERE condition? We're looking to

EXISTS is another alternative.

avoid P separate queries to the RDBMS here, because that's a serious
performance hit, and make it one SELECT instead. If there are practical

Again, no one has measured performance so we don't even know if it's a
performance hit at all, much less a "serious" one.

limitations on how many values you can jam into the inlist, you can chunk it
up rather coarsely, and still have very many fewer queries to get your data
than you'd have had otherwise...as in less than 10 for P<=1000. Although with
most databases I suspect you don't have to chunk at all.

Any serious objections to that experiment?

As an experiment, no. As a production decision, yes. All this complication
is for a so-called "problem" that has not yet been demonstrated to exist.

--
Lew
Honi soit qui mal y pense.