Re: Any suggestions for handling data of huge dimension in Java?

From:

Eric Sosman <esosman@ieee-dot-org.invalid>

Newsgroups:

comp.lang.java.programmer

Date:

Fri, 25 Mar 2011 22:18:34 -0400

Message-ID:

<imjige$m48$1@dont-email.me>

On 3/25/2011 6:09 AM, Simon Ng wrote:

On Thursday, March 24, 2011 7:46:07 PM UTC+8, Eric Sosman wrote:

Find a way to "classify" incrementally, so you don't need to hold
all 250,000,000 key/value pairs in memory at the same time. Sorry I
can't be more specific, but I'm unable to guess what your data looks
like or what "classify" means.

(reformatted for readbility)

Hi Eric Sosman, thank you for the suggestion, I think classify

> incrementally is a good suggestion. Actually I am doing web page
> language identification, my data is the N-grams extracted from web
> page. So, my aim is to identify the language belongs to a web page
> and it is assumed that each web page is written by one language only.
> So, each column of matrix is represented by one of the N-grams, and
> each row is the web page to be identified. For the details, may be
> you can read this paper [...]

     I have not read the paper, so my remarks may be off the mark.
But two things occur to me: First, your "5000*50*1000" matrix will
probably be quite sparse, and all this talk of a quarter-billion
cells is probably a red herring. Second, it's not clear that you
need more than one row at a time. If you've already developed a
decision procedure, you can read one web page, say "Ici on parle
fran?ais" or "Ja, hier sind wir alles Deutsch" and move along to
the next page, forgetting the first. If you're trying to accumulate
a training set of some kind, you can process all the French pages
and output their statistics, then all the German pages and output
theirs, and so on, retaining just one set of statistics at a time.
In neither case can I see any need to have all the web pages in
memory at the same time.

     (Curmudgeonly rant:) In the Old Days when a good-sized computer
had a quarter-meg of RAM and only supercomputers boasted a megabyte,
we did this kind of processing all the time. Slurp in the small
amount of data you could hold, do a "local context" computation,
spit out the result, repeat. The art was in arranging those local
context computations so they'd do something useful without ever
needing access to the entire data set at once. That art seems to
have withered somewhat, possibly because memories have grown so
much larger that the art's exercise is no longer bread-and-butter.
It's not too great an exaggeration to say that today's programmers
can think of HashMap and RDBMS -- and nothing in between.

     And yet ... There certainly are data sets that are too large
for the greatest amount of RAM you can afford (proof: If this were
not so, disks would be about RAM-sized). And an RDBMS, admirable
though it is for many purposes, is several times slower than a disk,
several orders of magnitude slower than a CPU. If you can think of
a way to process your twenty-terabyte data set in four or five or
even ten sequential passes, you can get the job done in far less time
and for far less money than you'd spend on a fancy message-passing
distributed-computation network with fifty 512-GB processors and
their fifty Larry-has-his-eye-on-a-bigger-yacht Oracle licenses.

     Knuth's "Sorting and Searching" volume spends a good deal of
time describing ways to sort big data sets with limited memory but
large-capacity magnetic tapes. "Tapes," people sneer, "Ugh! How
last-millennium!" Yet the ideas remain valuable, even if you never
write a serious sort: You might get some ideas about how to deal
with larger data sets than you can stuff into a puny HashMap.

--
Eric Sosman
esosman@ieee-dot-org.invalid