Re: Reading from very large file

From:

Tom Anderson <twic@urchin.earth.li>

Newsgroups:

comp.lang.java.programmer

Date:

Sun, 9 May 2010 12:16:41 +0100

Message-ID:

<alpine.DEB.1.10.1005091125520.31162@urchin.earth.li>

On Sun, 9 May 2010, Hakan wrote:

Sorry about the mistake, but the file is actually 13 GB. I can read to a
character array buffering about 30 million characters before the heap space
is overflowed. This is still only a part of the file.

The sscce site is down and not accessible when I tried. What I have been
doing so far is something like this in rough code:

Rough code is really not that useful - you're having a problem because
something in your code is wrong, which means that something in your
*understanding* of the code is wrong. Telling us about your understanding
of the code is therefore not very useful. Why can't you copy and paste
your actual code?

static int nchars=27000000;
int startpos=0;
File readfile="../x.txt";
FileReader frd=new File;
String searchs="20020701";
char[] arr=new char[nchars];

while (more dates to search for)
{
frd=new FileReader(readfile); /*reopen file
frd.skip(startpos); /*move to file pointer where final place of last date was found

I suspect the above line is the problem.

A FileReader works in characters, not bytes. Characters may be a variable
number of bytes (in some encodings, and so in general), and thus skipping
a given number of a characters doesn't corresponding to skipping any fixed
number of bytes. Thus, FileReader.skip can't be implemented efficiently on
top of the low-level seek() system call. Instead, it has to read through
the contents of the file, counting characters until it's skipped the right
number. So, every time you make this call, you're re-reading all of the
file you've read so far.

frd.read(arr,0,nchars); /*10
find number of date occurrences in arr with pattern matching
update searchs (first time to "20020702" and so on
startpos=startpos+(last place of pattern match)
output result for this date
}

This in all tends to use one to two minutes per run of the loop. What I
would like to do is to a) either preprocess the file such that I get an
input file where only numbers are present or b) change the read call at
label 10 so that it only reads numbers instead of all next characters.

No, you don't want to do either of those things. You want to avoid the
real problem, which is re-reading the file every trip round the loop.

You're massively overcomplicating this problem. All you need to do is set
up the FileReader - once, and with suitable buffering - then read
characters from it, looking for strings which look like dates. You can do
this in exactly one pass of the file, and less than 30 lines of code.

I know that because in ten minutes, i just wrote a program that does it.
Download the class file from here:

http://urchin.earth.li/~twic/tmp/DateScanner.class

And run it like:

java DateScanner name-of-file.txt

It doesn't do the full sequential processing of dates that you want to do,
but it does report every date it finds, and its position. Now run it like
this:

java -Dquiet=true DateScanner name-of-file.txt

To suppress output. How long does it take to process your file?

tom

--
All roads lead unto death row; who knows what's after?