Re: Reading from very large file
Hakan wrote:
I'd like to read only numbers from an extremely big file containing
both characters and digits. It turns out that a) reading each
character with a RandomAccessFile is too slow
markspace wrote:
I think a tightly scoped SSCCE is needed here. "Extremely big" and
"too slow" are such vague and relative terms that there's not really
much we can do if we don't know what sort of performance target we're
trying to hit.
SSCCE with the access times you are seeing, plus your desired
performance improvement, would be the best.
Hakan wrote:
The text file has a size in the range of 13.7 MB. No matter what access
times I have on an individual read, it will take immense amounts of time
unless I find the smartest way to preprocess it and filter out all
non-digits. Thanks.
First, 13.7 MB isn't so terribly large. Second, markspace specifically asked
for hard numbers and pointed out that adjectives like "extremely big" are not
terribly meaningful, yet you ignored that advice and the request and simply
provided another vague adjective, "immense", without any indication of what
your target performance is. Third, he asked for an SSCCE, which you also
ignored completely.
Given all that, you make it impossible to help you, but let me try anyway.
I'm just a great guy that way.
But you're still going to have to provide an SSCCE. Read
<http://sscce.org/>
to learn about that.
You mentioned that "reading each character with a RandomAccessFile is too
slow". OK, then don't do that! Stream the data in, using a large block size
for the read, for example, using
<http://java.sun.com/javase/6/docs/api/java/io/BufferedReader.html#BufferedReader(java.io.Reader,
int)>
to establish the stream.
At that point your search for digits is nearly all memory-bound. On most
modern systems you should be able to fit the entire 13.7 MB in memory at once,
eliminating I/O as a limiting factor.
Now you just need an efficient algorithm. Perhaps a state machine that scans
your 13.7 MB in-memory buffer and spits out sequences of digits to a handler,
somewhat the way XML SAX parsers handle searches for tags, would be useful.
Now for the best piece of advice when asking for help from Usenet:
<http://sscce.org/>
<http://sscce.org/>
<http://sscce.org/>
--
Lew