Re: Virtual files
On Wed, 14 May 2008, Andreas Leitgeb wrote:
Roedy Green <see_website@mindprod.com.invalid> wrote:
I have written quite a bit of code that loads an entire file into RAM
then does indexOf, substring, regexes etc, working on the giant
String, often creating a new giant String and writing it out.
Don't know of such a library, but depending on the exact actions,
such a library may be impossible to exist.
A regex quite surely needs its "haystack" all in memory, so unless you
can be sure that your particular regex will never match a string of
length greater than some "n", you'd at least have to load the whole file
at once.
Fine, so constrain the length of a match. You lose generality, but i bet
it makes no practical difference. How often do you search text for a
pattern >100 characters long, let alone >1 000 000 000?
If instead you can exclude the possibility that a match ever exceeds say
10000 chars, then you can apply it to 20000-length blocks, and then keep
the upper 10000, load another 10000 from file and re-apply the regexp
...
Something like that. If you were clever at block boundaries, you might be
able to work through the blocks quicker than that.
Back in the olden days, when memory was small (and made of magnetic
doughnuts) and data lived on tapes, a lot of work was done on algorithms
which didn't need to have their whole dataset in memory, and did the
minimal amount of IO, and particularly non-sequential IO, to do their
jobs. Particularly algorithms related to data-crunching, as you might well
want to do with a computer of that period, databases and whatnot, so
plenty of sorting and searching work. I bet there's stuff that could be
applied to this case. Sadly, i think a lot of this knowledge has been lost
- it's all written down, of course, but there isn't a living community of
practice which would help one make use if it.
The other operations are probably less of a principial problem.
Another thing is, that for substrings you perhaps do not always need to
create new strings separately in memory to write out. The
OutputStreamWriter has a version of "write" that takes a String and two
ints offset,length to write out only a substring. I don't know, if that
just obtains a .substring in memory, or if it uses the data directly
from the given String.
Good idea. I have this vague memory of String doing this internally
anyway, possibly in some ancient version of java - there was a thing about
getting memory leaks where you take a 10-char substring of a 10 000
000-char string, drop the big string, but the char[] doesn't get collected
becaue the little string is holding on to it. Maybe that was a dream.
But a CraftyStringBuffer object that worked by maintaining indices into an
existing buffer, and coalescing them or converting them to strings when it
saved memory (ie when they were short) would be cool.
Finally, you can use JNI and have C-code memory-map the whole file at
once. But then you won't get a String, but at most a byte[] visible in
Java... Paging in/out the appropriate parts of the file is then made
OS's job.
No need for JNI (your own, at least) - you can memory-map files using NIO:
String fname = ... ;
File f = new File(fname) ;
MappedByteBuffer buf =
new RandomAccessFile(f, "r").getChannel().map(FileChannel.MapMode.READ_ONLY, 0L, f.length()) ;
However, i note that Buffers use ints for indexing, so they can't handle
2 GB files. Doh.
Okay, but the offset and length for the map *are* long, so you can create
a sequence of Buffers which between them tile the entire file. You can
then wrap these in a GignormoBuffer class which looks like a buffer but
uses long indices and routes calls to the appropriate underlying buffer.
Then you write a GignormoCharBuffer which wraps a GignormoBuffer and
translates bytes to chars (and doesn't handle encodings which are not 1:k,
eg Latin-1 or UTF-16, so no UTF-8).
Then you rewrite string searching and regexps to use a GignormoCharBuffer
instead of a String. String searching is easy, you just rock some
Boyer-Moore action. Regexps are harder. Your best bet might be to let
GignormoCharBuffer produce CharSequences which are windows onto a 2
gigachar region, then apply java's regexps to those, and add logic to
search a whole buffer by doing it in bits and sliding the window,
including handling the joins (which might be quite hard to do cleverly).
Otherwise, write regexps from the ground up to use long-based offsets
(enjoy!).
So no, i don't know of a library for doing this. But it doesn't sound like
an insuperable problem. But i suspect Roedy knew that already.
tom
--
Hit to death in the future head