Re: Reading from very large file
In article <1273342288.02@user.newsoffice.de>, Hakan <H.L@softhome.net>
wrote:
b) a StreamTokenizer did not work, as it has irregular delimiters for
some reason.
As a longtime fan of StreamTokenizer, I'm puzzled by this. The default
delimiter is white space, and whitespaceChars() allows considerable
flexibility.
As for performance, others have suggested a suitable buffer, and I chose
1 MiB. I get correct results for an 87 KiB test file containing 2^14
numbers and a 2.4 MiB dictionary containing no numbers. A 30 MiB file
takes less than two seconds to process.
<console>
$ make run < test.txt
java -cp build/classes cli.TokenizerTest
Count: 16384
83.659 ms
$ make run < /usr/share/dict/words
java -cp build/classes cli.TokenizerTest
Count: 0
230.727 ms
$ make run < classes.jar
java -cp build/classes cli.TokenizerTest
Count: 131906
1689.561 ms
</console>
<code>
package cli;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.StreamTokenizer;
/** @author John B. Matthews */
public class TokenizerTest {
public static void main(String[] args) {
long start = System.nanoTime();
tokenize();
System.out.println(
(System.nanoTime() - start) / 1000000d + " ms");
}
private static void tokenize() {
StreamTokenizer tokens = new StreamTokenizer(
new BufferedReader(
new InputStreamReader(System.in), 1024 * 1024));
try {
int count = 0;
int token = tokens.nextToken();
while (token != StreamTokenizer.TT_EOF) {
if (token == StreamTokenizer.TT_NUMBER) {
count++;
}
token = tokens.nextToken();
}
System.out.println("Count: " + count);
} catch (IOException e) {
e.printStackTrace();
}
}
}
<code>
--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>