Re: Reading from very large file

From:

"John B. Matthews" <nospam@nospam.invalid>

Newsgroups:

comp.lang.java.programmer

Date:

Sat, 08 May 2010 19:56:19 -0400

Message-ID:

<nospam-FFBC82.19561908052010@news.aioe.org>

In article <1273342288.02@user.newsoffice.de>, Hakan <H.L@softhome.net>
wrote:

b) a StreamTokenizer did not work, as it has irregular delimiters for
some reason.

As a longtime fan of StreamTokenizer, I'm puzzled by this. The default
delimiter is white space, and whitespaceChars() allows considerable
flexibility.

As for performance, others have suggested a suitable buffer, and I chose
1 MiB. I get correct results for an 87 KiB test file containing 2^14
numbers and a 2.4 MiB dictionary containing no numbers. A 30 MiB file
takes less than two seconds to process.

<console>
$ make run < test.txt
java -cp build/classes cli.TokenizerTest
Count: 16384
83.659 ms
$ make run < /usr/share/dict/words
java -cp build/classes cli.TokenizerTest
Count: 0
230.727 ms
$ make run < classes.jar
java -cp build/classes cli.TokenizerTest
Count: 131906
1689.561 ms
</console>

<code>
package cli;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.StreamTokenizer;

/** @author John B. Matthews */
public class TokenizerTest {

    public static void main(String[] args) {
        long start = System.nanoTime();
        tokenize();
        System.out.println(
            (System.nanoTime() - start) / 1000000d + " ms");
    }

    private static void tokenize() {
        StreamTokenizer tokens = new StreamTokenizer(
            new BufferedReader(
                new InputStreamReader(System.in), 1024 * 1024));
        try {
            int count = 0;
            int token = tokens.nextToken();
            while (token != StreamTokenizer.TT_EOF) {
                if (token == StreamTokenizer.TT_NUMBER) {
                    count++;
                }
                token = tokens.nextToken();
            }
            System.out.println("Count: " + count);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}
<code>

--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>