Hemal Pandya skrev:
Ed Kirwan wrote:
Patricia Shanahan wrote:
[...]
Perhaps using a List would obviate the need for the nest loop?
It will, but will be a lot more expensive.
[....]
Thanks for that tip, Hemal. I had no idea that Set-implementations were
so much more efficient (in this case) than List-implementations. The
output from the (no-doubt indent-mashed) code below gives:
522393 duplicated words. Using java.util.HashSet, time = 678ms.
522393 duplicated words. Using java.util.TreeSet, time = 1812ms.
522393 duplicated words. Using java.util.ArrayList, time = 157724ms.
522393 duplicated words. Using java.util.LinkedList, time = 251739ms.
import java.util.*;
import java.io.*;
class Test {
private static String TEXT_BOOK_NAME = "war-and-peace.txt";
public static void main(String[] args) {
try {
String text = readText(); // Read text into RAM
countDuplicateWords(text, new HashSet());
countDuplicateWords(text, new TreeSet());
countDuplicateWords(text, new ArrayList());
countDuplicateWords(text, new LinkedList());
} catch (Throwable t) {
System.out.println(t.toString());
}
}
private static String readText() throws Throwable {
BufferedReader reader =
new BufferedReader(new FileReader(TEXT_BOOK_NAME));
String line = null;
StringBuffer text = new StringBuffer();
while ((line = reader.readLine()) != null) {
text.append(line + " ");
}
return text.toString();
}
private static void countDuplicateWords(String text,
Collection listOfWords) {
int numDuplicatedWords = 0;
long startTime = System.currentTimeMillis();
for (StringTokenizer i = new StringTokenizer(text);
i.hasMoreElements();) {
String word = i.nextToken();
if (listOfWords.contains(word)) {
numDuplicatedWords++;
} else {
listOfWords.add(word);
}
}
long endTime = System.currentTimeMillis();
System.out.println(numDuplicatedWords + " duplicated words. " +
"Using " + listOfWords.getClass().getName() +
", time = " + (endTime - startTime) + "ms.");
}
}
You could use a HashMap if you wanted to know how many times each word occurred: