Re: byte stream vs char stream buffer
On 5/11/2014 12:34 PM, Robert Klemme wrote:
1. You are never transferring those bytes to Java land (i.e. into a byte
or byte[] which is allocated on the heap) - data stays in native land.
2. You are not reading chars and hence also do not do the character
decoding.
Yes, I knew the data was "mostly ascii" and therefore I didn't have to
do character decoding. An efficient UTF-8 converter shouldn't be much
more complicated however.
I appear to be counting word lengths in the file, I'm not sure why at
this point. Some more found code:
FileInputStream fins = new FileInputStream( path.toFile() );
FileByteBufferInputStream fbbins =
new FileByteBufferInputStream( fins );
int charRead;
HashedHistogram histogram = new HashedHistogram();
charRead = fbbins.read();
StringBuilder sb = new StringBuilder();
while( charRead != -1 )
{
if( charRead < 128 && !Character.isWhitespace( charRead ) ) {
sb.append( (char) charRead );
charRead = fbbins.read();
} else {
histogram.add( sb.toString() );
sb.delete( 0, sb.length() );
while( (Character.isWhitespace( (charRead =
fbbins.read() )) ||
charRead >= 128) && charRead != -1 )
{
// nothing
}
}
}
System.out.println( histogram.size() + " words" );
Entry<Comparable,Integer>[] entries =
histogram.getSortedEntries();
System.out.println( "Bottom words:" );
for( int i = 0; i < 20; i++ )
System.out.println( entries[i].getKey()+",
"+entries[i].getValue() );
System.out.println( "Top words:" );
for( int i = entries.length-1; i > entries.length-41; i-- )
System.out.println( entries[i].getKey()+",
"+entries[i].getValue() );
Kind of ugly, but that's what I have.