Re: Problem with searching in TreeSet expressions starting with a
On 04.01.2007 12:49, setar wrote:
I store in TreeSet expressions in some natural language. This can be
any natural language. The content of a set is sorted in this language
by language collator:
collator = Collator.getInstance(locale);
Sorting is working correctly. For example in English the letters have
following order: ..., c, d, e, ... so words: cat, dog, ear will be
sorted in the given order.
Sorting is also working correctly in other languages. In non-English
languages there can be additional letters. For example in Polish there
is a '=C4=87' letter (c with an accent). This letters change the order =
other letter. In Polish the order is: ..., c, =C4=87, d, e, .... and fo=
example following words have given order: cena (en. price), =C4=87ma (e=
moth), duma (en. pride), echo (en. echo).
So index is build correctly for any language and it is sorted
Now I want to receive a range of its sorted values which are strings
beginning with a given string.
For English it is easy. If I want to get all expressions beginning with=
a string "cat" I can invoke subSet method with the first argument
equal to this string and the second argument the same as the first but
with last letter replaced with its SUCCESOR. Successors in English are
easy to determine:
But for other languages determining successors of letters isn't easy.
For example in Polish the successor of letter 'c' is '=C4=87' not 'd'. =
could create tables with orders of letters for all languages, but there=
is too many languages... I was looking in Java API for method
retrieving successor of given character but I found nothing. Does
anybody know if there is such a method?
Second way of retrieving expressions beginning with a given string is
invoking subSet with following arguments:
index.subSet("cat", "cat" + Character.MAX_VALUE);
I tested it with few string and everything seemed to work ok. But
recently I tried to return all expressions starting with a word
"cat", not only with a string. So I want to receive string "cat
is running", but I don't want to receive expression "catch". So I
invoked subSet with space after string "cat":
index.subSet("cat ", "cat " + Character.MAX_VALUE);
Unfortunately method returned also the "catch" word. I don't know
why subSet is working in this way.
It may be due to the workings of the Comparator you are using. Try
comparing "cat ", "catch" and "cat " + Character.MAX_VALUE with your
Comparator and see the outcome. After all Character.MAX_VALUE might not =
be a valid Unicode character.
IIRC there is a method in Character that checks whether a char is a
valid character. You could use that to find the max legal character and =
try again with that.
Does anybody know how to correct described ideas or maybe problem can
be solved in other way?
Some thoughts: it may be that you stretch the TreeSet too far. Could be =
that with better access to internals you are better off, so you might
have to craft your own data structure (Tries come to mind).
A two phase approach might work as well: use subSet() as the first line
to reduce the set's size and then iterate through it and use some other
mechanism like a regexp to further narrow down your result.