Re: Patricia trie vs binary search.
On 5/28/12 9:54 PM, Lew wrote:
On 05/28/2012 09:20 AM, Gene Wirchenko wrote:
On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
<newsgroup.nospam@virtualinfinity.net> wrote:
On 5/27/12 6:44 PM, Gene Wirchenko wrote:
On Sat, 26 May 2012 17:30:17 -0700, Daniel Pitts
<newsgroup.nospam@virtualinfinity.net> wrote:
[snip]
I tend to use a Deterministic Finite State Automata for this. You can
load the entire English dictionary fairly easily with that scheme.
Yes,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
you use a bit of memory, but unless you're doing this on an embedded
device, its probably not enough memory to be concerned about.
Including all affixes?
I suppose it depends on the particular dictionary, but we're only
talking a few hundred thousand entries, at least with the Moby
word-lists as a base:
Considering how many affixes can be applied to some words, I find
that very questionable:
self:
*ish
*ishly
un*ish
un*ishly
*less
*lessly
un*less
un*lessly
position:
*s
*ed
*al
*ally
re*
re*s
re*ed
*less
mis*
*er
*ers
friend
*s
*ly
*liness
be*
be*ing
be*ed
be*er
be*ers
These are not particularly extreme examples.
It's not a question of how extreme the examples are but how many there are.
Not all words can be legitimately affixed. Many can be affixed by
algorithm, or by bitmaps as to which affixes apply, so you only store
the root, the bitmap and perhaps one more form.
I don't know how much memory expansion you think your factors will
cause, as you only hand wave and say there will be some and act like
it's a problem, but let's say it doubles the size of the dictionary. By
Daniel's experiment upthread, that would bring it to around 8 MiB, let's
round and say 10MiB. Being text and all, that should compress to about 3
MiB or less.
Now I am interested to hear what sort of trouble you assert that 3 MiB
or so of storage will cause.
Actually, the word lists I used to come up with my figures include those
inflections of the words that it is used for.
For example "ishly" and "ness" suffixes:
> grep "ishly\b" * | wc -l
323
Which includes:
wordishly
womanishly
wolfishly
wishly
winterishly
wildishly
whorishly
whoreishly
Most of these entries aren't even in Thunderbirds spell-check dictionary.
> grep "ness\b" * | wc -l
11762
Includes entries such as:
woodlessness
wonderfulness
unmysteriousness
unexperiencedness
undeceptiveness
nonvolatileness
Again, most of these aren't spell-check.
My numbers are accurate, but thanks for questioning them, so I could
further explore and see for myself.