Re: Patricia trie vs binary search.
On 5/29/12 9:14 AM, Gene Wirchenko wrote:
On Mon, 28 May 2012 21:54:39 -0700, Lew<noone@lewscanon.com> wrote:
On 05/28/2012 09:20 AM, Gene Wirchenko wrote:
On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
<newsgroup.nospam@virtualinfinity.net> wrote:
On 5/27/12 6:44 PM, Gene Wirchenko wrote:
[snip]
These are not particularly extreme examples.
It's not a question of how extreme the examples are but how many there are.
I mentioned extremity just to point out that such base words are
not that unusual. There are a lot of them.
Not all words can be legitimately affixed. Many can be affixed by algorithm,
or by bitmaps as to which affixes apply, so you only store the root, the
bitmap and perhaps one more form.
There are multiple affixes. I suppose they could be combined
into aggregate affixes. e.g. -less + -ness -> -lessness.
I don't know how much memory expansion you think your factors will cause, as
you only hand wave and say there will be some and act like it's a problem, but
let's say it doubles the size of the dictionary. By Daniel's experiment
Let's not make up data. I would like to know how much of the
English language actually is in his dataset.
Agreed, let's not make up data. Using the Moby word-list as a guide, it
includes a significant number of those particular suffixes and prefixes
you've mentioned. Granted, its not a complete word-list, but then again
nothing is. It is "complete enough" for most purposes.
upthread, that would bring it to around 8 MiB, let's round and say 10MiB.
Being text and all, that should compress to about 3 MiB or less.
Now I am interested to hear what sort of trouble you assert that 3 MiB or so
of storage will cause.
Assuming your facts and then calling me on what you made up?
http://oxforddictionaries.com/words/how-many-words-are-there-in-the-english-language
is short but interesting reading. Its final paragraph: "This suggests
that there are, at the very least, a quarter of a million distinct
English words, excluding inflections, and words from technical and
regional vocabulary not covered by the OED, or words not yet added to
the published dictionary, of which perhaps 20 per cent are no longer
in current use. If distinct senses were counted, the total would
probably approach three quarters of a million."
Now, add on what they exclude. Is one million words out of line?
Itis one thing to have some of a language encoded. It is quite
another to try to handle everything. Exceptional cases tend to be
more difficult to handle.
Are you arguing that a modern system can't handle that number of words?
A modern desktop has more than enough memory to easily handle a quarter
*billion* words, which is a 100 times greater than your guessed upper limit.
And that's *without* compression.