Re: Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary".

From:

Daniel Pitts <googlegroupie@coloraura.com>

Newsgroups:

comp.lang.java.programmer

Date:

Fri, 21 Sep 2007 15:31:23 -0000

Message-ID:

<1190388683.344021.226870@q3g2000prf.googlegroups.com>

On Sep 21, 2:43 am, RedGrittyBrick <redgrittybr...@spamweary.foo>
wrote:

Roedy Green wrote:

On Thu, 20 Sep 2007 05:03:36 -0000, Daniel Pitts
<googlegrou...@coloraura.com> wrote, quoted or indirectly quoted
someone who said :

At this point, I'm not sure if I'd be better off converting their
custom "entities" into the equivalent UTF-8 encoded characters, or if
it would be better to convert all entities and non-standard characters
into some sort of XML encoded entities.

Perhaps the way to go is to devise a font that renders these odd
characters correctly. Then the text could be easily manipulated
programmatically with tiny mods to existing software. Then you could
even publish it as a PDF document.

Your problem then becomes political, talking some skilled type
designer into donating her skills in return for some exposure.

The purpose of a dictionary is semantic. The actual glyphs are
comparatively unimportant. The intellectual accomplishment does not lie
mainly in the choice of symbols.

If you want to reproduce the beautiful typography of the original, use
high quality image scans.

Otherwise I'd translate the glyphs to something semantically or visually
close in the unicode character set.

I think I'd try for a purely semantic markup in XML. Then create a
stylesheet that would render it in XHTML (say) and which would introduce
glyphs and fonts as close to the original as possible. That way, if
unicode ever gets extended to include some of the odd characters used in
the original, you only have to amend the stylesheet.

So I'd represent the "double vertical bar" as an attribute of a tag.
e.g. <word spelling="adopted"> The stylesheet could insert a glyph
visually close to "double vertical bar".

In particular, I'd translate markup like "<universbold>" into
<exposition> or <shape-description> or something. I'm pretty sure
Webster didn't compose his dictionary with LaserJet fonts in mind :-)

Heh. He probably was using a BubbleJet :-)

But seriously. I'd like to keep the original intent (the
transcriber's, not necessarily Webster's), and then in a later stage
of the processing, convert it to the more semantic meaning, and
probably ignore the rendering of that information. My personal use-
case actually only cares about the relationships between words, and
the part of speech. For instance, I'd like to be able to recognize
Ran, Run, and Runs as different tenses of the same word, and Leaf/
Leaves as different inflections of the same word.

Actually, thats not quite my "ultimate" goal. The ultimate goal is to
create an English Imperative Sentence parser to use in a text
adventure game. I just figured I might as well do something useful
for the community while I'm at it (in this case, semanticize the
dictionary). Although it appears that gcide_xml may have done what I
wanted to do already.