Re: Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary".

From:

Daniel Pitts <googlegroupie@coloraura.com>

Newsgroups:

comp.lang.java.programmer

Date:

Thu, 20 Sep 2007 20:43:20 -0000

Message-ID:

<1190321000.630728.104780@q3g2000prf.googlegroups.com>

On Sep 20, 6:39 am, "Jeff Higgins" <oohigg...@yahoo.com> wrote:

Daniel Pitts wrote:

So, I've spent all day working on this. Funfun...

Back story: Project Gutenburg create free ebooks from content that is
now in the public domain, including the "1913 Webster Unabridged
Dictionary". The problem with this particular work (pgw050*.txt), is
that it uses a very "odd" character set, and an almost-xml markup (it
may be valid SGML, but I wouldn't bank on it)

Its part DOS extended ascii, and then some proprietary character
codes.

My goal:
I'd like to get this into a form that is easily processed by a
program. I think the best way to do this is to put it into a robust
XML formal. This would involved cleaning up the markup to be more
valid XML, as well as processing some of the character codes into
nicer forms. I've already written a program that will read the
original texts, and re-encode the files as UTF-8, using appropriate
character substitution when possible.

Whew. After a quick read of webfont.asc and tagset.web I can feel
your pain. I think the main problem here is that the typesetters /style/
conveys so much information. For instance:

216 d8 =D8 <par/ double vertical bar (short length; the long
               length is the graphics character 186)
               This precedes words marked with a double vertical bar in
               the original dictionary, signifying that the word was
               adopted directly into English without modification of
               the spelling.

For myself, I suppose the question would be: Do I want my
/program/ to understand and/or act upon the fact that a character
code 0xd8 signifies the above or is it strictly for a /human/ readers'
consumption? If the former probably an XML tag would be appropriate,
if the latter maybe an appropriate glyph is sufficient.

Thanks for the reply. My main goal is to retain as much semantic
meaning as possible for the program to understand. So if I understand
your point, I should convert it to XML tags to maintain that
information...

This brings up a related point. In XML, can "&blah;" entities have
semantic meaning associated with them? Or are they only replacements
for otherwise difficult-to-represent characters? That makes a
difference between using &directlyAdopted; and <directly-adopted/>

<http://www.gutenberg.org/dirs/etext96/pgw050ab.txt>

At this point, I'm not sure if I'd be better off converting their
custom "entities" into the equivalent UTF-8 encoded characters, or if
it would be better to convert all entities and non-standard characters
into some sort of XML encoded entities.

Anyone have suggestions on what would be the most useful way to go?

Thanks,
Daniel.