Re: Parsing generic XML

From:

Owen Jacobson <angrybaldguy@gmail.com>

Newsgroups:

comp.lang.java.programmer

Date:

Wed, 11 Jun 2008 08:28:45 -0700 (PDT)

Message-ID:

<e876a780-5d7d-40c3-a39b-9c9fec4665fe@x35g2000hsb.googlegroups.com>

On Jun 11, 10:40 am, Roedy Green <see_webs...@mindprod.com.invalid>
wrote:

I have some XML, namely PAD files, for which I have no schema, though
I probably could cook one up in a day or two.

Similarly I have some XHTML, I want to screenscrape where, I really
only care about the <table <tr and <td elements.

So what I am after is some sort of extremely relaxed schema that will
eat pretty well anything so long as the tags balance.

I tried parsing without any schema at all, and it choked on  
entities.

Entity references (  and friends) only have meaning with respect
to a schema or DTD which maps them to entities (eg.,   in the
case of  ). XML documents which contain entity references MUST
contain a definition somewhere; there's not really any avoiding it.

Fortunately, for XHTML that's easy; there's a published DTD.

In the case of PAD files you may have to replace the entity references
with entities manually, if you can't find a schema that defines them.

Any basic XML parser (jdom, dom4j, sax, w3c dom, et multiple cetera)
should accept any well-formed document if you turn off validation.

-o