Re: Parsing generic XML
On Jun 11, 10:40 am, Roedy Green <see_webs...@mindprod.com.invalid>
wrote:
I have some XML, namely PAD files, for which I have no schema, though
I probably could cook one up in a day or two.
Similarly I have some XHTML, I want to screenscrape where, I really
only care about the <table <tr and <td elements.
So what I am after is some sort of extremely relaxed schema that will
eat pretty well anything so long as the tags balance.
I tried parsing without any schema at all, and it choked on
entities.
Entity references ( and friends) only have meaning with respect
to a schema or DTD which maps them to entities (eg.,   in the
case of ). XML documents which contain entity references MUST
contain a definition somewhere; there's not really any avoiding it.
Fortunately, for XHTML that's easy; there's a published DTD.
In the case of PAD files you may have to replace the entity references
with entities manually, if you can't find a schema that defines them.
Any basic XML parser (jdom, dom4j, sax, w3c dom, et multiple cetera)
should accept any well-formed document if you turn off validation.
-o