Re: Loading a simple XHTML transitional document into a org.w3c.dom.Document

From:

"Mike Schilling" <mscottschilling@hotmail.com>

Newsgroups:

comp.lang.java.programmer

Date:

Thu, 9 Jul 2009 15:16:19 -0700

Message-ID:

<h35q7m$uh$1@news.eternal-september.org>

Ion Freeman wrote:

Hi!
  I'm just trying to do the simplest thing in the world. Where input
is a java.io.File that contains an transitional XHTML 1.0 file, I do

     DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance
();
     dbf.setNamespaceAware(false);
     db = dbf.newDocumentBuilder();
     Document doc = db.parse(input);

Unfortunately, this tries to pull the DTD from the W3C, and they
didn't like that. So, they give me a 503 error. I tried the
EntityResolver from
http://forums.sun.com/thread.jspa?threadID=5244492, but that just
gives me a MalformedURLException. Either way, my parse fails.

I'm sure that at least tens of thousands of people have written code
to do this, but I can't find a (working) reference online. I think
most of my XML parsing happened when the W3C would just give the DTDs
out -- I understand that they found that unworkable, but I still need
to parse my document.

How should I be doing this?

You should be able to solve this with an entity resolver that returns an
input source containing the right DTD text. They're not that difficut to
construct; just recognize the URL and return a StringReader or
ByteArrayInputStream. Return null for any URL you don't recognize.

If you know for a fact that the parser is Xerces (it's the default in Java
1.5 and later), you could try setting the Xerces-specific feature to ignore
DTDs. http://xml.org/sax/features/external-parameter-entities suggests that
you set http://xml.org/sax/features/external-parameter-entities to
"false", though we set
"http://apache.org/xml/features/nonvalidating/load-dtd-grammar" and
"http://apache.org/xml/features/nonvalidating/load-external-dtd" to false.
Be sure to call setValidating(false) too, though I'm pretty sure that's the
default anyway.