On 23 Oct, 03:55, earth_792 <mike_nguy...@hotmail.com> wrote:
Does anyone have any ideas how to convert Html into XML by using
Java?
This depends on what you mean by "HTML". If it's guaranteed to be
well-formed and valid, then it's a simple matter - use an SGML or HTML
parser, then output the DOM as XML.
If it's "typical" HTML "tag soup", then this is fundamentally a much
more difficult task. You can't convert with a simple automatic
process, at times you have to infer "what the author meant" rather
than "what they wrote". I suggest reading up on HTML Tidy, which isn't
(AFAIK) ported to Java, but does discuss the problems and their
solutions.
If you're trying to embed HTML in RSS (which is usually an XML
protocol) or similar, then you don't even need to "convert HTML to
XML", you just neeed to encode the relevant entities (such as "<" and
">") into a CDATA section. That's _much_ easier, you don't even need a
HTML parser, just a simple character-by-character scan and replace.
On the whole though, I can't imagine many cases when it really is
necessary to "convert HTML to XML". Just about the only one is loading
legacy web sites into a new XML-based CMS.
If you give us more context, then you might get more relevant advice.
post. Now, I understand what I should do. My initial problem is "to
new presentation). I don't want to cut and paste content from legacy
ones to the new ones. There have thousands of pages. So, I thought if
HTML. :))