Re: parsing xml from a stream

From:

"Mike Schilling" <mscottschilling@hotmail.com>

Newsgroups:

comp.lang.java.programmer

Date:

Wed, 26 Aug 2009 23:36:32 -0700

Message-ID:

<h759hi$np2$1@news.eternal-september.org>

Peter Horlock wrote:

Hi,

I am using apache xmlbeans 2.2 to parse XML from an InputStream
and to create Java Beans from it.

The input is ISO-8859-1 encoded. The first 3 lines, as well as
the last 3 lines, are empty lines, and I can't (currently) change
that. Before, we were using method.getResponseBodyAsString().trim();
and gave the result to xmlbeans - that worked, but resulted in a lot
of warnings in the Server LOGS, as the input sometimes can be pritty
big.
Here's what I am doing now:
InputStream inputStream = method.getResponseBodyAsStream();

XmlOptions xmlOptions = new XmlOptions();
xmlOptions.setCharacterEncoding("ISO-8859-1");
xmlOptions.setLoadStripComments();
xmlOptions.setLoadTrimTextBuffer();
xmlOptions.setLoadStripWhitespace();

org.apache.xmlbeans.SchemaType type =
(org.apache.xmlbeans.SchemaType);
org.apache.xmlbeans.XmlBeans.getContextTypeLoader().parse
( inputStream, type, xmlOptions );

This however, throws the following error:
[...]
Caused by: java.io.CharConversionException: Malformed UTF-8
character:
0xfc 0x72 0x6b 0x65
        at
org.apache.xmlbeans.impl.piccolo.xml.UTF8XMLDecoder.decode
(UTF8XMLDecoder.java:141)
        at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader
$FastStreamDecoder.read(XMLStreamReader.java:762)
        at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader.read
(XMLStreamReader.java:162)
        at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yy_refill
(PiccoloLexer.java:3474)
        at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex
(PiccoloLexer.java:3958)
        at
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:
1290)
        at
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:
1400)
        at
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:
714)
        at
org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:
3435)
        at
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:
1270)
        at
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:
1257)
        at
org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse
(SchemaTypeLoaderBase.java:345)

------------
When I instead used
method.getResponseBodyAsString().trim();
and created an InputStream based on the trimmed String, then it
worked. So I asume something is wrong with the empty lines at the
beginning
and end of the document. How can I get rid of them without
converting
the entire stream to a String (e.g. getResponseBodyAsString())???

Write a subclass of FilterInputStream that trims off any leading
whitespace. I suspect the trailing whitespace won't cause any
problems, which is good, because it's harder to recognize.

This is very odd, though. If the input is ISO-8859-1, and you've told
the parser that it's ISO-8859-1, what the hell is it complaining about
malformed UTF-8 characters for? The blank lines can't be causing it,
because they'd be ASCII characters, which have the same values in
ISO-8859-1 and UTF-8.