Re: Detect XML document encodings with SAX
Arne Vajh=F8j wrote:
Lew wrote:
Sebastian wrote:
[snip]
output an encoding of UTF-8, while looking at the file
as they should.
No.
If the XML prolog specifies another encoding than UTF-8,
then it should not return UTF-8.
True, but I'm saying they should specify UTF-8 in the prolog.
XML should be encoded in UTF-8 nearly always.
See?
XML allows for other encodings.
So? You should use UTF-8 nearly always, i.e., unless there's a compelling=
reason not to.
And Java XML parsers support it.
For those rare times when you deviate from the usual UTF-8.
So it should always work.
But SAX is a parser, so it doesn't output, it inputs. What are you telli=
ng us?
Output usually mean System.out.println - that works fine with a parser.
His phrasing wasn't clear to me. That's why I asked for clarification.
I could have guessed, too.
If your problem is with reading the file, then the encoding in the XML d=
eclaration
See? You're preaching to the choir.
should suffice to guide the parser. But then why do you talk about metho=
ds that
"output an encoding"?
Because he wants to know what it is.
However, according to
http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding
supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2,
ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS,
and EUC-JP,
So it looks like you must not accept XML documents with such a
non-standard encoding.
Those that has researched would know that the XML spec do not
limit the encodings at all. The XML processor must support UTF-8
and UTF-16, but are free to support others.
Perhaps the OP's parser doesn't exercise that freedom, judging by the
symptoms.
'sall I'm sayin'.
Obviously I don't know the answer, but he's asking for suggestions
to investigate, AIUI. He's having encoding problems. His XML is apparently=
encoded in Windows-1252, a notoriously funky encoding especially for
the variety of characters with which one might wish to deal. So why not
investigate obtaining material that isn't in such a notoriously funky
encoding, like, oh, say, the old reliable standard UTF-8?
Perhaps that isn't feasible, for reasons as yet unstated, but that's
the nature of brainstorming.
--
Lew