Re: Whitespace problems, xml-parsing
WP wrote:
On 15 Apr, 19:05, WP <mindcoo...@gmail.com> wrote:
On 15 Apr, 17:35, RedGrittyBrick <RedGrittyBr...@SpamWeary.foo> wrote:
WP wrote:
I'm very rusty at java and this is the first time I've been working
with xml in any programming language and my problem is that when I
parse it I get a lot of text nodes containing just whitespace even
though I thought I set it to ignore such whitespace.
Maybe it is because of this ...
"Note that only whitespace which is directly contained within
element content that has an element only content model (see
XML Rec 3.2.1) will be eliminated."
From API reference documentation.
I have now turned on validation and fixed so the schema is found
properly (had missed to do a call to setFeature() on the
DocumentBuilderFactory object. It didn't solve anything, however,
output is still as in my OP. Then I stumbled upon this:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6545684
so it seems that it worked like I want it to in jdk5 but was regressed
in jdk6 and then "fixed" even though jdk5 did wrong. But I'm running
the latest JDK so maybe that fix was reverted. Sigh.
How do people handle this?
I wrote this when first learning Java + XML some while ago, It looks a
bit lame now but I think it does what you want. It discards whitespace
used for indentation but retains all whitespace (including leading and
trailing whitespace) within data elements.
-------------------------------- 8< ----------------------------------
public class ParseXMLbyDOM {
public static void main(String[] args) {
String filename = "XML/animals.xml";
String uri = "file:" + new File(filename).getAbsolutePath();
Document doc = null;
try {
DocumentBuilderFactory factory = DocumentBuilderFactory
.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
doc = builder.parse(uri);
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
doRecursive(doc, "");
}
private static void doRecursive(Node node, String name) {
if (node == null)
return;
NodeList nodes = node.getChildNodes();
for (int i = 0; i < nodes.getLength(); i++) {
Node n = nodes.item(i);
if (n == null)
continue;
doNode(n, name);
}
}
private static void doNode(Node node, String name) {
String nodeName = "unknown";
switch (node.getNodeType()) {
case Node.ELEMENT_NODE:
if (name.length() == 0) {
nodeName = node.getNodeName();
} else {
nodeName = name + "." + node.getNodeName();
}
doRecursive(node, nodeName);
break;
case Node.TEXT_NODE:
String text = node.getNodeValue();
if (text.length() == 0 || text.matches("\n *")
|| text.equals("\\r")) {
break;
}
String type = "";
NamedNodeMap attrs = node.getAttributes();
if (attrs != null) {
Node attr = attrs.getNamedItem("type");
if (attr != null) {
type = attr.getNodeValue();
}
}
System.out.println(name + "(" + type + ") = '"
+ text + "'.");
nodeName = "unknown";
break;
default:
System.out.println("Other node "
+ node.getNodeType() + " : "
+ node.getClass());
break;
}
}
}
-------------------------------- 8< ----------------------------------
<inventory>
<animal type="mammal">
<name>Fred</name>
<species>Hippo</species>
<weight units="Kg">1552</weight>
</animal>
<animal type="reptile">
<name>
Gert
AKA Gertrude
the galloping reptile
</name>
<species>Croc</species>
</animal>
</inventory>
-------------------------------- 8< ----------------------------------
inventory.animal.name() = 'Fred'.
inventory.animal.species() = 'Hippo'.
inventory.animal.weight() = '1552'.
inventory.animal.name() = '
Gert
AKA Gertrude
the galloping reptile
'.
inventory.animal.species() = 'Croc'.
-------------------------------- 8< ----------------------------------
I will be reading schemas and files where the content is
unknown beforehand, how am I to know what whitespace is just eye-candy
(indentation) and should be discarded and what is actual data and
should be kept?
I don't think XML explicitly differentiates between "eye candy"
whitespace and "actual data" whitespace.
If so, you'll have to invent your own heuristics for this.
--
RGB