Re: Out of memory error in SAX parsing with validation
On 12/26/2014 5:45 AM, Sebastian wrote:
Am 26.12.2014 00:45, schrieb Arne Vajh?j:
On 12/25/2014 6:32 PM, Sebastian wrote:
does anyone here know something about the memory requirements for
validating XML with SAX? I've encountered what I think is a memory leak
with the Xerces version included in JDK 7 and 8.
I'm using a SAX parser (XMLReader) to parse a large XML file.
Using a non-validating parser, I can process a 7 GB file containing 25
million small elements (each having ca. 3 - 5 subelements) with just 64
MB of heap space. With XML validation against a DTD turned on, 1024 MB
do not suffice. I have taken a cursory glance at the heap with
JVisualVM, and see millions of QName instances being created and never
being GC'ed. I suspect this to be at least a part of the problem.
Can anyone enlighten me as to why SAX would require so much memory for
validation? Isn't it enough to know that each element is well-formed?
If you ask it to validate against a DTD then it is obviously not
enough to check for well-formed-ness.
sorry, I did mean "valid". At the end of an element, shouldn't the
parser be able to release all resources associated with validating the
current "level", i. e. everything except information about the ancestors
of the next element? After all, a DTD cannot contain constraints like:
"if you have seen element X, no element Y must occur",
which would ncessitate retaining information about siblings.
That would have been my expectation as well.
But Xerces seems to work different.
Maybe time to dust of good old Crimson.
:-)
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 107 validating:
false -> 3.8 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 305 validating:
false -> 3.0 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 2285 validating:
false -> 2.0 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 22085 validating:
false -> 2.0 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 220085 validating:
false -> 2.0 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 2200085
validating: false -> 2.0 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 22000085
validating: false -> 2.0 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 220000085
validating: false -> 1.7 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 107 validating:
true -> 1.7 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 305 validating:
true -> 1.7 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 2285 validating:
true -> 1.7 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 22085 validating:
true -> 1.8 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 220085 validating:
true -> 2.3 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 2200085
validating: true -> 6.4 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 22000085
validating: true -> 39.9 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 220000085
validating: true -> 606.4 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 107 validating:
false -> 2.6 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 305 validating:
false -> 2.6 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 2285 validating:
false -> 2.6 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 22085 validating:
false -> 2.6 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 220085 validating:
false -> 2.6 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 2200085
validating: false -> 2.6 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 22000085
validating: false -> 2.7 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 220000085
validating: false -> 3.1 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 107 validating:
true -> 3.1 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 305 validating:
true -> 3.1 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 2285 validating:
true -> 3.1 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 22085 validating:
true -> 3.1 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 220085 validating:
true -> 3.7 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 2200085
validating: true -> 7.8 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 22000085
validating: true -> 40.8 MB heap
org.apache.xerces.jaxp.SAXParserFactoryImpl XML size: 220000085
validating: true -> 605.3 MB heap
net.sf.saxon.aelfred.SAXParserFactoryImpl XML size: 107 validating:
false -> 1.4 MB heap
net.sf.saxon.aelfred.SAXParserFactoryImpl XML size: 305 validating:
false -> 1.4 MB heap
net.sf.saxon.aelfred.SAXParserFactoryImpl XML size: 2285 validating:
false -> 1.4 MB heap
net.sf.saxon.aelfred.SAXParserFactoryImpl XML size: 22085 validating:
false -> 1.4 MB heap
net.sf.saxon.aelfred.SAXParserFactoryImpl XML size: 220085 validating:
false -> 1.4 MB heap
net.sf.saxon.aelfred.SAXParserFactoryImpl XML size: 2200085 validating:
false -> 1.4 MB heap
net.sf.saxon.aelfred.SAXParserFactoryImpl XML size: 22000085 validating:
false -> 1.5 MB heap
net.sf.saxon.aelfred.SAXParserFactoryImpl XML size: 220000085
validating: false -> 1.8 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 107 validating:
false -> 1.9 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 305 validating:
false -> 1.8 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 2285 validating:
false -> 1.8 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 22085 validating:
false -> 1.8 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 220085
validating: false -> 1.8 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 2200085
validating: false -> 1.8 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 22000085
validating: false -> 1.8 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 220000085
validating: false -> 2.0 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 107 validating:
true -> 2.0 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 305 validating:
true -> 2.0 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 2285 validating:
true -> 2.1 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 22085 validating:
true -> 2.0 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 220085
validating: true -> 2.0 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 2200085
validating: true -> 2.0 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 22000085
validating: true -> 2.0 MB heap
org.apache.crimson.jaxp.SAXParserFactoryImpl XML size: 220000085
validating: true -> 2.1 MB heap
(see code below)
Arne
====
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.ErrorHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
public class SAXMemoryUsage {
private static final String FNM = "/work/big.xml";
private static final String ROOT_ELM = "root";
private static final String INNER_ELM = "elm";
private static final int NSIZ = 8;
private static void genXML(int n) throws IOException {
PrintWriter pw = new PrintWriter(new FileWriter(FNM));
pw.println("<!DOCTYPE " + ROOT_ELM + " [");
pw.println("<!ELEMENT " + ROOT_ELM + " (" + INNER_ELM + ")*>");
pw.println("<!ELEMENT " + INNER_ELM + " (#PCDATA)>");
pw.println("]>");
pw.print("<" + ROOT_ELM + ">");
for(int i = 0; i < n; i++) {
pw.print(" <" + INNER_ELM + ">bla bla</" + INNER_ELM + ">");
}
pw.print("</" + ROOT_ELM + ">");
pw.close();
}
private static void testOne(boolean val) throws
ParserConfigurationException, SAXException, IOException {
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setValidating(val);
SAXParser sp = spf.newSAXParser();
XMLReader xr = sp.getXMLReader();
xr.setContentHandler(new DefaultHandler() {
public void endElement(String namespaceURI, String localName,
String rawName) throws SAXException {
if (rawName.equals(ROOT_ELM)) {
System.gc();
System.out.printf("%s XML size: %d validating: %b ->
%.1f MB heap\r\n",
spf.getClass().getName(),
new File(FNM).length(),
val,
(Runtime.getRuntime().totalMemory() -
Runtime.getRuntime().freeMemory()) / 1000000.0);
}
}
});
xr.setErrorHandler(new ErrorHandler() {
@Override
public void warning(SAXParseException ex) throws SAXException {
System.out.println(ex.getMessage());
}
@Override
public void error(SAXParseException ex) throws SAXException {
System.out.println(ex.getMessage());
}
@Override
public void fatalError(SAXParseException ex) throws SAXException {
System.out.println(ex.getMessage());
}
});
FileReader fr = new FileReader(FNM);
xr.parse(new InputSource(fr));
fr.close();
}
private static void testMany(boolean val) throws
ParserConfigurationException, SAXException, IOException {
int n = 1;
for(int i = 0; i < NSIZ; i++) {
genXML(n);
testOne(val);
n *= 10;
}
}
public static void main(String[] args) throws Exception {
testMany(false);
testMany(true);
System.setProperty("javax.xml.parsers.SAXParserFactory",
"org.apache.xerces.jaxp.SAXParserFactoryImpl");
testMany(false);
testMany(true);
System.setProperty("javax.xml.parsers.SAXParserFactory",
"net.sf.saxon.aelfred.SAXParserFactoryImpl");
testMany(false);
System.setProperty("javax.xml.parsers.SAXParserFactory",
"org.apache.crimson.jaxp.SAXParserFactoryImpl");
testMany(false);
testMany(true);
}
}