XPath querying text node *including* <br/>
Dear all,
I'm trying to extract data from HTML using XPath in Java.
Unfortunately the text contents of nodes may contain <br/> tags which
are not correctly interpreted, at least not for me ;)
A <p> node may contain this text:
<p>
Test1<br/>
Test2<br/>
Test3
</p>
Which is returned by the XPath query as "Test1Test2Test3" but I need
it as "Test1\nTest2\nTest3" or "Test1 Test2 Test3".
Here's example code (Java 6):
public class Example {
private static final String html = "<html><body><p>Test1<br/
Test2<br/>Test3</p></body></html>";
public static void main( String[] args ) throws Exception {
final XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
String value = (String)xPath.evaluate(
"//p",
new InputSource( new StringReader( html ) ),
XPathConstants.STRING );
System.out.println( value );
xPath = xPathFactory.newXPath();
value = (String)xPath.evaluate(
"//p/text()",
new InputSource( new StringReader( html ) ),
XPathConstants.STRING );
System.out.println( value );
xPath = xPathFactory.newXPath();
value = (String)xPath.evaluate(
"//p/node()",
new InputSource( new StringReader( html ) ),
XPathConstants.STRING );
System.out.println( value );
}
}
This code returns:
Test1Test2Test3
Test1
Test1
Is there any way (XPath function etc) which will return the contents
as desired?
Thank you!