html parsing

From:

"Damo_Suzuki" <zumbar@b00mb0x.org>

Newsgroups:

comp.lang.java.programmer

Date:

2 Dec 2006 12:56:35 -0800

Message-ID:

<1165092995.642688.3440@j44g2000cwa.googlegroups.com>

Hi,
I'm new to this html parsing lark. I want to parse a search engine
result html page to extract the title,summary and URL of every result.
I've made an attempt at it with the following code:

HTMLEditorKit htmlKit = new HTMLEditorKit();
        HTMLDocument htmlDoc = (HTMLDocument)
htmlKit.createDefaultDocument();
        HTMLEditorKit.Parser parser = new ParserDelegator();
        HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
        parser.parse(buffer, callback, true);
        StringBuffer text = new StringBuffer();
        StringBuffer snippet = new StringBuffer();

        ElementIterator iterator = new ElementIterator(htmlDoc);
        Element element;
        while ((element = iterator.next()) != null)
        {
            AttributeSet attributes = element.getAttributes();
            Object name =
attributes.getAttribute(StyleConstants.NameAttribute);

            if ((name instanceof HTML.Tag)&& (name == HTML.Tag.H2))
            {
            // Build up content text as it may be within multiple
elements
            //StringBuffer text = new StringBuffer();
            int count = element.getElementCount();
            for (int i = 0; i < count; i++)
            {
                 Element child = element.getElement(i);
                 AttributeSet childAttributes = child.getAttributes();
                 if
(childAttributes.getAttribute(StyleConstants.NameAttribute) ==
HTML.Tag.CONTENT)
                 {
                       int startOffset = child.getStartOffset();
                       int endOffset = child.getEndOffset();
                       int length = endOffset - startOffset;
                       text.append(htmlDoc.getText(startOffset,
length));
                 }
            }

            }

            if (!(name instanceof HTML.Tag)&& (name == HTML.Tag.TD))
            {
             element=iterator.next();
            }
            else
            {
            // Build up content text as it may be within multiple
elements
                int count = element.getElementCount();
                for (int i = 0; i < count; i++)
                {
                     Element child = element.getElement(i);
                     AttributeSet childAttributes =
child.getAttributes();
                     if
(childAttributes.getAttribute(StyleConstants.NameAttribute) ==
HTML.Tag.CONTENT)
                     {
                         int startOffset = child.getStartOffset();
                         int endOffset = child.getEndOffset();
                         int length = endOffset - startOffset;
                         snippet.append(htmlDoc.getText(startOffset,
length));
                     }
                }
            }

       }

            ArrayList result = new ArrayList();
            result.add(text);
            result.add(snippet);
            in.close();
            return result;
    }

currently it returns an arraylist with two long strings in it. a string
made of all the titles and a string made up of all the rest. The
problem is the summary and the URLs are in one table and to get summary
you also get the URL together with it.

the html of one result looks like this:
<h2 class=r>
<a class=l href="http://www.java.com/" onmousedown="return
clk(this.href,'','','res','1','')">
<b>java</b>.com: Hot Games, Cool Apps</a></h2>

<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td class=j><font size=-1>
Get the latest <b>Java</b> Software and explore how <b>Java
</b> technology provides a better digital experience.<br>
<span class=a>www.<b>java</b>.com/ - 16k - </span><nobr>
<a class=fl href="http://66.102.9.104/search?q=cache:gzY4gL02EzEJ
:www.java.com/+java&hl=en&gl=ie&ct=clnk&cd=1">Cached</a> -
<a class=fl href="/search?hl=en&lr=&q=related:www.java.com/">
Similar pages</a></nobr></font>
</td>
</tr>
</table>

Does anyone know a better way of doing this, or know how to seperate
the URL from the summary?
Any help would be greatly appreciated