Pages

Thursday, February 10, 2011

Screen Scrapping Guide for Java

I created and app about one year a go. The application need to screen scrap a website because it didn't give an API to access the data. When i started, I was confused by so many choice that available for screen scrapping in java. And at that time I've decided to go with NekoHTML, its very powerful but also complicated and a bit of nightmare to maintain. So today i decided to look for better alternative. It turn out, after one year I easily confused by the so many java library out there that i can use. So i decided to create this guide.

Regex
The simplest and probably the fastest way to do screen scrapping in Java. The regex class is already included in the standard java library. So you don't need to add another library and learn how they work.

The problem is, regex is very prone to error. There are many reason for the error. It could be because the html doesn't comply to standards, bad implementation, or even non matching tags. And also a simple change on the website could break the whole regex rule and you need to start from zero. 

To use regex, you need to be very diligent about white space character, or it's only a matter of time before it bites you.

NekoHTML
This is one of the best parser that i know. It often used on other framework such as HTMLUnit as the default html parser. It creates the DOM representation of the page, and then you can traverse through the DOM tree getting each node content. Since it such a low level access, its a bit of pain to maintain. And also you still need to traverse throug the node that isn't important for you. Such as empty text, series of newline, break tag, etc. 
To overcome this you can add a custom filter to remove the unnecessary tag, and uses XPath to query the tag that you are interested in. 

HTMLUnit
Beside as a framework for testing, you can use html Unit for screen scrapping. It can be used for screen scrapping. You just create a webclient and then you can access the tag that you are interested in using the getHtmlElementById or you can also uses XPath. Then you can access the data as an XML or Text. 

It gives a higher abstraction, you don't need to traverse the DOM tree manually. And it also gives build in cookie management, proxy access, JavaScript, CSS and Ajax Support. Everything that you ever need from a headless browsers.

The funny thing is, after looking for something new, i decided to stick with NekoHTML. Its fast, and implementing custom filter helps me a lot. And it also doesn't come with the unnecessary bagage such as Javascript, CSS and Ajax Support. I don't need it for the moment. And of course using XPath has become very valuable. It cuts my previous code to 1/3 even 1/4 in some case. No more manually navigating through the dome, and it's more intuitive. I think i finally found the right tool :)



Hope that helps , I'll re-write the rant some other time to give better explanation. :)



No comments: