Pages

Thursday, February 24, 2011

Saving youtube video from google chrome

If you remember my post about youtube flv cache files, I've been saving my youtube video's manually. I like this approach because it allowed me to watch the video first, if I found it interesting then I will save the video. While some firefox plugin allow you to save youtube videos, it often force you to re-download the video from the beginning. And if you just download the video without watching it first, some time after one hour of waiting you just end up with a complete garbage !

These days I'm using Google Chrome as my main browser. So I'm faced with the same problem in a different environment. Unlike firefox, Chrome uses the windows temp to store its temporary files. Including the FLV files. But it's very defensive. You can't copy the cache while chrome uses it, and if you close the tab that shows the video, Chrome will immediately delete the file. Giving you no time to copy it. Damn you google guys :P

After some goggling i finally found a solution. Hobocopy to the rescue ! Hobocopy is an open source utility that can copy copy-protected file in windows. You can download hobo copy from here. Using this approach I can copy the protected files. The flv cache files can be identified easily by their name and size.  Usually it will start with fla and end up with .tmp extension. So using Hobocopy i can copy the file i want as simple as :

C:\HoboCopy.exe C:\temp F:\MyVideos\ fla18E.tmp

But you must be aware, since Hobocopy is not actually a file copy utility, the Syntax is a bit different.

Another thing that I've found out is about the firefox case. Remember when firefox only cache a part of my video but I can still play the full video in my browser. It turn out you can get the full FLV video's from Windows temp directory using this same exact method.

Now i can view my Youtube video's without worry about rebuffer and wasted a lot of bandwidth :D

Thursday, February 10, 2011

Screen Scrapping Guide for Java

I created and app about one year a go. The application need to screen scrap a website because it didn't give an API to access the data. When i started, I was confused by so many choice that available for screen scrapping in java. And at that time I've decided to go with NekoHTML, its very powerful but also complicated and a bit of nightmare to maintain. So today i decided to look for better alternative. It turn out, after one year I easily confused by the so many java library out there that i can use. So i decided to create this guide.

Regex
The simplest and probably the fastest way to do screen scrapping in Java. The regex class is already included in the standard java library. So you don't need to add another library and learn how they work.

The problem is, regex is very prone to error. There are many reason for the error. It could be because the html doesn't comply to standards, bad implementation, or even non matching tags. And also a simple change on the website could break the whole regex rule and you need to start from zero. 

To use regex, you need to be very diligent about white space character, or it's only a matter of time before it bites you.

NekoHTML
This is one of the best parser that i know. It often used on other framework such as HTMLUnit as the default html parser. It creates the DOM representation of the page, and then you can traverse through the DOM tree getting each node content. Since it such a low level access, its a bit of pain to maintain. And also you still need to traverse throug the node that isn't important for you. Such as empty text, series of newline, break tag, etc. 
To overcome this you can add a custom filter to remove the unnecessary tag, and uses XPath to query the tag that you are interested in. 

HTMLUnit
Beside as a framework for testing, you can use html Unit for screen scrapping. It can be used for screen scrapping. You just create a webclient and then you can access the tag that you are interested in using the getHtmlElementById or you can also uses XPath. Then you can access the data as an XML or Text. 

It gives a higher abstraction, you don't need to traverse the DOM tree manually. And it also gives build in cookie management, proxy access, JavaScript, CSS and Ajax Support. Everything that you ever need from a headless browsers.

The funny thing is, after looking for something new, i decided to stick with NekoHTML. Its fast, and implementing custom filter helps me a lot. And it also doesn't come with the unnecessary bagage such as Javascript, CSS and Ajax Support. I don't need it for the moment. And of course using XPath has become very valuable. It cuts my previous code to 1/3 even 1/4 in some case. No more manually navigating through the dome, and it's more intuitive. I think i finally found the right tool :)



Hope that helps , I'll re-write the rant some other time to give better explanation. :)