Aug 21 2009

Crawl, automate or test with HtmlUnit and JRuby

I bet you get dizzy with all the web testers that exist out there. I've been searching for the most  suitable one for my needs,  in order to write my own autonomous automators and crawlers. The main problem i was facing was that most of the libraries or gui-less browsers i used didn't support javascript, and that was a pain in the ass because i was getting stuck to a lot of pages. I have the impression that javascript is being used extensively by web developers nowadays and so if you want to write your own code doing some interesting stuff inside the web u will need support for javascript!

I came up with htmlunit library which is gui-less browser for java easy to use and has good support for javascript. Well it's even easier to use this library when you write your programs in jruby.

Below i will explain about the setting up  those together and writing some very usefull bots for your everyday needs.

  • Install jruby
  • Download htmlunit
  • Enable JRuby and include jar files
  • Write some code

Step 1

First of all we have to install jruby. If you compile jruby yourself remember to include it in your classpath.

Mac OS X

You will have to download and install MacPorts (http://www.macports.org/install.php) and then issue the following command:

$ sudo port install jruby

Linux

Use the package manager you have installed in your system. You simply write the following for distributions using aptitude:

$ sudo apt-get install jruby

Windows

http://www.devdaily.com/blog/post/ruby/installing-jruby-on-windows-xp-system/

Step 2

Download htmlunit from http://sourceforge.net/projects/htmlunit/files/
Place the downloaded jars into a folder named lib.

 
tar -zxvf htmlunit-x_x.tar.gz
cd htmlunit-x-x/
mv lib/ path_of_your_choice/
 

Step 3

Top in the ruby file you  are working write the following:

# Require Java so we can use the Java libraries
require 'java';
 
# Get HTML Unit and all of its required libraries
require 'htmlunit-2.1.jar';

Example: Vodafone bill

A simple example retreiving the bill for my mobile phone from vodafone:

voda.rb

# Require Java so we can use the Java libraries
require 'java';
 
# Get HTML Unit and all of its required libraries
require 'htmlunit-2.1.jar';
require 'commons-httpclient-3.1.jar';
require 'commons-io-1.4.jar';
require 'commons-logging-1.1.1.jar';
require 'commons-lang-2.4.jar'
require 'commons-codec-1.3.jar'
require 'xercesImpl-2.8.1.jar'
require 'xml-apis-1.0.b2.jar'
require 'jaxen-1.1.1.jar'
require 'commons-collections-3.2.jar'
require 'js-1.7R1.jar'
require 'nekohtml-1.9.7.jar'
require 'sac-1.3.jar'
require 'cssparser-0.9.5.jar'
require 'xalan-2.7.0.jar'
require 'xercesImpl-2.8.1.jar'
 
# Include the Web Client class
include_class 'com.gargoylesoftware.htmlunit.WebClient';
include_class 'com.gargoylesoftware.htmlunit.BrowserVersion';
 
# Function to connect to vodafone website
def connect_to_vodafone
version = BrowserVersion.new( "Netscape", "5.0 (Macintosh; en-US)", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14", "1.2" , 5.0 )
puts "version:ok"
wc = WebClient.new(version)
puts "wc:ok"
page = wc.getPage("http://www.vodafone.gr/portal/client/cms/viewCmsPage.action?pageId=1032");
puts "load_page:ok"
puts "\nLogging in to vodafone.gr ...\n"
#get login box
forms = page.getForms()
login_form = nil
forms.each do |form|
if form.getActionAttribute().include? "/portal/client/idm/login!login.action"
login_form = form
end
end
 
username = login_form.getInputByName("username")
password = login_form.getInputByName("password")
button = login_form.getInputByName("Submit")
#set values to login box
username.setValueAttribute("your_pass")
password.setValueAttribute("your_username")
 
mypage = button.click()
 
mypage = wc.getPage("https://www.vodafone.gr/portal/client/idm/loadUserProfile.action");
account_form = mypage.getFormByName("myAccountSelectBill")
select_drop_down = mypage.getByXPath('//select[@id="billingAccountField"]')[0]
#results for 1st account
get_results(select_drop_down.asText(),mypage)
end
 
def get_results(am,page)
#Collect the data you are interested in
total_amount = page.getByXPath('//input[@id="payBill_totalOwnedAmount"]')[0].getValueAttribute()
recent_amount = ""
duration = ""
page.getByXPath('//td[@class="main_text pad5"]').each do |td|
if td.asXml().include?"€"
recent_amount = td.asText
end
if td.asXml().include?"-"
duration = td.asText
end
end
 
#Print collection
puts "\nVodafone bill"
puts "-------------------------------------------"
puts "A/M: "+am+"\n"
puts "Total amount: " + total_amount + " €\n"
puts "Recent bill amount: " + recent_amount.split(' ')[1].split(',').join('.') + " €\n"
puts "Duration: " + duration + "\n\n"
end
 
connect_to_vodafone

Execution

jruby -Ipath_to_lib_folder voda.rb 2>/dev/null

More examples to come :)