Crawl, automate or test with HtmlUnit and JRuby
I bet you get dizzy with all the web testers that exist out there. I've been searching for the most suitable one for my needs, in order to write my own autonomous automators and crawlers. The main problem i was facing was that most of the libraries or gui-less browsers i used didn't support javascript, and that was a pain in the ass because i was getting stuck to a lot of pages. I have the impression that javascript is being used extensively by web developers nowadays and so if you want to write your own code doing some interesting stuff inside the web u will need support for javascript!
I came up with htmlunit library which is gui-less browser for java easy to use and has good support for javascript. Well it's even easier to use this library when you write your programs in jruby.
Below i will explain about the setting up those together and writing some very usefull bots for your everyday needs.
- Install jruby
- Download htmlunit
- Enable JRuby and include jar files
- Write some code
Step 1
First of all we have to install jruby. If you compile jruby yourself remember to include it in your classpath.
Mac OS X
You will have to download and install MacPorts (http://www.macports.org/install.php) and then issue the following command:
$ sudo port install jruby
Linux
Use the package manager you have installed in your system. You simply write the following for distributions using aptitude:
$ sudo apt-get install jruby
Windows
http://www.devdaily.com/blog/post/ruby/installing-jruby-on-windows-xp-system/
Step 2
Download htmlunit from http://sourceforge.net/projects/htmlunit/files/
Place the downloaded jars into a folder named lib.
tar -zxvf htmlunit-x_x.tar.gz cd htmlunit-x-x/ mv lib/ path_of_your_choice/
Step 3
Top in the ruby file you are working write the following:
# Require Java so we can use the Java libraries require 'java'; # Get HTML Unit and all of its required libraries require 'htmlunit-2.1.jar';
Example: Vodafone bill
A simple example retreiving the bill for my mobile phone from vodafone:
voda.rb
# Require Java so we can use the Java libraries require 'java'; # Get HTML Unit and all of its required libraries require 'htmlunit-2.1.jar'; require 'commons-httpclient-3.1.jar'; require 'commons-io-1.4.jar'; require 'commons-logging-1.1.1.jar'; require 'commons-lang-2.4.jar' require 'commons-codec-1.3.jar' require 'xercesImpl-2.8.1.jar' require 'xml-apis-1.0.b2.jar' require 'jaxen-1.1.1.jar' require 'commons-collections-3.2.jar' require 'js-1.7R1.jar' require 'nekohtml-1.9.7.jar' require 'sac-1.3.jar' require 'cssparser-0.9.5.jar' require 'xalan-2.7.0.jar' require 'xercesImpl-2.8.1.jar' # Include the Web Client class include_class 'com.gargoylesoftware.htmlunit.WebClient'; include_class 'com.gargoylesoftware.htmlunit.BrowserVersion'; # Function to connect to vodafone website def connect_to_vodafone version = BrowserVersion.new( "Netscape", "5.0 (Macintosh; en-US)", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14", "1.2" , 5.0 ) puts "version:ok" wc = WebClient.new(version) puts "wc:ok" page = wc.getPage("http://www.vodafone.gr/portal/client/cms/viewCmsPage.action?pageId=1032"); puts "load_page:ok" puts "\nLogging in to vodafone.gr ...\n" #get login box forms = page.getForms() login_form = nil forms.each do |form| if form.getActionAttribute().include? "/portal/client/idm/login!login.action" login_form = form end end username = login_form.getInputByName("username") password = login_form.getInputByName("password") button = login_form.getInputByName("Submit") #set values to login box username.setValueAttribute("your_pass") password.setValueAttribute("your_username") mypage = button.click() mypage = wc.getPage("https://www.vodafone.gr/portal/client/idm/loadUserProfile.action"); account_form = mypage.getFormByName("myAccountSelectBill") select_drop_down = mypage.getByXPath('//select[@id="billingAccountField"]')[0] #results for 1st account get_results(select_drop_down.asText(),mypage) end def get_results(am,page) #Collect the data you are interested in total_amount = page.getByXPath('//input[@id="payBill_totalOwnedAmount"]')[0].getValueAttribute() recent_amount = "" duration = "" page.getByXPath('//td[@class="main_text pad5"]').each do |td| if td.asXml().include?"€" recent_amount = td.asText end if td.asXml().include?"-" duration = td.asText end end #Print collection puts "\nVodafone bill" puts "-------------------------------------------" puts "A/M: "+am+"\n" puts "Total amount: " + total_amount + " €\n" puts "Recent bill amount: " + recent_amount.split(' ')[1].split(',').join('.') + " €\n" puts "Duration: " + duration + "\n\n" end connect_to_vodafone
Execution
jruby -Ipath_to_lib_folder voda.rb 2>/dev/null
More examples to come