Aug 21 2009

Crawl, automate or test with HtmlUnit and JRuby

I bet you get dizzy with all the web testers that exist out there. I've been searching for the most  suitable one for my needs,  in order to write my own autonomous automators and crawlers. The main problem i was facing was that most of the libraries or gui-less browsers i used didn't support javascript, and that was a pain in the ass because i was getting stuck to a lot of pages. I have the impression that javascript is being used extensively by web developers nowadays and so if you want to write your own code doing some interesting stuff inside the web u will need support for javascript!

I came up with htmlunit library which is gui-less browser for java easy to use and has good support for javascript. Well it's even easier to use this library when you write your programs in jruby.

Below i will explain about the setting up  those together and writing some very usefull bots for your everyday needs.

  • Install jruby
  • Download htmlunit
  • Enable JRuby and include jar files
  • Write some code

Step 1

First of all we have to install jruby. If you compile jruby yourself remember to include it in your classpath.

Mac OS X

You will have to download and install MacPorts (http://www.macports.org/install.php) and then issue the following command:

$ sudo port install jruby

Linux

Use the package manager you have installed in your system. You simply write the following for distributions using aptitude:

$ sudo apt-get install jruby

Windows

http://www.devdaily.com/blog/post/ruby/installing-jruby-on-windows-xp-system/

Step 2

Download htmlunit from http://sourceforge.net/projects/htmlunit/files/
Place the downloaded jars into a folder named lib.

 
tar -zxvf htmlunit-x_x.tar.gz
cd htmlunit-x-x/
mv lib/ path_of_your_choice/
 

Step 3

Top in the ruby file you  are working write the following:

# Require Java so we can use the Java libraries
require 'java';
 
# Get HTML Unit and all of its required libraries
require 'htmlunit-2.1.jar';

Example: Vodafone bill

A simple example retreiving the bill for my mobile phone from vodafone:

voda.rb

# Require Java so we can use the Java libraries
require 'java';
 
# Get HTML Unit and all of its required libraries
require 'htmlunit-2.1.jar';
require 'commons-httpclient-3.1.jar';
require 'commons-io-1.4.jar';
require 'commons-logging-1.1.1.jar';
require 'commons-lang-2.4.jar'
require 'commons-codec-1.3.jar'
require 'xercesImpl-2.8.1.jar'
require 'xml-apis-1.0.b2.jar'
require 'jaxen-1.1.1.jar'
require 'commons-collections-3.2.jar'
require 'js-1.7R1.jar'
require 'nekohtml-1.9.7.jar'
require 'sac-1.3.jar'
require 'cssparser-0.9.5.jar'
require 'xalan-2.7.0.jar'
require 'xercesImpl-2.8.1.jar'
 
# Include the Web Client class
include_class 'com.gargoylesoftware.htmlunit.WebClient';
include_class 'com.gargoylesoftware.htmlunit.BrowserVersion';
 
# Function to connect to vodafone website
def connect_to_vodafone
version = BrowserVersion.new( "Netscape", "5.0 (Macintosh; en-US)", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14", "1.2" , 5.0 )
puts "version:ok"
wc = WebClient.new(version)
puts "wc:ok"
page = wc.getPage("http://www.vodafone.gr/portal/client/cms/viewCmsPage.action?pageId=1032");
puts "load_page:ok"
puts "\nLogging in to vodafone.gr ...\n"
#get login box
forms = page.getForms()
login_form = nil
forms.each do |form|
if form.getActionAttribute().include? "/portal/client/idm/login!login.action"
login_form = form
end
end
 
username = login_form.getInputByName("username")
password = login_form.getInputByName("password")
button = login_form.getInputByName("Submit")
#set values to login box
username.setValueAttribute("your_pass")
password.setValueAttribute("your_username")
 
mypage = button.click()
 
mypage = wc.getPage("https://www.vodafone.gr/portal/client/idm/loadUserProfile.action");
account_form = mypage.getFormByName("myAccountSelectBill")
select_drop_down = mypage.getByXPath('//select[@id="billingAccountField"]')[0]
#results for 1st account
get_results(select_drop_down.asText(),mypage)
end
 
def get_results(am,page)
#Collect the data you are interested in
total_amount = page.getByXPath('//input[@id="payBill_totalOwnedAmount"]')[0].getValueAttribute()
recent_amount = ""
duration = ""
page.getByXPath('//td[@class="main_text pad5"]').each do |td|
if td.asXml().include?"€"
recent_amount = td.asText
end
if td.asXml().include?"-"
duration = td.asText
end
end
 
#Print collection
puts "\nVodafone bill"
puts "-------------------------------------------"
puts "A/M: "+am+"\n"
puts "Total amount: " + total_amount + " €\n"
puts "Recent bill amount: " + recent_amount.split(' ')[1].split(',').join('.') + " €\n"
puts "Duration: " + duration + "\n\n"
end
 
connect_to_vodafone

Execution

jruby -Ipath_to_lib_folder voda.rb 2>/dev/null

More examples to come :)


Jul 10 2009

Creating a CMS using CouchDB, Django and 30 lines of python

I bet you have heard the news on the street about this old ericsson programming language coming back to life bringing functional programming on the web. Yes, I talk about erlang! The language that is used by companies like Facebook, Amazon and of course Ericsson for their network products. It offers features like hot code swapping, lightweight inter-process communication and more. Generally it is a really great language that fits really well into the "Cloud computing" industry.

As I was exploring the language a while ago, I came into a project called Apache CouchDB, an object-oriented database, or to be more precise a "document store". At first I thought, oh just another object store, but then, after a bit of research, I got in love with it. It is wonderful because It gives you the ability to store, retrieve and query structured data without having to define a schema, one can also attach files onto each document! And the best of all you can do all of this through a neat web interface! Of course it is programmed in Erlang and this is the reason why it offers great speed, stability and distributed features like replication. As I mentioned earlier you can even query data, using uhm... yes... JavaScript! Smart.

After I used it for a couple of hours, I thought that it would really be a great Django template store! One could serve all templates in CouchDB and also define template instances using documents that include all the variables a template renders. Isn't this some kind of a CMS? I started a simple implementation and after no more than 30 lines of python there it was! Really simple but also really functional!

Here are the steps!

  1. Install CouchDB
  2. Create a new Django project
  3. Create a new app inside tha django project, I called mine totemplate.
  4. Create the appropriate views and setup the urls.py
  5. Design the database on CouchDB
  6. enjoy!

Step 1

Installing CouchDB should be really easy for all platforms.

Mac OS X

You will have to download and install MacPorts (http://www.macports.org/install.php) and then issue the following command:

$ sudo port install couchdb

Linux

Here is a link to a tutorial for linux: http://barkingiguana.com/2008/06/28/installing-couchdb-080-on-ubuntu-804.

Windows

If you are on windows you can take a look here: http://wiki.apache.org/couchdb/Installing_on_Windows I haven't tested it myself but it should work fine.

After you have installed couchdb, you will also have to install the python library for it. This is called couchdb-python and is available here:
http://code.google.com/p/couchdb-python/.
For you that have easy_install installed on your machines, issuing:

$ sudo easy_install couchdb-pytho

should do the job quickly and easilly.

Here is the views.py:

# Create your views here.
from django.http import HttpResponse
from django.template import Template, Context
from couchdb import *
from totemplate.settings import COUCH_SERVER
 
def show(request, resource_id, page_id):
    couch_store = Server(COUCH_SERVER)
    category_name = couch_store['indexers'][resource_id]['category']
    category = couch_store[category_name][page_id]
    template = couch_store['templates'][category['template_id']]
    html = Template(template['body']).render(Context(category))
    return HttpResponse(html)
 
def index(request, resource_id):
    couch_store = Server(COUCH_SERVER)
    indexer = couch_store['indexers'][resource_id]
    t = Template(indexer['template'])
    html = t.render(Context(couch_store))
    return HttpResponse(html)
 
def resources(request):
    couch_store = Server(COUCH_SERVER)
    resources = {"indexers":[]}
    for indexer in couch_store['indexers']:
        if couch_store['indexers'][indexer].has_key('template'):
            resources['indexers'] += [indexer]
    t = Template( couch_store["settings"]["index"]["body"])
    html = t.render(Context(resources))
    return HttpResponse(html)

Jun 12 2009

Caching the result of a python function using memcached and decorators

sarcachem.py

import memcache, time
 
HOST = "127.0.0.1"
PORT = "11211"
MC_CLIENT = memcache.Client(['%s:%s'%(HOST,PORT)], debug=0)
 
class sarcachem:
 
    class helper:
 
        def __init__(self, outer, fun):
            self.outer = outer
            self.fun = fun
 
        def __call__(self, *args, **kwargs):
            # If cached value does not exist
            # 1. Check to see if it is locked
            #    If it is, wait until it unlocks
            #    If it is not, lock and calculate value,
            #    unlock when finished
            # Return cached value
            key = "%s.%s->(%s,%s)"%(self.outer.salt,
                                   self.fun.func_name,
                                   repr(args),
                                   repr(kwargs))
            key_lock = "00_locked_%s"%(key)
 
            if MC_CLIENT.get(key) is None:
                if not MC_CLIENT.get(key_lock):
                    MC_CLIENT.set(key_lock,True)
                    result = self.fun(*args, **kwargs)
                    MC_CLIENT.set(key,result,time=self.outer.time)
                    print "Storing: ", key, ": ", MC_CLIENT.get(key)
                    MC_CLIENT.delete(key_lock)
                else:
                    while True:
                        time.sleep(1)
                        if not MC_CLIENT.get(key_lock):
                            break
                        else:
                            continue
 
            return MC_CLIENT.get(key)
 
    def __init__(self,time=3,salt="base"):
        """ In this function we set all our decorator's parameters """
        self.time = time
        self.salt = salt
 
    def __call__(self, fun):
        return sarcachem.helper(self, fun)

And here is the way you can use it:

from sarcachem import sarcachem
 
@sarcachem(10,__file__)
def fib(number=0):
 
    # Suck my life into the CPUHOLE
    for i in range(0,100000):
        i+100;
    # END OF LIFE SUCKER
 
    if number==0:
        return 0
    elif number==1:
        return 1
    else:
        return int(fib(number-1)) + int(fib(number-2))
 
if __name__=="__main__":
    print fib(100), fib(29)

Jun 12 2009

Up and running

Welcome to our brand new blog. We decided to start deepcore.gr with our blog for the moment. New stuff coming in the near future.

Enjoy your stay,

The team