Thursday, August 13, 2009

Screen-scraping with Perl

What is screen scraping?

It is automated retrieval of information from websites which don't actually provide an API to do so. Eg. Fetch their web page, and locate and extract the particular bit(s) data out of the HTML.

It can be a little dubious, both ethically and legally, to do this - however there are plenty of times it can be very handy for personal projects.

Don't rely upon it; it's quite dependent upon the target website not changing their format.

  • Use random, real-world user agent strings, rather than the defaults from LWP::UserAgent. Try hitting up your own webserver's logs, and make a unique list of the user agent strings there, and then randomly iterate through those in your web scraper application.

  • Try to avoid hitting the target website too much. Implement caching at your end where possible if you think you're likely to generate a lot of traffic.

Suggested method:

Use something like the Firefox Web Developer plugin to locate the CSS tags matching the content you want to retrieve.

Now, create an XPath query that will locate that CSS.

Let's go through an example..
Say you have some HTML like this:
<div class="price">$2000</div>
Then your xpath query would be: //div[class="price"]
It's better to use wildcard searches (the // in the query) as this means your query is more likely to keep working if the destination site changes the structure a little bit.

XHTML parsers.

Originally I used XML::libXML, which had a flag to set that aimed to process potentially-broken xhtml and html.. However it broke (See RT #44715) a few versions back, and there seems to be no interest from the author in fixing it.

I then looked at HTML::TreeBuilder, which has a robust parser, but was difficult to locate your content with. Then someone helpfully pointed me towards HTML::TreeBuilder::XPath! This module allows you to do xpath queries on top of an html-treebuilder tree. Hurrah!

You'll need to use LWP::UserAgent to fetch the HTML content, and then pass it into treebuilder. Then you can perform queries like:

foreach my $node ($root->findnodes('//div[class="carDetails"]')) {

One word of warning -- the treebuilder engine seems prone to leak memory unless you're careful to manually tear-down the objects after you've finished with them. Read the POD carefully in this regard.

Newer parser!

This year a new dedicated module appeared, called Web::Scraper.
I haven't used it much myself yet, but it seems promising.

It works on xpath queries or direct CSS locators, which makes it even easier to set up. However the use of the module is kind of backwards to the former html-treebuilder way.

First you build the web::scraper object, and pass in the rules for items you wish to locate, and only after that do you tell it to scrape the target website. However you don't need to specifically fetch the page manually, which is nice.

An example:

use Web::Scraper;
my $scraper = scraper {
process "div.carDetails > a", link => '@href', description => 'TEXT';
my $result = $scraper->scrape( URI->new('') );
say "Link URL = ' . $result->{link} . "; link text = " . $result->{description};

Wednesday, August 12, 2009

PAR and Module::Scandeps vs autobox

PAR is a handy tool for dealing with your Perl application's many dependencies.
You can quickly bundle up the app + all requirements, and then run it on other machines. It's great when you don't feel like building half of CPAN just to run a temporary utility somewhere.

It generally works fairly well, but it relies upon Module::ScanDeps to report what those dependencies are.

And unfortunately I've been totally stumped by it when the autobox module is involved.. but only on the Debian Etch (4.0 and 4.5) platform; it works fine on Ubuntu 8.10 and 9.04.

To demonstrate, take the following mini script:

use strict;
use warnings;
use autobox;

print "Hello world!"->length;

sub SCALAR::length {
return length(shift);

Now run this to build a .par file from it:
pp -p -o test.par

Now we know that the test script used autobox.. So this test.par should contain and, right?

$ unzip -l test.par|grep autobox

Wrong. Where did it go?
Attempting to run the resulting par on another etch system does, predictably, fail due to a missing autobox module.

Running the same commands on Ubuntu 9.04 work just fine. Both systems are running the latest version of PAR, PAR::Packer, and Module::ScanDeps.