Thursday, August 13, 2009

Screen-scraping with Perl

What is screen scraping?

It is automated retrieval of information from websites which don't actually provide an API to do so. Eg. Fetch their web page, and locate and extract the particular bit(s) data out of the HTML.

It can be a little dubious, both ethically and legally, to do this - however there are plenty of times it can be very handy for personal projects.

Don't rely upon it; it's quite dependent upon the target website not changing their format.

Hints:
  • Use random, real-world user agent strings, rather than the defaults from LWP::UserAgent. Try hitting up your own webserver's logs, and make a unique list of the user agent strings there, and then randomly iterate through those in your web scraper application.

  • Try to avoid hitting the target website too much. Implement caching at your end where possible if you think you're likely to generate a lot of traffic.


Suggested method:

Use something like the Firefox Web Developer plugin to locate the CSS tags matching the content you want to retrieve.

Now, create an XPath query that will locate that CSS.

Let's go through an example..
Say you have some HTML like this:
<div class="price">$2000</div>
Then your xpath query would be: //div[class="price"]
It's better to use wildcard searches (the // in the query) as this means your query is more likely to keep working if the destination site changes the structure a little bit.

XHTML parsers.

Originally I used XML::libXML, which had a flag to set that aimed to process potentially-broken xhtml and html.. However it broke (See RT #44715) a few versions back, and there seems to be no interest from the author in fixing it.

I then looked at HTML::TreeBuilder, which has a robust parser, but was difficult to locate your content with. Then someone helpfully pointed me towards HTML::TreeBuilder::XPath! This module allows you to do xpath queries on top of an html-treebuilder tree. Hurrah!

You'll need to use LWP::UserAgent to fetch the HTML content, and then pass it into treebuilder. Then you can perform queries like:

foreach my $node ($root->findnodes('//div[class="carDetails"]')) {
$node->findvalue('div[class="price"]');
}


One word of warning -- the treebuilder engine seems prone to leak memory unless you're careful to manually tear-down the objects after you've finished with them. Read the POD carefully in this regard.

Newer parser!

This year a new dedicated module appeared, called Web::Scraper.
I haven't used it much myself yet, but it seems promising.

It works on xpath queries or direct CSS locators, which makes it even easier to set up. However the use of the module is kind of backwards to the former html-treebuilder way.

First you build the web::scraper object, and pass in the rules for items you wish to locate, and only after that do you tell it to scrape the target website. However you don't need to specifically fetch the page manually, which is nice.

An example:

use Web::Scraper;
my $scraper = scraper {
process "div.carDetails > a", link => '@href', description => 'TEXT';
};
my $result = $scraper->scrape( URI->new('http://cars.example.com/') );
say "Link URL = ' . $result->{link} . "; link text = " . $result->{description};

3 comments:

  1. Hi, I have a couple jobs that I would like to advertise on your site or via an email list to inform your readers about Perl programming jobs. Please get back to me as soon as you get a chance.

    Look forward to hearing from you.

    Chris

    crose@enticelabs.com

    ReplyDelete
  2. Chris, I suggest you get in contact with your local Perlmongers organisation, and I'm sure they can connect you with the right people.
    Start at http://www.pm.org/ and follow the links on the left-hand side..

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete