Thursday, October 8, 2009

Melbourne GTUG videos

I've uploaded some videos from the Melbourne google technologies user group meeting.

They're available at http://www.youtube.com/gtugmelbourne

Oddly, Youtube allows up to 2 gigabytes per video upload, but only a maximum of 10 minutes! That's about the same bitrate as my practically-uncompressed original DV files. Who would upload something like that?

Tuesday, September 29, 2009

Catalyst 5.71 vs 5.8 performance test

After a discussion on the merits of Catalyst 5.7 vs 5.8 as far as performance use goes, I decided to knock up a proper test.

I have two identical virtual machines, only on one I installed Catalyst::Runtime 5.71001 and the other with 5.80013. (Plus dependencies of course)

Running the exact same app, I hit them up with Siege for a while, results follow at the end of this email.

If you want to replicate the test or examine my extremely-simple test app, see:
Catalyst performance test app on Github (Patches gleefully accepted ;)

It's interesting to note the headline figures have 5.71 performing 316 tps, vs 5.80 making only 283 tps.
Memory usage (for this small app) has increased by 4MB, but is presumably shared. I guess I should look into that more.

The same system can serve small static pages from the webserver at about 1900 tps. A real-world application there on Cat 5.8 gets 90 tps.

I don't see that performance difference (5.71 vs 5.80) as significant, since most of your time ends up being spent in application code, rather than the Catalyst framework itself.
ie. If you want to make your code go faster, look at optimising your templating and database queries before you worry about downgrading Catalyst.

-Toby

------------------= results =----------------------
Running 10 second warmup on 5.7..
Running main test on 5.7..

Transactions: 94796 hits
Availability: 100.00 %
Elapsed time: 300.00 secs
Data transferred: 77.35 MB
Response time: 0.03 secs
Transaction rate: 315.99 trans/sec
Throughput: 0.26 MB/sec
Concurrency: 10.00
Successful transactions: 94796
Failed transactions: 0
Longest transaction: 0.98
Shortest transaction: 0.00

Process size:
101m VIRT, 34m RES



Running 10 second warmup on 5.8..
Running main test on 5.8..

Transactions: 84805 hits
Availability: 100.00 %
Elapsed time: 300.00 secs
Data transferred: 69.20 MB
Response time: 0.04 secs
Transaction rate: 282.68 trans/sec
Throughput: 0.23 MB/sec
Concurrency: 9.99
Successful transactions: 84805
Failed transactions: 0
Longest transaction: 1.07
Shortest transaction: 0.00

Process size:
103m VIRT, 38m RES

Thursday, August 13, 2009

Screen-scraping with Perl

What is screen scraping?

It is automated retrieval of information from websites which don't actually provide an API to do so. Eg. Fetch their web page, and locate and extract the particular bit(s) data out of the HTML.

It can be a little dubious, both ethically and legally, to do this - however there are plenty of times it can be very handy for personal projects.

Don't rely upon it; it's quite dependent upon the target website not changing their format.

Hints:
  • Use random, real-world user agent strings, rather than the defaults from LWP::UserAgent. Try hitting up your own webserver's logs, and make a unique list of the user agent strings there, and then randomly iterate through those in your web scraper application.

  • Try to avoid hitting the target website too much. Implement caching at your end where possible if you think you're likely to generate a lot of traffic.


Suggested method:

Use something like the Firefox Web Developer plugin to locate the CSS tags matching the content you want to retrieve.

Now, create an XPath query that will locate that CSS.

Let's go through an example..
Say you have some HTML like this:
<div class="price">$2000</div>
Then your xpath query would be: //div[class="price"]
It's better to use wildcard searches (the // in the query) as this means your query is more likely to keep working if the destination site changes the structure a little bit.

XHTML parsers.

Originally I used XML::libXML, which had a flag to set that aimed to process potentially-broken xhtml and html.. However it broke (See RT #44715) a few versions back, and there seems to be no interest from the author in fixing it.

I then looked at HTML::TreeBuilder, which has a robust parser, but was difficult to locate your content with. Then someone helpfully pointed me towards HTML::TreeBuilder::XPath! This module allows you to do xpath queries on top of an html-treebuilder tree. Hurrah!

You'll need to use LWP::UserAgent to fetch the HTML content, and then pass it into treebuilder. Then you can perform queries like:

foreach my $node ($root->findnodes('//div[class="carDetails"]')) {
$node->findvalue('div[class="price"]');
}


One word of warning -- the treebuilder engine seems prone to leak memory unless you're careful to manually tear-down the objects after you've finished with them. Read the POD carefully in this regard.

Newer parser!

This year a new dedicated module appeared, called Web::Scraper.
I haven't used it much myself yet, but it seems promising.

It works on xpath queries or direct CSS locators, which makes it even easier to set up. However the use of the module is kind of backwards to the former html-treebuilder way.

First you build the web::scraper object, and pass in the rules for items you wish to locate, and only after that do you tell it to scrape the target website. However you don't need to specifically fetch the page manually, which is nice.

An example:

use Web::Scraper;
my $scraper = scraper {
process "div.carDetails > a", link => '@href', description => 'TEXT';
};
my $result = $scraper->scrape( URI->new('http://cars.example.com/') );
say "Link URL = ' . $result->{link} . "; link text = " . $result->{description};

Wednesday, August 12, 2009

PAR and Module::Scandeps vs autobox

PAR is a handy tool for dealing with your Perl application's many dependencies.
You can quickly bundle up the app + all requirements, and then run it on other machines. It's great when you don't feel like building half of CPAN just to run a temporary utility somewhere.

It generally works fairly well, but it relies upon Module::ScanDeps to report what those dependencies are.

And unfortunately I've been totally stumped by it when the autobox module is involved.. but only on the Debian Etch (4.0 and 4.5) platform; it works fine on Ubuntu 8.10 and 9.04.

To demonstrate, take the following mini script:

#!/usr/bin/perl
use strict;
use warnings;
use autobox;

print "Hello world!"->length;

sub SCALAR::length {
return length(shift);
}


Now run this to build a .par file from it:
pp -p -o test.par test.pl

Now we know that the test script used autobox.. So this test.par should contain autobox.pm and autobox.so, right?

$ unzip -l test.par|grep autobox
$

Wrong. Where did it go?
Attempting to run the resulting par on another etch system does, predictably, fail due to a missing autobox module.

Running the same commands on Ubuntu 9.04 work just fine. Both systems are running the latest version of PAR, PAR::Packer, and Module::ScanDeps.

Thursday, July 9, 2009

Quieter PostgreSQL deployment from DBIx::Class

I use DBIx::Class to deploy my SQL schema to the database, rather than hand-crafting SQL.
This allows me to deploy the schema to a variety of databases easily, without worrying about inter-database SQL concerns.

In my unit tests, I create quite a lot of test databases.. PostgreSQL in particular is very noisy during this process, printing a LOT of "NOTICE" messages to the console. I wanted to quieten it down, so that actual test failures/warnings were more obvious.

If you'd like that too, try this in your MyApp::Schema class:


# If you use Moose:
before 'deploy' => sub {
my $schema = shift;
$schema->storage->dbh->do('SET client_min_messages = warning');
};

# If you don't use Moose:
sub deploy {
my $schema = shift;
$schema->storage->dbh->do('SET client_min_messages = warning');
$schema->next::method(@_);
}

Thursday, June 11, 2009

Features attract users; Documentation required to keep them

The Perl development community is, contrary to some reports, still very active and bringing out many innovative and handy modules. To keep growing, we need to get more people using Perl who are every-day developers - by which I mean, they consume CPAN modules, rather than create them.

Unfortunately, too often I am seeing users getting turned off Perl because the CPAN modules I direct them to are too hard to figure out. In turn, this puts the users off Perl. They really want to use the new modules, since they've heard good things about them, but they're not going to go so far as to join IRC and/or read the source code to figure stuff out.

The problem is that many of the modern CPAN modules don't seem very well documented.
There are two problems:
1) API is unstable / Documentation out of date.
Often tutorials were written against older versions of modules, and if users try to run them, they now receive errors or big warnings saying the feature is deprecated. This is confusing and off-putting for a new user. Sometimes even the documentation that ships with the module is out of date too!
2) Documentation is up to date, but hard to find.
One of the biggest complaints I hear about modern CPAN modules is that in order to find the docs for something, you already have to know the answer. This is due to the tendency to use OO and mix-in inheritance, and to document features in the module where the code exists. While that makes sense to a developer, a end-user will look at the other end of the inheritance tree, where they expect the feature to be exposed.

Developers don't like writing documentation, generally. But if we want Perl to increase the number of developers using it, someone will have to do it..
It's not enough to just tell users to "buy the book".. It's a pity that it takes a commercial incentive for good documentation to get written :(

Wednesday, May 27, 2009

How (not) to do inherited tables in DBIx::Class

When writing DBIx::Class schemas for a database that includes several similar tables, it would appear to make sense to use object-oriented programming, and make one table inherit from another.
(Now, this actually makes me think that your database tables aren't correctly normalised, however sometimes you don't have a choice over the DB layout, or you have some reason for using the denormalised layout.)

When using DBIx::Class, do not be tempted to make a Result class that inherits from another Result class. It ends up with a big mess occuring in the internal structures of dbic, and although it seems to work at first, you'll get some weirdness down the track.

Let me give you an example *of what not to do* so you understand.

package My::Schema::People;
use base 'DBIx::Class';
__PACKAGE__->load_components('Core');
__PACKAGE__->table('people'); # Or persons?
__PACKAGE__->add_columns(qw(id age gender));
1;

package My::Schema::Teenagers;
use base 'My::Schema::People';
__PACKAGE__->table('teenagers');
__PACKAGE__->add_columns(qw(allowance angst phone_bill));
1;

package My::Schema::Adults;
use base 'My::Schema::People';
__PACKAGE__->table('adults');
__PACKAGE__->add_columns(qw(salary stress childcare_centre));
1;


The problem is that when you "use base People" you cause ->table() to get called with it set to "people".. Now later you re-call table() in your own class and set it to teenagers, but it's too late - that first call to table() triggers a lot of code inside DBIC which ends up associating the wrong things to your class and to that original table.

I'll now show you a way which does work correctly.


package My::Base::People;
use base 'DBIx::Class';
sub foo {
my ($class, %args) = @_;
$class->load_components('Core');
$class->table($args{table});
$class->add_columns(qw(id age gender));
}
1;

package My::Schema::People;
use base 'My::Base::People';
__PACKAGE__->foo(table => 'people');
1;

package My::Schema::Teenagers;
use base 'My::Base::People';
__PACKAGE__->foo(table => 'teenagers');
__PACKAGE__->add_columns(qw(allowance angst phone_bill));
1;

package My::Schema::Adults;
use base 'My::Schema::People';
__PACKAGE__->foo(table => 'adults');
__PACKAGE__->add_columns(qw(salary stress childcare_centre));
1;


I have also thought of creating the base class as a DBIC Component instead, and hooking it into the ->table() call. It'd look slightly neater in the result classes, but wouldn't be as clear for my example and understanding.

ie. your result classes would look like:

package My::Schema:People;
use base 'DBIx::Class';
__PACKAGE__->load_components(qw(+My::Component::People Core));
__PACKAGE__->table('people')
1;
and your component class would look a bit like
package My::Component::People;
use base 'DBIx::Class::Component';
sub table {
my ($class, $table) = @_;
$class->next::method($table);
$class->add_columns(qw(id age gender));
}
1;

Ironman

I have so many blogs, or microblogs, of one sort or another already..
Livejournal, Flickr's photostream, Facebook's news page, Twitter, Dreamwidth, Identi.ca.
Why am I starting another one?

I'm entering the Enlightened Perl Ironman challenge. (See link from title)
It seems like a good idea, and about time I wrote more about what I do, rather than the inanity of facebook and twitter, or the boring details of my life on livejournal.