Skip to content

Web scraping in Ruby: why I had to use scrAPI instead of WWW::Mechanize and Hpricot

Thursday evening: so, I had written myself a nice little script using Aaron Patterson’s WWW::Mechanize and why’s Hpricot to extract some data from a popular web-based airport directory.

Hpricot logo

I was warmed up for Hpricot by the promise of XPath and CSS selector support (and a very cool logo, of course). As a long time XPath user, I started banging out some crispy XPath expressions until I realized that XPath support was only very partial. I kept on trying expressions that would work, even bowing down to expressions that, according to the Wiki, would work, but differently. Come on guys, either support a standard or just plainly ignore it, please! 😡 Because I couldn’t figure out how I’d have to integrate why’s fork of the XPath spec in my expressions, I decided to stick with why’s fork of the CSS selectors instead.

Then, it became time to execute my code. I had estimated that it would take about two hours to finish downloading and parsing the approximately 10.000 pages which contained the data in which I was interested. So, I executed my script, detached my screen session and went to bed, trusting that I would find a nice, handy CSV file in the morning.

Friday morning, I was disappointed to find that my script had been killed. I was left wondering what could have killed the script. I decided to restart the script at the countries starting with the letter b (it had died somewhere halfway the list of countries starting with a b). Soon the script was happily appending data again to the existing CSV file.

Disclaimer: why is a much more prolific Ruby coder than I’ll ever be, so please take my comments with a grain of salt. No, actually, rather take them with a few spoonfuls of salt.

Later, I talked about the spontaneous death of the script with Wiebe. Curious, he looked at the memory usage of my script and saw that it was happily munching away hundreds of megs of memory on our server. And memory usage was growing! With crucial server processes at the risk of running out of memory and with me having to build a circumference around the vegetable garden to protect it from a bunch of brawling chickens, Wiebe was friendly enough to drop in and take a look at my spaghetti code to see if he could fix the leak. He couldn’t, because the leak didn’t appear to be in my code. I wasn’t the first to be bugged by a leak in Hpricot.

That news didn’t make me very happy, because it implied I had to redo the script using different tools. I knew that WWW::Mechanize had been inspired by the Perl package by the same name, so I started by looking at that. After installing WWW::Mechanize, I explored CPAN’s WWW namespace a bit further and noticed that the Perl crowd also had two other good scrapers at their fingertips: WWW::Extractor and WWW::Scraper. Once again I was reminded that Perl, despite its funky syntax, is still the king of all scripting languages when it comes to the availability of quality modules. 🙁 After a few deep breaths, I set my rusty Perl skill into (slow)motion. Hell, this was supposed to be a quick script. Why was this taking so much time? (Yeah, yeah; cue all the jokes about developer incompetence. 😕 )

I was almost stamped by a horde of camels, each with a name more syntactically confusing than the other. Just before I was crushed, I came across a reference to a Ruby scraper with decent support for CSS3 selectors: scrAPI. Credits for this discovery go to the documentors of scRUBYt, a featurefull scraper layered on top of WWW::Mechanize. The documentation writers of scRUBYt where friendly enough to help their users by including a link to the competition.

It took me some time to rewrite the script using scrAPI, partially because it was hard to find any documentation that was more comprehensive than a few blog posts and a cheat sheet and less of a hassle than reading the source. But, when Assaf answered my need by pointing me to the online API docs, I was happy.

Another reason why it was hard to migrate from WWW::Mechanize/Hpricot to scrAPI was that Hpricot starts element offsets for XPath predicates and CSS selectors at zero instead of one where they should start. And of course, I had to rid myself of the weird breed between CSS and XPath selectors.

I was surprised that the script using scrAPI ran about twice as fast as the Hpricot-based script. This was including a cumulative sleep() time between each request of almost an hour, because the speed during testing made me worry about over-exerting their web server. Knowing that one of the popular features of Hpricot is its speed, this was very unexpected, although I have to admit that Hpricot did fill my memory very quickly.

    7 Comments ( Add comment / trackback )

    1. (permalink)
      Comment by Ellecer
      On September 16, 2007 at 04:20

      I’m evaluating some of the Ruby screen scraping libraries out there for use at work and this post was quite helpful. I’ll keep the memory consumption in mind when testing hpricot and scrubyt. I’m still unsure what _why actually meant in his last reply to that bug report on this:

    2. (permalink)
      Comment by albert
      On February 25, 2008 at 16:02

      I was debugging an mechinze scraper i just did, and when i made it forget about history, it seemed to drop to sane memory consumption levels. I just set agent.max_history to 1, since i wasnt needing it anyway.

      And memory went down from 1,5 gigs and rising to a stable 40 megs.

    3. (permalink)
      Comment by albert
      On February 25, 2008 at 16:04

      Omg, my brain is fried today. Why did I link my webmail there? And its mechanize ofc, not mechinze 😀

    4. (permalink)
      Comment by Rowan Rodrik
      On March 13, 2008 at 21:07

      Thanks for the note Albert. Maybe, next time, I should give Mechanize another try. 🙂

    5. (permalink)
      Comment by bubfranks
      On August 6, 2008 at 09:10

      yup, worked for me too, set max_history = 1 and it won’t keep track of every page you visit.

    6. (permalink)
      Comment by Rowan Rodrik
      On June 21, 2010 at 14:10

      Grumbl. 😐 I’ve just spent over an hour trying to figure out why I couldn’t get my scrAPI for the old Aihato’s guest book to work until I decided to take a look at the source without Firebug. I counted at least three <html> HTML tags. It’s official: I’m disgusted. 😯 It made me download the entire thing with wget to keep it as a warning for little children…

    7. (permalink)
      Comment by Harry
      On April 15, 2015 at 11:04

      It worked for me.

      Thanks for the post 🙂

    Post a comment


    Your email is never published nor shared.

    Allowed HTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>