Webscraping has always been, at best, a flaky way of gathering data and at worst a legal gray area. With premier sites such as Google and Amazon now offering official webservice interfaces to their data, developers can now add both respectability and reliability to their applications and drop the fragile HTML parsing.
This change in focus from using these services at the provided front end to wrapping our own services around them takes a while to get your head around but once you 'get it' the possibilities become pretty much endless. For myself the need was a pretty straight forward one, each day i ran half a dozen Google queries by hand to keep track of announcements, pages on specific topics, links created back to my articles and, yes I'll admit it, the ranking of my web site; unixdaemon.net
Some Perl code, the essential SOAP::Lite module, a copy of Google Hacks and half a day of playing around later i had my first set of results emailed to me.
The application itself isn't anything ground breaking, when run it looks in a configuration directory (defined in an application level config file) for any files that end in '.conf' and parses them looking for certain directives, the most basic config would look like this:
# Google Report Mailer config file # search term, including any parameters to be passed to Google phrase="Perl design patterns" #addresses to send the results to. email@example.com
Any lines beginning with a '#' are treated as a comment and ignored, as are blank lines or lines consisting only of whitespace. The two configuration options, phrase and recipients both do what you'd expect. The phrase option can contain anything that you can type into the search box on the Google homepage and is limited only by the rules of the Google API itself. In our example we use the double quotes to influence Googles results, they are not mandatory.
The recipients directive is a little more flexible, you can have as many email addresses as you wish on each line provided they are comma separated, you may also include more than one line of recipients in the file, multiple lines are appended to each other and the results are sent to all the users. An example of this can be found here:
firstname.lastname@example.org, email@example.com firstname.lastname@example.org, email@example.com
At the current time these query level config files only accept one other option, a Google developers key. While a key is provided in the application configuration file, more of which later, the user can also provide their own key in the query config and it'll be used for that query. This was done to allow fair use on machines with multiple people. You just create a directory all the googlemailer users can write to and they add their own query requests to it with their own keys.
Once these configurations have been picked up the script iterates through them, retrieving the results from Google and emailing them out, in plain text, to the supplied recipients. A sample message would have a subject line similar to "Google report for 'Perl design patterns' ", a snippet of the message body is show below:
Enclosed below is the result set returned for the query: Perl design patterns Time taken: 0.262195, 92100 total results available perl .com: Perl Design Patterns [Jun. 13, 2003] Perl Design Patterns . by Phil Crow June 13, 2003 Introduction. Advertisement. In 1995, Design Patterns was published, and during the <http://www.perl.com/pub/a/2003/06/13/design1.html> perl .com: Perl Design Patterns , Part 3 [Aug. 15, 2003] Perl Design Patterns , Part 3. by Phil Crow August 15, 2003... It's easier to say when objects are bad, which they are in these cases: More on Perl Design Patterns : ... <http://www.perl.com/pub/a/2003/08/15/design3.html>
The Google Mailer application itself isn't very complex but there are a few steps you need to take before it will run happily, when trying the first few runs by hand if the script outputs errors the best thing to do is change the $debug variable to 1 and it will put out a reasonable amount of debug information. If your nice enough in your email and include both the query config it failed on and the output I'll even have a look at it in case its something wrong in the code.
The two things that you will need to change when you install Google Mailer are the settings in the app.conf file, these are tied to the local machine, and in the first few lines of the googlemailer.pl script there is a path to the app.conf file, which you will have to change. The external dependencies are equally light, you will only need the following before it will run happily:
The googlemailer archive comes with a command called 'generate_gm_conf.pl' in the bin subdirectory that can be used to aid in configuration file creation. It can be invoked in three different ways.
If invoked with no options then a blank template (including basic comments) for a configuration file is printed to wherever standard out currently points. Use this invocation to create a new query config. If 'generate_gm_conf.pl' is invoked with a '-e' (short for example) then it prints a rather verbose sample configuration with heavy commenting and sample values for you to peruse. This should be enough to show you what the config expects. If the command is invoked with any other option then it displays its usage and exits.
You can find the current version of GoogleMailer with docs and a sample config file below, at the moment it's only offered as a manual install tgz archive but other package formats are planned to follow: GoogleMailer Archive
Posted on Thu Aug 28 23:14:31 2003 by Dean Wilson