Small Mosaic


Categories:

/books
/career
/codinghorrors
/events
/geekstuff
/justdont
/languages
/languages/bash
/linkshot
/magazines
/meta
/misctech
/movies
/nottech
/operatingsystems
/operatingsystems/linux
/operatingsystems/linux/debian
/operatingsystems/solaris
/perl
/presentations
/programming
/python
/ruby
/security
/security/apache
/security/tools
/serversmells
/services
/services/dns
/sites
/specifications
/sysadmin
/testing
/tools
/tools/commandline
/tools/firefox
/tools/gui
/tools/network
/tools/online
/tools/online/greasemonkey
/tools/puppet
/unixdaemon

Archives:

July 20111
June 20112
May 20113
April 20112
March 20117
January 20111
December 20103
November 20103
August 20101
July 20101
June 20104
May 20102
April 20101
March 20108
February 20101
January 20102
Full Archives

Sun, 04 Jan 2009

Simple Stemming with Perl
Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form.
-- Wikipedia article on Stemming

Ever used a website that allowed you to tag content? Ever ended up accidently using slightly different tags? Something like graphs and graphing or blog and blogs? (I hope so, otherwise it's just me...) To spot some of the more obvious overlaps you can stem each of the words and look for a common base. Where one's found there is the possibility of mistaken duplication. For example if you passed hunts, hunted and hunting through a stemmer each would return 'hunt'. If you want to try for yourself there are online stemmers available.

As a more concrete example let's look at the wonderful service del.icio.us. You upload your own bookmarks, tag them with a number of keywords and can then group, sort and search them by your own defined terms. Except I have a habit of tagging articles about similar topics with nearly, but not quite the same tag.

The perl code below shows how easy it is (using Lingua::Stem from CPAN) to run your own data through a stemmer and look for overlaps. There are implementations in most languages (PyStemmer is also very nice) and the wikipedia article is actually a very easy to follow introduction.


#!/usr/bin/perl -w
use strict;
use warnings;
use Lingua::Stem;
use Net::Delicious;

my $del = Net::Delicious->new(
                               {
                                 user => "username",
                                 pswd => "password"
                               }
                             );

my $stemmer = Lingua::Stem->new( -locale => 'EN-UK' );

my %stems;
for my $tag ( $del->tags() ) {
  my $stemmed = $stemmer->stem( $tag->tag );

  push( @{ $stems{$stemmed->[0]} },  $tag->tag );
}

for my $stemmed (sort keys %stems ) {
  # we only care about base words with more than one tag associated
  next unless ( scalar @{ $stems{$stemmed} } > 1);

  print "Possible duplicates -\n";
  print "  --  ";
  print join(" : ", @{ $stems{$stemmed} }), "\n";
}


Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

Posted: 2009/01/04 19:32 | /perl | Permanent link to this entry | This entry and same date


books career codinghorrors events geekstuff justdont magazines meta misctech movies nottech operatingsystems/linux operatingsystems/linux/debian operatingsystems/solaris perl programming python ruby security security/apache security/tools serversmells services/dns sites sysadmin testing tools tools/commandline tools/firefox tools/gui tools/network tools/online tools/online/greasemonkey tools/puppet unixdaemon

Copyright © 2000-2010 Dean Wilson XML feed logo