THIS IS QUITE OLD AND REMAINS FOR PURPOSES OF CODE ARCHEOLOGY

The Product Class Used by CafeFetcher

CafeFetcher is a Mechanize app that snarfs img elements from assorted CafePress sites and turns them into Product objects, which are then manipulated and spit out to help build the rubystuff.com Web site.

Mechanize lets you register classes to be created when the parser encounters certain elements. For CafeFetcher, we want to grab img elements because they happen to carry all the information we need.

(Not all images are desirable, so the main code needs to do a bit of reaping. But we are spared having to cross-reference data from multiple locations in the HTML.) # You can, of course, just grab the image data straight from the parsed HTML; Mechanize offers a number if methods for finding stuff in the post-parse node tree. But, in this case at least, encapsulating the img data with some additional accessor methods and serialization code made the main program simpler.

product.rb

require 'mechanize'
require 'builder'

Builder is used to create XML versions of the data. Nice work by Jim Weirich.

include Builder
class Product

This regular expression is used to extract the product ID from the image URL

   ID_RE = /\/(\d+)_[A-Z]_store/ 

All classes registered with the Mechanize watch_for list must accept a Node object passed to the constructor.

This is a REXML node ; in our example here, it has no child nodes, but in other cases it might, which can be handy when trying to grab more complex data. Or not; if your code does not behave as expected, look and see if you are getting nodes with unwanted, or least unexpected, children.

The basic idea is to get the node, extract the raw data, fix it up, and make it available though more intuitive methods. If the source HTML changes then there’s a chance you can just adjust your derived business objects and leave the main client code untouched. # #

  def initialize  node   
     @node = node
     @alt =  ''
     @src = ''
     @p_link = ''
     @pid = ''
     @price = ''

The Builder class takes an options hash; here we define the indentation level to use for all XML output.

     @xml_options = { :indent=> 2 }

Not every Node we get is going to be suitable as a Product object. There may be other ways to do this, but this version just tries to populate what attributes it can; client code then selects a subset of the resulting product instances, filtering out instances that do not define all the required properties

     if @node.attributes[ 'alt' ] 
       @alt =  @node.attributes[ 'alt' ].to_s.strip
     end
     if @node.attributes[ 'src' ] 
       @src =  @node.attributes[ 'src' ].to_s.strip 
     end     
     @pid = get_prod_id @src 

This is a bit of code that used to be in the main CafeFetcher source, and was moved here when I decide to offer this up as an example. The threat of public scrutiny can do wonders for code refactoring. It wasn’t a major issue, but the truth is that client code (usually) has no business asking an object for some data, munging it, and then handing the results back to that same object. That’s the object’s job. (And if you really think an object is incapable, by default, of correctly calculating some derived value, then it is probably better to pass in a proc or block than to ask for the data. Proof is left as an exercise for the reader)

     get_price
   end

Here’s another handy feature of working with REXML nodes: not only can you deal with child elements, but the parent node is accessible as well.

This method looks to see if the parent is a paragraph, and if so, grabs the product price from the parent text value.

   def get_price
     parent = @node.parent.parent
     if parent.expanded_name.to_s.strip == 'p'
       @price = parent.text
     end
   end

Another bit that once lived in the cold, cruel word of client code:

   def set_full_link base_url
     @p_link = base_url + '.' + self.pid
   end

I wanted to build up sorted lists of products, so a comparison method as needed.

   def <=> other 
     return -1 unless other.class == self.class 
     other.pid  <=> @pid
   end

More data cooking.

  def get_prod_id img_url 
    #  http://storetn.cafepress.com/nocache/0/22276700_F_store.jpg?r=632521309912518670
    md = ID_RE.match( img_url )
    $1
  end

Assorted accessor methods:

  def pid= s 
     @pid = s
   end
   def pid
     @pid 
   end
   def alt
     @alt
   end
   def src
     @src
   end
   def to_s
     @node.to_s
   end

Serialization routines:

   def to_xml
     xml = XmlMarkup.new @xml_options 
     xml.product_image{ |x|
       x.alt @alt
       x.p_link( @p_link )
       x.src( @src )
     }
   end

   def to_rss1_item
     xml = XmlMarkup.new @xml_options 
     xml.item{ |x|
       x.title( @alt, 'rdf:about' => @p_link )
       x.link( @p_link )
       x.description( @alt + " (#{@price})")
       x.image( :item,  'rdf:about' => @src )
     }
   end
   def p_link= p_link 
     @p_link = p_link
   end
   def p_link
     @p_link
   end
end

There’s not a whole lot to this, which may make it a good example of a general technique, if not a particular implementation. The main idea is to encapsulate the data manipulating and reordering inside of business objects derived from one or more nodes of the parse tree. It should help isolate the rest of the code from changes in the source HTML, and make the code easier to read and maintain because the naming and behavior expresses the intent, not the HTML source.

Return to the main page

Copyright 2005 © James Britt