THIS IS QUITE OLD AND REMAINS FOR PURPOSES OF CODE ARCHEOLOGY

The Product Class Used by CafeFetcher

CafeFetcher is a Mechanize app that snarfs img elements from assorted CafePress sites and turns them into Product objects, which are then manipulated and spit out to help build the rubystuff.com Web site.

Mechanize lets you register classes to be created when the parser encounters certain elements. For CafeFetcher, we want to grab img elements because they happen to carry all the information we need.

(Not all images are desirable, so the main code needs to do a bit of reaping. But we are spared having to cross-reference data from multiple locations in the HTML.) # You can, of course, just grab the image data straight from the parsed HTML; Mechanize offers a number if methods for finding stuff in the post-parse node tree. But, in this case at least, encapsulating the img data with some additional accessor methods and serialization code made the main program simpler.

product.rb

require 'mechanize'

require 'builder'

Builder is used to create XML versions of the data. Nice work by Jim Weirich.

include Builder

class Product

This regular expression is used to extract the product ID from the image URL

   ID_RE = /\/(\d+)_[A-Z]_store/

All classes registered with the Mechanize watch_for list must accept a Node object passed to the constructor.

This is a REXML node ; in our example here, it has no child nodes, but in other cases it might, which can be handy when trying to grab more complex data. Or not; if your code does not behave as expected, look and see if you are getting nodes with unwanted, or least unexpected, children.

The basic idea is to get the node, extract the raw data, fix it up, and make it available though more intuitive methods. If the source HTML changes then there’s a chance you can just adjust your derived business objects and leave the main client code untouched. # #

  def initialize  node

     @node = node

     @alt =  ''

     @src = ''

     @p_link = ''

     @pid = ''

     @price = ''

The Builder class takes an options hash; here we define the indentation level to use for all XML output.

     @xml_options = { :indent=> 2 }

Not every Node we get is going to be suitable as a Product object. There may be other ways to do this, but this version just tries to populate what attributes it can; client code then selects a subset of the resulting product instances, filtering out instances that do not define all the required properties

     if @node.attributes[ 'alt' ]

       @alt =  @node.attributes[ 'alt' ].to_s.strip

end

     if @node.attributes[ 'src' ]

       @src =  @node.attributes[ 'src' ].to_s.strip

end

     @pid = get_prod_id @src

This is a bit of code that used to be in the main CafeFetcher source, and was moved here when I decide to offer this up as an example. The threat of public scrutiny can do wonders for code refactoring. It wasn’t a major issue, but the truth is that client code (usually) has no business asking an object for some data, munging it, and then handing the results back to that same object. That’s the object’s job. (And if you really think an object is incapable, by default, of correctly calculating some derived value, then it is probably better to pass in a proc or block than to ask for the data. Proof is left as an exercise for the reader)

     get_price

end

Here’s another handy feature of working with REXML nodes: not only can you deal with child elements, but the parent node is accessible as well.

This method looks to see if the parent is a paragraph, and if so, grabs the product price from the parent text value.

   def get_price

     parent = @node.parent.parent

     if parent.expanded_name.to_s.strip == 'p'

       @price = parent.text

end

end

Another bit that once lived in the cold, cruel word of client code:

   def set_full_link base_url

     @p_link = base_url + '.' + self.pid

end

I wanted to build up sorted lists of products, so a comparison method as needed.

   def <=> other

     return -1 unless other.class == self.class

     other.pid  <=> @pid

end

More data cooking.

  def get_prod_id img_url

    #  http://storetn.cafepress.com/nocache/0/22276700_F_store.jpg?r=632521309912518670

    md = ID_RE.match( img_url )

$1

end

Assorted accessor methods:

  def pid= s

     @pid = s

end

   def pid

     @pid

end

   def alt

     @alt

end

   def src

     @src

end

   def to_s

     @node.to_s

end

Serialization routines:

   def to_xml

     xml = XmlMarkup.new @xml_options

     xml.product_image{ |x|

       x.alt @alt

       x.p_link( @p_link )

       x.src( @src )

end

   def to_rss1_item

     xml = XmlMarkup.new @xml_options

     xml.item{ |x|

       x.title( @alt, 'rdf:about' => @p_link )

       x.link( @p_link )

       x.description( @alt + " (#{@price})")

       x.image( :item,  'rdf:about' => @src )

end

   def p_link= p_link

     @p_link = p_link

end

   def p_link

     @p_link

end

end

There’s not a whole lot to this, which may make it a good example of a general technique, if not a particular implementation. The main idea is to encapsulate the data manipulating and reordering inside of business objects derived from one or more nodes of the parse tree. It should help isolate the rest of the code from changes in the source HTML, and make the code easier to read and maintain because the naming and behavior expresses the intent, not the HTML source.

Return to the main page