CafeFetcher is a Mechanize app that snarfs img elements from assorted CafePress sites and turns them into Product objects, which are then manipulated and spit out to help build the rubystuff.com Web site.
Mechanize lets you register classes to be created when the parser encounters certain elements. For CafeFetcher, we want to grab img elements because they happen to carry all the information we need.
(Not all images are desirable, so the main code needs to do a bit of reaping. But we are spared having to cross-reference data from multiple locations in the HTML.) # You can, of course, just grab the image data straight from the parsed HTML; Mechanize offers a number if methods for finding stuff in the post-parse node tree. But, in this case at least, encapsulating the img data with some additional accessor methods and serialization code made the main program simpler.
require 'mechanize'
require 'builder'
Builder is used to create XML versions of the data. Nice work by Jim Weirich.
include Builder
class Product
This regular expression is used to extract the product ID from the image URL
ID_RE = /\/(\d+)_[A-Z]_store/
All classes registered with the Mechanize watch_for list must accept a Node object passed to the constructor.
This is a REXML node ; in our example here, it has no child nodes, but in other cases it might, which can be handy when trying to grab more complex data. Or not; if your code does not behave as expected, look and see if you are getting nodes with unwanted, or least unexpected, children.
The basic idea is to get the node, extract the raw data, fix it up, and make it available though more intuitive methods. If the source HTML changes then there’s a chance you can just adjust your derived business objects and leave the main client code untouched. # #
def initialize node
@node = node
@alt = ''
@src = ''
@p_link = ''
@pid = ''
@price = ''
The Builder class takes an options hash; here we define the indentation level to use for all XML output.
@xml_options = { :indent=> 2 }
Not every Node we get is going to be suitable as a Product object. There may be other ways to do this, but this version just tries to populate what attributes it can; client code then selects a subset of the resulting product instances, filtering out instances that do not define all the required properties
if @node.attributes[ 'alt' ]
@alt = @node.attributes[ 'alt' ].to_s.strip
end
if @node.attributes[ 'src' ]
@src = @node.attributes[ 'src' ].to_s.strip
end
@pid = get_prod_id @src
This is a bit of code that used to be in the main CafeFetcher source, and was moved here when I decide to offer this up as an example. The threat of public scrutiny can do wonders for code refactoring. It wasn’t a major issue, but the truth is that client code (usually) has no business asking an object for some data, munging it, and then handing the results back to that same object. That’s the object’s job. (And if you really think an object is incapable, by default, of correctly calculating some derived value, then it is probably better to pass in a proc or block than to ask for the data. Proof is left as an exercise for the reader)
get_price
end
Here’s another handy feature of working with REXML nodes: not only can you deal with child elements, but the parent node is accessible as well.
This method looks to see if the parent is a paragraph, and if so, grabs the product price from the parent text value.
def get_price
parent = @node.parent.parent
if parent.expanded_name.to_s.strip == 'p'
@price = parent.text
end
end
Another bit that once lived in the cold, cruel word of client code:
def set_full_link base_url
@p_link = base_url + '.' + self.pid
end
I wanted to build up sorted lists of products, so a comparison method as needed.
def <=> other
return -1 unless other.class == self.class
other.pid <=> @pid
end
More data cooking.
def get_prod_id img_url
# http://storetn.cafepress.com/nocache/0/22276700_F_store.jpg?r=632521309912518670
md = ID_RE.match( img_url )
$1
end
Assorted accessor methods:
def pid= s
@pid = s
end
def pid
@pid
end
def alt
@alt
end
def src
@src
end
def to_s
@node.to_s
end
Serialization routines:
def to_xml
xml = XmlMarkup.new @xml_options
xml.product_image{ |x|
x.alt @alt
x.p_link( @p_link )
x.src( @src )
}
end
def to_rss1_item
xml = XmlMarkup.new @xml_options
xml.item{ |x|
x.title( @alt, 'rdf:about' => @p_link )
x.link( @p_link )
x.description( @alt + " (#{@price})")
x.image( :item, 'rdf:about' => @src )
}
end
def p_link= p_link
@p_link = p_link
end
def p_link
@p_link
end
end
There’s not a whole lot to this, which may make it a good example of a general technique, if not a particular implementation. The main idea is to encapsulate the data manipulating and reordering inside of business objects derived from one or more nodes of the parse tree. It should help isolate the rest of the code from changes in the source HTML, and make the code easier to read and maintain because the naming and behavior expresses the intent, not the HTML source.
Return to the main page
Copyright 2005 © James Britt