Please forgive the conversational style of this document.  I am quite jazzed about this new technology and feel that a simple walk through is the best way to get you up and running quickly.

            Mike Agar, October 28, 2002

Template Based Scraping

 

A quick overview of screen scraping may be needed to bring us all to the same playing field.  MyHeadlines has historically been based on the generosity of other sites producing syndicated content in a form that is readable by this software.  The standards used to syndicate headlines and other content have been around for quite some time and MyHeadlines is one of many different applications that makes excellent use of this free information. 

 

Standards like RSS, RDF, and XML have many different flavors and have been extended and tweaked, and garbled over time.  At the core of all of these standards there is a common dataset from which they diverged over time.  I make no use of Dublin Core extensions, or name spaces, or Winer’s latest creation.  I use three simple pieces of data to describe an item and three common elements to describe a source:

·        Site Name

·        Site Logo

·        Site Slogan

·        Item Title

·        Item Link

·        Item Description

·        Item Picture  (added in v4.1.4)

 

All of the above mentioned standards publish this list of data elements in a common fashion, and as such I make no distinction between them.  That was the past.  Today we introduce the future:  Template Based Scraping.

 

Screen Scrapers

The existing MyHeadlines Stock Ticker has always made use of screen scraping technology to download the latest ticker information.  It accomplishes this task by pretending to be a normal user and downloading the Yahoo Financial web page. Then it performs an “Indexed Array”[1] based scrape to capture only the information required.  This works well for cases where you want only a single data element from a site where the content is dynamically generated, but the “Indexed Array” method failed miserably as a solution where you are seeking multiple data elements from a single web page.

 

Faced with the reality that my existing scraper was not up to the task of full scale headline grabbing, I set out to design a scraper that would meet my needs.  I dissected the problem as if I were doing this task by hand.  Imagine I gave you this assignment:

1.      print out CNN’s front page

2.      Using a yellow highlighter mark all of the headlines

3.      Using a blue highlighter mark all of the associated story descriptions (if there are any)

4.      Using a green Highlighter mark the URLs to each of the stories.  (okay this can’t be done on paper, so you’ll need to find another way to do this...)

 

Well you’ve managed 1, 2, and 3 very easily, but step four is not visible in the printed media.  So, re-do all four steps but this time in step 1, use the source HTML for the page instead.  (right click, view source, blah, blah, blah.  If you can’t figure out what “source” means stop now, since it is only going to get worse)

 

You may notice that the output from the print version attempt is quite useful in locating the code fragments within the HTML source.  This exercise is the intuitive method for solving the above problem.  You may also note that the Yellow data matches the MyHeadlines “Item Title”, Blue is the “Item Description”, and Green is the “Item Link”.  This is no accident: what you just produced is the beginning of the Scraper Template.  In an actual scraper template you would have replaced the highlighted items with variable names, and then saved the modified source html as the template for CNN.

 

MyHeadlines Scraper Template Syntax

A template always starts out as the original HTML source code for the site you are scraping.  The source is then modified by replacing sections of the code with variable place-holders.  A variable is denoted by curly brackets:

 

Example:  {variable_name}

 

The beginning of a template where a code fragment is replaced by a variable:

 

<html>

  <head>

      <title>{site_name}</title>

  </head>

  <body>

      . . .

 

 

You’ll notice that the variable occupies the space where the desired content is located.  When the template is applied to the site’s html source code the scraper will using pattern matching algorithms to locate and assign values for each variable contained within the template.  The results are returned in a PHP array where the keys to the array are the variable names, and the values for variable is the array element.

Reserved Variable Names:

The following variables are understood by MyHeadlines, and are expected.  The first three are required and comprise the lists of headlines scraped.  Replace the “X” with a number between 0 and 19 in each of these. This allows for up to 20 stories to be scraped per visit.

 

The following are not required, but I like to have the site logo most of all.  The slogan and site name are overridden in MyHeadlines anyway, so these are more of a tool for the administrator to gage the effectiveness of the template.

 

Hints Before We Begin:

There are a few pitfalls that first time scrapers may wish to avoid.  First: most source html is repetitive, and as such is hardly useful.  I employ a technique of using a {dump} variable to hold vast quantities html code which is of little use, and does not help identify the content we are searching for.  This will reduce the size of your templates and makes them almost readable.  Second, you should watch out for dynamic sections within the html source code:  A date on a page is only correct in a template on that day!  Make special note of these and other things like Slashdot’s “This page generated by a flock of crazed zebra finches”.  Always replace this dynamic and changing html with a dummy variable so that the pattern matching algorithms don’t get hung up looking for “Oct 8, 2002” within the page.  With these two hints in mind lets begin with a local newspaper site that does not produce RSS/RDF/XML content.

 

Scraper Template Example

The target site:  http://www.durhamregion.com/dr/info/index.html has 8 stories published on the main page.  Here’s the template for the page:  I have highlighted key variables for you, anything without a highlight is raw html source from the site and is used to position the scraper accordingly.

 

{dump}<title>{site_name}</title>

{dump}</head>

<body marginheight="0" marginwidth="0" topmargin="0" leftmargin="0" rightmargin="0"

bgcolor="#FFFFFF">

 

<table width="100%" border="0" cellspacing="0" cellpadding="0">

<tr>

<td bgcolor="#F0F0F0" width="191"><img src="{image}" width

{dump}src="/images/dr/graphics/info/vote.gif" border="0">

{dump}</table>

{dump}<td background="/images/dr/graphics/network/widgets/divider.gif"><img src="/images/dr/graphics/pixel.gif" width="5" height="5"></td>

{dump}<div><a style="font-family: Verdana, Helvetica, sans-serif; font-size: 14px; font-weight: bold;" href={link_1}>{title_1}</a></div>

{dump}<div class="medblack">{desc_1}</div>

{dump}<td background="/images/dr/graphics/network/widgets/divider.gif"><img src="/images/dr/graphics/pixel.gif" width="5" height="5"></td>

{dump}<div><a style="font-family: Verdana, Helvetica, sans-serif; font-size: 14px; font-weight: bold;" href={link_2}>{title_2}</a></div>

{dump}<div class="medblack">{desc_2}</div>

{dump}<td background="/images/dr/graphics/network/widgets/divider.gif"><img src="/images/dr/graphics/pixel.gif" width="5" height="5"></td>

{dump}<div><a style="font-family: Verdana, Helvetica, sans-serif; font-size: 14px; font-weight: bold;" href={link_3}>{title_3}</a></div>

{dump}<div class="medblack">{desc_3}</div>

{dump}<td background="/images/dr/graphics/network/widgets/divider.gif"><img src="/images/dr/graphics/pixel.gif" width="5" height="5"></td>

{dump}<div><a style="font-family: Verdana, Helvetica, sans-serif; font-size: 14px; font-weight: bold;" href={link_4}>{title_4}</a></div>

{dump}<div class="medblack">{desc_4}</div>

{dump}<td background="/images/dr/graphics/network/widgets/divider.gif"><img src="/images/dr/graphics/pixel.gif" width="5" height="5"></td>

{dump}<div><a style="font-family: Verdana, Helvetica, sans-serif; font-size: 14px; font-weight: bold;" href={link_5}>{title_5}</a></div>

{dump}<div class="medblack">{desc_5}</div>

{dump}<td background="/images/dr/graphics/network/widgets/divider.gif"><img src="/images/dr/graphics/pixel.gif" width="5" height="5"></td>

{dump}<div><a style="font-family: Verdana, Helvetica, sans-serif; font-size: 14px; font-weight: bold;" href={link_6}>{title_6}</a></div>

{dump}<div class="medblack">{desc_6}</div>

{dump}<td background="/images/dr/graphics/network/widgets/divider.gif"><img src="/images/dr/graphics/pixel.gif" width="5" height="5"></td>

{dump}<div><a style="font-family: Verdana, Helvetica, sans-serif; font-size: 14px; font-weight: bold;" href={link_7}>{title_7}</a></div>

{dump}<div class="medblack">{desc_7}</div>

{dump}<td background="/images/dr/graphics/network/widgets/divider.gif"><img src="/images/dr/graphics/pixel.gif" width="5" height="5"></td>

{dump}<div><a style="font-family: Verdana, Helvetica, sans-serif; font-size: 14px; font-weight: bold;" href={link_8}>{title_8}</a></div>

{dump}<div class="medblack">{desc_8}</div>

{dump}

 

The keys to success are to keep just enough of the HTML source to allow for unique identifiers to mark the locations of the content you are seeking.  A generous use of “dump” or other dummy variables will increase your ability to read the template and debug changes in the source over time.  Yes the dump variable returns the remainder of the page from the above template, but since the “dump” name is not a reserved name within MyHeadlines, it is ignored.



[1] “Indexed Array” scraping is a scraping technology I developed specifically for this purpose.  It makes use of an array of strings that must be matched before any attempt to locate the desired data element is found.  This method essentially uses the array to index into the page and position the scraper near the desired data point.  This technique is very useful for scraping sites that employ anti-scraping technology and attempt to defeat the efforts of lesser scrapers.