Scraping In The Name Of!

Interpretting The Information

Now that we have our data, we need to do something useful with it. We could use regular expressions to isolate the important bits, but a much more useable approach is to use the Document Object Model (DOM), a language-independent interface for representing and interacting with objects in HTML and XML documents. Specifically, we will use the PHP Simple HTML Dom Parser from http://simplehtmldom.sourceforge.net/. Download the simple_html_dom.php file, throw it into the directory with your other web files, and let’s get started!

Isolate The Important HTML Elements

We need to determine which DOM elements we want to isolate. To do this, we will use Chrome’s Developer Tools (Firefox and Opera both have very similar tools built in). Simply load the page for your first location, right-click on the temperature and select “Inspect element.”

This will open developer tools on the bottom of the window showing the element within the DOM tree.

If you mouse over different HTML elements, you will notice Chrome highlights those elements in blue along with some identification information. We can use this process to determine where all the information we want is located.

Code:
  1. <?php
  2. function scrapeWebsite ($url, &$weather)
  3. {
  4. // Parse the URL to retrieve the city name and page
  5. $result = preg_match ("/^.*\/weather\/(?P<page>[^\/]+)\/(?P<code>[^\/]+)$/", $url, $matches);
  6.  
  7. // If the result from preg_match is not 1, the pattern was not found so return nothing
  8. if ($result !== 1)
  9. {
  10. return false;
  11. }
  12. else
  13. {
  14. $page = $matches['page'];
  15. $code = $matches['code'];
  16. }
  17.  
  18. // If the code has not yet been added to the container, create it
  19. if (!isset($weather[$code]))
  20. {
  21. $weather[$code] = array ();
  22. }
  23.  
  24. // Initialize a new session and return a cURL handle
  25. $crl = curl_init ();
  26.  
  27. // Set options for cURL
  28. curl_setopt ($crl, CURLOPT_URL, $url); // The URL to fetch
  29. curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1); // Return the transfer as a string
  30. curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, 5); // Allow 5 seconds for connecting
  31.  
  32. // Execute the given cURL session
  33. $content = curl_exec ($crl);
  34.  
  35. // Store the content in the container using $page as the key
  36. $weather[$code][$page] = $content;
  37.  
  38. // Close the cURL session
  39. curl_close ($crl);
  40. }
  41.  
  42. $weather = array ();
  43. $display = array ();
  44. $urls = array ('http://www.weather.com/weather/right-now/USCO0357', 'http://www.weather.com/weather/today/USCO0357', 'http://www.weather.com/weather/right-now/USCO0105', 'http://www.weather.com/weather/today/USCO0105');
  45.  
  46. foreach ($urls as $url)
  47. {
  48. scrapeWebsite ($url, $weather);
  49. }
  50.  
  51. require_once ('./simple_html_dom.php');
  52.  
  53. foreach ($weather as $code => $page)
  54. {
  55. // Reset the location
  56. $location = null;
  57.  
  58. foreach ($page as $key => $content)
  59. {
  60. // Create DOM from HTML string
  61. $html = str_get_html ($content);
  62.  
  63. if ($location === null)
  64. {
  65. $location = $html->find('div.wx-location-title', 0)->find('h1', 0)->plaintext;
  66. $display[$location] = array ();
  67. $display[$location][$key] = array ();
  68. }
  69.  
  70. switch ($key)
  71. {
  72. case 'right-now' :
  73. $tmp = array ();
  74.  
  75. // Find the div element with id 'wx-main'
  76. $main = $html->find('div#wx-main', 0);
  77. $tmp['wind'] = $main->find('div.wx-cc-wind-speed', 0)->plaintext;
  78.  
  79. // Find the div element with class 'wx-featured'
  80. $featured = $main->find('div.wx-featured', 0);
  81. $tmp['temp'] = $featured->find('li.wx-temp', 0)->plaintext;
  82. $tmp['phrase'] = $featured->find('li.wx-phrase', 0)->plaintext;
  83. $tmp['feels'] = $featured->find('li.wx-feels', 0)->plaintext;
  84.  
  85. $display[$location][$key] = $tmp;
  86. break;
  87. case 'today' :
  88. $tmp = array ();
  89.  
  90. // Find the div element with class 'wx-12hour'
  91. $container = $html->find('div.wx-12hour', 0);
  92. $day = $container->find('div.wx-daypart', 0);
  93. $night = $container->find('div.wx-daypart', 1);
  94.  
  95. // Determine if the high for the day has already been observed
  96. if (strpos($day->class, 'observed') !== false)
  97. {
  98. $text = $day->find('p.wx-observed', 0)->innertext;
  99.  
  100. $result = preg_match ('/^[a-zA-Z\' ]+(?P<temp>-?\d+<sup>[^<]+<\/sup>)(.*?)\bwere (?P<phrase>.*)$/', $text, $matches);
  101.  
  102. if ($result !== 1)
  103. {
  104. $tmp['high'] = 'N/A';
  105. $tmp['high-phrase'] = 'Error getting High';
  106. }
  107. else
  108. {
  109. $tmp['high'] = $matches['temp'];
  110. $tmp['high-phrase'] = $matches['phrase'];
  111. }
  112. }
  113. else
  114. {
  115. $high = $day->find('p.wx-temp', 0);
  116. $high->find('span.wx-label', 0)->outertext = '';
  117. $tmp['high'] = $high->innertext;
  118. $tmp['high-phrase'] = $day->find('p.wx-phrase', 0)->innertext;
  119. }
  120.  
  121. $low = $night->find('p.wx-temp', 0);
  122. $low->find('span.wx-label', 0)->outertext = '';
  123. $unit = $low->find('sup', 0);
  124. $unit->innertext = $unit->innertext . 'F';
  125. $tmp['low'] = $low->innertext;
  126. $tmp['low-phrase'] = $night->find('p.wx-phrase', 0)->innertext;
  127.  
  128. $display[$location][$key] = $tmp;
  129.  
  130. break;
  131. }
  132. }
  133. }
  134.  
  135. print_r ($display);
  136.  
  137. ?>
Result:
Array ( 
    [ Silverthorne Weather ] => Array ( 
        [right-now] => Array ( 
            [wind] => 4 mph 
            [temp] => 28 °F 
            [phrase] => Partly Cloudy 
            [feels] => Feels like 23 °F 
        ) 
        [today] => Array ( 
            [high] => 29°F 
            [high-phrase] => Sunny 
            [low] => 16°F 
            [low-phrase] => Snow Shower 
        ) 
    ) 
    [ Denver Weather ] => Array ( 
        [right-now] => Array ( 
            [wind] => 1 mph 
            [temp] => 39 °F 
            [phrase] => Partly Cloudy 
            [feels] => Feels like 39 °F 
        ) 
        [today] => Array ( 
            [high] => 45°F 
            [high-phrase] => Partly Cloudy 
            [low] => 32°F 
            [low-phrase] => Partly Cloudy 
        ) 
    ) 
)

Looks good! Let's go through the regular expression in greater detail before we move on to creating a page to display the info.

Regex:
  • Beginning of line or string
  • Any character in this class: [a-zA-Z'], one or more repetitions
  • [temp]: A named capture group. [\d+<sup>[^<]+</sup>]
    • -?\d+<sup>[^<]+</sup>
      • - zero or one repetitions
      • Any digit, one or more repetitions
      • <
      • s
      • u
      • p
      • >
      • Any character that is NOT in this class: [<], one or more repetitions
      • <
      • /
      • s
      • u
      • p
      • >
  • [1]: A numbered capture group. [.*?]
    • Any character, any number of repetitions, as few as possible
  • \bwere
    • First or last character in a word
    • w
    • e
    • r
    • e
    • Space
  • [phrase]: A named capture group. [.*]
    • Any character, any number of repetitions
  • End of line or string

Let's translate that into English!

Regex English
^[a-zA-Z' ]+ Starting at the beginning of the string, match all letters, apostrophes, and spaces
(?<temp>-?\d+<sup>[^<]+</sup>) Match one or more digits including a negative sign if present, followed by the string '<sup>', followed by all characters that are NOT a less-than sign '<', and ending with the string '</sup>' and capture as a group named 'temp'
(.*?)\bwere Match any characters until reaching the word 'were' followed by a Space
(?<phrase>.*)$ Match any characters until the end of the string and capture as a group named 'phrase'

Now let's make a cool display for all this sweet scraped data!



;