Scraping In The Name Of!

Jump To:

Interpretting The Information

Now that we have our data, we need to do something useful with it. We could use regular expressions to isolate the important bits, but a much more useable approach is to use the Document Object Model (DOM), a language-independent interface for representing and interacting with objects in HTML and XML documents. Specifically, we will use the PHP Simple HTML Dom Parser from http://simplehtmldom.sourceforge.net/. Download the simple_html_dom.php file, throw it into the directory with your other web files, and let’s get started!

Isolate The Important HTML Elements

We need to determine which DOM elements we want to isolate. To do this, we will use Chrome’s Developer Tools (Firefox and Opera both have very similar tools built in). Simply load the page for your first location, right-click on the temperature and select “Inspect element.”

This will open developer tools on the bottom of the window showing the element within the DOM tree.

If you mouse over different HTML elements, you will notice Chrome highlights those elements in blue along with some identification information. We can use this process to determine where all the information we want is located.

Code:

<?php
    function scrapeWebsite ($url, &$weather)
    {
        // Parse the URL to retrieve the city name and page
        $result = preg_match ("/^.*\/weather\/(?P<page>[^\/]+)\/(?P<code>[^\/]+)$/", $url, $matches);
 
        // If the result from preg_match is not 1, the pattern was not found so return nothing
        if ($result !== 1)
        {
            return false;
        }
        else
        {
            $page = $matches['page'];
            $code = $matches['code'];
        }
 
        // If the code has not yet been added to the container, create it
        if (!isset($weather[$code]))
        {
            $weather[$code] = array ();
        }
 
        // Initialize a new session and return a cURL handle
        $crl = curl_init ();
 
        // Set options for cURL
        curl_setopt ($crl, CURLOPT_URL, $url); // The URL to fetch
        curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1); // Return the transfer as a string
        curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, 5); // Allow 5 seconds for connecting
 
        // Execute the given cURL session
        $content = curl_exec ($crl);
 
        // Store the content in the container using $page as the key
        $weather[$code][$page] = $content;
 
        // Close the cURL session
        curl_close ($crl);
    }
 
    $weather = array ();
    $display = array ();
    $urls = array ('http://www.weather.com/weather/right-now/USCO0357', 'http://www.weather.com/weather/today/USCO0357', 'http://www.weather.com/weather/right-now/USCO0105', 'http://www.weather.com/weather/today/USCO0105');
 
    foreach ($urls as $url)
    {
        scrapeWebsite ($url, $weather);
    }
 
    require_once ('./simple_html_dom.php');
 
    foreach ($weather as $code => $page)
    {
        // Reset the location
        $location = null;
 
        foreach ($page as $key => $content)
        {
            // Create DOM from HTML string
            $html = str_get_html ($content);
 
            if ($location === null)
            {
                $location = $html->find('div.wx-location-title', 0)->find('h1', 0)->plaintext;
                $display[$location] = array ();
                $display[$location][$key] = array ();
            }
 
            switch ($key)
            {
                case 'right-now' :
                    $tmp = array ();
 
                    // Find the div element with id 'wx-main'
                    $main = $html->find('div#wx-main', 0);
                    $tmp['wind'] = $main->find('div.wx-cc-wind-speed', 0)->plaintext;
 
                    // Find the div element with class 'wx-featured'
                    $featured = $main->find('div.wx-featured', 0);
                    $tmp['temp'] = $featured->find('li.wx-temp', 0)->plaintext;
                    $tmp['phrase'] = $featured->find('li.wx-phrase', 0)->plaintext;
                    $tmp['feels'] = $featured->find('li.wx-feels', 0)->plaintext;
 
                    $display[$location][$key] = $tmp;
                break;
                case 'today' :
                    $tmp = array ();
 
                    // Find the div element with class 'wx-12hour'
                    $container = $html->find('div.wx-12hour', 0);
                    $day = $container->find('div.wx-daypart', 0);
                    $night = $container->find('div.wx-daypart', 1);
 
                    // Determine if the high for the day has already been observed
                    if (strpos($day->class, 'observed') !== false)
                    {
                        $text = $day->find('p.wx-observed', 0)->innertext;
 
                        $result = preg_match ('/^[a-zA-Z\' ]+(?P<temp>-?\d+<sup>[^<]+<\/sup>)(.*?)\bwere (?P<phrase>.*)$/', $text, $matches);
 
                        if ($result !== 1)
                        {
                            $tmp['high'] = 'N/A';
                            $tmp['high-phrase'] = 'Error getting High';
                        }
                        else
                        {
                            $tmp['high'] = $matches['temp'];
                            $tmp['high-phrase'] = $matches['phrase'];
                        }
                    }
                    else
                    {
                        $high = $day->find('p.wx-temp', 0);
                        $high->find('span.wx-label', 0)->outertext = '';
                        $tmp['high'] = $high->innertext;
                        $tmp['high-phrase'] = $day->find('p.wx-phrase', 0)->innertext;
                    }
 
                    $low = $night->find('p.wx-temp', 0);
                    $low->find('span.wx-label', 0)->outertext = '';
                    $unit = $low->find('sup', 0);
                    $unit->innertext = $unit->innertext . 'F';
                    $tmp['low'] = $low->innertext;
                    $tmp['low-phrase'] = $night->find('p.wx-phrase', 0)->innertext;
 
                    $display[$location][$key] = $tmp;
 
                break;
            }
        }
    }
 
    print_r ($display);
 
?>

Result:

Array ( 
    [ Silverthorne Weather ] => Array ( 
        [right-now] => Array ( 
            [wind] => 4 mph 
            [temp] => 28 °F 
            [phrase] => Partly Cloudy 
            [feels] => Feels like 23 °F 
        ) 
        [today] => Array ( 
            [high] => 29°F 
            [high-phrase] => Sunny 
            [low] => 16°F 
            [low-phrase] => Snow Shower 
        ) 
    ) 
    [ Denver Weather ] => Array ( 
        [right-now] => Array ( 
            [wind] => 1 mph 
            [temp] => 39 °F 
            [phrase] => Partly Cloudy 
            [feels] => Feels like 39 °F 
        ) 
        [today] => Array ( 
            [high] => 45°F 
            [high-phrase] => Partly Cloudy 
            [low] => 32°F 
            [low-phrase] => Partly Cloudy 
        ) 
    ) 
)

Looks good! Let's go through the regular expression in greater detail before we move on to creating a page to display the info.

Regex:

Beginning of line or string
Any character in this class: [a-zA-Z'], one or more repetitions
[temp]: A named capture group. [\d+<sup>[^<]+</sup>]
- -?\d+<sup>[^<]+</sup>
  - - zero or one repetitions
  - Any digit, one or more repetitions
  - <
  - s
  - u
  - p
  - >
  - Any character that is NOT in this class: [<], one or more repetitions
  - <
  - /
  - s
  - u
  - p
  - >
[1]: A numbered capture group. [.*?]
- Any character, any number of repetitions, as few as possible
\bwere
- First or last character in a word
- w
- e
- r
- e
- Space
[phrase]: A named capture group. [.*]
- Any character, any number of repetitions
End of line or string

Let's translate that into English!

Regex	English
^[a-zA-Z' ]+	Starting at the beginning of the string, match all letters, apostrophes, and spaces
(?<temp>-?\d+<sup>[^<]+</sup>)	Match one or more digits including a negative sign if present, followed by the string '<sup>', followed by all characters that are NOT a less-than sign '<', and ending with the string '</sup>' and capture as a group named 'temp'
(.*?)\bwere	Match any characters until reaching the word 'were' followed by a Space
(?<phrase>.*)$	Match any characters until the end of the string and capture as a group named 'phrase'

Now let's make a cool display for all this sweet scraped data!

« Previous Page