Scraping In The Name Of!

Extracting The Data

We will be using the Client URL Library (cURL) to do our web scraping. You can view the documentation at http://php.net/manual/en/book.curl.php. This library allows us to communicate with different types of servers using multiple different protocols. We are interested in connecting to weather.com's Apache server via HTTP so we are golden because cURL supports both!

Now that we've found our method of extraction, let's begin writing the code to do the work. Let's begin by writing the code necessary to extract information for current conditions of our first location, Silverthorne:

Code:
  1. <?php
  2. $url = "http://www.weather.com/weather/right-now/Silverthorne+CO+USCO0357";
  3. $crl = curl_init();
  4.  
  5. curl_setopt ($crl, CURLOPT_URL, $url);
  6. curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
  7. curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, 5);
  8.  
  9. $content = curl_exec ($crl);
  10. curl_close ($crl);
  11. ?>

If you print out the variable $content, you will see the HTML from the URL you loaded. Since we will have to run this same code for all the pages from which we want to extract information, we should put it in a function, passing the URL as a parameter:

Code:
  1. <?php
  2. function scrapeWebsite ($url)
  3. {
  4. $crl = curl_init ();
  5.  
  6. curl_setopt ($crl, CURLOPT_URL, $url);
  7. curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
  8. curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, 5);
  9.  
  10. $content = curl_exec ($crl);
  11. curl_close ($crl);
  12.  
  13. return $content;
  14. }
  15. ?>

Now we have a function ready to scrape any pages we want. Next, we need to pass each of the URLs to our function and store the scraped data so we can interpret and manipulate it. We could build an array to store the data and pass in each URL as follows:

Code:
  1. <?php
  2. function scrapeWebsite ($url)
  3. {
  4. $crl = curl_init ();
  5.  
  6. curl_setopt ($crl, CURLOPT_URL, $url);
  7. curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
  8. curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, 5);
  9.  
  10. $content = curl_exec ($crl);
  11. curl_close ($crl);
  12.  
  13. return $content;
  14. }
  15.  
  16. $weather = array ();
  17. $weather['silverthorne'] = array (
  18. 'current' => array (
  19. 'url' => 'http://www.weather.com/weather/right-now/Silverthorne+CO+USCO0357'
  20. 'content' => ""
  21. ),
  22. 'forecast' => array (
  23. 'url' => 'http://www.weather.com/weather/today/Silverthorne+CO+USCO0357:1:US'
  24. 'content' => ""
  25. )
  26. );
  27.  
  28. $weather['silverthorne']['current']['content'] = scrapeWebsite($weather['silverthorne']['current']['url']);
  29. $weather['silverthorne']['forecast']['content'] = scrapeWebsite($weather['silverthorne']['forecast']['url']);
  30. ?>

This would work and we could add as many cities as we’d like, but would also mean some manual work every time we wanted to add or delete a city. Instead, let’s make the software do the work for us! Let’s modify our function so all we need to do is pass it a URL and our storage array and it will do the rest! Warning, fancy thinking ahead!

If you look at our URLs, each city has a code associated with it. For example, Silverthorne’s code is USCO0357. Weather.com accepts this code alone in its URL (as opposed to the entire “Silverthorne+CO+USCO0357:1:US” string), making it possible for us to create a standard pattern for our URLs. Let’s split up the URL for current conditions in Silverthorne to recreate an array similar to how we planned our original one:

http://www.weather.com/weather/ right-now / USCO0357
  $page   $code

So our array will have an entry for each location code, and each location code will have an entry for each page from which we want data. We will use regular expressions to parse each URL for the $page and $code. You can learn more about regular expressions in PHP at http://php.net/manual/en/book.pcre.php.

Please note we MUST escape slashes ‘/’ with backslashes ‘\’ in our pattern when using preg_match() since slash ‘/’ normally denotes the beginning and end of the pattern. Code:
  1. <?php
  2. function scrapeWebsite ($url, &$weather)
  3. {
  4. // Parse the URL to retrieve the city name and page
  5. $result = preg_match ("/^.*\/weather\/(?P<page>[^\/]+)\/(?P<code>[^\/]+)$/", $url, $matches);
  6.  
  7. // If the result from preg_match is not 1, the pattern was not found so return nothing
  8. if ($result !== 1)
  9. {
  10. return false;
  11. }
  12. else
  13. {
  14. $page = $matches['page'];
  15. $code = $matches['code'];
  16. }
  17.  
  18. // If the code has not yet been added to the container, create it
  19. if (!isset($weather[$code]))
  20. {
  21. $weather[$code] = array ();
  22. }
  23.  
  24. // Initialize a new session and return a cURL handle
  25. $crl = curl_init ();
  26.  
  27. // Set options for cURL
  28. curl_setopt ($crl, CURLOPT_URL, $url); // The URL to fetch
  29. curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1); // Return the transfer as a string
  30. curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, 5); // Allow 5 seconds for connecting
  31.  
  32. // Execute the given cURL session
  33. $content = curl_exec ($crl);
  34.  
  35. // Store the content in the container using $page as the key
  36. $weather[$code][$page] = $content;
  37.  
  38. // Close the cURL session
  39. curl_close ($crl);
  40. }
  41.  
  42. $weather = array ();
  43. $urls = array ('http://www.weather.com/weather/right-now/USCO0357', 'http://www.weather.com/weather/today/USCO0357', 'http://www.weather.com/weather/right-now/USCO0105', 'http://www.weather.com/weather/today/USCO0105');
  44.  
  45. foreach ($urls as $url)
  46. {
  47. scrapeWebsite ($url, $weather);
  48. }
  49. ?>

If you print out the contents of the $weather array now, it will show the HTML for each of the four pages we supplied.

Please note, your URLs MUST be in the pattern http://www.weather.com/weather/$page/$code for this to work. Result:
Array (
        [USCO0357] => Array (
                [right-now] => (HTML)
                [today] => (HTML)
        )
        [USCO0105] => Array (
                [right-now] => (HTML)
                [today] => (HTML)
        )
)

Success! The beauty of doing it this way is we can add as many locations as we’d like as well as include any pages (5-day forecast, 10-day forecast, etc) too. As we have done in the past, let’s look at the regular expression in greater detail.

Regex:
  • ^.*/weather/
    • Beginning of line or string
    • Any character, any number of repetitions
    • /
    • w
    • e
    • a
    • t
    • h
    • e
    • r
    • /
  • [page]: A named capture group. [[^/]+]
    • Any character that is NOT in this class: [/], one or more repetitions
  • /
  • [code]: A named capture group. [[^/]+]
    • Any character that is NOT in this class: [/], one or more repetitions
    • End of line or string

This regex is pretty simple but let’s translate it to English to be absolutely clear what is happening.

Regex English
^.*/weather/ Starting at the beginning of the string, match any characters until reaching a slash ‘/’ followed by the string “weather” and another slash ‘/’
(?P<page>[^/]+)/ Match any characters that are NOT a slash ‘/’ until reaching a slash ‘/’ and capture as a group named ‘page’
(?P<code>[^/]+)$ Match any characters that are NOT a slash ‘/’ until reaching the end of the string and capture as a group named ‘code’

Now let’s do something useful with that data!



;