Introduction To Web Scraping With Python: Part 2

icheka

Icheka Ozuru

Posted on January 15, 2021

Introduction To Web Scraping With Python: Part 2

“If programming is magic, then web scraping is surely a form of wizardry.”

Web Scraping With Python; Ryan Mitchell.

This is article 2 of 4. If you have not read article 1, or if you do not have foundational knowledge web scraping with Python, please, read the first article.

In the first article of the series, I gave an introduction to the meaning, applications and basic principles of web scraping. I also introduced the Python programming language as a very robust general-purpose language that is very well suited for data retrieval/handling tasks like web scraping. I briefly discussed the BeautifulSoup library, and I shared a few illustrative scrips that display the basic syntax of the BeautifulSoup library. In addition to these, I explained the basic mechanics of networking -- creating requests and receiving responses, as well as the foundational principles behind server design.

In this article, I am going to take things up a notch or several. I will provide more in-depth analysis of the urllib and BeautifulSoup libraries, their syntax, methods and properties; errors and exceptions that can arise during the data-retrieval process, and robust error-handling approaches to designing and implementing scrapers. I will go through easy-to-follow steps too creating reliable web scrapers, as well as how to store the data they retrieve for analysis in future. Throughout the article, I will share code that has been written for implementing real-world solutions to practical problems. At the end of this article, assuming that you have read and completely understood the previous one (read article 1 here), you will be able to tackle real-world problems with simple scrapers that you can depend on to retrieve and store your data for you.

The First Rule Of Web Scraping

Always look for patterns and target specific attributes.

The First Rule, and possibly the most important, is to always inspect the data (e.g web document) for patterns that you can target using attributes in the markup, and write scrapers that target these attributes.

The first URL we will scrape for data is a simple online store. The web document contains information about all available products (product name, price in USD, description, and rating). We will begin by learning to quickly extract information about patterns in markup. Then we will learn to design our scraper to extract the needed information based on the attributes our patterns offer.

A quick and easy way to view the markup for a page is to request the document, parse it using BeautifulSoup, and use BeautifulSoup's prettify() method to indent it and make it easily-understandable by humans.

# 1.py
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://webscraper.io/test-sites/e-commerce/allinone')
bsoup = BeautifulSoup(html.read(), 'html.parser')
print(bsoup.prettify())
Enter fullscreen mode Exit fullscreen mode

This script should give output similar to:

<!DOCTYPE html>
<html lang="en">
 <head>
  <!-- Anti-flicker snippet (recommended)  -->
  <style>
   .async-hide {
        opacity: 0 !important
    }
  </style>
  <script>
   (function (a, s, y, n, c, h, i, d, e) {
        s.className += ' ' + y;
        h.start = 1 * new Date;
        h.end = i = function () {
            s.className = s.className.replace(RegExp(' ?' + y), '')
        };
        (a[n] = a[n] || []).hide = h;
        setTimeout(function () {
            i();
            h.end = null
        }, c);
        h.timeout = c;
    })(window, document.documentElement, 'async-hide', 'dataLayer', 4000,
        {'GTM-NVFPDWB': true});
  </script>
  <!-- Google Tag Manager -->
  <script>
   (function (w, d, s, l, i) {
        w[l] = w[l] || [];
        w[l].push({
            'gtm.start':
                new Date().getTime(), event: 'gtm.js'
        });
        var f = d.getElementsByTagName(s)[0],
            j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : '';
        j.async = true;
        j.src =
            'https://www.googletagmanager.com/gtm.js?id=' + i + dl;
        f.parentNode.insertBefore(j, f);
    })(window, document, 'script', 'dataLayer', 'GTM-NVFPDWB');
  </script>
  <!-- End Google Tag Manager -->
  <title>
   Web Scraper Test Sites
  </title>
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper" name="keywords">
   <meta content="The most popular web scraping extension. Start scraping in minutes. Automate your tasks with our Cloud Scraper. No software to download, no coding needed." name="description">
    <link href="/favicon.png" rel="icon" sizes="128x128"/>
    <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
    <link href="/css/app.css?id=e4de8db16b64e604046e" rel="stylesheet"/>
    <link href="https://webscraper.io/test-sites/e-commerce/allinone" rel="canonical"/>
    <link href="/img/logo-icon.png" rel="apple-touch-icon"/>
    <script defer="" src="/js/app.js?id=e64f07a3ce1d466cf04c">
    </script>
   </meta>
  </meta>
 </head>
<!--- content truncated due to length -->
Enter fullscreen mode Exit fullscreen mode

This might not make much sense, unless you come from a web development background or understand HTML, however, we do not need to concern ourselves with this section of the webpage just yet. The juicy parts come much later.
Notice how prettify() 'prettifies' the HTML, indenting it nicely and making it much easier to follow, understand and/or find patterns in snippets. Without the prettify() function, we would have a mishmash of tags and words that might not make much sense at all to us:

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\n\t\t\t<!-- Anti-flicker snippet (recommended)  -->\n<style>.async-hide {\n\t\topacity: 0 !important\n\t} </style>\n<script>(function (a, s, y, n, c, h, i, d, e) {\n\t\ts.className += \' \' + y;\n\t\th.start = 1 * new Date;\n\t\th.end = i = function () {\n\t\t\ts.className = s.className.replace(RegExp(\' ?\' + y), \'\')\n\t\t};\n\t\t(a[n] = a[n] || []).hide = h;\n\t\tsetTimeout(function () {\n\t\t\ti();\n\t\t\th.end = null\n\t\t}, c);\n\t\th.timeout = c;\n\t})(window, document.documentElement, \'async-hide\', \'dataLayer\', 4000,\n\t\t{\'GTM-NVFPDWB\': true});</script>\n\t\n\t<!-- Google Tag Manager -->\n<script>(function (w, d, s, l, i) {\n\t\tw[l] = w[l] || [];\n\t\tw[l].push({\n\t\t\t\'gtm.start\':\n\t\t\t\tnew Date().getTime(), event: \'gtm.js\'\n\t\t});\n\t\tvar f = d.getElementsByTagName(s)[0],\n\t\t\tj = d.createElement(s), dl = l != \'dataLayer\' ? \'&l=\' + l : \'\';\n\t\tj.async = true;\n\t\tj.src =\n\t\t\t\'https://www.googletagmanager.com/gtm.js?id=\' + i + dl;\n\t\tf.parentNode.insertBefore(j, f);\n\t})(window, document, \'script\', \'dataLayer\', \'GTM-NVFPDWB\');</script>\n<!-- End Google Tag Manager -->\n\t<title>Web Scraper Test Sites</title>\n\t<meta charset="utf-8">\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n\n\t<meta name="keywords"\n\t      content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper"/>\n\t<meta name="description"\n\t      content="The most popular web scraping extension. Start scraping in minutes. Automate your tasks with our Cloud Scraper. No software to download, no coding needed."/>\n\t<link rel="icon" sizes="128x128" href="/favicon.png">\n\n\t<meta name="viewport" content="width=device-width, initial-scale=1.0">\n\t\n\t<link rel="stylesheet" href="/css/app.css?id=e4de8db16b64e604046e">\n\n\t<link rel="canonical" href="https://webscraper.io/test-sites/e-commerce/allinone">\n\t<link rel="apple-touch-icon" href="/img/logo-icon.png">\n\n\t\t<script defer src="/js/app.js?id=e64f07a3ce1d466cf04c"></script>\n\n\t\n</head>\n<body>\n<!-- Google Tag Manager (noscript) -->\n<noscript>\n\t<iframe src="https://www.googletagmanager.com/ns.html?id=GTM-NVFPDWB"\n\t        height="0" width="0" style="display:none;visibility:hidden"></iframe>\n</noscript>\n<!-- End Google Tag Manager (noscript) -->\n<header role="banner" class="navbar navbar-fixed-top navbar-static">\n\t<div class="container">\n\n\t\t<div class="navbar-header">\n\n\t\t\t<a data-toggle="collapse-side" data-target=".side-collapse" data-target-2=".side-collapse-container">\n\t\t\t\t<button type="button" class="navbar-toggle pull-right collapsed" data-toggle="collapse"\n\t\t\t\t        data-target="#navbar" data-target-2=".side-collapse-container" data-target-3=".side-collapse"\n\t\t\t\t        aria-expanded="false" aria-controls="navbar">\n\n\t\t\t\t\t<span class="sr-only">Toggle navigation</span>\n\t\t\t\t\t<span class="icon-bar top-bar"></span>\n\t\t\t\t\t<span class="icon-bar middle-bar"></span>\n\t\t\t\t\t<span class="icon-bar bottom-bar"></span>\n\n\t\t\t\t</button>\n\t\t\t</a>\n\t\t\t<div class="navbar-brand">\n\t\t\t\t<a href="/"><img src="/img/logo_white.svg" alt="Web Scraper"></a>\n\t\t\t</div>\n\t\t</div>\n\n\t\t<div class="side-collapse in">\n\t\t\t<nav id="navbar" role="navigation" class="navbar-collapse collapse">\n\t\t\t\t<ul class="nav navbar-nav navbar-right">\n\t\t\t\t\t<li class="hidden">\n\t\t\t\t\t\t<a href="#page-top"></a>\n\t\t\t\t\t</li>\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="/" class="menuitm">\n\t\t\t\t\t\t\t<p>Web Scraper</p>\n\t\t\t\t\t\t\t<div class="crta"></div>\n\t\t\t\t\t\t</a>\n\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="/cloud-scraper" class="menuitm">\n\t\t\t\t\t\t\t<p>Cloud Scraper</p>\n\t\t\t\t\t\t\t<div class="crta"></div>\n\t\t\t\t\t\t</a>\n\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="/pricing" class="menuitm">\n\t\t\t\t\t\t\t<p>Pricing</p>\n\t\t\t\t\t\t\t<div class="crta"></div>\n\t\t\t\t\t\t</a>\n\t\t\t\t\t</li>\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t<li class="dropdown">\n\t\t\t\t\t\t<a href="#section3" class="menuitm dropdown-toggle" data-toggle="dropdown">\n\t\t\t\t\t\t\t<p>Learn</p>\n\t\t\t\t\t\t\t<div class="crta"></div>\n\t\t\t\t\t\t</a>\n\t\t\t\t\t\t<ul class="dropdown-menu">\n\t\t\t\t\t\t\t<li>\n\t\t\t\t\t\t\t\t<a href="/documentation">Documentation</a>\n\t\t\t\t\t\t\t</li>\n\t\t\t\t\t\t\t<li>\n\t\t\t\t\t\t\t\t<a href="/tutorials">Video Tutorials</a>\n\t\t\t\t\t\t\t</li>\n\t\t\t\t\t\t\t<li>\n\t\t\t\t\t\t\t\t<a href="/how-to-videos">How to</a>\n\t\t\t\t\t\t\t</li>\n\t\t\t\t\t\t\t<li>\n\t\t\t\t\t\t\t\t<a href="/test-sites">Test Sites</a>\n\t\t\t\t\t\t\t</li>\n\t\t\t\t\t\t\t<li>\n\t\t\t\t\t\t\t\t<a href="https://forum.webscraper.io/" target="_blank" rel="noopener">Forum</a>\n\t\t\t\t\t\t\t</li>\n\t\t\t\t\t\t</ul>\n\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a target="_blank" href="https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn?hl=en" class="btn-menu1 install-extension" rel="noopener">Install</a>\n\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="https://cloud.webscraper.io/" class="btn-menu2">Login</a>\n\t\t\t\t\t</li>\n\t\t\t\t</ul>\n\t\t\t</nav>\n\t\t</div>\n\t</div>\n</header>\n\n<div class="wrapper">\n\t\t<div class="formenu-here container-fluid">\n\n\t</div>\n\t<div class="container-fluid blog-hero">\n\t\t<div class="container">\n\t\t\t<div class="row">\n\t\t\t\t<div class="col-md-12">\n\t\t\t\t\t<h1>Test Sites</h1>\n\t\t\t\t</div>\n\t\t\t</div>\n\t\t</div>\n\t</div>\n\n\t<div class="container test-site">\n\t\t<div class="row">\n\t\t\t<div class="col-md-3 sidebar">\n\t\t\t\t\t<div class="navbar-default sidebar" role="navigation">\n\t<div class="sidebar-nav navbar-collapse">\n\t\t<ul class="nav" id="side-menu">\n\n\t\t\t<li  class="active" >\n\t\t\t\t<a href="/test-sites/e-commerce/allinone">Home</a>\n\t\t\t</li>\n\n\t\t\t\t\t\t<li >\n\t\t\t\t<a href="/test-sites/e-commerce/allinone/computers" class="category-link ">\n\t\t\t\t\tComputers\n\t\t\t\t\t<span class="fa arrow"></span>\n\t\t\t\t</a>\n\n\t\t\t\t\t\t\t</li>\n\t\t\t\t\t\t<li >\n\t\t\t\t<a href="/test-sites/e-commerce/allinone/phones" class="category-link ">\n\t\t\t\t\tPhones\n\t\t\t\t\t<span class="fa arrow"></span>\n\t\t\t\t</a>\n\n\t\t\t\t\t\t\t</li>\n\t\t\t\t\t</ul>\n\t</div>\n</div>\n\n\t\t\t</div>\n\t\t\t<div class="col-md-9">\n\t\t\t\t\n\t<div class="jumbotron">\n\t\t<h1>E-commerce training site</h1>\n\t\t<p>\n\t\t\tWelcome to WebScraper e-commerce site. You can use this site for training\n\t\t\tto learn how to use the Web Scraper. Items listed here are not for sale.\n\t\t</p>\n\t</div>\n\n\t<h2>Top items being scraped right now</h2>\n\n\t<div class="row">\n\t\t\t<div class="col-sm-4 col-lg-4 col-md-4">\n\t<div class="thumbnail">\n\t\t<img class="img-responsive" alt="item"\n\t\t     src="/images/test-sites/e-commerce/items/cart2.png">\n\t\t<div class="caption">\n\t\t\t<h4 class="pull-right price">$809.00</h4>\n\t\t\t<h4>\n\t\t\t\t\t\t\t\t\t<a href="/test-sites/e-commerce/allinone/product/537" class="title" title="Acer Nitro 5 AN515-51">Acer Nitro 5 AN5...</a>\n\t\t\t\t\t\t\t</h4>\n\t\t\t<p class="description">Acer Nitro 5 AN515-51, 15.6&quot; FHD IPS, Core i5-7300HQ, 8GB, 1TB, GeForce GTX 1050 2GB, Windows 10 Home</p>\n\n\t\t</div>\n\t\t<div class="ratings">\n\t\t\t<p class="pull-right">0 reviews</p>\n\t\t\t<p data-rating="1">\n\t\t\t\t\t\t\t\t<span class="glyphicon glyphicon-star"></span>\n\t\t\t\t\t\t\t</p>\n\t\t</div>\n\t</div>\n</div>\n\t\t\t<div class="col-sm-4 col-lg-4 col-md-4">\n\t<div class="thumbnail">\n\t\t<img class="img-responsive" alt="item"\n\t\t     src="/images/test-sites/e-commerce/items/cart2.png">\n\t\t<div class="caption">\n\t\t\t<h4 class="pull-right price">$1102.66</h4>\n\t\t\t<h4>\n\t\t\t\t\t\t\t\t\t<a href="/test-sites/e-commerce/allinone/product/584" class="title" title="Dell Latitude 5280">Dell Latitude 52...</a>\n\t\t\t\t\t\t\t</h4>\n\t\t\t<p class="description">Dell Latitude 5280, 12.5&quot; HD, Core i5-7300U, 8GB, 256GB SSD, Windows 10 Pro</p>\n\n\t\t</div>\n\t\t<div class="ratings">\n\t\t\t<p class="pull-right">8 reviews</p>\n\t\t\t<p data-rating="1">\n\t\t\t\t\t\t\t\t<span class="glyphicon glyphicon-star"></span>\n\t\t\t\t\t\t\t</p>\n\t\t</div>\n\t</div>\n</div>\n\t\t\t<div class="col-sm-4 col-lg-4 col-md-4">\n\t<div class="thumbnail">\n\t\t<img class="img-responsive" alt="item"\n\t\t     src="/images/test-sites/e-commerce/items/cart2.png">\n\t\t<div class="caption">\n\t\t\t<h4 class="pull-right price">$399.99</h4>\n\t\t\t<h4>\n\t\t\t\t\t\t\t\t\t<a href="/test-sites/e-commerce/allinone/product/556" class="title" title="Asus VivoBook E502NA-GO022T Dark Blue">Asus VivoBook E5...</a>\n\t\t\t\t\t\t\t</h4>\n\t\t\t<p class="description">Asus VivoBook E502NA-GO022T Dark Blue, 15.6&quot; HD, Pentium N4200 1.1GHz, 4GB, 128GB SSD, Windows 10 Home, En/Ru kbd</p>\n\n\t\t</div>\n\t\t<div class="ratings">\n\t\t\t<p class="pull-right">3 reviews</p>\n\t\t\t<p data-rating="4">\n\t\t\t\t\t\t\t\t<span class="glyphicon glyphicon-star"></span>\n\t\t\t\t\t\t\t\t<span class="glyphicon glyphicon-star"></span>\n\t\t\t\t\t\t\t\t<span class="glyphicon glyphicon-star"></span>\n\t\t\t\t\t\t\t\t<span class="glyphicon glyphicon-star"></span>\n\t\t\t\t\t\t\t</p>\n\t\t</div>\n\t</div>\n</div>\n\t</div>\n\n\n\t\t\t</div>\n\t\t</div>\n\t</div>\n\t<div class="clearfix"></div>\n\t<div class="push"></div>\n</div>\n\n<div class="container-fluid footer" id="layout-footer">\n\t<div class="container">\n\t\t<div class="row">\n\t\t\t<div class="col-md-3">\n\t\t\t\t<ul>\n\t\t\t\t\t<li><p>Products</p></li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="/">Web Scraper browser extension</a>\n\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="/pricing">Web Scraper Cloud</a>\n\t\t\t\t\t</li>\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t</ul>\n\t\t\t</div>\n\t\t\t<div class="col-md-3">\n\t\t\t\t<ul>\n\t\t\t\t\t<li><p>Company</p></li>\n\t\t\t\t\t\n\t\t\t\t\t<li><a href="/contact">Contact</a>\n\t\t\t\t\t</li>\n\t\t\t\t\t\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="/privacy-policy">Website Privacy Policy</a>\n\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="/extension-privacy-policy">Browser Extension Privacy Policy</a>\n\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="http://webscraperio.us-east-1.elasticbeanstalk.com/downloads/Web_Scraper_Media_Kit.zip">Media kit</a>\n\t\t\t\t\t</li>\n\t\t\t\t\t\n\t\t\t\t\t<li><a href="/jobs">Jobs</a></li>\n\t\t\t\t</ul>\n\t\t\t</div>\n\t\t\t<div class="col-md-3">\n\t\t\t\t<ul>\n\t\t\t\t\t<li><p>Resources</p></li>\n\t\t\t\t\t<li><a href="/blog">Blog</a></li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="/documentation">Documentation</a>\n\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="/tutorials">Video Tutorials</a>\n\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="/screenshots">Screenshots</a>\n\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="/test-sites">Test Sites</a>\n\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a target="_blank" href="https://forum.webscraper.io/" rel="noopener">Forum</a>\n\t\t\t\t\t</li>\n\t\t\t\t</ul>\n\t\t\t</div>\n\t\t\t<div class="col-md-3">\n\t\t\t\t<ul>\n\t\t\t\t\t<li><p>CONTACT US</p></li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="mailto:info@webscraper.io">info@webscraper.io</a>\n\t\t\t\t\t</li>\n\t\t\t\t\t\n\t\t\t\t\t<li>Rupniecibas iela 30,<br> Riga, Latvia, LV-1045</li>\n\t\t\t\t</ul>\n\t\t\t\t<ul class="smedia">\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="https://www.facebook.com/webscraperio/" target="_blank" rel="noopener"><img src="/img/fbicon.png" alt="Web Scraper on Facebook"></a>\n\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t<a href="https://twitter.com/webscraperio" target="_blank" rel="noopener"><img src="/img/twicon.png" alt="Web Scraper on Twitter"></a>\n\t\t\t\t\t</li>\n\t\t\t\t\t\n\n\t\t\t\t</ul>\n\t\t\t</div>\n\t\t</div>\n\t\t<div class="row">\n\t\t\t<div class="col-md-12">\n\t\t\t\t<p class="copyright">Copyright &copy 2021\n\t\t\t\t\t<a href="#">Web Scraper</a> | All rights\n\t\t\t\t\treserved | Made by zoom59</p>\n\t\t\t</div>\n\t\t</div>\n\t</div>\n</div>\n\n\n</body>\n</html>\n'
>>> 
Enter fullscreen mode Exit fullscreen mode

If you run the first script (and you should follow the exercises as you read!) you will be able to scroll down and see something similar to:

      <h2>
       Top items being scraped right now
      </h2>
       <div class="col-sm-4 col-lg-4 col-md-4">
        <div class="thumbnail">
         <img alt="item" class="img-responsive" src="/images/test-sites/e-commerce/items/cart2.png"/>
         <div class="caption">
          <h4 class="pull-right price">
           $1260.13
          </h4>
          <h4>
           <a class="title" href="/test-sites/e-commerce/allinone/product/616" title='Apple MacBook Air 13"'>
            Apple MacBook Ai...
           </a>
          </h4>
          <p class="description">
           Apple MacBook Air 13", i5 1.8GHz, 8GB, 256GB SSD, Intel HD 6000, ENG
          </p>
         </div>
         <div class="ratings">
          <p class="pull-right">
           8 reviews
          </p>
          <p data-rating="4">
           <span class="glyphicon glyphicon-star">
           </span>
           <span class="glyphicon glyphicon-star">
           </span>
           <span class="glyphicon glyphicon-star">
           </span>
           <span class="glyphicon glyphicon-star">
           </span>
          </p>
         </div>
        </div>
Enter fullscreen mode Exit fullscreen mode

The header for this section of the webpage (designated "h2") clearly tells us: "Top items being scraped right now". This is obviously a play on "Top items being searched right now" or something related. What is the takeaway here? Websites and web documents often have "pointers" or cues that can sometimes tell us where to find the data we are looking for. Seeing "Top items being ordered right now" in an online store is a cue as good as any that tells us: "Here come the products!".

Another Serving of BeautifulSoup
So far we have gotten by with very limited vocabulary, and that's fine. However, the BeautifulSoup object provides a whole suite of methods and properties that make sifting through markup as easy and tasting soup. I will gradually introduce the most important of these attributes.

How might we extract the name of every character in http://www.pythonscraping.com/pages/warandpeace.html?
"Oh. I wouldn't know", you might say.
And you wouldn't be wrong. You might not always know right away how to extract the data you need, but you can certainly analyze the markup for the HTML web document and find a reliable pattern. While this is entirely possible using the first script above, there is another, faster, easier way that comes without the strain that staring intently at the default white screen of the Python IDLE editor. This method is by using the developer console that comes pre-packaged with most modern browsers. Most likely, in your the browser you are reading this article with, this developer markup inspector console can be accessed by right-clicking the text or media that you want to inspect, and clicking "Inspect". (Most browsers also allow use of the keyboard shortcut CTRL+SHIFT+I or CMD+SHIFT+I on Macs).

So, there are two ways for you to inspect the markup of the War and Peace webpage:

  1. Using BeautifulSoup's prettify() method.
  2. Using the "Inspect" option of your browser's Developer Tools console.
<h1>
   War and Peace
  </h1>
  <h2>
   Chapter 1
  </h2>
  <div id="text">
   "
   <span class="red">
    Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
   </span>
   "
   <p>
   </p>
   It was in July, 1805, and the speaker was the well-known
   <span class="green">
    Anna
Pavlovna Scherer
   </span>
   , maid of honor and favorite of the
   <span class="green">
    Empress Marya
Fedorovna
   </span>
Enter fullscreen mode Exit fullscreen mode

Carefully analyzing the War and Peace markup might reveal an interesting pattern:

  1. Quoted text is always found within a pair of span tags with a class attribute of "red".
  2. Character name are always found within a pair of span tags with a class attribute of "green".

Didn't notice it? Look again.

In fact, if you simply navigate to the URL with your browser, you will be met with a screen similar to:
Alt Text

You will be able to easy spot the pattern: quoted text in red, character names in green.

Now that you have spotted a pattern in your data, it is time to design an algorithm for extracting the information you want.

Algorithms

What Are Algorithms?

And, more importantly, can we avoid using them?

Because they are a cross-breed of unfriendly Mathematics and hardcore computer programming, right?
Right?
Wrong.

While many students of programming fear algorithms, it is quite impossible to do any form of impactful programming without an implementation of some algorithm.

Why?
The reason lies in the fact that, underneath all the hype computer systems get from the media (such as the hype around self-driving cars), computer programs (and computers generally) are really dumb things! Sometime in the future, our programs might evolve to the point where they are able to parallel human intelligence, but that future is still so far away. We can catch glimpses of such a future in movies like The Matrix, but this is about all we can do.
In order for a computer program to solve a problem, it has to follow a series of steps and logical thought-flows. These steps or flows of thought that lead a computer program from the beginning of its execution to its arrival at a solution (even the wrong solution) is known as algorithms.

Algorithms: Where Computer Science meets Mathematics?
No. Algorithms aren't the product of a collision between Mathematics and any form of Computer Science. That is not how it works and the misconceptions that exist around the A-word are largely to blame for the fear many students of computer programming have for algorithms.
As I tried to explain earlier, an algorithm is a logical sequence of steps to take (aka subtasks to complete) in order to arrive at the solution to a problem.
You may think of algorithms as a "trail" of rewards that you lay on the ground in order to coax a pet to follow you.

If you need a more extensive introduction to algorithms, I recommend this article at HowStuffWorks.
For professional-level in-depth study of computer algorithms, I recommend the following books, in increasing order of difficulty:

  1. Fundamentals of Computer Algorithms, Horowitz and Sahani.
  2. Computational Thinking - First Algorithms, Then Code, * Fabrizio Luccio and Paolo Ferragina*.
  3. Graphs, Networks and Algorithms (Algorithms and Computation In Mathematics), Dieter Jungnickel.
  4. The Art Of Computer Programming, Donald Knuth.

Why Are Algorithms Important To Productive Web Scraping?
First, programming software is a tool for problem-solving,, nore than anything else. And web scrapig happens to solve a very ubiquitous, very troublesome problem: how to extract structured data in an automated and timely manner from otherwise inaccessible data sources.
The experienced programmer understands that "coding" is only about a fifth of his job. Programming is a skill, a tool, for solving problems. There is no other purpose. This reason is sufficient.
In the case of War and Peace, our problem is extracting the name of every character in the data source. Our problem is compounded by the fact that there exist several different forms of data in the webpage. How does one extract data from data? By finding patterns that enable us to define a structure, an expectation of sanity, amidst the larger data.
This, by itself, is one step of the algorithm.

We do not need to delve into the intricacies of computational algorithm design to solve our problem.
We can "break" down our task into subtasks based on the pattern we have found for the information we need:

  1. Request the webpage and store the response.
  2. Parse the response and create a BeautifulSoup object of it.
  3. Find all occurrences of text between span tags that have the class="green" attribute and store them.
  4. Iterate through the occurrences, extracting the text for each span occurrence and printing it.

Congratulations! You have just written your first(?) web scraping algorithm!!

Implementing The Algorithm
Now that we have a thorough understanding of our problem, its peculiarities and requirements, as well as an algorithm for arriving at a solution, we are finally ready to write some code:

# 2.py
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://www.pythonscraping.com/pages/warandpeace.html"

html = urlopen(url)
bsoup = BeautifulSoup(html.read(), 'html.parser')

characters = bsoup.findAll('span', {'class': 'green'})
for character in characters:
    print(character.get_text())
Enter fullscreen mode Exit fullscreen mode

Executing this script will give output similar to:

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Enter fullscreen mode Exit fullscreen mode

Let's work through our scraper and analyze what each line does.
After importing urlopen and BeautifulSoup from their modules (a module is like a package that contains function that perform related operations), we assign the string-value of the War and Peace URL to a variable (url) (line 3). We do this because we want to be able to simply swap URLs in future, modify the scraper a little bit and re-use it for a similar scraping task.
In line 5, we create our bsoup object and parse the network response from the URL we made our request to in line 3.
line 6 is where things start to get interesting. The findAll() method locates all occurrences of tags matching our specifications (tag name: span, class attribute: green) and appends each occurence to a list (characters).
Having located all of our targets, we simply iterate through each element of our characters list and extract the text between the span tags using the get_text() method.

Don't let all this talk of methods confuse you. Method is just another name for function, and a function is just a bit of code (called snippet in programming jargon) that performs a specific action.

Easy as pie, right?

In fact, we can take our scraper a bit further and instruct it to print serial numbers for the character names, as well as a count of character names. Modifying our algorithm shouldn't be too much hassle.

  1. Request the webpage and store the response.
  2. Parse the response and create a BeautifulSoup object of it.
  3. Find all occurrences of text between span tags that have the class="green" attribute and store them.
  4. Iterate through the occurrences, extracting the text for each span occurrence and printing it with a serial number attached.
  5. Print a count of all occurrences of character names.

We can now modify our scraper thus:

# 3.py

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://www.pythonscraping.com/pages/warandpeace.html"

html = urlopen(url)
bsoup = BeautifulSoup(html.read(), 'html.parser')
characters = bsoup.findAll('span', {'class': 'green'})

for i in range(len(characters)):
    print("{}: {}".format(i + 1, characters[i].get_text()))

print("========")
print("The count is: {}".format(len(characters)))
Enter fullscreen mode Exit fullscreen mode

(If you have problems understanding the complicated print statement above, you might need to study Python's basic syntax and data structures. There are recommendations for publications at the end of this article.)

The find() and findAll() Methods: Maximizing Your Ability To Extract Information
In your work scraping the web for data, the two functions you are likely to use again and again are the find() and findAll() BeautifulSoup methods. Therefore, a thorough understanding of the methods, their parameters and usage is important to maximizing your ability to extract the information you need from web sources.

The findAll() method
The findAll() method is a very special method because it's one you are likely to use in all of your scrapers.
The findAll() function takes six (6) arguments (you'll see what arguments are in a bit, so don't worry if you don't understand the word):

bsoup.findAll(tag, attributes, recursive, text, keywords)
Enter fullscreen mode Exit fullscreen mode
A brief aside: Arguments

arguments are simply values that you pass to functions when you call them so that the function can use your values to do something. In the case of our web scraper, here is a (non-encompassing) list of functions that we called with arguments:

# 1
urlopen(url)
# 2
BeautifulSoup(html.read(), 'html.parser')
# 3
findAll('span', {'class': 'green'})
# 4
len(characters)
# 5
format(i + 1, characters[i].get_text())
# 6
format(len(characters))
# 7
print("========")
# 8
print(f"The count is: {len(characters)}")
Enter fullscreen mode Exit fullscreen mode

And here's a list of functions we called without any arguments:

# 1
read()
# 2
get_text()
Enter fullscreen mode Exit fullscreen mode

You must have noticed that the functions we called with arguments have values passed to them, within their parentheses. these values are separated from each other by commas. On the other hand, the functions that we called without arguments do not have any values between their parantheses.

Detailed information about arguments can be found in the recommended publications for learning Python at the end of this article.

The first and second arguments of the findAll() method are the ones you are most likely to use. They are the tag and attribute arguments. In our scraper (3.py) we used them thus:

...
characters = bsoup.findAll('span', {'class': 'green'})
...
Enter fullscreen mode Exit fullscreen mode

This means that we asked BeautifulSoup to find all all occurrences of span tags that have a class attribute of green.
This matches all tags that look like:

<span class="green">...</span>
Enter fullscreen mode Exit fullscreen mode

Simple, right?
But, you might wonder, 'what other cool stuff can I do with this findAll() function? Besides targeting tags based on their name and attributes?'

More Cool Stuff With findAll()
The format of arguments to findAll() is:

findAll(tag, attributes, recursive, text, limit, keywords)
Enter fullscreen mode Exit fullscreen mode

I'll discuss each argument and show you examples of what you can do with each.

  1. tag: You have already seen how you can pass a tag's name as a string as this argument.
# find all images in a webpage
findAll('img')
# find all sections in a webpage
findAll('section')
# find all videos in a webpage
findAll('video')
Enter fullscreen mode Exit fullscreen mode

But what happens if you need to find multiple tags that have different tag names (and possibly share the same attribute)?

# passing multiple tag names in a list will find them all:
# this line will find all images <img>, videos <video> and paragraphs <p> in a webpage
findAll(['img', 'video', 'p'])
Enter fullscreen mode Exit fullscreen mode

What if we only want our scraper to find images and videos that have a class attribute of 'multimedia'?

# this line uses the first two arguments to findAll(): 'tag' and 'attribute'
findAll(['img', 'video'], {'class': 'multimedia'})
Enter fullscreen mode Exit fullscreen mode
  1. attribute: The attribute argument is the second argument to findAll(). HTML elements can have several different attributes, so it is important to be certain that the attributes you are targeting will not return unwanted elements. We will see how easily this can happen. Given the HTML:
<div class="product featured" data-featured-product="pr/123456">
    <img src="some_long_path_to_an_image_of_a_cup" alt="product:123456" />
    <div class="product-name">Gold-plated Cups!</div>
</div>
<div class="product last-featured">
    <img src="some_long_path_to_an_image_of_a_spoon" alt="product:123457" />
    <div class="product-name">Gold-plated Spoons</div>
</div>
<div class="product">
    <img src="some_long_path_to_an_image_of_a_doorknob" alt="product:123456" />
    <div class="product-name">Gold-plated Doorknobs</div>
</div>
Enter fullscreen mode Exit fullscreen mode

This is an extract from a (fictional) e-commerce site. What we want to do is retrieve every item for sale and store in a database. The algorithm for the retrieval part of our scraper might look like:

  1. Request URL and save response.
  2. Find all div tags that have a class attribute of 'product-name'.
  3. Iterate through each ta and get its text. Short, simple and effective! Let's get to the programming bit:
# 4.py
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "our_fictional_url"

html = urlopen(url)
bsoup = BeautifulSoup(html.read(), 'html.parser')
products = bsoup.findAll('div', {'class': 'product-name'})

for product in products:
    product_name = product.get_text()
    # save product_name to database
    # we'll learn how to save data to databases soon
Enter fullscreen mode Exit fullscreen mode

This is good and fine. But what if we want to get the name of the featured product for the day, and possibly have our script start itself every morning to get the name of the featured product of the day?
You might not have noticed it, but the three products in our HTML are not exactly alike. The first product has an extra class value: 'featured'. The second has another class: 'last-featured'. The third only has one class.
Bases on the extra information suddenly available to you, you can modify your scraper to get the featured product and print it (or save it to a file, which is what we will now do). Since we are dealing with a fictional

# 5.py
from urllib.request import urlopen
from bs4 import BeautifulSoup
# we'll use the datetime module to get today's date
from datetime import date

url = "our_fictional_url"

html = urlopen(url)

bsoup = BeautifulSoup(html.read(), 'html.parser')
products = bsoup.findAll('div', {'class': {'product-name', 'featured'}})

for product in products:
    product_name = product.get_text()
    # save product to a file
    with open('featured-products-file.txt', 'a+') as f:
        msg = "{}: The featured product is {}\n".format(date.today(), product_name)
        f.write(msg)
print("File saved.")
Enter fullscreen mode Exit fullscreen mode

(If you try to run this scraper, you'll get a ValueError. This is because our_fictional_url is not a valid URL! Although this script will not run, you can easily understand the logic behind it, as well as how effective it would be if the URL pointed to a valid resource.)

Practice Exercise 1:
Write a scraper for retrieving the last-featured product, and storing it to a file.

  1. recursive: The recursive argument is pretty straightforward. If it is set to true, and it is by default, the findAll() function retrieves all tags that match the specification, whether they are nested or not. You'll likely always want to set this to true, and since it's true by default, you can forget the argument even exists for the remainder of this article. If set to false, the findAll() function only retrieves top-level tags. No nested tags will be returned.

If you do not have a background in web development, I recommend a quick primer course in HTML. John Duckett's 'HTML And CSS' is the best book I know for learning the language.

  1. text: The text argument is used to target elements that contain the specified text. It's a handy argument to have.
<div>
<span>English</span>
<span>French</span>
<span>Latin</span>
<span>English</span>
<span>German</span>
</div>
Enter fullscreen mode Exit fullscreen mode

Find the number of occurrences of the word 'English' in the document.
One thing you will notice about this markup is that there are six elements, all <span> tags, with no attributes. We cannot count on classes here.

But we can count on text.

...
occurrences = findAll(text="English")
print(f"The number of times the word 'English' occurs: {len(occurrences)}")
...
Enter fullscreen mode Exit fullscreen mode

This script is the exact same thing as:

...
occurrences = findAll('span', text="English")
print(f"The number of times the word 'English' occurs: {len(occurrences)}")
...
Enter fullscreen mode Exit fullscreen mode

Can you figure out why?

  1. limit: The limit argument is simple: it limits the number of tags the findAll() function to the value specified.
# 6.py
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'http://www.pythonscraping.com/pages/warandpeace.html'
html = urlopen(url)
bsoup = BeautifulSoup(html.read(), 'html.parser')

characters = bsoup.findAll('span')
for character in characters:
    print(character.get_text())
Enter fullscreen mode Exit fullscreen mode

This prints text from the first two <span> tags (there are 75 <span> tags). We can verify that we are actually printing the first two <span> tags by calling BeautifulSoup's prettify() method.

# 7.py
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'http://www.pythonscraping.com/pages/warandpeace.html'
html = urlopen(url)
bsoup = BeautifulSoup(html.read(), 'html.parser')

characters = bsoup.findAll('span')
# prettify
print(characters.prettify())
Enter fullscreen mode Exit fullscreen mode

The find() method is the same as calling the findAll() method with limit set to 1

# retrieve the first <span> element
...
bsoup = BeautifulSoup(html.read(), 'html.parser')

# 1
characters = findAll('span', limit=1)
# is the same as:
# 2
characters = find('span')
Enter fullscreen mode Exit fullscreen mode
💖 💪 🙅 🚩
icheka
Icheka Ozuru

Posted on January 15, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Matplotlib Tutorial #3: Plot Without Line
datascience Matplotlib Tutorial #3: Plot Without Line

September 4, 2023

Matplotlib Tutorial #2: Create a Plot
datascience Matplotlib Tutorial #2: Create a Plot

September 3, 2023

NumPy Tutorial #14: Random
datascience NumPy Tutorial #14: Random

August 31, 2023

NumPy Tutorial #11: Array Search
datascience NumPy Tutorial #11: Array Search

August 28, 2023