Turn Any Website Into An API with AutoScraper and FastAPI

kami4ka

Oleg Kulyk

Posted on April 24, 2021

Turn Any Website Into An API with AutoScraper and FastAPI

In this article, we will learn how to create a simple e-commerce search API with multiple platform support: eBay and Amazon. AutoScraper and FastAPi provide the ability to create a powerful JSON API for the date. With Playwright's help, we'll extend our scraper and avoid blocking by using ScrapingAnt's web scraping API.

Let's satisfy the requirements

To complete the whole tutorial we'll need the following libraries:

To install all the requirements via pip just run the following command (Python 3.7+ required):

pip install fastapi 'uvicorn[standard]' autoscraper playwright scrapingant-client
playwright install
Enter fullscreen mode Exit fullscreen mode

Not all the described libraries are required for a complete project, but I'd suggest installing them to try different ways of data extraction.

Scrape one page, and you'll scrape them all

Let's start from the most interesting part - creating a smart scraper to fetch data from Amazon's search result page. For a simple case, we'll use the title, price and product link as a required data for extraction. To generate a product link for Amazon's product we'll need to get a unique product id like B077XTPWZ5.

Amazon Web Scraping Sample

Using AutoScraper, it would be easily done by just providing some sample data:

from autoscraper import AutoScraper

url = 'https://www.amazon.com/s?k=baking+mat'

wanted_dict = {
    'title': [
        'Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat - Pack of 2',
        'Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat - Pack of 3',
    ],
    'price': ['$15.06', '$11.55'],
    'url': ['B0725GYNG6', 'B077XTPWZ5']
}

scraper = AutoScraper()
scraper.build(url=url, wanted_dict=wanted_dict)

# results grouped per rule to we'll know which one to use
scraper.get_result_similar(url, grouped=True)
Enter fullscreen mode Exit fullscreen mode

The code above tells us how to apply several samples when the structure of items on search result pages is different, and we want the scraper to learn them all, which is redundant for the current case and added to show this cool feature.

If you want to copy and run this code, you may need to update the wanted_dict with your values.

From the output, we’ll know which rule corresponds to which data:

{
  "rule_iqkj": ["Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat - Pack of 2", "Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat - Pack of 3", "Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat - Pack of 4", "Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat, Macaron - Pack of 2", "Amazon Basics Anti-Fatigue Mat Marbleized Composite Mat 1/2' Thick 2X3 Black/Gray/White"], 
  "rule_0emm": ["Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat - Pack of 2", "Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat - Pack of 3", "Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat - Pack of 4", "Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat, Macaron - Pack of 2", "Amazon Basics Anti-Fatigue Mat Marbleized Composite Mat 1/2"], 
  "rule_91nj": ["Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat - Pack of 2", "Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat - Pack of 3", "Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat - Pack of 4", "Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat, Macaron - Pack of 2", "Amazon Basics Anti-Fatigue Mat Marbleized Composite Mat 1/2' Thick 2X3 Black/Gray/White"], 
  "rule_pkc7": ["Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat - Pack of 2", "Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat - Pack of 3", "Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat - Pack of 4", "Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat, Macaron - Pack of 2", "Amazon Basics Anti-Fatigue Mat Marbleized Composite Mat 1/2"], 
  "rule_vv2c": ["Amazon Basics Silicone, Non-Stick, Food Safe Baking Mat - Pack of 3", "Silpat Premium Non-Stick Silicone Baking Mat, Half Sheet Size, 11-5/8 x 16-1/2"], 
  "rule_o7dc": ["$15.06", "$11.55", "$14.99", "$25.87", "$12.05", "$13.99", "$45.49"], 
  "rule_y9sp": ["$15.06", "$11.55", "$14.99", "$25.87", "$12.05", "$13.99", "$45.49"], 
  "rule_4psi": ["$15.06", "$11.55", "$14.99", "$12.05", "$13.99", "$24.99", "$14.99", "$9.19", "$11.50", "$25.87", "$19.99", "$22.25", "$10.36", "$13.99", "$17.97", "$13.99", "$15.47", "$17.00", "$7.99", "$23.28", "$9.99", "$24.99", "$10.99", "$3.49", "$4.99", "$6.99", "$9.99", "$15.99", "$19.99", "$8.99", "$14.99", "$14.99", "$16.92", "$8.26", "$41.27", "$30.00", "$27.66", "$17.95", "$17.90"], 
  "rule_t1vk": ["$15.06", "$11.55", "$14.99", "$12.05", "$13.99", "$24.99", "$14.99", "$9.19", "$11.50", "$25.87", "$19.99", "$22.25", "$10.36", "$13.99", "$17.97", "$13.99", "$15.47", "$17.00", "$7.99", "$23.28", "$9.99", "$24.99", "$10.99", "$3.49", "$4.99", "$6.99", "$9.99", "$15.99", "$19.99", "$8.99", "$14.99", "$14.99", "$16.92", "$8.26", "$41.27", "$30.00", "$27.66", "$17.95", "$17.90"], 
  "rule_vjnv": ["B0725GYNG6", "B077XTPWZ5", "B07DZVRJBZ", "B077XR38W3", "B07Q1CJSWL"], 
  "rule_zho2": ["B0725GYNG6"], 
  "rule_3h0m": ["B0725GYNG6", "B077XTPWZ5", "B07DZVRJBZ", "B077XR38W3", "B07Q1CJSWL"], 
  "rule_h12x": ["B077XTPWZ5", "B00008T960"], 
  "rule_he5c": ["B077XTPWZ5"]
}
Enter fullscreen mode Exit fullscreen mode

After analyzing the output, let’s keep only desired rules, remove the rest, and save our model:

scraper.keep_rules(['rule_iqkj', 'rule_7wqn', 'rule_vjnv', 'rule_y9sp'])

# Give it a file path
scraper.save('amazon-search')
Enter fullscreen mode Exit fullscreen mode

Rule names will be re-generated each script run. Please, use REPL, Jupyter or save model without cleaning first. You'll be able to open and clean the model from the redundant rules later.

Let's do the same trick for eBay:

from autoscraper import AutoScraper

url = 'https://www.ebay.com/sch/i.html?_nkw=baking+mat'
wanted_list = ['Durable Silicone Baking Mat Non-Stick Pastry Cookie Sheet Oven Baking Mat Liner', '$2.15', 'https://www.ebay.com/itm/Durable-Silicone-Baking-Mat-Non-Stick-Pastry-Cookie-Sheet-Oven-Baking-Mat-Liner/202362695368?hash=item2f1dc1aec8:g:S3UAAOSwwpdW62sK']

scraper = AutoScraper()
scraper.build(url=url, wanted_list=wanted_list)

scraper.get_result_similar(url, grouped=True)
Enter fullscreen mode Exit fullscreen mode

... and the execution hangs.

What is the reason? Why it not worked?

AutoScraper is a good tool for automatic intelligent parsing. Still, it does not have capabilities for proper data extraction from the page, so the URL's direct data extraction can not work for some sites.

Let's try the workaround for the learning phase only: save the required page as HTML via the browser and then open via Python to pass the data into AutoScraper:

from autoscraper import AutoScraper
from pathlib import Path

# page content from https://www.ebay.com/sch/i.html?_nkw=baking+mat saved from the browser
html = Path('baking_mat_ebay.html').read_text()

wanted_list = ['Durable Silicone Baking Mat Non-Stick Pastry Cookie Sheet Oven Baking Mat Liner', '$2.14', 'https://www.ebay.com/itm/Durable-Silicone-Baking-Mat-Non-Stick-Pastry-Cookie-Sheet-Oven-Baking-Mat-Liner/202362695368?hash=item2f1dc1aec8:g:S3UAAOSwwpdW62sK']

scraper = AutoScraper()
scraper.build(html=html, wanted_list=wanted_list)

scraper.get_result_similar(html=html, grouped=True)

scraper.set_rule_aliases({'rule_9otk': 'title', 'rule_7n5a': 'price', 'rule_buz1': 'url'})

// No rules will be cleaned this time, all of them works fine

# Give it a file path
scraper.save('ebay-search')
Enter fullscreen mode Exit fullscreen mode

How to extract data and avoid anti-scraping techniques? Let's try out two different approaches:

Playwright

Playwright is a high-level API over headless Chrome, Firefox and Webkit browser. It means that you can automate the action that we made above (open the page's HTML), and this action will not trigger the website's anti-scraping mechanism. Let's create a function that does content extraction:

from playwright.sync_api import sync_playwright

def get_webpage_content_playwright(url):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)
        content = page.content()
        browser.close()
        return content
Enter fullscreen mode Exit fullscreen mode

This function allows us to get the page's HTML content without being blocked and detected as a bot.

Also, we'll need to apply rotating proxies to avoid rate-limiting for the web scraping at scale. Check out how to use a proxy in Playwright.

Let's check out the pros and cons of using Playwright for web scraping:

Pros:

  • On-premise or cloud solution with no third party API
  • Full browser control
  • Customisable proxy pool
  • Community support

Cons:

  • Low scalability
  • High proxy prices
  • Complicated maintenance
  • High expenses for infrastructure to run CPU intense task

ScrapingAnt API

ScrapingAnt web scraping API is a web interface for performing data extraction at a scale that already uses headless browsers and rotating proxies inside. Only one action is needed - execute an API request with a target URL. Let's create the same function with using ScrapingAnt API:

from scrapingant_client import ScrapingAntClient

client = ScrapingAntClient(token='<YOUR-SCRAPINGANT-API-TOKEN>')

def get_webpage_content_scrapingant(url):
    return client.general_request(url).content
Enter fullscreen mode Exit fullscreen mode

To get you API token, please, visit Login page to authorize in ScrapingAnt User panel. It's free.

Pros:

  • Simple usage
  • Always reproducible results
  • Low maintenance and usage price
  • Large proxy pool
  • High data extraction success rate

Cons:

  • Third-party API usage
  • Paid when scraping web pages at scale

Learning phase notes

Model creation should be implemented in exactly the same way you're going to use for other pages scraping. If you're planning to use AutoScraper with url param - you should also create the model with passing url param. The different browsers may create different layouts, so a Playwright or a desktop Chrome model can not parse HTML retrieved with a direct URL propagation or content retrieved via ScrapingAnt web scraping API.

To save HTML for learning, you can use the following snippets:

Playwright

from pathlib import Path
from playwright.sync_api import sync_playwright

url = 'https://www.ebay.com/sch/i.html?_nkw=baking+mat'

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto(url)
    Path('baking_mat_ebay.html').write_text(page.content())
    browser.close()
Enter fullscreen mode Exit fullscreen mode

ScrapingAnt

from pathlib import Path
from scrapingant_client import ScrapingAntClient

url = 'https://www.ebay.com/sch/i.html?_nkw=baking+mat'
client = ScrapingAntClient(token='<YOUR-SCRAPINGANT-API-TOKEN>')
html = client.general_request(url=url).content

Path('baking_mat_ebay.html').write_text(html)
Enter fullscreen mode Exit fullscreen mode

Create your own API

It's time to glue-up everything together and create our e-commerce API based on Amazon and eBay. The final code contains both data extraction functions, so you'll be able to modify and use your favorite:

from autoscraper import AutoScraper
from fastapi import FastAPI
from scrapingant_client import ScrapingAntClient
from typing import Optional
from playwright.sync_api import sync_playwright

ebay_scraper = AutoScraper()
amazon_scraper = AutoScraper()
ebay_scraper.load('ebay-search')
amazon_scraper.load('amazon-search')

client = ScrapingAntClient(token='<YOUR-SCRAPINGANT-API-TOKEN>')

app = FastAPI()


def get_ebay_result(search_query):
    url = 'https://www.ebay.com/sch/i.html?_nkw=%s' % search_query
    html = get_webpage_content_scrapingant(url)
    result = ebay_scraper.get_result_similar(html=html, group_by_alias=True)
    return _aggregate_result(result)


def get_amazon_result(search_query):
    url = 'https://www.amazon.com/s?k=%s' % search_query
    html = get_webpage_content_scrapingant(url)
    result = amazon_scraper.get_result_similar(html=html, group_by_alias=True)
    result['url'] = [f'https://www.amazon.com/dp/{i}' for i in result['url']]
    return _aggregate_result(result)


def get_webpage_content_scrapingant(url):
    return client.general_request(url).content


def get_webpage_content_playwright(url):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)
        content = page.content()
        browser.close()
        return content


def _aggregate_result(result):
    """
    :param result {
        'title': ['name1', 'name2']
        'url': ['url1', 'url2']
        'price': ['price1', 'price2']
    }

    :returns [
        {'title': 'name1', 'url': 'url1', 'price': 'price1'},
        {'title': 'name2', 'url': 'url2', 'price': 'price2'},
    ]
    """
    final_result = []
    items_count = len(result['title'])
    for i in range(items_count):
        final_result.append({alias: result[alias][i] for alias in result})
    return final_result

@app.get("/")
def search_api(q: Optional[str] = None):
    return dict(result=get_ebay_result(q) + get_amazon_result(q))

Enter fullscreen mode Exit fullscreen mode

Execute the following command to run the API server:

uvicorn main:app --reload
Enter fullscreen mode Exit fullscreen mode

By running this code, the API server will be up listening on port 8080. So let’s test our API by opening http://127.0.0.1:8080/?q=baking in our browser:

E-Commerce API response

And finally, we're having our own custom multi-sourced e-commerce API. Just replace baking in the URL with your desired search query to get its results.

Conclusion

This tutorial is intended for personal and educational use. I hope this article is useful and helps to grab all the provided code snippets to a fully functional and production-grade application.

I'd recommend following the links below to extend your knowledge and have a deep dive into the project parts:

Happy web scraping, and don't forget to check websites policies regarding the scraping bots 😉

💖 💪 🙅 🚩
kami4ka
Oleg Kulyk

Posted on April 24, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related