Build an Advanced Web Scraping Tool Using ToolJet and Scraper API! 🚀 🛠️

karanrathod316

Karan Rathod

Posted on August 15, 2024

Build an Advanced Web Scraping Tool Using ToolJet and Scraper API! 🚀 🛠️

Introduction

Scraping data can be tedious, especially when dealing with dynamic content. However, it becomes a lot more convenient if you can instantly preview the data scraped in the UI. ToolJet, combined with Scraper API, enables exactly that. This tutorial shows how to set up a script to scrape data using ToolJet and display the results in real time.

If you've worked with web scraping using Google Colab or tools like Selenium, you know the challenges. Here, we'll take a different approach using JavaScript web scraping, utilizing ToolJet’s visual app builder to manage the data flow.


Prerequisites

  • ToolJet (https://github.com/ToolJet/ToolJet) : An open-source, low-code platform designed for quickly building and deploying internal tools. Sign up for a free ToolJet cloud account here.
  • Basic knowledge of JavaScript.

Begin by creating an application named TJ Web Scraper.


Step 1: Designing the UI to Display the Scraped Data

Let's use ToolJet's visual app builder to quickly design the UI.

  • Drag and drop a Container component onto the canvas.
  • Inside the container, place an Icon component on the left for the logo.
  • Place a Text component next to it and enter "TJ Web Scraper" under its Data property.
  • Place another Text component on the right to display the total count of scraped products.

Image description

We are using blue (hex code: #075ab9) as the primary color for this application; change the color scheme of all the components accordingly.

  • Add a Table component below the header and a Button component on the bottom-right of the canvas.

Image description

With that, our UI is ready in just a couple of minutes.


Step 2: Writing the JavaScript Code to Scrape Data

In this step, we will utilize the Scraper API to scrape the data from a sample eCommerce website. Here's a preview of the website:

Image description

The website has a few products with their images, titles, and prices. Additionally, it has a load more button to dynamically load more content.

  • To start, expand the Query Panel at the bottom, and click on the Add button to create a new Run JavaScript code query. Rename the query to scrapeData.

  • Set up the main scraping function: Create the runMainScript() function that will coordinate all the logic needed for scraping products.

function runMainScript() {
    const API_KEY = 'SCRAPER_API_KEY'; 

    // Main logic goes here
    // .....
}
Enter fullscreen mode Exit fullscreen mode
  • Create a request helper: Build a helper function makeRequest() using axios to handle API requests, manage responses, and deal with errors efficiently.
async function makeRequest(url) {
    try {
        const response = await axios.get('https://api.scraperapi.com/', {
            params: {
                api_key: API_KEY,
                url: url
            }
        });
        return response.data;
    } catch (error) {
        if (error.response && error.response.status === 404) {
            throw new Error('404NotFound');
        }
        console.error(`Error making request to ${url}: ${error}`);
        return null;
    }
}
Enter fullscreen mode Exit fullscreen mode
  • Extract product details: Define the parseProducts() function to gather relevant information (like title, price, and image) from the HTML content, filtering out incomplete data. This function uses the HTML selectors tailored to the target website.
function parseProducts(html) {
    const parser = new DOMParser();
    const doc = parser.parseFromString(html, 'text/html');
    const items = doc.querySelectorAll('.product-item');
    return Array.from(items).map(item => ({
        title: item.querySelector('.product-name')?.textContent.trim() || '',
        price: item.querySelector('.product-price')?.textContent.trim() || '',
        image: item.querySelector('img')?.src || 'N/A',
        url: item.querySelector('a')?.href || null
    })).filter(item => item.title && item.price);
}
Enter fullscreen mode Exit fullscreen mode
  • Handle dynamic loading: The fetchProducts() function manages the initial page load and any additional AJAX requests, collecting all available products. It saves the total count in a ToolJet variable.
async function fetchProducts(pageUrl, ajaxUrl) {
    let products = [];
    let offset = 0;
    const initialPageHtml = await makeRequest(pageUrl);

    if (!initialPageHtml) return products;
    products = products.concat(parseProducts(initialPageHtml));

    while (true) {
        const ajaxHtml = await makeRequest(`${ajaxUrl}?offset=${offset}`);
        if (!ajaxHtml) break;
        const newProducts = parseProducts(ajaxHtml);
        if (newProducts.length === 0) break;
        products = products.concat(newProducts);
        offset += 12;
        console.log(`Scraped ${products.length} products so far...`);
        // Save the length of the total products fetched in a ToolJet variable
        actions.setVariable('totalProductsScraped', products.length);
    }
    return products;
}
Enter fullscreen mode Exit fullscreen mode
  • Launch the scraping process: Implement the scrapeProducts() function to trigger the scraping and output the final count of products collected.
async function scrapeProducts() {
    const pageUrl = "https://www.scrapingcourse.com/button-click";
    const ajaxUrl = "https://www.scrapingcourse.com/ajax/products";
    let products = await fetchProducts(pageUrl, ajaxUrl);
    console.log(`\nTotal products scraped: ${products.length}`);
    return products;
}
Enter fullscreen mode Exit fullscreen mode
  • Run the script and handle results: Execute the scraping process, save the data, and log a sample of the products for review and handles potential errors by setting an error variable.
scrapeProducts().then(products => {
    // Save all the products fetched from the eCommerce website
    actions.setVariable('scrapedProducts', products);
    console.log("\nScraped products stored in 'scrapedProducts' variable.");
    console.log(`Total products: ${products.length}`);
    console.log("\nSample of scraped products:");
    products.slice(0, 5).forEach(product => {
        console.log(`Title: ${product.title}`);
        console.log(`Price: ${product.price}`);
        console.log(`Image: ${product.image}`);
        console.log(`URL: ${product.url}`);
        console.log("---");
    });
}).catch(error => {
    actions.setVariable('scrapingError', error.message);
    console.error("An error occurred:", error);
});
Enter fullscreen mode Exit fullscreen mode
  • Time to see the code in action : Invoke the runMainScript() function to start the entire process.
function runMainScript() {
    // Main logic
    // .....
}

runMainScript();
Enter fullscreen mode Exit fullscreen mode

The code to scrape the data is ready. The above code for web scraping will have to be adjusted based on each target website.

Click on the Run button on the Query Panel and check all the logs that will appear in the browser console.

Image description


Step 3: Displaying the Scraped Data

With the UI and code set up, we can now focus on displaying the data on the Table component and triggering the code based on Button click.

  • Select the Button component, navigate to its properties and create a new event handler.
  • Select On click as the event, Run Query as the action, and scrapeData as the query.

Image description

  • Select the Table component, and under its Data property, enter {{variables.scrapedProducts}}.
  • Select the Text component in the header that is created to display the total count of the products that are scraped. Enter the {{"Total Products Scraped: " + variables.totalProductsScraped || 0 }} code under its Data property.

We've successfully linked the components with the query. Now, just click the Button component and watch as the data is scraped and displayed in the Table component.

Image description


Conclusion

Scraping data effectively requires overcoming challenges like dynamic content and pagination, especially when dealing with AJAX-loaded pages. Using ToolJet combined with Scraper API, you can simplify this process and gain the ability to instantly preview and manage your scraped data through a clean UI.

Unlike traditional approaches like Selenium web scraping or using Google Colab, this method integrates JavaScript web scraping seamlessly into your workflow with real-time visibility of your data. Building on this foundation, you can scale the tool to handle more complex scraping needs while maintaining an intuitive interface.

To learn more, check out ToolJet's official documentation or connect on Slack with any questions.💡

💖 💪 🙅 🚩
karanrathod316
Karan Rathod

Posted on August 15, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related