Build an Advanced Web Scraping Tool Using ToolJet and Scraper API! 🚀 🛠️
Karan Rathod
Posted on August 15, 2024
Introduction
Scraping data can be tedious, especially when dealing with dynamic content. However, it becomes a lot more convenient if you can instantly preview the data scraped in the UI. ToolJet, combined with Scraper API, enables exactly that. This tutorial shows how to set up a script to scrape data using ToolJet and display the results in real time.
If you've worked with web scraping using Google Colab or tools like Selenium, you know the challenges. Here, we'll take a different approach using JavaScript web scraping, utilizing ToolJet’s visual app builder to manage the data flow.
Prerequisites
- ToolJet (https://github.com/ToolJet/ToolJet) : An open-source, low-code platform designed for quickly building and deploying internal tools. Sign up for a free ToolJet cloud account here.
- Basic knowledge of JavaScript.
Begin by creating an application named TJ Web Scraper.
Step 1: Designing the UI to Display the Scraped Data
Let's use ToolJet's visual app builder to quickly design the UI.
- Drag and drop a Container component onto the canvas.
- Inside the container, place an Icon component on the left for the logo.
- Place a Text component next to it and enter "TJ Web Scraper" under its Data property.
- Place another Text component on the right to display the total count of scraped products.
We are using blue (hex code: #075ab9
) as the primary color for this application; change the color scheme of all the components accordingly.
- Add a Table component below the header and a Button component on the bottom-right of the canvas.
With that, our UI is ready in just a couple of minutes.
Step 2: Writing the JavaScript Code to Scrape Data
In this step, we will utilize the Scraper API to scrape the data from a sample eCommerce website. Here's a preview of the website:
The website has a few products with their images, titles, and prices. Additionally, it has a load more button to dynamically load more content.
To start, expand the Query Panel at the bottom, and click on the Add button to create a new Run JavaScript code query. Rename the query to scrapeData.
Set up the main scraping function: Create the
runMainScript()
function that will coordinate all the logic needed for scraping products.
function runMainScript() {
const API_KEY = 'SCRAPER_API_KEY';
// Main logic goes here
// .....
}
-
Create a request helper: Build a helper function
makeRequest()
using axios to handle API requests, manage responses, and deal with errors efficiently.
async function makeRequest(url) {
try {
const response = await axios.get('https://api.scraperapi.com/', {
params: {
api_key: API_KEY,
url: url
}
});
return response.data;
} catch (error) {
if (error.response && error.response.status === 404) {
throw new Error('404NotFound');
}
console.error(`Error making request to ${url}: ${error}`);
return null;
}
}
-
Extract product details: Define the
parseProducts()
function to gather relevant information (like title, price, and image) from the HTML content, filtering out incomplete data. This function uses the HTML selectors tailored to the target website.
function parseProducts(html) {
const parser = new DOMParser();
const doc = parser.parseFromString(html, 'text/html');
const items = doc.querySelectorAll('.product-item');
return Array.from(items).map(item => ({
title: item.querySelector('.product-name')?.textContent.trim() || '',
price: item.querySelector('.product-price')?.textContent.trim() || '',
image: item.querySelector('img')?.src || 'N/A',
url: item.querySelector('a')?.href || null
})).filter(item => item.title && item.price);
}
- Handle dynamic loading: The fetchProducts() function manages the initial page load and any additional AJAX requests, collecting all available products. It saves the total count in a ToolJet variable.
async function fetchProducts(pageUrl, ajaxUrl) {
let products = [];
let offset = 0;
const initialPageHtml = await makeRequest(pageUrl);
if (!initialPageHtml) return products;
products = products.concat(parseProducts(initialPageHtml));
while (true) {
const ajaxHtml = await makeRequest(`${ajaxUrl}?offset=${offset}`);
if (!ajaxHtml) break;
const newProducts = parseProducts(ajaxHtml);
if (newProducts.length === 0) break;
products = products.concat(newProducts);
offset += 12;
console.log(`Scraped ${products.length} products so far...`);
// Save the length of the total products fetched in a ToolJet variable
actions.setVariable('totalProductsScraped', products.length);
}
return products;
}
-
Launch the scraping process: Implement the
scrapeProducts()
function to trigger the scraping and output the final count of products collected.
async function scrapeProducts() {
const pageUrl = "https://www.scrapingcourse.com/button-click";
const ajaxUrl = "https://www.scrapingcourse.com/ajax/products";
let products = await fetchProducts(pageUrl, ajaxUrl);
console.log(`\nTotal products scraped: ${products.length}`);
return products;
}
- Run the script and handle results: Execute the scraping process, save the data, and log a sample of the products for review and handles potential errors by setting an error variable.
scrapeProducts().then(products => {
// Save all the products fetched from the eCommerce website
actions.setVariable('scrapedProducts', products);
console.log("\nScraped products stored in 'scrapedProducts' variable.");
console.log(`Total products: ${products.length}`);
console.log("\nSample of scraped products:");
products.slice(0, 5).forEach(product => {
console.log(`Title: ${product.title}`);
console.log(`Price: ${product.price}`);
console.log(`Image: ${product.image}`);
console.log(`URL: ${product.url}`);
console.log("---");
});
}).catch(error => {
actions.setVariable('scrapingError', error.message);
console.error("An error occurred:", error);
});
-
Time to see the code in action : Invoke the
runMainScript()
function to start the entire process.
function runMainScript() {
// Main logic
// .....
}
runMainScript();
The code to scrape the data is ready. The above code for web scraping will have to be adjusted based on each target website.
Click on the Run button on the Query Panel and check all the logs that will appear in the browser console.
Step 3: Displaying the Scraped Data
With the UI and code set up, we can now focus on displaying the data on the Table component and triggering the code based on Button click.
- Select the Button component, navigate to its properties and create a new event handler.
- Select On click as the event, Run Query as the action, and scrapeData as the query.
- Select the Table component, and under its Data property, enter
{{variables.scrapedProducts}}
. - Select the Text component in the header that is created to display the total count of the products that are scraped. Enter the
{{"Total Products Scraped: " + variables.totalProductsScraped || 0 }}
code under its Data property.
We've successfully linked the components with the query. Now, just click the Button component and watch as the data is scraped and displayed in the Table component.
Conclusion
Scraping data effectively requires overcoming challenges like dynamic content and pagination, especially when dealing with AJAX-loaded pages. Using ToolJet combined with Scraper API, you can simplify this process and gain the ability to instantly preview and manage your scraped data through a clean UI.
Unlike traditional approaches like Selenium web scraping or using Google Colab, this method integrates JavaScript web scraping seamlessly into your workflow with real-time visibility of your data. Building on this foundation, you can scale the tool to handle more complex scraping needs while maintaining an intuitive interface.
To learn more, check out ToolJet's official documentation or connect on Slack with any questions.💡
Posted on August 15, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.