In the ever-evolving landscape of web development, the ability to efficiently gather, process, and display data from external sources has become increasingly valuable. Whether for market research, competitive analysis, or customer insights, web scraping plays a crucial role in unlocking the vast potential of the internet's data.
This blog post introduces a comprehensive guide to building a robust Next.js application designed for scraping flight data from Kayak, one of the leading travel search engines. By leveraging the power of Next.js alongside modern technologies such as BullMQ, Redis, and Puppeteer.
π Next.js 14 App Directory with Tailwind CSS - Experience the sleek and modern UI powered by the latest Next.js 14 and styled with Tailwind CSS for that perfect look and feel.
π API Routes & Server Actions - Dive into seamless backend integration with Next.js 14's API routes and server actions ensuring efficient data handling and server-side logic execution.
π· Scraping with Puppeteer Redis and BullMQ - Harness the power of Puppeteer for advanced web scraping with Redis and BullMQ managing queues and jobs for robust backend operations.
π JWT Tokens for Authentication and Authorization - Secure your app with JWT tokens providing a reliable method for authentication and authorization across your platform.
π³ Stripe for Payment Gateways - Integrate Stripe for seamless payment processing enabling secure and easy transactions for booking trips flights and hotels.
βοΈ Book Trips Flights and Hotels with Stripe Payment Gateway - Make your travel booking experience effortless with our Stripe-powered payment system.
π Scrape Live Data from Multiple Websites - Stay ahead with real-time data scraping from multiple sources keeping your app updated with the latest information.
πΎ Store the Scraped Data in PostgreSQL with Prisma - Leverage PostgreSQL and Prisma for efficient storage and management of your scraped data ensuring reliability and speed.
π Zustand for State Management - Enjoy smooth and manageable state management in your app with Zustand simplifying state logic and enhancing performance.
π Best Feature of the App - Scraping the Unscrapable Data with Bright Data's Scraping Browser.
Bright Data's Scraping Browser provides us with an automatic captcha-solving feature that helps us scrape the un-scrapable data.
Step 1: Setting Up the Next.js Application
Create a Next.js App: Start by creating a new Next.js app if you haven't already. You can do this by running the following command in your terminal:
npx create-next-app@latest booking-app
Navigate to Your App Directory: Change into your newly created app directory:
cd booking-app
Step 2: Installing Required Packages
You'll need to install several packages, including Redis, BullMQ, and Puppeteer Core. Run the following command to install them:
npm install ioredis bullmq puppeteer-core
ioredis is a robust Redis client for Node.js, enabling communication with Redis.
bullmq manages job and message queues with Redis as the backend.
puppeteer-core allows you to control an external browser for scraping purposes.
Step 3: Setting Up Redis Connection
Create a file (e.g., redis.js) in a suitable directory (e.g., lib/) to configure the Redis connection:
// lib/redis.jsimportRedisfrom'ioredis';// Use REDIS_URL from environment or fallback to localhostconstREDIS_URL=process.env.REDIS_URL||'redis://localhost:6379';constconnection=newRedis(REDIS_URL);export{connection};
Step 4: Configuring BullMQ Queue
Set up the BullMQ queue by creating another file (e.g., queue.js) in the same directory as your Redis configuration:
2.Create a Worker for Job Processing: In your application, create a file (instrumentation.js) to handle job processing. This worker will use Puppeteer for scraping tasks:
Before setting up Bright Data Scraping Browser let's talk about what a scraping browser is.
What is Bright Data's scraping browser?
Bright Data's Scraping Browser is a cutting-edge tool for automated web scraping, designed to seamlessly integrate with Puppeteer, Playwright, and Selenium. It offers a suite of website unblocking features, including proxy rotation, CAPTCHA solving, and more, to enhance scraping efficiency. Ideal for complex web scraping requiring interactions, it allows scalability by hosting unlimited browser sessions on Bright Dataβs infrastructure. For more details, visit Bright Data.
Step 1: Navigate to Bright Data's Website
Begin by heading over to Brightdata.com. This is your gateway to accessing the wealth of web scraping resources and tools offered by Bright Data.
Step 2: Create an Account
Once you're on Bright Data's website, sign up to create a new account. You'll be prompted to enter essential information to get your account up and running.
Step 3: Select Your Product
On the product selection page, look for the Proxies & Scraping Infrastructure product. This product is specifically designed to meet your web scraping needs, offering powerful tools and features for data extraction.
Step 4: Add a New Proxy
Within the Proxies & Scraping Infrastructure page, you'll find an "add new button." Click on this to start the process of adding a new scraping browser to your toolkit.
Step 5: Choose the Scraping Browser
A dropdown list will appear, from which you should select the scraping browser option. This tells Bright Data that you intend to set up a new scraping browser environment.
Step 6: Name Your Scraping Browser
Give your new scraping browser a unique name. This helps in identifying and managing it later, especially if you plan to use multiple browsers for different scraping projects.
Step 7: Add the Browser
After naming your browser, click on the "add" button. This action finalizes the creation of your new scraping browser.
Step 8: View Your Scraping Browser Details
Upon adding your scraping browser, you will be directed to a page where you can see all the details of your newly created scraping browser. This information is crucial for integration and use.
Step 9: Access Code and Integration Examples
Look for the "check out code and integration examples" button. Clicking this will provide you with a comprehensive view of how to integrate and use your scraping browser across multiple programming languages and libraries. This resource is invaluable for developers looking to customize their scraping setup.
Step 10: Integrate Your Scraping Browser
Finally, copy the SRS_WS_ENDPOINT variable. This is a critical piece of information that you will need to integrate into your source code, allowing your applications to communicate with the scraping browser you've just set up.
By following these detailed steps, you have successfully created a scraping browser within Bright Data's platform, ready to tackle your web scraping tasks. Remember, Bright Data offers extensive documentation and support to help you maximize your scraping projects' efficiency and effectiveness. Whether you're gathering market intelligence, conducting research, or monitoring competitive landscapes, your newly set up scraping browser is a powerful tool in your data collection arsenal.
Step 7: Implementing the Scraping Logic with Puppeteer
Continuing from where we left off in setting up our Next.js application for scraping flight data, the next critical step is to implement the actual scraping logic. This process involves utilizing Puppeteer to connect to a browser instance, navigate to the target URL (in our case, Kayak), and scrape the necessary flight data. The code snippet provided outlines a sophisticated method for achieving this goal, seamlessly integrating with our previously established BullMQ worker setup. Let's break down the components of this scraping logic and understand how it fits into our application.
Establishing a Connection to the Browser
The first step in our scraping process is to establish a connection to the browser through Puppeteer. This is accomplished by utilizing the puppeteer.connect method, which connects to an existing browser instance using a WebSocket endpoint (SBR_WS_ENDPOINT). This environment variable should be set to the WebSocket URL of the scraping browser service you're using, such as Bright Data:
Opening a New Page and Navigating to the Target URL
Once connected, we create a new page in the browser and navigate to the target URL specified in the job data. This URL is the specific Kayak search result page from which we intend to scrape flight data:
The core of our logic lies in scraping the flight data from the page. We achieve this by using page.evaluate, a Puppeteer method that allows us to run scripts in the context of the browser. Within this script, we wait for the necessary elements to load and then proceed to collect flight information:
Flight Selector: We target elements with the class .nrc6-wrapper, which contain flight details.
Data Extraction: For each flight element, we extract details such as the airline logo, departure and arrival times, flight duration, airline name, and price. The departure and arrival times are cleaned to remove unnecessary numeric values at the end, ensuring we capture the time accurately.
Price Processing: The price is extracted as an integer after removing all non-numeric characters, ensuring it can be used for numerical operations or comparisons.
The extracted data is structured into an array of flight objects, each containing the details mentioned above:
constscrappedFlights=awaitpage.evaluate(async ()=>{// Data extraction logicconstflights=[];// Process each flight element// ...returnflights;});
Error Handling and Cleanup
Our scraping logic is wrapped in a try-catch block to handle any potential errors gracefully during the scraping process. Regardless of the outcome, we ensure the browser is closed properly in the finally block, maintaining resource efficiency and preventing potential memory leaks:
constSBR_WS_ENDPOINT=process.env.SBR_WS_ENDPOINT;exportconstregister=async ()=>{if (process.env.NEXT_RUNTIME==="nodejs"){const{Worker}=awaitimport("bullmq");constpuppeteer=awaitimport("puppeteer");const{connection}=awaitimport("./lib/redis");const{importQueue}=awaitimport("./lib/queue");newWorker("importQueue",async (job)=>{constbrowser=awaitpuppeteer.connect({browserWSEndpoint:SBR_WS_ENDPOINT,});try{constpage=awaitbrowser.newPage();console.log("in flight scraping");console.log("Connected! Navigating to "+job.data.url);awaitpage.goto(job.data.url);console.log("Navigated! Scraping page content...");constscrappedFlights=awaitpage.evaluate(async ()=>{awaitnewPromise((resolve)=>setTimeout(resolve,5000));constflights=[];constflightSelectors=document.querySelectorAll(".nrc6-wrapper");flightSelectors.forEach((flightElement)=>{constairlineLogo=flightElement.querySelector("img")?.src||"";const[rawDepartureTime,rawArrivalTime]=(flightElement.querySelector(".vmXl")?.innerText||"").split(" β ");// Function to extract time and remove numeric values at the endconstextractTime=(rawTime:string):string=>{consttimeWithoutNumbers=rawTime.replace(/[0-9+\s]+$/,"").trim();returntimeWithoutNumbers;};constdepartureTime=extractTime(rawDepartureTime);constarrivalTime=extractTime(rawArrivalTime);constflightDuration=(flightElement.querySelector(".xdW8")?.children[0]?.innerText||"").trim();constairlineName=(flightElement.querySelector(".VY2U")?.children[1]?.innerText||"").trim();// Extract priceconstprice=parseInt((flightElement.querySelector(".f8F1-price-text")?.innerText||"").replace(/[^\d]/g,"").trim(),10);flights.push({airlineLogo,departureTime,arrivalTime,flightDuration,airlineName,price,});});returnflights;});}catch (error){console.log({error});}finally{awaitbrowser.close();console.log("Browser closed successfully.");}},{connection,concurrency:10,removeOnComplete:{count:1000},removeOnFail:{count:5000},});}};
Step 8: Flight Search Feature
Building upon our flight data scraping functionality, let's integrate a comprehensive flight search feature into our Next.js application. This feature will provide users with a dynamic interface to search for flights by specifying the source, destination, and date. Leveraging the powerful Next.js framework alongside a modern UI library and state management, we create an engaging and responsive flight search experience.
Key Components of the Flight Search Feature
Dynamic City Selection: The feature includes an autocomplete functionality for source and destination inputs, powered by a pre-defined list of city-airport codes. As users type, the application filters and displays matching cities, enhancing the user experience by making it easier to find and select airports.
Date Selection: Users can select their intended flight date through a date input, providing flexibility in planning their travel.
Scraping Status Monitoring: After initiating a scraping job, the application monitors the job's status through periodic API calls. This asynchronous checking allows the app to update the UI with the status of the scraping process, ensuring users are informed of the progress and results.
After successfully scraping flight data, the next crucial step is to present these results to the users in a user-friendly manner. The Flights component in your Next.js application is designed for this purpose.
The sections and code snippets shared above represent just a fraction of the full functionality and code necessary to build a robust flight data scraping and search application using Next.js. To grasp the entirety of this project, including advanced features, optimizations, and best practices, I invite you to dive deeper through my comprehensive resources available online.
Watch the Detailed Explanation on YouTube
For a step-by-step video guide that walks you through the development process, coding nuances, and functionality of this application, check out my YouTube video. This tutorial is designed to provide you with a deeper understanding of the concepts, allowing you to follow along at your own pace and gain valuable insights into Next.js application development.
Explore the Full Code on GitHub
If you're eager to explore the code in its entirety, head over to my GitHub repository. There, you'll find the complete codebase, including all the components, utilities, and setup instructions you need to get this application running on your own machine.
Travel Planner App with Live Web Scraping from various sources using Bright Data scraping browser.
Project Screenshots:
π§ Features
Here are some of the project's best features:
π Next.js 14 App Directory with Tailwind CSS - Experience the sleek and modern UI powered by the latest Next.js 14 and styled with Tailwind CSS for that perfect look and feel.
π API Routes & Server Actions - Dive into seamless backend integration with Next.js 14's API routes and server actions ensuring efficient data handling and server-side logic execution.
π· Scraping with Puppeteer Redis and BullMQ - Harness the power of Puppeteer for advanced web scraping with Redis and BullMQ managing queues and jobs for robust backend operations.
π JWT Tokens for Authentication and Authorization - Secure your app with JWT tokens providing a reliable method for authentication and authorization across your platform.
Building a comprehensive application like the flight data scraping and search tool with Next.js showcases the power and versatility of modern web development tools and frameworks. Whether you're a seasoned developer looking to refine your skills or a beginner eager to dive into web development, these resources are tailored to support your journey. Watch the detailed tutorial on YouTube, explore the full code on GitHub, and join the conversation to enhance your development expertise and contribute to the vibrant developer community.