How to Scrape Tripadvisor Reviews with Nodejs and Puppeteer
Andreas
Posted on November 4, 2019
Tripadvisor contains tons of useful local business reviews. However, the site’s API does not provide an endpoint to access them. Also, Tripadvisor uses a lot of javascript in the frontend, which makes it a little bit harder to scrape than a lot of other websites.
In this quick tutorial, we are going to walk through all steps to scrape the customer reviews from a tripadvisor local business page.
Why Puppeteer?
Let me quickly say a few words about why I decided to use Puppeteer for this project. As I mentioned above, Tripadvisor requires a full browser, because a lot of content requires javascript, in order to be rendered. That basically left me with two options: Selenium and Puppeteer. Throughout the last months, Puppeteer has evolved to be the more prominent solution, as it is noticeably faster.
Information we are going to scrape
For this tutorial, I have selected a random pizzeria in New York City. It has the following profile url:
As you can see in the screenshot above, we are going to scrape the following pieces of information of each review:
• The rating
• Date of the review
• Date of the visit
• The review title
• Review text (we will have to expand it)
Getting started
Before we start extracting the code, we will have to install puppeteer on our local environment:
npm install puppeteer --save
The full code
/* Part 1 */
const puppeteer = require('puppeteer');
puppeteer.launch({ headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox', '--window-size=1920,1080'] }).then(async browser => {
const page = await browser.newPage();
await page.goto("https://www.tripadvisor.com/Restaurant_Review-g60763-d15873406-Reviews-Ortomare_Ristorante_Pizzeria-New_York_City_New_York.html");
await page.waitForSelector('body');
/* Part 2 */
await page.click('.taLnk.ulBlueLinks');
await page.waitForFunction('document.querySelector("body").innerText.includes("Show less")');
/* Part 3 */
var reviews = await page.evaluate(() => {
var results = [];
var items = document.body.querySelectorAll('.review-container');
items.forEach((item) => {
/* Get and format Rating */
let ratingElement = item.querySelector('.ui_bubble_rating').getAttribute('class');
let integer = ratingElement.replace(/[^0-9]/g,'');
let parsedRating = parseInt(integer) / 10;
/* Get and format date of Visit */
let dateOfVisitElement = item.querySelector('.prw_rup.prw_reviews_stay_date_hsx').innerText;
let parsedDateOfVisit = dateOfVisitElement.replace('Date of visit:', '').trim();
/* Part 4 */
results.push({
rating: parsedRating,
dateOfVisit: parsedDateOfVisit,
ratingDate: item.querySelector('.ratingDate').getAttribute('title'),
title: item.querySelector('.noQuotes').innerText,
content: item.querySelector('.partial_entry').innerText,
});
});
return results;
});
console.log(reviews);
await browser.close();
}).catch(function(error) {
console.error(error);
});
Let me walk through the parts of the code:
Part 1:
With these first lines, we launch puppeteer in headless mode and navigate to the profile page of the pizzeria. All following actions need the document body to be fully loaded. This is ensured by the last line of part 1.
Part 2:
As you can see above, not the entire review text is shown by default. Hence, we have to click on “More”, before scraping the actual content. This is executed by line 8. Again, the following code is only to be executed, once the click action has successfully been completed. This is ensured by the last line of part 2.
Part 3:
This is where the magic happens. We access the page-DOM and extract all desired information from each review that is visible on the page.
Rating:
By taking a closer look at the element, we can see that the ratings are made up from pseudo elements. However, there is a class on the element from which we can conclude the rating:
This review shows a 5/5 rating. We can calculate the given rating by extracting the number “50” from the string “bubble_50”, convert it to an integer, and divide it by 10.
Date of visit: The date of visit can be obtained quite easily. We simply select the element that contains the date and remove the substring “Date of visit:”.
Review title and content:
These can be simply extracted by simply getting the text from the related elements. No manipulation needed.
We have now successfully gathered all information.
Part 4:
We append all gathered information in an array array, which is then returned by the function.
Running the script in your console should return all ten results from the first page.
Possible improvements
- The script above only returns the reviews that are shown on the first page. In order to obtain all available reviews, you have to paginate through all review pages. Each page contains up to 10 reviews. This can be achieved by clicking on the page links on the bottom of the page, like we did with the “show more” links.
- When scraping the reviews of a bigger list of restaurants, I recommend using a puppeteer cluster. Make sure to limit the concurrency, so you are not spamming too much.
- Also, your scraper might get blocked at some point. This is one of the major issues my current startup, a web scraping API, is trying to solve: zenscrape.com
- Zenscrape also offers a great tutorial on how to get started with nodejs getting started with nodejs and puppeteer
Posted on November 4, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.