Mark Harless
Posted on May 31, 2020
I don't know what it is about web scraping that I find so fascinating. Maybe because there's a clear goal and many different ways to do something. I find myself refactoring my code even though the web scrapers works and I already have the data I needed!
Right now, I'm working on a React app based on the classic Hangman game. In this version of Hangman, I'll be using IMDb's 1,000 highest rated movies for people to guess. Users will be given the length of the title (obviously), release year, genre and summary. To shake things up a little, I've also scraped IMDb's top 1,000 rated movie stars. Users will be given the length of the movie star's name, the movie they're most famous for, and 10 portraits (I think I'll only be showing the largest four images). There will be a 50/50 chance of either a movie or movie star to pop up.
Some challenges I'm working through in my head:
Some of the movie's summaries include the title, for instance, Joker (2019). I'll need to create a function to censor it out.
How will mobile users type? There isn't going to be a form or any input field (I think). So how do I get the keyboard to show?
Can I easily implement a hangman SVG character or should I do something else?
If I show four portraits for, let's say, Tom Hardy, I think it would be best to implement a horizontal scroll for mobile users so they can still see what they're guessing. How can I make it work on the desktop?
For the movie star scraper, I used Beautiful Soup to gather the names and what movie they are famous for. On IMDb, viewing their images would create a modal popup. This was difficult to extract images from so what I ended up using was Contextual Web Search API. It's simple and free to use 10,000 times a month. Great! I only need to use 1000.
I collected the first 10 results for each movie star. The reason why I'm showing four portraits is because some of them aren't actual pictures of the actor or actress. For examaple, here's one of Lynn Shelton who's popular from the movie Humpday. And other images where the movie star is with a group of people.
Here's the scraper code:
import json
import requests
from time import sleep
from bs4 import BeautifulSoup
class Star:
def __init__(self, name, famous_for, portraits):
self.name = name
self.famous_for = famous_for
self.portraits = portraits
def __str__(self):
return f"{self.name}, {self.famous_for}"
def to_json(self):
# converts to JSON
return json.dumps(self, default=lambda o: o.__dict__, indent=4)
page_num = 1
star_list = []
while page_num <= 1001:
page = requests.get(
f"https://www.imdb.com/search/name/?gender=male,female&start={page_num}&ref_=rlm").text
soup = BeautifulSoup(page, 'html.parser')
stars = soup.find_all('div', class_='lister-item')
for star in stars:
name = star.h3.a.text.strip()
famous_for = star.find_all('a')[2].text.strip()
url = "https://contextualwebsearch-websearch-v1.p.rapidapi.com/api/Search/ImageSearchAPI"
querystring = {"autoCorrect": "false", "pageNumber": "1",
"pageSize": "10", "q": f"{name}", "safeSearch": "true"}
headers = {
'x-rapidapi-host': "contextualwebsearch-websearch-v1.p.rapidapi.com",
'x-rapidapi-key': "1b193d24f4msh7e6cc900a3729c9p1a4fbbjsnb60afe89cf7b"
}
response = requests.request(
"GET", url, headers=headers, params=querystring)
sleep(1)
try:
portraits = json.loads(response.text)['value']
except Exception as e:
print(e)
star = Star(name, famous_for, portraits)
star_list.append(star.to_json())
print(len(star_list))
page_num += 50
f = open('stars.json', 'w')
for star in star_list:
f.writelines(star)
f.writelines(',')
f.close()
Once it's done scraping, it exports it to a JSON file in the same directory. How cool! The one thing I couldn't figure out was how to put all of that data into a single array. So what I ended up doing, for now, is select the 1,000 dictionaries and hit the bracket symbol to surround them in brackets. If anyone knows how I can get it to work automatically, please let me know!
Lastly, here's the data :)
Posted on May 31, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.