Huxley
Posted on December 12, 2019
Scraping web pages with infinite scrolling using python, bs4 and selenium
Scroll function
This function takes two arguments. The driver that is being used and a timeout. The driver is used to scroll and the timeout is used to wait for the page to load.
def scroll(driver, timeout):
scroll_pause_time = timeout
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(scroll_pause_time)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
# If heights are the same it will exit the function
break
last_height = new_height
Here is an example using the function
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
# Your options may be different
options = Options()
options.set_preference('permissions.default.image', 2)
options.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', False)
def all_links(url):
# Setup the driver. This one uses firefox with some options and a path to the geckodriver
driver = webdriver.Firefox(options=options ,executable_path='./geckodriver')
# implicitly_wait tells the driver to wait before throwing an exception
driver.implicitly_wait(30)
# driver.get(url) opens the page
driver.get(url)
# This starts the scrolling by passing the driver and a timeout
scroll(driver, 5)
# Once scroll returns bs4 parsers the page_source
soup_a = BeautifulSoup(driver.page_source, 'lxml')
# Them we close the driver as soup_a is storing the page source
driver.close()
# Empty array to store the links
links = []
# Looping through all the a elements in the page source
for link in soup_a.find_all('a'):
# link.get('href') gets the href/url out of the a element
links.append(link.get('href'))
return links
And that's how you scrap a page with infinite scrolling
💖 💪 🙅 🚩
Huxley
Posted on December 12, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.