[Python] A Comprehensive Guide to Scraping Instagram Data. How to bypass Instagram login while scraping - Meta Spy

Meta Spy: https://github.com/DEENUU1/meta-spy
Full code is available here: https://pastebin.com/QMmDUZtj
Demo: https://github.com/DEENUU1/meta-spy/blob/main/assets/instagram/imagescraper.gif?raw=true

Info

This article is based on my project which I am still developing - Meta Spy (Facebook Spy before) this week I started to add commands for scraping data from Instagram, my idea is to expand this app for all Meta applications and also add Flet framework as a GUI because typing this commands is making me bored.

How to bypass login ?

Bypassing Instagram's login process might sound like a daunting task, but it's surprisingly straightforward. We'll extract the sessionid key from a browser where we're already logged in and integrate it into the Selenium driver. Here's a step-by-step guide:

Launch Instagram in your browser and press F12 to open the Developer Tools.
In the Developer Tools sidebar, select "Data."
Locate and select the "Cookies" option, then choose cookies for instagram.com.
Copy the sessionid value.

It's time to write some code

Now that we've covered the initial steps, it's time to dive into the code.

Setting Up Chrome Driver Options

To begin, we'll create a class with a static method that simplifies the configuration of the Chrome driver. This class will serve as the foundation for our scraper.




from typing import List  
from time import sleep  
from selenium.webdriver.common.by import By  
from selenium import webdriver  
from selenium.webdriver.support.ui import WebDriverWait  
from selenium.webdriver.chrome.options import Options

class Scraper:  

    @staticmethod  
    def _chrome_driver_configuration() -> Options:  
        chrome_options = Options()  
        chrome_options.add_argument("--disable-notifications")  
        chrome_options.add_argument("--disable-extensions")  
        chrome_options.add_argument("--disable-popup-blocking")  
        chrome_options.add_argument("--disable-default-apps")  
        chrome_options.add_argument("--disable-infobars")  
        chrome_options.add_argument("--disable-web-security")  
        chrome_options.add_argument(  
            "--disable-features=IsolateOrigins,site-per-process"  
        )  
        chrome_options.add_argument(  
            "--enable-features=NetworkService,NetworkServiceInProcess"  
        )  
        chrome_options.add_argument("--profile-directory=Default")  
        chrome_options.add_experimental_option("excludeSwitches", ["enable-logging"])  
        return chrome_options

Implementing the Base Scraper Class

While this tutorial might appear to introduce more classes than necessary, it aligns with our modular approach to project development. This approach allows us to showcase the complete implementation of specific functionalities.




class BaseInstagramScraper(Scraper):  
    def __init__(self, user_id: str, base_url: str) -> None:  
        super().__init__()  
        self._user_id = user_id  
        self._base_url = base_url.format(self._user_id)  
        self._driver = webdriver.Chrome(options=self._chrome_driver_configuration())  
        self._driver.get(self._base_url)  
        self._wait = WebDriverWait(self._driver, 10)

Scoll

Retrieving the full content from Instagram profiles requires scrolling, but it's not as simple as a one-time scroll-and-scrape process. When scrolling through a profile, data appears and disappears dynamically. As only a few rows of images are visible at a time, scrolling to the end and scraping the data is not feasible. To address this, we've created a function that provides a callback mechanism for dynamic content retrieval.

Our standard function scrolls the page down and captures all the visible content. However, in this case, dynamic data retrieval is necessary.



def scroll_page_callback(driver, callback) -> None:  
    """  
    Scrolls the page to load more data from a website    """    try:  
        last_height = driver.execute_script("return document.body.scrollHeight")  
        consecutive_scrolls = 0  

        while consecutive_scrolls < 3:  
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")  

            sleep(3)  
            new_height = driver.execute_script("return document.body.scrollHeight")  

            if new_height == last_height:  
                consecutive_scrolls += 1  
            else:  
                consecutive_scrolls = 0  

            last_height = new_height  

            callback(driver)  

    except Exception as e:  
        logs.log_error(f"Error occurred while scrolling: {e}")

Scraping data

Now, let's put all the pieces together and explore the main class responsible for scraping Instagram data.



class ProfileScraper(BaseInstagramScraper):  
    def __init__(self, user_id: str) -> None:  
        super().__init__(user_id, base_url=f"https://www.instagram.com/{user_id}/")  
        self._driver.add_cookie(  
            {  
                "name": "sessionid",  
                "value": "your_sessionid_goes_HERE",  
                "domain": ".instagram.com",  
            }  
        )  
        self._refresh_driver()  

    def _refresh_driver(self) -> None:  
        self._driver.refresh()

The ProfileScraper class inherits from the BaseInstagramScraper, which already includes Chrome driver configurations and more. We add the sessionid cookie to the driver, ensuring that the "value" field contains your sessionid. Next, we call the method:



self._refresh_driver

This method refreshes the driver and correctly loads any newly added cookies.



def extract_images(self) -> List[str]:  
    extracted_image_urls = []  
    try:  

        def extract_callback(driver):  
            img_elements = self._driver.find_elements(  
                By.CLASS_NAME,  
                "x5yr21d.xu96u03.x10l6tqk.x13vifvy.x87ps6o.xh8yej3",  
            )  
            for img_element in img_elements:  
                src_attribute = img_element.get_attribute("src")  
                if src_attribute and src_attribute not in extracted_image_urls:  
                    #print(f"Extracted image URL: {src_attribute}")  
                    extracted_image_urls.append(src_attribute)  
        scroll_page_callback(self._driver, extract_callback)  

    except Exception as e:  
        print(f"An  error occurred while extracting images: {e}")  

    return extracted_image_urls

The core of this class lies in the extract_images method, which returns a list of all scraped image URLs. Inside this method, we find the extract_callback function. It identifies image elements, prints them to the console, and adds them to the extracted_image_url list, checking for duplicates.

Finally, we call the scroll_page_callback function with the Chrome driver and the data extraction function as arguments, ensuring that our scraper works seamlessly.

With this comprehensive guide, you're well-equipped to dive into Instagram data scraping with Meta Spy. As we continue developing this project, expect more features and functionalities that expand its capabilities across all Meta applications. And don't forget, our plans to integrate Flet as a GUI promise to make the experience even more user-friendly. Happy scraping!

Running code




if __name__ == "__main__":  
    scraper = ProfileScraper("sawardega_wataha")  
    data = scraper.extract_images()  
    print(len(data))  
    print(data[0])

Inside ProfileScraper class add a user_id from instagram account.

Results



> python .\main.py
33 # This is a number of scraped urls 
# This is a full url to the scraped image 
https://scontent-waw1-1.cdninstagram.com/v/t51.2885-15/387688415_1338700880368645_3875950289382108239_n.jpg?stp=dst-jpg_e35&efg=eyJ2ZW5jb2RlX3RhZyI6ImltYWdlX3VybGdlbi4xNDQweDE4MDAuc2RyIn0&_nc_ht=scontent-waw1-1.cdnin
stagram.com&_nc_cat=101&_nc_ohc=-w6WTMiiWj4AX-_Qfkt&edm=ACWDqb8BAAAA&ccb=7-5&ig_cache_key=MzIxMTM2ODUyNjYzMDkzMTEzMA%3D%3D.2-ccb7-5&oh=00_AfDoHMVh0dS6msk5yKaW9d81HCeCSgBUJzW82sKRHYRvwQ&oe=65433911&_nc_sid=ee9879

Blog