Scrape Brave Images with Python

chukhraiartur

Artur Chukhrai

Posted on November 15, 2022

Scrape Brave Images with Python

Intro

Currently, we don't have an API that supports extracting data from Brave Search.

This blog post is to show you way how you can do it yourself with provided DIY solution below while we're working on releasing our proper API.

The solution can be used for personal use as it doesn't include the Legal US Shield that we offer for our paid production and above plans and has its limitations such as the need to bypass blocks, for example, CAPTCHA.

You can check our public roadmap to track the progress for this API: [New API] Brave Search

What will be scraped

wwbs-brave-images

What is Brave Search

The previous Brave blog post previously described what is Brave search. For the sake of non-duplicating content, this information is not mentioned in this blog post.

Full Code

If you don't need explanation, have a look at full code example in the online IDE.

import requests, json


def scrape_brave_images():
    # https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
    params = {
        'q': 'dune 2021',       # query 
        'source': 'web',        # source 
        'size': 'All',          # size (Small, Medium, Large, Wallpaper) 
        '_type': 'All',         # type (Photo, Clipart, AnimatedGifHttps, Transparent) 
        'layout': 'All',        # layout (Square, Tall, Wide) 
        'color': 'All',         # colors (Monochrome, ColorOnly, Red etc) 
        'license': 'All',       # license (Public, Share, Modify etc)
        'offset': 0
    }

    # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
    headers = {
        'content-type': 'application/json',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
    }

    data = []
    old_page_result = []

    while True:
        html = requests.get('https://search.brave.com/api/images', headers=headers, params=params).json()

        new_page_result = html['results']

        if new_page_result == old_page_result:
            break

        for result in new_page_result:
            data.append({
                'title': result.get('title'),
                'link': result.get('url'),
                'source': result.get('source'),
                'width': result.get('properties').get('width'),
                'height': result.get('properties').get('height'),
                'image': result.get('properties').get('url')
            })

        params['offset'] += 151
        old_page_result = new_page_result

    return data


if __name__ == "__main__":
    brave_images = scrape_brave_images()
    print(json.dumps(brave_images, indent=2))
Enter fullscreen mode Exit fullscreen mode

Preparation

Install libraries:

pip install requests
Enter fullscreen mode Exit fullscreen mode

Reduce the chance of being blocked

Make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.

There's a how to reduce the chance of being blocked while web scraping blog post that can get you familiar with basic and more advanced approaches.

Code Explanation

Import libraries:

import requests, json
Enter fullscreen mode Exit fullscreen mode
Library Purpose
requests to make a request to the website.
json to convert extracted data to a JSON object.

Create URL parameters and request headers:

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    'q': 'dune 2021',       # query 
    'source': 'web',        # source 
    'size': 'All',          # size (Small, Medium, Large, Wallpaper) 
    '_type': 'All',         # type (Photo, Clipart, AnimatedGifHttps, Transparent) 
    'layout': 'All',        # layout (Square, Tall, Wide) 
    'color': 'All',         # colors (Monochrome, ColorOnly, Red etc) 
    'license': 'All',       # license (Public, Share, Modify etc)
    'offset': 0
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    'content-type': 'application/json',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
Enter fullscreen mode Exit fullscreen mode
Code Explanation
params a prettier way of passing URL parameters to a request.
content-type to indicate the original media type of the resource (prior to any content encoding applied for sending). In responses, a Content-Type header provides the client with the actual content type of the returned content.
user-agent to act as a "real" user request from the browser by passing it to request headers. Default requests user-agent is a python-reqeusts so websites might understand that it's a bot or a script and block the request to the website. Check what's your user-agent.

Create the data list to hold all the data, and the old_page_result list that we'll need later:

data = []
old_page_result = []
Enter fullscreen mode Exit fullscreen mode

To scrape Brave images with pagination, you need to use the offset parameter of the URL, which defaults to 0 for the first page, 151 for the second, and so on. Since data is retrieved from all pages, it is necessary to implement a while loop:

while True:
    # pagination will be here
Enter fullscreen mode Exit fullscreen mode

In each iteration of the loop, you need to make a request to the Brave API, pass the created request parameters and headers. Using the json() method, the response is converted into a JSON object for further work:

html = requests.get('https://search.brave.com/api/images', headers=headers, params=params).json()
Enter fullscreen mode Exit fullscreen mode

The new_page_result list contains all the results on the current page. The new_page_result list is compared with the old_page_result list. If they are the same, then this means that we have reached the last page and there is no more new data. Therefore, you need to break the loop:

new_page_result = html['results']

if new_page_result == old_page_result:
    break
Enter fullscreen mode Exit fullscreen mode

πŸ“ŒNote: In the first iteration of the loop, there is no data in the old_page_result list. Therefore, the check will fail.

By looping through the new_page_result list in a for loop, you can get the data. For each result, data such as title, link, source, width, height, and image are retrieved:

for result in new_page_result:
    data.append({
        'title': result.get('title'),
        'link': result.get('url'),
        'source': result.get('source'),
        'width': result.get('properties').get('width'),
        'height': result.get('properties').get('height'),
        'image': result.get('properties').get('url')
    })
Enter fullscreen mode Exit fullscreen mode

πŸ“ŒNote: The image key contains a full resolution image.

After extracting the data, you need to increase the value of the offset parameter by 151. This value also increases on the site when you click on the button responsible for showing more data, that is, we simulate this behavior:

params['offset'] += 151
Enter fullscreen mode Exit fullscreen mode

This is shown more clearly in the GIF below:

brave-images-pagination

On each iteration, the data from the new_page_result list will be written to the old_page_result list until they are the same:

old_page_result = new_page_result
Enter fullscreen mode Exit fullscreen mode

Output

[
  {
    "title": "Dune (2021) | The Poster Database (TPDb)",
    "link": "https://theposterdb.com/posters/42710?page=2",
    "source": "theposterdb.com",
    "width": 1365,
    "height": 2048,
    "image": "https://image.tmdb.org/t/p/original/2sxSn0jjjQoIIZfZjC6j5GZkMVR.jpg"
  },
  {
    "title": "Dune (2021) - Posters \u2014 The Movie Database (TMDB)",
    "link": "https://www.themoviedb.org/movie/438631-dune/images/posters",
    "source": "The Movie Database",
    "width": 2000,
    "height": 3000,
    "image": "https://www.themoviedb.org/t/p/original/7S56MF6XA1jIzD9I2ejMjd6aNvN.jpg"
  },
  {
    "title": "Dune (2021) - Posters \u2014 The Movie Database (TMDb)",
    "link": "https://www.themoviedb.org/movie/438631-dune/images/posters",
    "source": "The Movie Database",
    "width": 956,
    "height": 1333,
    "image": "https://www.themoviedb.org/t/p/original/AqjrlcNRSKx84CeNJyNueg6V1SR.jpg"
  },
  {
    "title": "Dune - Pel\u00edcula 2021 - SensaCine.com",
    "link": "http://www.sensacine.com/peliculas/pelicula-133392/",
    "source": "Sensacine",
    "width": 600,
    "height": 800,
    "image": "http://es.web.img2.acsta.net/pictures/20/04/15/09/53/3283826.jpg"
  },
  {
    "title": "DUNE 2021 Movie Poster : dune",
    "link": "https://www.reddit.com/r/dune/comments/kh9som/dune_2021_movie_poster/",
    "source": "reddit.com",
    "width": 1890,
    "height": 2800,
    "image": "https://preview.redd.it/3fl2s0q1ug661.jpg?auto=webp&s=ed5e4418f962103b0d47b5b466036d7b40aa761b"
  },
  ... other images
]
Enter fullscreen mode Exit fullscreen mode

Join us on Twitter | YouTube

Add a Feature RequestπŸ’« or a Bug🐞

πŸ’– πŸ’ͺ πŸ™… 🚩
chukhraiartur
Artur Chukhrai

Posted on November 15, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related