Scrape Organic News from Brave Search with Python

dmitryzub

Dmitriy Zub ☀️

Posted on October 27, 2021

Scrape Organic News from Brave Search with Python

Intro

Currently, we don't have an API that supports extracting data from Brave Search.

This blog post is to show you way how you can do it yourself with provided DIY solution below while we're working on releasing our proper API.

The solution can be used for personal use as it doesn't include the Legal US Shield that we offer for our paid production and above plans and has its limitations such as the need to bypass blocks, for example, CAPTCHA.

You can check our public roadmap to track the progress for this API: [New API] Brave Search

What will be scraped

wwbs-brave-news-results

📌Note: Sometimes there may be no news in the organic search results. This blog post gets news from organic results and news tabs.

What is Brave Search

The previous Brave blog post previously described what is Brave search. For the sake of non-duplicating content, this information is not mentioned in this blog post.

Full Code

If you don't need explanation, have a look at full code example in the online IDE.

from bs4 import BeautifulSoup
import requests, lxml, json

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    'q': 'dune',            # query
    'source': 'web',        # source
    'tf': 'at'              # publish time (by default any time)
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}


def scrape_organic_news():
    html = requests.get('https://search.brave.com/search', headers=headers, params=params)
    soup = BeautifulSoup(html.text, 'lxml')

    brave_organic_news = []

    for result in soup.select('#news-carousel .card'):
        title = result.select_one('.title').get_text().strip()
        link = result.get('href')
        time_published = result.select_one('.card-footer__timestamp').get_text().strip()
        source = result.select_one('.anchor').get_text().strip()
        favicon = result.select_one('.favicon').get('src')
        thumbnail = result.select_one('.img-bg').get('style').split(', ')[0].replace("background-image: url('", "").replace("')", "")

        brave_organic_news.append({
            'title': title,
            'link': link,
            'time_published': time_published,
            'source': source,
            'favicon': favicon,
            'thumbnail': thumbnail
        })

    print(json.dumps(brave_organic_news, indent=2, ensure_ascii=False))


def scrape_tab_news():
    del params['source']
    html = requests.get('https://search.brave.com/news', headers=headers, params=params)
    soup = BeautifulSoup(html.text, 'lxml')

    brave_tab_news = []

    for result in soup.select('.snippet'):
        title = result.select_one('.snippet-title').get_text()
        link = result.select_one('.result-header').get('href')
        snippet = result.select_one('.snippet-description').get_text().strip()
        time_published = result.select_one('.ml-5+ .text-gray').get_text()
        source = result.select_one('.netloc').get_text()
        favicon = result.select_one('.favicon').get('src')
        thumbnail = result.select_one('.thumb')
        thumbnail = thumbnail.get('src') if thumbnail else None

        brave_tab_news.append({
            'title': title,
            'link': link,
            'snippet': snippet,
            'time_published': time_published,
            'source': source,
            'favicon': favicon,
            'thumbnail': thumbnail
        })

    print(json.dumps(brave_tab_news, indent=2, ensure_ascii=False))


if __name__ == "__main__":
    # scrape_organic_news()
    scrape_tab_news()
Enter fullscreen mode Exit fullscreen mode

Preparation

Install libraries:

pip install requests lxml beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective.

Reduce the chance of being blocked

Make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.

There's a how to reduce the chance of being blocked while web scraping blog post that can get you familiar with basic and more advanced approaches.

Code Explanation

Import libraries:

from bs4 import BeautifulSoup
import requests, lxml, json
Enter fullscreen mode Exit fullscreen mode
Library Purpose
BeautifulSoup to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
requests to make a request to the website.
lxml to process XML/HTML documents fast.
json to convert extracted data to a JSON object.

Create URL parameters and request headers:

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    'q': 'dune',            # query
    'source': 'web',        # source
    'tf': 'at'              # publish time (by default any time)
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
Enter fullscreen mode Exit fullscreen mode
Code Explanation
params a prettier way of passing URL parameters to a request.
user-agent to act as a "real" user request from the browser by passing it to request headers. Default requests user-agent is a python-reqeusts so websites might understand that it's a bot or a script and block the request to the website. Check what's your user-agent.

Scrape organic news

This function scrapes all organic news data for the https://search.brave.com/search URL and outputs all results in JSON format.

You need to make a request, pass the created request parameters and headers. The request returns HTML to BeautifulSoup:

html = requests.get('https://search.brave.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
Enter fullscreen mode Exit fullscreen mode
Code Explanation
timeout=30 to stop waiting for response after 30 seconds.
BeautifulSoup() where returned HTML data will be processed by bs4.

Create the brave_organic_news list to store all news:

brave_organic_news = []
Enter fullscreen mode Exit fullscreen mode

To extract the necessary data, you need to find the selector where they are located. In our case, this is the #news-carousel .card selector, which contains all organic news. You need to iterate each news in the loop:

for result in soup.select('#news-carousel .card'):
    # data extraction will be here
Enter fullscreen mode Exit fullscreen mode

To extract the data, you need to find the matching selectors. SelectorGadget was used to grab CSS selectors. I want to demonstrate how the selector selection process works:

brave-organic-news-selector-gadget

After the selectors have been found, we need to get the corresponding text or attribute value:

title = result.select_one('.title').get_text().strip()
link = result.get('href')
time_published = result.select_one('.card-footer__timestamp').get_text().strip()
source = result.select_one('.anchor').get_text().strip()
favicon = result.select_one('.favicon').get('src')
thumbnail = result.select_one('.img-bg').get('style').split(', ')[0].replace("background-image: url('", "").replace("')", "")
Enter fullscreen mode Exit fullscreen mode
Code Explanation
select_one()/select() to run a CSS selector against a parsed document and return all the matching elements.
get_text() to get textual data from the node.
get(<attribute>) to get attribute data from the node.
strip() to return a copy of the string with the leading and trailing characters removed.
replace() to replace all occurrences of the old substring with the new one without extra elements.
split() to return a list of words in a string, separating the string with a delimiter string.

After the data from item is retrieved, it is appended to the brave_organic_news list:

brave_organic_news.append({
    'title': title,
    'link': link,
    'time_published': time_published,
    'source': source,
    'favicon': favicon,
    'thumbnail': thumbnail
})
Enter fullscreen mode Exit fullscreen mode

The complete function to scrape organic news would look like this:

def scrape_organic_news():
    html = requests.get('https://search.brave.com/search', headers=headers, params=params)
    soup = BeautifulSoup(html.text, 'lxml')

    brave_organic_news = []

    for result in soup.select('#news-carousel .card'):
        title = result.select_one('.title').get_text().strip()
        link = result.get('href')
        time_published = result.select_one('.card-footer__timestamp').get_text().strip()
        source = result.select_one('.anchor').get_text().strip()
        favicon = result.select_one('.favicon').get('src')
        thumbnail = result.select_one('.img-bg').get('style').split(', ')[0].replace("background-image: url('", "").replace("')", "")

        brave_organic_news.append({
            'title': title,
            'link': link,
            'time_published': time_published,
            'source': source,
            'favicon': favicon,
            'thumbnail': thumbnail
        })

    print(json.dumps(brave_organic_news, indent=2, ensure_ascii=False))
Enter fullscreen mode Exit fullscreen mode

Output:

[
  {
    "title": "Dune subreddit group bans AI-generated art for being ‘low effort’ ...",
    "link": "https://www.theguardian.com/film/2022/oct/16/dune-subreddit-group-bans-ai-generated-art-for-being-low-effort",
    "time_published": "2 days ago",
    "source": "theguardian.com",
    "favicon": "https://imgs.search.brave.com/9NJ5RrmLraV8oAt2-ItS_A5rM7MNWTBcXog1rbJwni0/rs:fit:32:32:1/g:ce/aHR0cDovL2Zhdmlj/b25zLnNlYXJjaC5i/cmF2ZS5jb20vaWNv/bnMvNGRmYTNkMTZl/NmJhYTQwYmQ4NDRj/MzQ4NDZkNGQ0YTgy/ZWRlZDM4YWVkMzM4/NmM0Y2Y2NTgyMTQ5/NzQxOTExYy93d3cu/dGhlZ3VhcmRpYW4u/Y29tLw",
    "thumbnail": "https://imgs.search.brave.com/PO4d1ks7aUaIUG07Aty1tXis_sdCsr9ZUJ-IXB5Hr7U/rs:fit:200:200:1/g:ce/aHR0cHM6Ly9pLmd1/aW0uY28udWsvaW1n/L21lZGlhL2EzOTQy/NWM5N2M0MzlmY2Vi/Yzc3NTFlYzUzMDQ0/MmJmYWFjOWNlZGYv/Njk3XzBfMjk1OF8x/Nzc3L21hc3Rlci8y/OTU4LmpwZz93aWR0/aD0xMjAwJmhlaWdo/dD02MzAmcXVhbGl0/eT04NSZhdXRvPWZv/cm1hdCZmaXQ9Y3Jv/cCZvdmVybGF5LWFs/aWduPWJvdHRvbSUy/Q2xlZnQmb3Zlcmxh/eS13aWR0aD0xMDBw/Jm92ZXJsYXktYmFz/ZTY0PUwybHRaeTl6/ZEdGMGFXTXZiM1ps/Y214aGVYTXZkR2N0/WkdWbVlYVnNkQzV3/Ym1jJmVuYWJsZT11/cHNjYWxlJnM9NTVi/NDY1MzM1ZDcyNWNh/YzAxNDg2Nzk2NTNm/ZGJlMzg"
  },
  {
    "title": "Emily Watson: The Dune: ‘The Sisterhood’, ‘God’s Creatures’ ...",
    "link": "https://deadline.com/2022/10/emily-watson-the-dune-the-sisterhood-and-gods-creatures-star-says-she-loves-being-in-front-of-the-camera-because-it-gives-her-a-level-of-trust-1235145603/",
    "time_published": "3 days ago",
    "source": "deadline.com",
    "favicon": "https://imgs.search.brave.com/hbAJswXoM5V6EWHa7svVfcuTtKDvVN3HaccvtoCfhVo/rs:fit:32:32:1/g:ce/aHR0cDovL2Zhdmlj/b25zLnNlYXJjaC5i/cmF2ZS5jb20vaWNv/bnMvMjk2OWMwMWU5/ZDU0MGJjMDdkZjY1/NTJmZmU3OGEzMDU5/Y2U2MWQ2ODE5Njdj/NTEwYzA2MGY5ODYy/N2NlNTkzYS9kZWFk/bGluZS5jb20v",
    "thumbnail": "https://imgs.search.brave.com/-vqS2wBthAQPSJiCpxSHW_IcG2CFsVw-MWbUykIOIZQ/rs:fit:200:200:1/g:ce/aHR0cHM6Ly9kZWFk/bGluZS5jb20vd3At/Y29udGVudC91cGxv/YWRzLzIwMjIvMTAv/ZW1pbHkuanBnP3c9/MTAyNA"
  },
  {
    "title": "DUNE: THE SISTERHOOD Taps OBI-WAN KENOBI And GAME OF THRONES Star ...",
    "link": "https://comicbookmovie.com/sci-fi/dune/dune-the-sisterhood-taps-obi-wan-kenobi-and-game-of-thrones-star-indira-varma-for-lead-role-a197335",
    "time_published": "2 days ago",
    "source": "comicbookmovie.com",
    "favicon": "https://imgs.search.brave.com/ZqE9eQ5BIk1l3ZH7MOTWEPqScYt79E7VyJ5D46uRTeA/rs:fit:32:32:1/g:ce/aHR0cDovL2Zhdmlj/b25zLnNlYXJjaC5i/cmF2ZS5jb20vaWNv/bnMvYzNlOWQ3NGE2/MzQwYWExZjRhMDEx/Njk1NGE5OTlkYzhj/NjZmZmEwNjVlYmY1/ODM1MzIyMWZjNWQy/M2FjM2JlNi9jb21p/Y2Jvb2ttb3ZpZS5j/b20v",
    "thumbnail": "https://imgs.search.brave.com/-wstGGJXxONeT0Ig7TNrqFw1DLK5kIWLdm9V-_Ne4lU/rs:fit:200:200:1/g:ce/aHR0cHM6Ly9jb21p/Y2Jvb2ttb3ZpZS5j/b20vaW1hZ2VzL2Fy/dGljbGVzL2Jhbm5l/cnMvMTk3MzM1Lmpw/ZWc"
  },
  ... other news
]
Enter fullscreen mode Exit fullscreen mode

Scrape tab news

This function scrapes all tab news data for the https://search.brave.com/news URL and outputs all results in JSON format.

You need to make a request, pass the created request headers and parameters without 'source' parameter. The request returns HTML to BeautifulSoup:

del params['source']
html = requests.get('https://search.brave.com/news', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
Enter fullscreen mode Exit fullscreen mode

Create the brave_tab_news list to store all news:

brave_tab_news = []
Enter fullscreen mode Exit fullscreen mode

To retrieve data from all news in the page, you need to find the .snippet selector of the items. You need to iterate each item in the loop:

for result in soup.select('.snippet'):
    # data extraction will be here
Enter fullscreen mode Exit fullscreen mode

On this page, the matching selectors are different. So this function also used SelectorGadget to grab CSS selectors. I want to demonstrate how the selector selection process works:

brave-tab-news-selector-gadget

The difference between extracting data in this function is that here you can get a snippet:

title = result.select_one('.snippet-title').get_text()
link = result.select_one('.result-header').get('href')
snippet = result.select_one('.snippet-description').get_text().strip()
time_published = result.select_one('.ml-5+ .text-gray').get_text()
source = result.select_one('.netloc').get_text()
favicon = result.select_one('.favicon').get('src')
thumbnail = result.select_one('.thumb')
thumbnail = thumbnail.get('src') if thumbnail else None
Enter fullscreen mode Exit fullscreen mode

📌Note: When extracting the thumbnail, a ternary expression is used which handles the values of these data, if any are available.

After the data from item is retrieved, it is appended to the brave_tab_news list:

brave_tab_news.append({
    'title': title,
    'link': link,
    'snippet': snippet,
    'time_published': time_published,
    'source': source,
    'favicon': favicon,
    'thumbnail': thumbnail
})
Enter fullscreen mode Exit fullscreen mode

The complete function to scrape tab news would look like this:

def scrape_tab_news():
    del params['source']
    html = requests.get('https://search.brave.com/news', headers=headers, params=params)
    soup = BeautifulSoup(html.text, 'lxml')

    brave_tab_news = []

    for result in soup.select('.snippet'):
        title = result.select_one('.snippet-title').get_text()
        link = result.select_one('.result-header').get('href')
        snippet = result.select_one('.snippet-description').get_text().strip()
        time_published = result.select_one('.ml-5+ .text-gray').get_text()
        source = result.select_one('.netloc').get_text()
        favicon = result.select_one('.favicon').get('src')
        thumbnail = result.select_one('.thumb')
        thumbnail = thumbnail.get('src') if thumbnail else None

        brave_tab_news.append({
            'title': title,
            'link': link,
            'snippet': snippet,
            'time_published': time_published,
            'source': source,
            'favicon': favicon,
            'thumbnail': thumbnail
        })

    print(json.dumps(brave_tab_news, indent=2, ensure_ascii=False))
Enter fullscreen mode Exit fullscreen mode

Output:

[
  {
    "title": "Dune Games All Have The Same Problem",
    "link": "https://www.msn.com/en-us/entertainment/gaming/dune-games-all-have-the-same-problem/ar-AA1364qI",
    "snippet": "Dune video games have managed to reflect the strategy of ruling Arrakis, but they've failed to capture the franchise’s most fascinating quality.",
    "time_published": "1 day ago",
    "source": "ScreenRant on MSN.com",
    "favicon": "https://imgs.search.brave.com/-8C0opPjysKHAWE2H2sJ4d6TC-jhlh7zWo326qw_QK4/rs:fit:32:32:1/g:ce/aHR0cDovL2Zhdmlj/b25zLnNlYXJjaC5i/cmF2ZS5jb20vaWNv/bnMvNGJmODdhNGJk/YmYxY2RkMDU4YzNl/ZjY3OTUyMmNmMzlm/YjYyMmM4MDJlYmQ5/Yzg4ZjY2MzJiZDg4/MWEzYThkNi93d3cu/bXNuLmNvbS8",
    "thumbnail": "https://imgs.search.brave.com/JqEcp16LXMD8UqcFusDHrpixgLnI5EBQURQ9b02ox4U/rs:fit:1335:225:1/g:ce/aHR0cHM6Ly93d3cu/YmluZy5jb20vdGg_/aWQ9T1ZGVC4waktv/VGVKV21DY2VtNUhU/anJyb3NDJnBpZD1O/ZXdz"
  },
  {
    "title": "Dune: The Sisterhood to begin shooting in November as Indira Varma joins cast",
    "link": "https://www.flickeringmyth.com/2022/10/dune-the-sisterhood-to-begin-shooting-in-november-as-indira-varma-joins-cast/",
    "snippet": "As HBO Max and Legendary Television prepare to kick off production on Dune: The Sisterhood, Deadline is reporting that Indira Varma (Game of Thrones, Star Wars: Obi-Wan Kenobi) has joined the cast of the Dune spinoff television series.",
    "time_published": "4 hours ago",
    "source": "Flickeringmyth",
    "favicon": "https://imgs.search.brave.com/syftwTbOGwbuYrlw8LiSFZpqkyNYOzcn2zYsu9tP7g4/rs:fit:32:32:1/g:ce/aHR0cDovL2Zhdmlj/b25zLnNlYXJjaC5i/cmF2ZS5jb20vaWNv/bnMvNzQzMTViMjdk/ZWUxNTc3MWY3N2Vi/ZDEwZWI1ODgzOTIy/YzMzYjE5ZGYxODdi/YTUzYzZlZTFkOWM1/M2RlNWI3Yi93d3cu/ZmxpY2tlcmluZ215/dGguY29tLw",
    "thumbnail": null
  },
  {
    "title": "Dune subreddit group bans AI-generated art for being ‘low effort’",
    "link": "https://www.theguardian.com/film/2022/oct/16/dune-subreddit-group-bans-ai-generated-art-for-being-low-effort?amp;amp;amp",
    "snippet": "Moderators of community devoted to sci-fi films and novels say they want to prioritise ‘human-made’ art",
    "time_published": "3 days ago",
    "source": "The Guardian",
    "favicon": "https://imgs.search.brave.com/9NJ5RrmLraV8oAt2-ItS_A5rM7MNWTBcXog1rbJwni0/rs:fit:32:32:1/g:ce/aHR0cDovL2Zhdmlj/b25zLnNlYXJjaC5i/cmF2ZS5jb20vaWNv/bnMvNGRmYTNkMTZl/NmJhYTQwYmQ4NDRj/MzQ4NDZkNGQ0YTgy/ZWRlZDM4YWVkMzM4/NmM0Y2Y2NTgyMTQ5/NzQxOTExYy93d3cu/dGhlZ3VhcmRpYW4u/Y29tLw",
    "thumbnail": "https://imgs.search.brave.com/To--PX2Q-ovIhZzDlN07Go1mLvlaZmj7p6Nb3x-4dZU/rs:fit:1335:225:1/g:ce/aHR0cHM6Ly93d3cu/YmluZy5jb20vdGg_/aWQ9T1ZGVC5ZVUVJ/VzhYc1hxdDZmLTlR/OFl6S1NpJnBpZD1O/ZXdz"
  },
  ... other news
]
Enter fullscreen mode Exit fullscreen mode

Join us on Twitter | YouTube

Add a Feature Request💫 or a Bug🐞

💖 💪 🙅 🚩
dmitryzub
Dmitriy Zub ☀️

Posted on October 27, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related