Scrape Google Scholar Papers within a particular conference in Python
Dmitriy Zub ☀️
Posted on April 29, 2022
What will be scraped
How filtering works
To filter results, you need to use source:
operator which restricts search results to documents published by sources containing "NIPS"
in their name.
This operator can be used in addition to OR
operator i.e source:NIPS OR source:"Neural Information"
. So the search query would become:
search terms source:NIPS OR source:"Neural Information"
Prerequisites
Basic knowledge scraping with CSS selectors
CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.
If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective and show the most common approaches of using CSS selectors when web scraping.
Separate virtual environment
If you're on Linux:
python -m venv env && source env/bin/activate
If you're on Windows and using Git Bash:
python -m venv env && source env/Scripts/activate
In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other in the same system thus preventing libraries or Python version conflicts.
If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.
📌Note: this is not a strict requirement for this blog post.
Install libraries:
pip install requests parsel
Reduce the chance of being blocked
There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites.
Full Code
from parsel import Selector
import requests, json, os
def check_sources(source: list or str):
if isinstance(source, str):
return source # NIPS
elif isinstance(source, list):
return " OR ".join([f'source:{item}' for item in source]) # source:NIPS OR source:Neural Information
def scrape_conference_publications(query: str, source: list or str):
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": f'{query.lower()} {check_sources(source=source)}', # search query
"hl": "en", # language of the search
"gl": "us" # country of the search
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
}
html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
publications = []
for result in selector.css(".gs_r.gs_scl"):
title = result.css(".gs_rt").xpath("normalize-space()").get()
link = result.css(".gs_rt a::attr(href)").get()
result_id = result.attrib["data-cid"]
snippet = result.css(".gs_rs::text").get()
publication_info = result.css(".gs_a").xpath("normalize-space()").get()
cite_by_link = f'https://scholar.google.com/scholar{result.css(".gs_or_btn.gs_nph+ a::attr(href)").get()}'
all_versions_link = f'https://scholar.google.com/scholar{result.css("a~ a+ .gs_nph::attr(href)").get()}'
related_articles_link = f'https://scholar.google.com/scholar{result.css("a:nth-child(4)::attr(href)").get()}'
pdf_file_title = result.css(".gs_or_ggsm a").xpath("normalize-space()").get()
pdf_file_link = result.css(".gs_or_ggsm a::attr(href)").get()
publications.append({
"result_id": result_id,
"title": title,
"link": link,
"snippet": snippet,
"publication_info": publication_info,
"cite_by_link": cite_by_link,
"all_versions_link": all_versions_link,
"related_articles_link": related_articles_link,
"pdf": {
"title": pdf_file_title,
"link": pdf_file_link
}
})
# return publications
print(json.dumps(publications, indent=2, ensure_ascii=False))
scrape_conference_publications(query="anatomy", source=["NIPS", "Neural Information"])
Code Explanation
Add a function that accepts either list
of str
or a str
and transforms it if it's a list
:
def check_sources(source: list or str):
if isinstance(source, str):
return source # NIPS
elif isinstance(source, list):
return " OR ".join([f'source:{item}' for item in source]) # source:NIPS OR source:Neural Information
Define a parse function:
def scrape_conference_publications(query: str, source: list or str):
# further code...
Code | Explanation |
---|---|
query: str |
tells Python that query argument should be a string . |
source: list or str |
tells Python that source argument should be a list or a string . |
Create URL parameters, user-agent
and pass them to a request:
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": f'{query.lower()} {sources}', # search query
"hl": "en", # language of the search
"gl": "us" # country of the search
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
}
html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
Code | Explanation |
---|---|
timeout |
to tell requests to stop waiting for a response after 30 seconds. |
Selector |
is a HTML/XML processor that parses data. Like BeautifulSoup() . |
user-agent |
is used to act as a "real" user visit. Default requests user-agent is a python-requests so websites understand that it's a script that sends a request and might block it. Check what's your user-agent . |
Create a temporary list
to store the data, and iterate over organic results:
publications = []
for result in selector.css(".gs_r.gs_scl"):
title = result.css(".gs_rt").xpath("normalize-space()").get()
link = result.css(".gs_rt a::attr(href)").get()
result_id = result.attrib["data-cid"]
snippet = result.css(".gs_rs::text").get()
publication_info = result.css(".gs_a").xpath("normalize-space()").get()
cite_by_link = f'https://scholar.google.com/scholar{result.css(".gs_or_btn.gs_nph+ a::attr(href)").get()}'
all_versions_link = f'https://scholar.google.com/scholar{result.css("a~ a+ .gs_nph::attr(href)").get()}'
related_articles_link = f'https://scholar.google.com/scholar{result.css("a:nth-child(4)::attr(href)").get()}'
pdf_file_title = result.css(".gs_or_ggsm a").xpath("normalize-space()").get()
pdf_file_link = result.css(".gs_or_ggsm a::attr(href)").get()
Code | Explanation |
---|---|
css(<selector>) |
to extarct data from a given CSS selector. In the background parsel translates every CSS query into XPath query using cssselect . |
xpath("normalize-space()") |
to get blank text nodes as well. By default, blank text nodes will be skipped resulting not a complete output. |
::text /::attr()
|
is a parsel pseudo-elements to extract text or attribute data from the HTML node. |
get() |
to get actual data. |
Append the results as a dictionary to a temporary list, return or print extracted data:
publications.append({
"result_id": result_id,
"title": title,
"link": link,
"snippet": snippet,
"publication_info": publication_info,
"cite_by_link": cite_by_link,
"all_versions_link": all_versions_link,
"related_articles_link": related_articles_link,
"pdf": {
"title": pdf_file_title,
"link": pdf_file_link
}
})
# return publications
print(json.dumps(publications, indent=2, ensure_ascii=False))
scrape_conference_publications(query="anatomy", source=["NIPS", "Neural Information"])
Outputs:
[
{
"result_id": "hjgaRkq_oOEJ",
"title": "Differential representation of arm movement direction in relation to cortical anatomy and function",
"link": "https://iopscience.iop.org/article/10.1088/1741-2560/6/1/016006/meta",
"snippet": "… ",
"publication_info": "T Ball, A Schulze-Bonhage, A Aertsen… - Journal of neural …, 2009 - iopscience.iop.org",
"cite_by_link": "https://scholar.google.com/scholar/scholar?cites=16258204980532099206&as_sdt=2005&sciodt=0,5&hl=en",
"all_versions_link": "https://scholar.google.com/scholar/scholar?cluster=16258204980532099206&hl=en&as_sdt=0,5",
"related_articles_link": "https://scholar.google.com/scholar/scholar?q=related:hjgaRkq_oOEJ:scholar.google.com/&scioq=anatomy+source:NIPS+OR+source:Neural+Information&hl=en&as_sdt=0,5",
"pdf": {
"title": "[PDF] psu.edu",
"link": "http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.324.1523&rep=rep1&type=pdf"
}
}, ... other results
]
Alternatively, you can achieve it using Google Scholar Organic Results API from SerpApi.
The biggest difference is that you don't need to create a parser from scratch, maintain it, figure out how to scale it, and most importantly, how to bypass blocks from Google thus figuring out how to set up proxies and CAPTCHA solving solutions.
# pip install google-search-results
import os, json
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
params = {
# https://docs.python.org/3/library/os.html#os.getenv
"api_key": os.getenv("API_KEY"), # your serpapi API key
"engine": "google_scholar", # search engine
"q": "AI source:NIPS", # search query
"hl": "en", # language
# "as_ylo": "2017", # from 2017
# "as_yhi": "2021", # to 2021
"start": "0" # first page
}
search = GoogleSearch(params)
publications = []
publications_is_present = True
while publications_is_present:
results = search.get_dict()
print(f"Currently extracting page #{results.get('serpapi_pagination', {}).get('current')}..")
for result in results["organic_results"]:
position = result["position"]
title = result["title"]
publication_info_summary = result["publication_info"]["summary"]
result_id = result["result_id"]
link = result.get("link")
result_type = result.get("type")
snippet = result.get("snippet")
publications.append({
"page_number": results.get("serpapi_pagination", {}).get("current"),
"position": position + 1,
"result_type": result_type,
"title": title,
"link": link,
"result_id": result_id,
"publication_info_summary": publication_info_summary,
"snippet": snippet,
})
if "next" in results.get("serpapi_pagination", {}):
# splits URL in parts as a dict and passes it to a GoogleSearch() class.
search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
else:
papers_is_present = False
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Links
Add a Feature Request💫 or a Bug🐞
Posted on April 29, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.