Web Scraping all Questions from ResearchGate Search in Python
Dmitriy Zub βοΈ
Posted on September 14, 2022
Intro
Currently, we don't have an API that supports extracting data from ResearchGate Search - Questions page.
This blog post is to show you way how you can do it yourself with provided DIY solution below while we're working on releasing our proper API.
The solution can be used for personal use as it doesn't include the Legal US Shield that we offer for our paid production and above plans and has its limitations such as the need to bypass blocks, for example, CAPTCHA.
You can check our public roadmap to track the process for this API:
πΊ [New API] ResearchGate Search: Publications, Authors, Questions
What will be scraped
Prerequisites
Basic knowledge scraping with CSS selectors
If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective and show the most common approaches of using CSS selectors when web scraping.
Separate virtual environment
If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.
Reduce the chance of being blocked
There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites.
πNote: this is not a strict requirement for this blog post.
Install libraries:
pip install parsel playwright
Full Code
from parsel import Selector
from playwright.sync_api import sync_playwright
import json
def scrape_researchgate_questions(query: str):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True, slow_mo=50)
page = browser.new_page(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36")
questions = []
page_num = 1
while True:
page.goto(f"https://www.researchgate.net/search/question?q={query}&page={page_num}")
selector = Selector(text=page.content())
for question in selector.css(".nova-legacy-c-card__body--spacing-inherit"):
title = question.css(".nova-legacy-v-question-item__title .nova-legacy-e-link--theme-bare::text").get().title()
title_link = f'https://www.researchgate.net{question.css(".nova-legacy-v-question-item__title .nova-legacy-e-link--theme-bare::attr(href)").get()}'
question_type = question.css(".nova-legacy-v-question-item__badge::text").get()
question_date = question.css(".nova-legacy-v-question-item__meta-data-item:nth-child(1) span::text").get()
snippet = question.css(".redraft-text::text").get()
views = question.css(".nova-legacy-v-question-item__metrics-item:nth-child(1) .nova-legacy-e-link--theme-bare::text").get()
views_link = question.css(".nova-legacy-v-question-item__metrics-item:nth-child(1) .nova-legacy-e-link--theme-bare::attr(href)").get()
answer = question.css(".nova-legacy-v-question-item__metrics-item+ .nova-legacy-v-question-item__metrics-item .nova-legacy-e-link--theme-bare::text").get()
answer_link = question.css(".nova-legacy-v-question-item__metrics-item+ .nova-legacy-v-question-item__metrics-item .nova-legacy-e-link--theme-bare::attr(href)").get()
questions.append({
"title": title,
"link": title_link,
"snippet": snippet,
"question_type": question_type,
"question_date": question_date,
"views": {
"views_count": views,
"views_link": views_link
},
"answer": {
"answer_count": answer,
"answers_link": answer_link
}
})
print(f"page number: {page_num}")
# checks if next page arrow key is greyed out `attr(rel)` (inactive) and breaks out of the loop
if selector.css(".nova-legacy-c-button-group__item:nth-child(9) a::attr(rel)").get():
break
else:
page_num += 1
print(json.dumps(questions, indent=2, ensure_ascii=False))
browser.close()
scrape_researchgate_questions(query="coffee")
Code explanation
Import libraries:
from parsel import Selector
from playwright.sync_api import sync_playwright
import json
Code | Explanation |
---|---|
parsel |
to parse HTML/XML documents. Supports XPath. |
playwright |
to render the page with a browser instance. |
json |
to convert Python dictionary to JSON string. |
Define a function and open a playwright
with a context manager::
scrape_researchgate_questions(query="coffee"):
with sync_playwright() as p:
# ...
Code | Explanation |
---|---|
query: str |
to tell Python that query should be an str . |
Lunch a browser instance, open new_page
with passed user-agent
:
browser = p.chromium.launch(headless=True, slow_mo=50)
page = browser.new_page(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36")
Code | Explanation |
---|---|
p.chromium.launch() |
to launch Chromium browser instance. |
headless |
to explicitly tell playwright to run in headless mode even though it's a defaut value. |
slow_mo |
to tell playwright to slow down execution. |
browser.new_page() |
to open new page. user_agent is used to act a real user makes a request from the browser. If not used, it will default to playwright value which is None . Check what's your user-agent. |
Add a temporary list, set up a while loop, and open a new URL:
questions = []
while True:
page.goto(f"https://www.researchgate.net/search/question?q={query}&page={page_num}")
selector = Selector(text=page.content())
# ...
Code | Explanation |
---|---|
goto() |
to make a request to specific URL with passed query and page parameters. |
Selector() |
to pass returned HTML data with page.content() and process it. |
Iterate over author results on each page, extract the data and append
to a temporary list
:
for question in selector.css(".nova-legacy-c-card__body--spacing-inherit"):
title = question.css(".nova-legacy-v-question-item__title .nova-legacy-e-link--theme-bare::text").get().title()
title_link = f'https://www.researchgate.net{question.css(".nova-legacy-v-question-item__title .nova-legacy-e-link--theme-bare::attr(href)").get()}'
question_type = question.css(".nova-legacy-v-question-item__badge::text").get()
question_date = question.css(".nova-legacy-v-question-item__meta-data-item:nth-child(1) span::text").get()
snippet = question.css(".redraft-text::text").get()
views = question.css(".nova-legacy-v-question-item__metrics-item:nth-child(1) .nova-legacy-e-link--theme-bare::text").get()
views_link = question.css(".nova-legacy-v-question-item__metrics-item:nth-child(1) .nova-legacy-e-link--theme-bare::attr(href)").get()
answer = question.css(".nova-legacy-v-question-item__metrics-item+ .nova-legacy-v-question-item__metrics-item .nova-legacy-e-link--theme-bare::text").get()
answer_link = question.css(".nova-legacy-v-question-item__metrics-item+ .nova-legacy-v-question-item__metrics-item .nova-legacy-e-link--theme-bare::attr(href)").get()
questions.append({
"title": title,
"link": title_link,
"snippet": snippet,
"question_type": question_type,
"question_date": question_date,
"views": {
"views_count": views,
"views_link": views_link
},
"answer": {
"answer_count": answer,
"answers_link": answer_link
}
})
Code | Explanation |
---|---|
css() |
to parse data from the passed CSS selector(s). Every CSS query traslates to XPath using csselect package under the hood. |
::text /::attr(attribute)
|
to extract textual or attribute data from the node. |
get() /getall()
|
to get actual data from a matched node, or to get a list of matched data from nodes. |
xpath("normalize-space()") |
to parse blank text node as well. By default, blank text node is be skipped by XPath. |
Check if the next page is present and paginate:
# checks if the next page arrow key is greyed out `attr(rel)` (inactive) -> breaks out of the loop
if selector.css(".nova-legacy-c-button-group__item:nth-child(9) a::attr(rel)").get():
break
else:
page_num += 1
Print extracted data, and close
browser instance:
print(json.dumps(publications, indent=2, ensure_ascii=False))
browser.close()
# call the function
scrape_researchgate_questions(query="coffee")
Part of the JSON output:
[
{
"title": "Any Recommendations On An Inexpensive Coffee Grinder To Grind Up Bark Samples To Measure Ph?",
"link": "https://www.researchgate.netpost/Any_recommendations_on_an_inexpensive_coffee_grinder_to_grind_up_bark_samples_to_measure_pH?_sg=tsmZvLsXrFpn6TG77ljxS8pVJhdOMYVlqqYhQl0BszqPCDW1__lnpczwZl8XJiVROJ8_8G8jaerzpX8",
"snippet": "We are folloiwng protocol by Hansen et al. (2015) Sci. Pharm. They recommend a Rancilio coffee grinder but these are several hundred dollars. Hoping to use something a little less expensive.",
"question_type": "Question",
"question_date": "Oct 2017",
"views": {
"views_count": "97 Views",
"views_link": "post/Any_recommendations_on_an_inexpensive_coffee_grinder_to_grind_up_bark_samples_to_measure_pH?_sg=tsmZvLsXrFpn6TG77ljxS8pVJhdOMYVlqqYhQl0BszqPCDW1__lnpczwZl8XJiVROJ8_8G8jaerzpX8"
},
"answer": {
"answer_count": "2 Answers",
"answers_link": "https://www.researchgate.netpost/Any_recommendations_on_an_inexpensive_coffee_grinder_to_grind_up_bark_samples_to_measure_pH?_sg=tsmZvLsXrFpn6TG77ljxS8pVJhdOMYVlqqYhQl0BszqPCDW1__lnpczwZl8XJiVROJ8_8G8jaerzpX8"
}
}, ... other questions
{
"title": "Are There Any Ways To Find The Concentration Of A Solution Where Its Chemical Formula And Number Of Moles Are Unknown? ",
"link": "https://www.researchgate.netpost/Are_there_any_ways_to_find_the_concentration_of_a_solution_where_its_chemical_formula_and_number_of_moles_are_unknown?_sg=6W-hvIYx-FRel_YiWd62lbksTzeWP7GVkZ3tVO6SgZI7F_czhLz_oFCduq9DVhrhvIUy97168wXrn30",
"snippet": "A comprehensive way to find the concentration of random solutions would enhance benefits related with health, industry, technology and commercial aspects. Although beer lambert law is a solution, there are some cases where Epsilon is unknown (Example: A Coca-Cola drink or a cup of coffee). In this cases, proper aβlβtβ",
"question_type": "Question",
"question_date": "Jan 2022",
"views": {
"views_count": "742 Views",
"views_link": "post/Are_there_any_ways_to_find_the_concentration_of_a_solution_where_its_chemical_formula_and_number_of_moles_are_unknown?_sg=6W-hvIYx-FRel_YiWd62lbksTzeWP7GVkZ3tVO6SgZI7F_czhLz_oFCduq9DVhrhvIUy97168wXrn30"
},
"answer": {
"answer_count": "4 Answers",
"answers_link": "https://www.researchgate.netpost/Are_there_any_ways_to_find_the_concentration_of_a_solution_where_its_chemical_formula_and_number_of_moles_are_unknown?_sg=6W-hvIYx-FRel_YiWd62lbksTzeWP7GVkZ3tVO6SgZI7F_czhLz_oFCduq9DVhrhvIUy97168wXrn30"
}
}
]
Links
Add a Feature Requestπ« or a Bugπ
Posted on September 14, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.