Scrape Google Scholar Metrics Results to CSV with Python

dmitryzub

Dmitriy Zub โ˜€๏ธ

Posted on March 30, 2022

Scrape Google Scholar Metrics Results to CSV with Python

What will be scraped

image

image

๐Ÿ“ŒNote: you have an option to save CSV file from public access mandates but there will be no funder link. This blog post shows how to scrape funder link.

If you don't need an explanation:


Prerequisites

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective.

Separate virtual environment

It's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other on the same system prevention libraries or Python version conflicts when working on multiple projects at the same time.

If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

๐Ÿ“ŒNote: using virtual environment is not a strict requirement.

Libraries:

pip install requests lxml beautifulsoup4 pandas
Enter fullscreen mode Exit fullscreen mode

Reducing the chance of being blocked

There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping blog of mine, there's eleven methods to bypass blocks from most websites.


Scrape Google Scholar Metrics all Top Publications

import requests, lxml
from bs4 import BeautifulSoup
import pandas as pd


def scrape_all_metrics_top_publications():

    params = {
        "view_op": "top_venues",  # top publications results
        "hl": "en"  # or other lang: pt, sp, de, ru, fr, ja, ko, pl, uk, id
        }

    # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
    # whatismybrowser.com/detect/what-is-my-user-agent
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.88 Safari/537.36"
        }

    html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml").find("table")

    df = pd.DataFrame(pd.read_html(str(soup))[0])
    df.drop(df.columns[0], axis=1, inplace=True)
    df.insert(loc=2,
              column="h5-index link",
              value=[f'https://scholar.google.com/{link.a["href"]}' for link in soup.select(".gsc_mvt_t+ td")])

    df.to_csv("google_scholar_metrics_top_publications.csv", index=False)

    # save to csv for specific language
    # df.to_csv(f"google_scholar_metrics_top_publications_lang_{params['hl']}.csv", index=False)

scrape_all_metrics_top_publications()
Enter fullscreen mode Exit fullscreen mode

Create search query parameters:

params = {
    "view_op": "top_venues",  # top publications results
    "hl": "en"                # language:
                              # pt - Portuguese
                              # sp - Spanish
                              # de - German
                              # ru - Russian
                              # fr - French
                              # ja - Japanese
                              # ko - Korean
                              # pl - Polish
                              # uk - Ukrainian
                              # id - Indonesian
}
Enter fullscreen mode Exit fullscreen mode

Pass search query params to request and find() the <table> via BeautifulSoup():

html = requests.get("https://scholar.google.com/citations", params=params)
soup = BeautifulSoup(html.text, "lxml").find("table")
Enter fullscreen mode Exit fullscreen mode

You can scrape table data without scraping with BeautifulSoup() first, but you won't have an option to save links from the table using pandas only.

Scraping table with BeautifulSoup() will allow you to scrape links data as well once passed to pandas read_html().

read_html(), access table data [0] from the soup and create a DataFrame:

df = pd.DataFrame(pd.read_html(str(soup))[0])
Enter fullscreen mode Exit fullscreen mode

Drop unnecessary numeration "Unnamed" column:

df.drop(df.columns[0], axis=1, inplace=True)
Enter fullscreen mode Exit fullscreen mode
Code Explanation
df.columns[0] first column in the table. In this case it's "Unnamed" column.
axis=1 delete column instead of row.
inplace=True allows doing operations on existing DataFrame without having to reassign to a new variable.

Insert a new column and add extracted links:

df.insert(loc=2,
          column="h5-index link",
          value=[f'https://scholar.google.com/{link.a["href"]}' for link in soup.select(".gsc_mvt_t+ td")])
Enter fullscreen mode Exit fullscreen mode
Code Explanation
loc=2 location where column will be added.
columns= your column name.
value= your extracted value.

Save to_csv():

df.to_csv("google_scholar_metrics_top_publications.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

index=False to drop default pandas row numbers.

Save to_csv() for a specific language:

df.to_csv(f"google_scholar_metrics_top_publications_lang_{params['hl']}.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

params['hl'] is the language that will be passed to search query params.


Scrape Google Scholar Metrics all Public Access Mandates

import requests, lxml
from bs4 import BeautifulSoup
import pandas as pd


def scrape_all_metrics_public_mandates():
    params = {
        "view_op": "mandates_leaderboard",  # public access mandates results
        "hl": "en"  # or other lang: pt, sp, de, ru, fr, ja, ko, pl, uk, id
        }

    # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
    # whatismybrowser.com/detect/what-is-my-user-agent
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.88 Safari/537.36"
        }

    html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml").find("table")

    df = pd.DataFrame(pd.read_html(str(soup))[0])
    df.drop(df.columns[[0, 2]], axis=1, inplace=True)
    df.insert(loc=1, column="Funder Link", value=[link.a["href"] for link in soup.select("td.gsc_mlt_t")])

    df.to_csv("google_scholar_metrics_public_access_mandates.csv", index=False)

    # save to csv for specific language
    # df.to_csv(f"google_scholar_metrics_public_access_mandates_lang_{params['hl']}.csv", index=False)


scrape_all_metrics_public_mandates()
Enter fullscreen mode Exit fullscreen mode

Create search query parameters:

params = {
    "view_op": "mandates_leaderboard",  # public access mandates results
    "hl": "en"  # or other lang: pt, sp, de, ru, fr, ja, ko, pl, uk, id
}
Enter fullscreen mode Exit fullscreen mode

Pass search query params, make a request and find() the <table> via BeautifulSoup():

html = requests.get("https://scholar.google.com/citations", params=params)
soup = BeautifulSoup(html.text, "lxml").find("table")
Enter fullscreen mode Exit fullscreen mode

read_html(),access table data [0] and create a DataFrame:

df = pd.DataFrame(pd.read_html(str(soup))[0])
Enter fullscreen mode Exit fullscreen mode

Drop two unnecessary "Unnamed, Available:" columns:

df.drop(df.columns[[0, 2]], axis=1, inplace=True)
Enter fullscreen mode Exit fullscreen mode
Code Explanation
df.columns[[0, 2]] first and third columns in the table. In this case it's "Unnamed, Available:" columns.
axis=1 delete column instead of row.
inplace=True allows doing operations on existing DataFrame without having to reassign to a new variable.

Insert a new column and add extracted links:

df.insert(loc=1, 
          column="Funder Link", 
          value=[link.a["href"] for link in soup.select("td.gsc_mlt_t")])
Enter fullscreen mode Exit fullscreen mode
Code Explanation
loc=1 location where column will be added.
columns= your column name.
value= your extracted value.

Save to_cs():

df.to_csv("google_scholar_metrics_public_access_mandates.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

index=False to drop default pandas row numbers.

Save to_csv() for a specific language:

df.to_csv(f"google_scholar_metrics_public_access_mandates_lang_{params['hl']}.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

params['hl'] is the language that will be passed to search query params.



Outro

If you have anything to share, any questions, suggestions, or something that isn't working correctly, reach out via Twitter at @dimitryzub, or @serp_api.

Yours,
Dimitry, and the rest of SerpApi Team.


Join us on Reddit | Twitter | YouTube

Add a Feature Request๐Ÿ’ซ or a Bug๐Ÿž

๐Ÿ’– ๐Ÿ’ช ๐Ÿ™… ๐Ÿšฉ
dmitryzub
Dmitriy Zub โ˜€๏ธ

Posted on March 30, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

ยฉ TheLazy.dev

About