Scrape Github User details with Python.

fredysomy

Fredy Somy

Posted on December 13, 2020

Scrape Github User details with Python.

When i was learning web scraping , one of the ideas that came to my mind is a Github Scraper.
Here i will try my best to describe each process.

Lets start..

We have to install a couple of packages first.

  • Beautifulsoup
  • requests
  • htmlparser
pip install requests
pip install html5lib
pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

  • Then open https://github.com/yourusername
  • Open Devtools. Alt Text
  • This is what i see when i open my dashboard and devtools.
  • While we scrape web , we need the element's id ,classname or xpath to scrape it.

  • We will be scraping Name, Username , No of Repos, Followers , Following and profile image.

import requests
from bs4 import BeautifulSoup
import html5lib
Enter fullscreen mode Exit fullscreen mode
  • Import the modules.

r=requests.get("https://github.com/fredysomy")
soup=BeautifulSoup(r.content,'html5lib')
Enter fullscreen mode Exit fullscreen mode
  • Make a request into the website.
  • Parse the html recieved as response in r.content using beautifulsoup and html5lib.

  • From here we are starting scraping.


namediv=soup.find("h1" ,class_="vcard-names pl-2 pl-md-0")
name=namediv.find_all('span')[0].getText()
u_name=namediv.find_all('span')[1].getText()
Enter fullscreen mode Exit fullscreen mode
  • Here we are getting all element in the element of class name vcard-names pl-2 pl-md-0"
  • Name and Username are in the span element in the above div.
  • We have assigned the content into namediv variable.
  • We are finding all span elements and selecting (0:name,1:Username) and getting the text using getText() function.

statstab=soup.find(class_="flex-order-1 flex-md-order-none mt-2 mt-md-0")
elements=statstab.find(class_="mb-3")
followers=elements.find_all('a')[0].find('span').getText().strip(' ')
following=elements.find_all('a')[1].find('span').getText().strip(' ')
totstars=elements.find_all('a')[2].find('span').getText().strip(' ')
Enter fullscreen mode Exit fullscreen mode
  • Here the same thing happens.
  • Followers,Following,Stargazers are inside element of classname flex-order-1 flex-md-order-none mt-2 mt-md-0 and in mb-3 which is inside that.

  • Lets get that and store it in elements variable.

  • Getting the span inside inside the elements returns a list.

    • Followers is having index=0
    • Following is having index=1
    • Stargazer is having index=2
elements.find_all('a')[2].find('span').getText().strip(' ')
Enter fullscreen mode Exit fullscreen mode
  • Here we are getting the second index item in a element and then getText() from the span inside it. We are using strip('') to remove unneccesary blank spaces in the result.
u_img=soup.find(class_="avatar avatar-user width-full border bg-white")['src']
Enter fullscreen mode Exit fullscreen mode
  • The above code gives the image tag and we are getting the src attribute.
repo_num=soup.find(class_="UnderlineNav-body").find('span',class_="Counter").getText()
Enter fullscreen mode Exit fullscreen mode
  • Here we are getting the no of repos user haves.

  • That is all you need to scrape user details with python.

    Source Code

import requests
from bs4 import BeautifulSoup
import html5lib
r=requests.get("https://github.com/fredysomy")
soup=BeautifulSoup(r.content,'html5lib')
namediv=soup.find("h1" ,class_="vcard-names pl-2 pl-md-0")
name=namediv.find_all('span')[0].getText()
u_name=namediv.find_all('span')[1].getText()
statstab=soup.find(class_="flex-order-1 flex-md-order-none mt-2 mt-md-0")
elements=statstab.find(class_="mb-3")
followers=elements.find_all('a')[0].find('span').getText().strip(' ')
following=elements.find_all('a')[1].find('span').getText().strip(' ')
totstars=elements.find_all('a')[2].find('span').getText().strip(' ')
u_img=soup.find(class_="avatar avatar-user width-full border bg-white")['src']
repo_num=soup.find(class_="UnderlineNav-body").find('span',class_="Counter").getText()
Enter fullscreen mode Exit fullscreen mode
  • The idea is that, we should make the program to navigate to the element we want and select the required element.
  • Refer some beautifulsoup methods here

  • I have also made a pypi module to scrape Github.See it here and give a star if you like it.

If you have any doubts or need clarification, comment down below.

Stay tuned for part 2 where we will scrape the user repo details.

💖 💪 🙅 🚩
fredysomy
Fredy Somy

Posted on December 13, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related