Scrape Github User details with Python.
Fredy Somy
Posted on December 13, 2020
When i was learning web scraping , one of the ideas that came to my mind is a Github Scraper.
Here i will try my best to describe each process.
Lets start..
We have to install a couple of packages first.
- Beautifulsoup
- requests
- htmlparser
pip install requests
pip install html5lib
pip install beautifulsoup4
- Then open
- Open Devtools.
- This is what i see when i open my dashboard and devtools.
While we scrape web , we need the element's id ,classname or xpath to scrape it.
We will be scraping Name, Username , No of Repos, Followers , Following and profile image.
import requests
from bs4 import BeautifulSoup
import html5lib
- Import the modules.
- Make a request into the website.
Parse the html recieved as response in
using beautifulsoup and html5lib.From here we are starting scraping.
namediv=soup.find("h1" ,class_="vcard-names pl-2 pl-md-0")
- Here we are getting all element in the element of class name
vcard-names pl-2 pl-md-0"
- Name and Username are in the span element in the above div.
- We have assigned the content into namediv variable.
- We are finding all span elements and selecting (0:name,1:Username) and getting the text using getText() function.
statstab=soup.find(class_="flex-order-1 flex-md-order-none mt-2 mt-md-0")
followers=elements.find_all('a')[0].find('span').getText().strip(' ')
following=elements.find_all('a')[1].find('span').getText().strip(' ')
totstars=elements.find_all('a')[2].find('span').getText().strip(' ')
- Here the same thing happens.
Followers,Following,Stargazers are inside element of classname
flex-order-1 flex-md-order-none mt-2 mt-md-0
and inmb-3
which is inside that.Lets get that and store it in elements variable.
Getting the span inside inside the elements returns a list.
- Followers is having index=0
- Following is having index=1
- Stargazer is having index=2
elements.find_all('a')[2].find('span').getText().strip(' ')
- Here we are getting the second index item in a element and then
from the span inside it. We are usingstrip('')
to remove unneccesary blank spaces in the result.
u_img=soup.find(class_="avatar avatar-user width-full border bg-white")['src']
- The above code gives the image tag and we are getting the src attribute.
Here we are getting the no of repos user haves.
That is all you need to scrape user details with python.
Source Code
import requests
from bs4 import BeautifulSoup
import html5lib
namediv=soup.find("h1" ,class_="vcard-names pl-2 pl-md-0")
statstab=soup.find(class_="flex-order-1 flex-md-order-none mt-2 mt-md-0")
followers=elements.find_all('a')[0].find('span').getText().strip(' ')
following=elements.find_all('a')[1].find('span').getText().strip(' ')
totstars=elements.find_all('a')[2].find('span').getText().strip(' ')
u_img=soup.find(class_="avatar avatar-user width-full border bg-white")['src']
- The idea is that, we should make the program to navigate to the element we want and select the required element.
Refer some beautifulsoup methods here
I have also made a pypi module to scrape Github.See it here and give a star if you like it.
If you have any doubts or need clarification, comment down below.
Stay tuned for part 2 where we will scrape the user repo details.
Posted on December 13, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.