Open Graph Protocol Analyzer in Python
Mageshwaran
Posted on October 19, 2020
What is Open Graph protocol used for?
The Open Graph protocol enables any web page to become a rich object in a social graph. For instance, this is used on Facebook to allow any web page to have the same functionality as any other object on Facebook.
Open Graph Protocol was first introduced by Facebook that allows integration between Facebook and its user data and a website. By integrating Open Graph meta tags into your website will help social network to crawl the data given in the web page, you can identify which elements of your page you want to show when someone share's your page.
You may seen this while sharing a web link in social network, like twitter card, Facebook link share, Whatsapp link card.
To learn more about Open Graph protocol
Online testing tool Open Graph Tester
We are going to build a simple Open Graph protocol analyzer, which will fetch the OGP data from websites. By using the python libraries BeautifulSoup and requests.
Install beautifulsoup4 and requests
pip install requests
pip install beautifulsoup4
The website like user here is only for learning purposes.
import requests
from bs4 import BeautifulSoup
url = "https://www.udemy.com/course/learn-flutter-dart-to-build-ios-android-apps/"
r = requests.get(url=url)
# Create a BeautifulSoup object
soup = BeautifulSoup(r.text, 'html.parser')
Above code will fetch the website content and load it as a BeautifulSoup object, from that we can extract the data.
# webpage can have more meta tags
# return the first meta tag
soup.find("meta")
# return all meta tag
soup.find_all("meta")
# return all meta tag with property as og:title
soup.find("meta", property="og:title")
soup.find_all("meta")
find all the meta tags, return the value in python list format, using this then filter out the individual tags. Iterate over the meta tag value, filter out the tag values using python if statement based on tag properties.
# data holder
data = {
"tag": {},
"ogp": {}
}
# find all the meta tags in the web page
for i in soup.find_all("meta"):
# extract individual tag with the property value
if i.get("property", None) == "og:title":
data["tag"]["title"] = i
data["ogp"]["title"] = i.get("content", None)
if i.get("property", None) == "og:url":
data["tag"]["url"] = i
data["ogp"]["url"] = i.get("content", None)
if i.get("property", None) == "og:description":
data["tag"]["description"] = i
data["ogp"]["description"] = i.get("content", None)
if i.get("property", None) == "og:image":
data["tag"]["image"] = i
data["ogp"]["image"] = i.get("content", None)
if i.get("property", None) == "og:type":
data["tag"]["type"] = i
data["ogp"]["type"] = i.get("content", None)
if i.get("property", None) == "og:site_name":
data["tag"]["site_name"] = i
data["ogp"]["site_name"] = i.get("content", None)
if i.get("property", None) == "og:locale":
data["tag"]["locale"] = i
data["ogp"]["locale"] = i.get("content", None)
print(data)
{'tag': {'title': <meta content="Flutter & Dart - The Complete Guide [2020 Edition]" property="og:title"/>, 'url': <meta content="https://www.udemy.com/course/learn-flutter-dart-to-build-ios-android-apps/" property="og:url"/>, 'description': <meta content="A Complete Guide to the Flutter SDK & Flutter Framework for building native iOS and Android apps" property="og:description"/>, 'image': <meta content="https://img-a.udemycdn.com/course/480x270/1708340_7108_4.jpg?mTkNpG_o5Wh0tcZgEWDnLLfndz7BG87EWBPuhbZij4iaIzFjeWC9AwmBEt4sTy0ioCD3r8w-Wtzfac00nfnb-TGMYVhafN8EXUpihTvhffAbcaEuTbQgRQvPORm5i1bX" property="og:image"/>, 'type': <meta content="udemy_com:course" property="og:type"/>, 'site_name': <meta content="Udemy" property="og:site_name"/>, 'locale': <meta content="en_US" property="og:locale"/>}, 'ogp': {'title': 'Flutter & Dart - The Complete Guide [2020 Edition]', 'url': 'https://www.udemy.com/course/learn-flutter-dart-to-build-ios-android-apps/', 'description': 'A Complete Guide to the Flutter SDK & Flutter Framework for building native iOS and Android apps', 'image': 'https://img-a.udemycdn.com/course/480x270/1708340_7108_4.jpg?mTkNpG_o5Wh0tcZgEWDnLLfndz7BG87EWBPuhbZij4iaIzFjeWC9AwmBEt4sTy0ioCD3r8w-Wtzfac00nfnb-TGMYVhafN8EXUpihTvhffAbcaEuTbQgRQvPORm5i1bX', 'type': 'udemy_com:course', 'site_name': 'Udemy', 'locale': 'en_US'}}
this is good, but ogp has more properties like og:image, og:audio, og:determiner, og.local, etc. For more detail https://ogp.me/#optional .
without explicit specifying individuals ogp property value, check the property attribute has a og value or else exclude it. Store the values in python dictionary variable called data
data = {
"tag": {},
"ogp": {}
}
for i in soup.find_all("meta"):
if i.get("property", None) is not None:
if i.get("property", None).split(":")[0] == "og":
data["tag"][i.get("property", None)] = i
data["ogp"][i.get("property", None)] = i.get("content", None)
print(data)
{'tag': {'og:title': <meta content="Flutter & Dart - The Complete Guide [2020 Edition]" property="og:title"/>, 'og:url': <meta content="https://www.udemy.com/course/learn-flutter-dart-to-build-ios-android-apps/" property="og:url"/>, 'og:description': <meta content="A Complete Guide to the Flutter SDK & Flutter Framework for building native iOS and Android apps" property="og:description"/>, 'og:image': <meta content="https://img-a.udemycdn.com/course/480x270/1708340_7108_4.jpg?mTkNpG_o5Wh0tcZgEWDnLLfndz7BG87EWBPuhbZij4iaIzFjeWC9AwmBEt4sTy0ioCD3r8w-Wtzfac00nfnb-TGMYVhafN8EXUpihTvhffAbcaEuTbQgRQvPORm5i1bX" property="og:image"/>, 'og:type': <meta content="udemy_com:course" property="og:type"/>, 'og:site_name': <meta content="Udemy" property="og:site_name"/>, 'og:locale': <meta content="en_US" property="og:locale"/>}, 'ogp': {'og:title': 'Flutter & Dart - The Complete Guide [2020 Edition]', 'og:url': 'https://www.udemy.com/course/learn-flutter-dart-to-build-ios-android-apps/', 'og:description': 'A Complete Guide to the Flutter SDK & Flutter Framework for building native iOS and Android apps', 'og:image': 'https://img-a.udemycdn.com/course/480x270/1708340_7108_4.jpg?mTkNpG_o5Wh0tcZgEWDnLLfndz7BG87EWBPuhbZij4iaIzFjeWC9AwmBEt4sTy0ioCD3r8w-Wtzfac00nfnb-TGMYVhafN8EXUpihTvhffAbcaEuTbQgRQvPORm5i1bX', 'og:type': 'udemy_com:course', 'og:site_name': 'Udemy', 'og:locale': 'en_US'}}
After this blog post in submitted Dev will generate Open Graph protocol for this page, you can check this by View page source.
https://dev.to/magesh236/open-graph-protocol-analyzer-4dk0
** Conclusion:** This comes under web scraping technique, so use it with caution. Not all the website allows you to scrape their content in that case use tool like selenium to render the website, after that get the web page content and pass it to the web scraping tool.
Posted on October 19, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.