Practical Puppeteer: Get Instagram account profile detail
Sony AK
Posted on December 13, 2019
Today scraping with Puppeteer will be related to Instagram. The scenario is we go to an Instagram profile and we will get some data from there, such as:
- Check username exists or not
- Username
- Verified account or not
- Private account or not
- Account name
- Bio description
- Account profile picture URL
- Bio URL display
- Total posts, total followers, total followings
- Recent posts (an array that contains URL to post and its thumbnail image)
As usual, we will use Puppeteer (not using any API). Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium. Please go to https://pptr.dev for more detail.
Let's start.
Preparation
Install Puppeteer
npm i puppeteer
The code
This code will get the detail public profile of Instagram account @cristiano, yes, it's Cristiano Ronaldo account.
File instagram_account_profile.js
const puppeteer = require('puppeteer');
(async () => {
// set some options (set headless to false so we can see
// this automated browsing experience)
let launchOptions = { headless: false, args: ['--start-maximized'] };
const browser = await puppeteer.launch(launchOptions);
const page = await browser.newPage();
// set viewport and user agent (just in case for nice viewing)
await page.setViewport({width: 1366, height: 768});
await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36');
// go to Instagram web profile (this example use Cristiano Ronaldo profile)
await page.goto('https://instagram.com/cristiano');
// check username exists or not exists
let isUsernameNotFound = await page.evaluate(() => {
// check selector exists
if(document.getElementsByTagName('h2')[0]) {
// check selector text content
if(document.getElementsByTagName('h2')[0].textContent == "Sorry, this page isn't available.") {
return true;
}
}
});
if(isUsernameNotFound) {
console.log('Account not exists!');
// close browser
await browser.close();
return;
}
// get username
let username = await page.evaluate(() => {
return document.querySelectorAll('header > section h1')[0].textContent;
});
// check the account is verified or not
let isVerifiedAccount = await page.evaluate(() => {
// check selector exists
if(document.getElementsByClassName('coreSpriteVerifiedBadge')[0]) {
return true;
} else {
return false;
}
});
// get username picture URL
let usernamePictureUrl = await page.evaluate(() => {
return document.querySelectorAll('header img')[0].getAttribute('src');
});
// get number of total posts
let postsCount = await page.evaluate(() => {
return document.querySelectorAll('header > section > ul > li span')[0].textContent.replace(/\,/g, '');
});
// get number of total followers
let followersCount = await page.evaluate(() => {
return document.querySelectorAll('header > section > ul > li span')[1].getAttribute('title').replace(/\,/g, '');
});
// get number of total followings
let followingsCount = await page.evaluate(() => {
return document.querySelectorAll('header > section > ul > li span')[2].textContent.replace(/\,/g, '');
});
// get bio name
let name = await page.evaluate(() => {
// check selector exists
if(document.querySelectorAll('header > section h1')[1]) {
return document.querySelectorAll('header > section h1')[1].textContent;
} else {
return '';
}
});
// get bio description
let bio = await page.evaluate(() => {
if(document.querySelectorAll('header h1')[1].parentNode.querySelectorAll('span')[0]) {
return document.querySelectorAll('header h1')[1].parentNode.querySelectorAll('span')[0].textContent;
} else {
return '';
}
});
// get bio URL
let bioUrl = await page.evaluate(() => {
// check selector exists
if(document.querySelectorAll('header > section div > a')[1]) {
return document.querySelectorAll('header > section div > a')[1].getAttribute('href');
} else {
return '';
}
});
// get bio display
let bioUrlDisplay = await page.evaluate(() => {
// check selector exists
if(document.querySelectorAll('header > section div > a')[1]) {
return document.querySelectorAll('header > section div > a')[1].textContent;
} else {
return '';
}
});
// check if account is private or not
let isPrivateAccount = await page.evaluate(() => {
// check selector exists
if(document.getElementsByTagName('h2')[0]) {
// check selector text content
if(document.getElementsByTagName('h2')[0].textContent == 'This Account is Private') {
return true;
} else {
return false;
}
} else {
return false;
}
});
// get recent posts (array of url and photo)
let recentPosts = await page.evaluate(() => {
let results = [];
// loop on recent posts selector
document.querySelectorAll('div[style*="flex-direction"] div > a').forEach((el) => {
// init the post object (for recent posts)
let post = {};
// fill the post object with URL and photo data
post.url = 'https://www.instagram.com' + el.getAttribute('href');
post.photo = el.querySelector('img').getAttribute('src');
// add the object to results array (by push operation)
results.push(post);
});
// recentPosts will contains data from results
return results;
});
// display the result to console
console.log({'username': username,
'is_verified_account': isVerifiedAccount,
'username_picture_url': usernamePictureUrl,
'posts_count': postsCount,
'followers_count': followersCount,
'followings_count': followingsCount,
'name': name,
'bio': bio,
'bio_url': bioUrl,
'bio_url_display': bioUrlDisplay,
'is_private_account': isPrivateAccount,
'recent_posts': recentPosts});
// close the browser
await browser.close();
})();
I set the headless
mode to false
in Puppeteer options, so we can see the browser in action.
Run it
node instagram_account_profile.js
If everything OK, it will show the data structure like below on the console.
{
username: 'cristiano',
is_verified_account: true,
username_picture_url: 'https://instagram.fcgk18-1.fna.fbcdn.net/v/t51.2885-19/s150x150/67310557_649773548849427_4130659181743046656_n.jpg?_nc_ht=instagram.fcgk18-1.fna.fbcdn.net&_nc_cat=1&oh=6fbc3118da5962a82e5733d14c93a93a&oe=5E70CF2D',
posts_count: '2716',
followers_count: '192798306',
followings_count: '445',
name: 'Cristiano Ronaldo',
bio: '',
bio_url: 'https://l.instagram.com/?u=http%3A%2F%2Fwww.cristianoronaldo.com%2F&e=ATMsBNjqh3vJtV6jZ68Jo1e8yXmGpacPHE4dfv_mSRg-PrcHYdCYZFkWxDuYLzORB-M3_aVb',
bio_url_display: 'www.cristianoronaldo.com',
is_private_account: false,
recent_posts: [
{
url: 'https://www.instagram.com/p/B58x9BUATxb/',
photo: 'https://instagram.fcgk18-1.fna.fbcdn.net/v/t51.2885-15/sh0.08/e35/c220.0.792.792a/s640x640/76876296_179193059941409_6221002990564880736_n.jpg?_nc_ht=instagram.fcgk18-1.fna.fbcdn.net&_nc_cat=1&oh=07ae6ecd5089fc1e5838ef86970c1f8c&oe=5E8023DF'
},
{
url: 'https://www.instagram.com/p/B55gk8DAL3Z/',
photo: 'https://instagram.fcgk18-1.fna.fbcdn.net/v/t51.2885-15/e35/c0.60.480.480a/75483286_186154695857472_4950353937543838253_n.jpg?_nc_ht=instagram.fcgk18-1.fna.fbcdn.net&_nc_cat=1&oh=cb3f7b242096ea16c3c4cc4b6312b87d&oe=5DF5B9F3'
},
{
url: 'https://www.instagram.com/p/B5zzJtBAoan/',
photo: 'https://instagram.fcgk18-1.fna.fbcdn.net/v/t51.2885-15/sh0.08/e35/c207.0.827.827a/s640x640/73393228_168482760903763_8963602282249975289_n.jpg?_nc_ht=instagram.fcgk18-1.fna.fbcdn.net&_nc_cat=1&oh=479cb033d8882b59fd6bbb4c6e1c408a&oe=5E80081A'
},
{
url: 'https://www.instagram.com/p/B5vuHHAAodt/',
photo: 'https://instagram.fcgk18-1.fna.fbcdn.net/v/t51.2885-15/sh0.08/e35/c240.0.960.960a/s640x640/74676914_139591227455800_1244894556711547199_n.jpg?_nc_ht=instagram.fcgk18-1.fna.fbcdn.net&_nc_cat=1&oh=c40bca7880742088d19a19ae382def7f&oe=5E81AB8C'
},
{
url: 'https://www.instagram.com/p/B5qW56QIFFp/',
photo: 'https://instagram.fcgk18-1.fna.fbcdn.net/v/t51.2885-15/sh0.08/e35/c213.0.853.853a/s640x640/72783037_1351521851696486_1891057812314322465_n.jpg?_nc_ht=instagram.fcgk18-1.fna.fbcdn.net&_nc_cat=1&oh=88a45933f962a91940e49ee24d5acb09&oe=5E6E2EDE'
},
{
url: 'https://www.instagram.com/p/B5qICTmg7hS/',
photo: 'https://instagram.fcgk18-1.fna.fbcdn.net/v/t51.2885-15/sh0.08/e35/c227.0.910.910a/s640x640/76944874_1768777216590413_4590633889755644385_n.jpg?_nc_ht=instagram.fcgk18-1.fna.fbcdn.net&_nc_cat=1&oh=e69a90e499a8797b5b0bc4c9d0be8889&oe=5E77027A'
},
{
url: 'https://www.instagram.com/p/B5phLcCAfWV/',
photo: 'https://instagram.fcgk18-1.fna.fbcdn.net/v/t51.2885-15/sh0.08/e35/c106.0.868.868a/s640x640/74711305_126116271783000_2660929486246111795_n.jpg?_nc_ht=instagram.fcgk18-1.fna.fbcdn.net&_nc_cat=1&oh=dce9f4e0c396491c8b4750f946acb043&oe=5E84A9B8'
},
{
url: 'https://www.instagram.com/p/B5nqI98g9jq/',
photo: 'https://instagram.fcgk18-1.fna.fbcdn.net/v/t51.2885-15/sh0.08/e35/c0.180.1440.1440a/s640x640/72295503_199047947810859_4327918090297549142_n.jpg?_nc_ht=instagram.fcgk18-1.fna.fbcdn.net&_nc_cat=1&oh=9083fc356fee2c6780424df45ae2bda5&oe=5E82CCA1'
},
{
url: 'https://www.instagram.com/p/B5lpnXXgbiT/',
photo: 'https://instagram.fcgk18-1.fna.fbcdn.net/v/t51.2885-15/sh0.08/e35/c0.161.1291.1291a/s640x640/74337451_200653047633832_6084933369944989223_n.jpg?_nc_ht=instagram.fcgk18-1.fna.fbcdn.net&_nc_cat=1&oh=0b5ceedb25781b4924565949937edc0b&oe=5EB1C0A1'
},
{
url: 'https://www.instagram.com/p/B5iI4Sag0qQ/',
photo: 'https://instagram.fcgk18-1.fna.fbcdn.net/v/t51.2885-15/sh0.08/e35/c177.0.710.710a/s640x640/73420511_1023531488000332_2506917797196221103_n.jpg?_nc_ht=instagram.fcgk18-1.fna.fbcdn.net&_nc_cat=1&oh=1312fb525a0bc8429e9181232d1d763f&oe=5E7156EB'
},
{
url: 'https://www.instagram.com/p/B5dRx0zgeSb/',
photo: 'https://instagram.fcgk18-1.fna.fbcdn.net/v/t51.2885-15/sh0.08/e35/s640x640/75299394_983315452036089_6040427267837814466_n.jpg?_nc_ht=instagram.fcgk18-1.fna.fbcdn.net&_nc_cat=1&oh=ea373d65404f9838cbbd777852445d12&oe=5DF5FDE7'
},
{
url: 'https://www.instagram.com/p/B5az6Qfg3va/',
photo: 'https://instagram.fcgk18-1.fna.fbcdn.net/v/t51.2885-15/sh0.08/e35/s640x640/73393267_185000869337693_7735852682111206915_n.jpg?_nc_ht=instagram.fcgk18-1.fna.fbcdn.net&_nc_cat=1&oh=17030ad8ad6d0453eed64c203167f359&oe=5E902F7F'
}
]
}
Ow nice.
What we can learn from this code is using selector in page.evaluate
and doing looping on page.evaluate
.
This code also available at GitHub repository at https://github.com/sonyarianto/get-instagram-account-profile-detail-with-puppeteer
Updates
Dellean Santos (@tawsbob) on the comment told me that for Instagram public account profile, we can get the data from window._sharedData object. It's nice. You can get it by using Puppetter as well by using this page.evaluate.
let sharedData = await page.evaluate(() => {
return window._sharedData.entry_data.ProfilePage[0].graphql.user;
});
Thank you and I hope you enjoy it.
Reference
Posted on December 13, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.