How I Did My First Web Scraping and Scraped My Twitter/X Tweet
Roopkumar Das
Posted on November 19, 2023
Hey, everyone! I'm sharing this blog to document my journey of scraping my Twitter/X tweet. Another purpose of mine in writing this is to give another directions for begineer web scrapper about how could they scrape websites.
Now, since I want to scrape tweets, I have two ways to do this:
- Using Puppeteer: Visit the tweet URL, scrape the information, and save it somewhere.
- Making HTTP Requests to the Backend: Make an HTTP request to the server and retrieve the result.
To be honest, I wasn't inclined to use the first method with Puppeteer. I don't know why, but whenever I thought about scraping via Puppeteer, I didn't want to. It felt like adding unnecessary complexity, and there were more points where I could expose my identity through browser fingerprinting, etc. I thought, why go the other way when I can simply make a request to the server and get the result.
However, I couldn't proceed with the second option either, as I needed to read through network requests, and I was overwhelmed by it. Each time I tried to read those network requests made by the browser, I couldn't focus well and left it after a while.
This was my constant struggle, switching between methods. Yes, the Puppeteer way would be easier, but the feeling was dragging me down, and I procrastinated a lot because of it.
Break
After a few days, I got a huge breakthrough. I suddenly had this thought in mind: why not go to my tweet with my incognito browser?
Then, I could save all the requests my browser makes from the DevTools network tab and, lastly, use Python to search through the entire requests to find mentions of the word "introspecta." This way, I would get a filtered version of the network requests, which I hoped would be smaller and easier for me to go through.
So, I searched on DuckDuckGo about how I could do this and finally learned about HAR (HTTP Archive Record). This is a JSON file that contains all the network request information like headers, cookies, payload, response, etc. This was the perfect thing that could help me achieve the above. Luckily, I didn't have to involve extensions and such since Chrome itself gives an option to save all HAR details in the DevTools network tab. Thus, I saved it and ran this Python script.
import json
with open("./twitter.com.har") as file:
json_data = json.load(file)
for entry in json_data["log"]["entries"]:
try:
if (
"Introspecta" in entry["response"]["content"]["text"]
or "introspecta" in entry["response"]["content"]["text"]
):
data = {}
data["initiated-by"] = entry["_initiator"]["type"]
data["request-method"] = entry["request"]["method"]
data["request-url"] = entry["request"]["url"]
data["request-headers"] = entry["request"]["headers"]
data["response-headers"] = entry["response"]["headers"]
data["response-cookies"] = entry["response"]["cookies"]
data["payload"] = entry["response"]["content"]["text"]
json_object = json.dumps(data)
with open("content.json", "w") as file:
file.write(json_object)
except:
pass
As you can see, I loaded the JSON data and looped through each entry to find the occurrence of the word "introspecta". I also included a try-except block as some entries don't have those specific fields.
Fortunately, I obtained this result, and luckily, there was just one entry that provided all the information I needed.
{
"initiated-by": "script",
"request-method": "GET",
"request-url": "https://api.twitter.com/graphql/5GOHgZe-8U2j5sVHQzEm9A/TweetResultByRestId?variables=....",
"request-headers": [
{ "name": ":authority", "value": "api.twitter.com" },
{ "name": ":method", "value": "GET" },
....
}
I didn't show the whole result as this would make the blog unnecessarily long, but if you want to check it out, you can find the complete code in my GitHub repository.
Coming back, I noticed that Twitter is employing GraphQL, and by making a request to this URL, the server responds with tweet data. Another thing to note is the headers and cookies we are sending while making this request.
Request Time
I am using the Python requests
library to make the entire HTTP request to the server.
First, I have to create a URL for the GET request. From examining the request URL in the HAR result, https://api.twitter.com/graphql/5GOHgZe-8U2j5sVHQzEm9A/TweetResultByRestId
is the base, and the rest of it consists of two query parameters: variables
and features
.
After copying and formatting them, I understood that I only need to change one field of the variable, which is the tweet ID, and the rest isn't that important in relation to my tweet.
tweetUrl = "https://twitter.com/Roopkd_/status/1716456929164411113"
tweet_id = re.findall(r"(?<=status/)\d+", tweetUrl)
variables = {
"tweetId": tweet_id[0],
...
}
features = {...}
finalUrl = f"https://api.twitter.com/graphql/5GOHgZe-8U2j5sVHQzEm9A/TweetResultByRestId?variables={urllib.parse.quote(json.dumps(variables))}&features={urllib.parse.quote(json.dumps(features))}"
Now that I have the URL, I need to think about how to create headers and cookies to send with the GET request, and this was the trickiest part.
I had to find ways to get the bearer token, guest token, and transaction ID.
For the first two, I didn't have to think much as someone had the code to get those values. A shoutout to the twitter_video_dl repo.
What it does is it basically makes a request to the tweet, and the server responds with cookies and the mainjs URL. Upon making another request to mainjs, it gives us the bearer token. Lastly, we make a POST request to client_json
, which gives us our guest token.
This also solves our cookies issue as we can just copy the cookie value and use it for our GET request.
def get_tokens(tweet_url):
html = requests.get(tweet_url)
cookiesToSend = html.cookies
......
return bearer_token, guest_token, cookies_header, cookiesToSend
For the transaction ID, I thought that Twitter creates it randomly, and I didn't find ways to generate it myself. Thus, I created the following transaction ID by copying the format in the HAR file.
Now, since we have all the values, let's create our header and make the request.
bearer_token, guest_token, header_cookies, cookiesToSend = get_tokens(tweetUrl)
id = "0Gq/4wkkphJSlDg3ZancYmmPJdaSslkjhZuibgL052lLK9zAP/0ru53eP3p7+ICASYsR1dHSh9zw8O9AcA2KtLPfeRFL0Q"
headers = {
"Host": "api.twitter.com",
"accept": "*/*",
....
}
val = requests.get(finalUrl, headers=headers, cookies=cookiesToSend)
print(val.text)
python3 make-request.py
And the result is, fingercross
{"data":{"tweetResult":{"result":{"__typename":"Tweet","rest_id":"1716456929164411113","core":{"user_results":{"result":{"__typename":"User", ....
Let's go, we got the result! For the final validation, I copied it into a JSON file and formatted it with the help of Prettier.
After examining the result, I noticed that Twitter not only provides information about the tweet but also about the author of the tweet itself, such as the author's followers count, etc. This can be especially helpful when I want both the tweet and author data at once.
But now, after going through it, it feels like the Puppeteer method is something I can see being used more professionally than this, as this method selectively requests a particular URL only. Thus, it stands out compared to the normal requests the browser makes.
Also, this method could have failed if the server tried to validate the request by talking to the client. Nevertheless, I am happy with the results, and I hope that you have learned something as well.
I know this is not the whole story, and there is also proxying involved, but I will leave it for now and will discuss these details later.
Thanks for reading, and I hope you have a great day! Sayonara.
GitHub Repo Link -> tweet-scrape
Posted on November 19, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.