Scraping Twitter with Twint

Ribhav Kapur

http://www.ribhavkapur.com

The rise of social media has been tremendous and has had multiple effects in areas like, mental health, procrastination, business, etc. It has been the most culturally significant event for the longest time and one aspect that I would like to talk about is the availability of open source data and specifically, talk about the some ways we can use it.

As the title suggests, this will be focused on using a program called twint (https://github.com/twintproject/twint) which can be used to collect data from Twitter. I’ll talk about the ways I have used it, and hopefully that will help spark some creativity.

Why Twint?
Twint is a program that makes it unbelievably easy to gather data from twitter without any rate limitations whatsoever. If you try to use the twitter API, you will only get results with around 3200 tweets, which is a really small number if you’re, say, trying to gather a data set for a machine learning algorithm. Twint allows users to search almost all tweets. Every single tweet. According to a quick Google search, there are 200 Billion tweets every year. That is an enormous amount of data, publicly available. Furthermore, it is really easy to set up and use. A simple “pip install’ through the command line will do the trick (assuming ofcourse you already have python setup). This is a huge bonus since anyone who’s worked with API’s before will tell you it’s a pain to setup. And, to top it off, you don’t even need to have an account on twitter. You can use this program completely anonymously.

Collecting Information for Research
The first time I used Twint was when I was creating my first startup. The idea heavily relied on the use of Facebook’s graph API for Instagram. The startup was still in the ideation stage and I used twint as a tool for research on twitter to scrape information about Instagram. I found a user (@wongmjane ) who tweets details about various mobile applications (like instagram). She goes into the source code of applications and talks about changes in upcoming features, API’s, various hidden features, etc. This promised to be a great way to gather information about where Instagram might be headed, and what type of things can be done with the API, or with upcoming features that weren’t talked about anywhere else.
This actually did prove to be an invaluable source of information since it helped me pivot the idea several times based on things Facebook was working on and helped gather some stats about what features are most valuable to users and then develop business strategies around those to help my startup get off the ground.

Gathering User Sentiment for Investment
Another time I used twint, and this one I’m particularly proud of, was to analyse the sentiment of the general public about stocks and/or companies and use that as a guide for investment. Now before I go ahead, I definitely DO NOT recommend doing this whatsoever. I’m just a kid in university who likes to mess around with projects so whatever you do, do not take financial advice from me. Now since that’s out of the way, lets talk about what exactly I did. I used twint to gather a bunch of data about a particular stock, say Tesla. This basically gathered a bunch of tweets where the main topic of conversation (or well main topic of “tweet”) was Tesla. I then parsed these tweets and used Google’s Natural Language Processing API’s sentiment analysis tool to gather the general sentiment of each tweet. Then based on the overall response, which would either be positive or negative, and ofcourse the confidence of each and the percentage of each, I decided whether or not that stock would be a good investment. Surprisingly, it worked better than I expected. I decided to invest a 100 dollars in some stocks (sadly 100 bucks doesn’t get you very far) and I actually managed to get a solid 5% return on the first day! I got excited and went on to lose $50 after that but eh you win some, you lose some.

Collecting Data for Training Machine Learning Algorithms
As all computer scientists know, the more the data, the better you can train your algorithm. As mentioned before, with ~200 Billion new tweets made every single year, Twitter is a great place to collect information and create data sets. As part of an assignment in University, I had to write an algorithm, using Bayes Nets, to classify emails as spam or not spam. Ofcourse we were given numerous guidelines and enough resources to complete the project, but what’s the fun the that? After completing the assignment, I used the same algorithm and trained it with data from Twitter. I used twint to gather tweets that were advertisements, to classify as spam, and gathered a set of random tweets from users, to classify them as not spam, just to see how well my algorithm worked on a large real world data set. It performed decently well and classified ~87% of the tweets correctly.

These are some ways I used twint to learn something new, mess around with a project or just kill time doing something interesting. Even though these weren’t extremely big projects and nothing other than some learning came out of it, I hope this demonstrates the possibilities when you have such easy access to large data and I hope this inspired some very novel use cases.

Blog

Scraping Twitter with Twint

ribhav99

Ribhav Kapur

http://www.ribhavkapur.com

Join Our Newsletter. No Spam, Only the good stuff.

Related