Trolls and bots are disrupting social media—here’s how AI can stop them (Part 1)

Trolls and bots have a huge and often unrecognized influence on social media. They are used to influence conversations for commercial or political reasons. They allow small hidden groups of people to promote information supporting their agenda and a large scale. They can push their content to the top of people’s news feeds, search results, and shopping carts. Some say they can even influence presidential elections. In order to maintain the quality of discussion on social sites, it’s become necessary to screen and moderate community content. Can we use machine learning to identify suspicious posts and comments? The answer is yes, and we’ll show you how.

This is a two part series. In this part, we'll cover how to collect comment data from Reddit in bulk and build a real-time dashboard using Node and Kafka to moderate suspected trolls and bots. In part two, we'll cover the specifics of building the machine learning model.

Trolls and bots are huge pain for social media

Trolls are dangerous online because it's not always obvious when you are being influenced by them or engaging with them. Posts created by Russian operatives were seen by up to 126 million Americans on Facebook leading up to the last election. Twitter released a massive data dump of over 9 million tweets from Russian trolls. And it’s not just Russia! There are also accounts of trolls attempting to influence Canada after the conflict with Huawei. The problem even extends to online shopping where reviews on Amazon have slowly been getting more heavily manipulated by merchants.

Bots are computer programs posing as people. They can amplify the effect of trolls by engaging or liking their content en masse, or by posting their own content in an automated fashion. They will get more sophisticated and harder to detect in the future. Bots can now create entire paragraphs of text in response to text posts or comments. OpenAI’s GPT-2 model can write text that feels and looks very similar to human quality. OpenAI decided not to release it due to safety concerns, but it’s only a matter of time before the spammers catch up. As a disclaimer, not all bots are harmful. In fact, the majority of bots on Reddit try to help the community by moderating content, finding duplicate links, providing summaries of articles, and more. It will be important to distinguish helpful from harmful bots.

How can we defend ourselves from propaganda and spam posted by malicious trolls and bots? We could carefully investigate the background of each poster, but we don’t have time to do this for every comment we read. The answer is to automate the detection using big data and machine learning. Let’s fight fire with fire!

Identifying bots and trolls on Reddit

We’ll focus on Reddit because users often complain of trolls in political threads. It’s easier for trolls to operate thanks to anonymous posting. Operatives can create dozens or hundreds of accounts to simulate user engagement, likes and comments. Research from Stanford has shown that just 1% of accounts create 74% of conflict. Over the past few months, we’ve seen numerous comments like this one in the worldnews subreddit:

“Anyone else notice the false users in this thread? I recognise their language. It has very specific traits like appearing to have genuine curiosity yet backed by absurd statements. Calling for 'clear evidence' and questioning the veracity of statements (which would normally be a good thing but not under a guise). Wonder if you could run it through machine learning to identify these type of users/comments.” - koalefant

https://www.reddit.com/r/worldnews/comments/aciovt/_/ed8alk0/?context=1

There are several existing resources we can leverage. For example, the botwatch subreddit keeps track of bots on Reddit, true to its namesake! Reddit’s 2017 Transparency Report also listed 944 accounts suspected of being trolls working for the Russian Internet Research Agency.

Also, there are software tools for analyzing Reddit users. For example, the very nicely designed reddit-user-analyzer can do sentiment analysis, plot the controversiality of user comments, and more. Let’s take this a step further and build a tool that puts the power in the hands of moderators and users.

In this article, the first of a two-part series, we’ll cover how to capture data from Reddit’s API for analysis and how to build the actual dashboard. In part two, we’ll dive deeper into how we built the machine learning model.

Creating a dashboard of suspected bots and trolls

In this tutorial, you’ll learn how to create a dashboard to identify bots and trolls on Reddit comments in real time, with the help of machine learning. This could be a useful tool to help moderators of political subreddits identify and remove content from bots and trolls. As users submit comments to the r/politics subreddit, we’ll capture the comments and run them through our machine learning model, then report suspicious ones on a dashboard for moderators to review.

Here’s a screengrab from our dashboard. Try it out yourself at reddit-dashboard.herokuapp.com.

To set your expectations, our system is designed as a proof of concept. It’s not meant to be a production system and is not 100% accurate. We’ll use it to illustrate the steps involved in building a system, with the hopes that platform providers will be able to offer official tools like these in the future.

System architecture

Due to the high number of posts and comments being made on social media sites, its necessary to use a scalable infrastructure to process them. We’ll design our system architecture using an example written by the Heroku team in Managing Real-time Event Streams with Apache Kafka. This is an event-driven architecture that will let us produce data from the Reddit API and send it to Kafka. Kafka makes it easy to process streaming data and decouple the different parts of our system. Reading this data from Kafka, our dashboard can call the machine learning API and display the results. We’ll also store the data in Redshift for historical analysis and use as training data.

Collecting data from Reddit

Our first step is to download the comments from the politics subreddit for analysis. Reddit makes it easy to access comments as structured data in JSON format. To get recent commits for any subreddit just request the following URL:

https://www.reddit.com/r/${subreddit}/comments.json

Likewise, we can access public data about each user, including their karma and comment history. All we need to do is request this data from a URL containing the username, as shown below.

https://www.reddit.com/user/${username}/about.json
https://www.reddit.com/user/${username}/comments.json

To collect the data, we just looped through each comment in the r/politics subreddit, and then loaded the user data for each commenter. You can use whatever HTTP request library you like, but we used our examples will use axios for Node.js. Also, we’ll combine data from both calls into a single convenient data structure that includes both the user information and their comments. This will make it easier to store and retrieve each example later. This functionality can be seen in the profile-scraper.js file and you can learn more about how to run it in the README.

Real-time event streaming in Kafka

Now that the data has been collected from Reddit, we are ready to stream the comments into Kafka. Before connecting to the Kafka server you will need to create a topic in the Heroku dashboard. Click Add Topic and set the topic name with a single partition.

To connect to the Kafka server as a Producer in Node.js you can use the no-kafka library with the connection information already set in the cluster created by Heroku:

const Kafka = require('no-kafka')
const url = process.env.KAFKA_URL
const cert = process.env.KAFKA_CLIENT_CERT
const key = process.env.KAFKA_CLIENT_CERT_KEY

fs.writeFileSync('./client.crt', cert)
fs.writeFileSync('./client.key', key)

const producer = new Kafka.Producer({
  clientId: 'reddit-comment-producer',
  connectionString: url.replace(/\+ssl/g, ''),
  ssl: {
    certFile: './client.crt',
    keyFile: './client.key'
  }
})

After you are connected to Kafka you can send messages to the topic you created

earlier. For convenience, we decided to stringify the JSON messages before sending them to Kafka in our live streaming app:

producer.send({
  topic: 'northcanadian-72923.reddit-comments',
  partition: 0,
  message: {
    value: JSON.stringify(message)
  }
})

In our repo, the sample live streaming worker code is in the kafka-stream.js file.

Building a moderator dashboard

Our sample dashboard is a JavaScript application based on a previous version of the twitter-display Kafka demo app by Heroku. We simplified the app by removing some dependencies and modules, but the general architecture remains: an Express app (server-side) to consume and process the Kafka topic, connected via a web socket with a D3 front end (client-side) to display the messages (Reddit comments) and their classification in real time. You may find our open source code on https://github.com/devspotlight/Reddit-Kafka-Consumers.

In the server-side Node app, we connect to Kafka as a simple Consumer, subscribe to the topic, and broadcast each group of messages to our function which loads the prediction:

new Consumer({
  broadcast: (msgs) => {
    predictBotOrTrolls(msgs)
  },
  interval: constants.INTERVAL,
  topic: constants.KAFKA_TOPIC,
  consumer: {
    connectionString: process.env.KAFKA_URL,
    ssl: {
      cert: './client.crt',
      key: './client.key'
    }
  }
})

We then use unirest (HTTP/REST request library) to send the unified data scheme from those messages to our machine learning API for real-time predictions on whether or not the author is a person or a bot or troll (more about that in the next section of this article).

Finally, a WebSocket server is used in our app.js so that the front end can get all the display data in real time. Since the subreddit comments stream in real time, the scaling and load balancing of each application should be considered and monitored.

We use the popular D3 JavaScript library to update the dashboard dynamically as Kafka messages stream in. Visually, there is a special table bound to the data stream, and this table gets updated with the newest comments as they come (newest first), as well as the count of each user type detected:

import * as d3 from 'd3'

class DataTable {
  constructor(selector, maxSize) {
    this.tbody = d3.select(selector)
    this._maxSize = maxSize
    this._rowData = []
  }

  update(data) {
    data.forEach((msg) => {
      this._rowData.push(msg)
    }

    if (this._rowData.length >= this._maxSize)
      this._rowData.splice(0, this._rowData.length - this._maxSize)

    // Bind data rows to target table
    let rows = this.tbody.selectAll('tr').data(this._rowData, (d) => d)

  ...

See data-table.js for more details. The code shown above is just an excerpt.

Calling out to our ML API

Our machine learning API is designed to examine features about the comment poster’s account and recent comment history. We trained our model to examine features like their Reddit “karma”, number of comments posted, whether they verified their account, and more. We also provided it with a collection of features that we hypothesize will be useful in categorizing users. We pass the collection to the model as a JSON object. The model then returns a prediction for that user that we can display in our dashboard. Below are sample JSON data objects (using our unified data scheme) sent as requests to the HTTP API.

Example for a bot user:

{
   "banned_by":null,
   "no_follow":true,
   "link_id":"t3_aqtwe1",
   "gilded":false,
   "author":"AutoModerator",
   "author_verified":false,
   "author_comment_karma":445850.0,
   "author_link_karma":1778.0,
   "num_comments":1.0,
   "created_utc":1550213389.0,
   "score":1.0,
   "over_18":false,
   "body":"Hey, thanks for posting at \\/r\\/SwitchHaxing! Unfortunately your comment has been removed due to rule 6; please post questions in the stickied Q&amp;A thread.If you believe this is an error, please contact us via modmail and well sort it out.*I am a bot",
   "downs":0.0,
   "is_submitter":false,
   "num_reports":null,
   "controversiality":0.0,
   "quarantine":"false",
   "ups":1.0,
   "is_bot":true,
   "is_troll":false,
   "recent_comments":"[...array of 20 recent comments...]"
}

The response returned is:

{
  "prediction": "Is a bot user"
}

Run it easily using a Heroku Button

As you can see, our architecture has many parts—including producers, Kafka, and a visualization app—which might make you think that it’s difficult to run or manage. However, we have a Heroku button that allows us to run the whole stack in a single click. Pretty neat, huh? This opens the door to using more sophisticated architectures without the extra fuss.

If you’re technically inclined, give it a shot. You can have a Kafka cluster running pretty quickly, and you only pay for the time it's running. Check out our documentation for the local development and the production deployment processes in our code’s README document.

Next steps

We’d like to encourage the community to use these types of techniques to control the spread of trolls and harmful bots. It’s an exciting time to be alive and watch as trolls attempt to influence social media, while these communities develop better machine learning and moderation tools to stop them. Hopefully we’ll be able to keep our community forums as places for meaningful discussion.

Check out our part two article “Detecting bots and trolls on Reddit using machine learning”, which will dive deeper into how we built the machine learning model and its accuracy.

Blog