koki kitamura
Posted on May 1, 2022
Overview
I created an API server that is highly scalable and can handle 10 million users. It's an SNS like Twitter.
The implementation is published on Github.
The development environment is as follows.
- Node 16.14
- Express 4.17.3
- DynamoDB 2012-08-10
The functional requirements are as follows.
- Post a tweet
- Post a Comment for tweet
- Follow user
- Get Timeline
Introduction
Services with hundreds of millions of users, such as Facebook, Amazon, and Youtube, need to handle a lot of traffic. A commonly used approach to handling heavy traffic is scale-out rather than scale-up. Scale-up is expensive because it uses high-performance server. In addition, there is a performance limit for operating on one server.
Let's talk about scale-out. Application can be broadly divided into three layers.
- Client layer
- Server layer
- Database layer
When handling a large amount of traffic, The server layer only processes the data, it does not store it. Therefore, it is easy to scale out.
On the other hand, the database layer becomes difficult to maintain consistency and availability as data is distributed due to scale-out. You also need the logic to decide which data is stored on which node. Data relocation is required when increasing or decreasing the number of nodes. Since these features are not in RDB, we will use NoSQL.
Typical databases that support scale-out include BigTable, HBase, DynamoDB, Cassandra, etc.
Database | Description |
---|---|
BigTable、HBase | Consistent and up-to-date data can be obtained. On the other hand, data cannot be acquired while the lock is applied due to data update. |
DynamoDB、Cassandra | Data is always accessible. On the other hand, old data may be read during data synchronization. |
This time, we will create an API server for SNS, so availability is more important than consistency. Therefore, we use DynamoDB.
What is DynamoDB?
DynamoDB is a key-value database. You can create tables, and each table stores an item. Each item has a key and a value.
You can specify a partition key and a sort key for the item key. The partition key is used to determine the node from within the DynamoDB cluster. The sort key is like an index on a table and is used for sorting.
You can store multiple attribute / value pairs for an item's value. The attributes can be different for each item.
DynamoDB queries are limited and basically narrow down items by partition key and sort key only. When querying using other attributes, it will be slower as the number of items increases because it is necessary to check all items.
When you want to treat other attributes as partition keys, use GSI (Global Secondaly Index). When other attributes are treated as sort keys, LSI (Local Secndary Index) is used.
Database Design
DynamoDB's database design is different from RDB. The flexibility of querying RDBs allows you to design a normalized table first, without considering access patterns to your data. On the other hand, DynamoDB has a limited query pattern, so first determine the access pattern to the data and then design the table based on it. Specifically, we will proceed with the following flow.
- Modeling
- Create use case list
- Design Table
- Create query definition
Modeling
The ER diagram is as follows.
The timeline shows tweets of users that you are following. In SNS, the display speed of the timeline has a great influence on usability. Consider a database design that can display the timeline faster.
Read Heavy / Write Light on the timeline
In the case of a normalized table design, writing data at the time of tweeting is light because data is written only to the Tweets table. On the other hand, reading data on the timeline is heavy. The main flow when reading the timeline is as follows.
- Get a list of IDs of users you are following
- Get tweets from each user you follow
- Merge the retrieved tweets
- Sort merged tweets
The SQL for getting the timeline is as follows.
SELECT
*
FROM
tweets
WHERE
userId IN (
SELECT followeeId FROM follows WHERE followerId = 'user id'
)
ORDER BY
postDate DESC
With this method, the more followers you have, the heavier the load on the timeline will be. It can be said to be a Read Heavy / Write Light method.
Read Light / Write Heavy on the timeline
Consider a Read Light / Write Heavy technique. If you create a Timeline table and want to read the timeline, just query the timeline table. On the other hand, when a user tweeted, make sure to write the tweet to the user's follower's timeline.
The SQL for getting the timeline is as follows.
SELECT
*
FROM
timelines
WHERE
userId = 'user id'
ORDER BY
tweetPostDate
This time, we will use this Read Light / Write Heavy method.
Create use case list
Create a data use case list based on functional requirements to find out how to access the data.
Entity | UseCase | Screen |
---|---|---|
Tweet | getTimelineByUserId | Home |
User | getUserByUserName | User Detail |
Follow | getFolloweesByUserId | User Detail |
Follow | getFollowersByUserId | User Detail |
Follow | getCountFoloweeByUserId | User Detail |
Follow | getcountFollowerByUsreId | User Detail |
Tweet | getTweetsByUserId | User Detail |
Tweet | getTweetByTweetId | Tweet Detail |
Comment | getCommentsByTweetId | Tweet Detail |
Design Table
We will design the table and index based on the use case list. DynamoDB has a limited query pattern, but a method called Overloading GSI allows for flexible queries.
Include the ID in the sort key. Make the order of ID and record creation time the same. Then you can sort the posts by date without using LSI.
Create query definition
Finally, write out the query conditions. Based on this, we will implement around the database.
Entity | UseCase | Parameters | Table / Index | Key Condition |
---|---|---|---|---|
Tweet | getTimelineByUserId | { UserId } | Primary Key | GetItem (ID=UserId AND begins_with(DataType, timeline)) |
User | getUserByUserName | {Username} | GSI-1 | Query (DataValue=Username AND DataType=usserProfile) |
Follow | getFolloweesByUserId | {UserId} | Primary key | Query (ID=userId AND begins_with(DataType, followee) |
Follow | getFollowersByUserId | {UserId} | Primary Key | Query (ID=userId AND begins_with(DataType, follower) |
Follow | getCountFoloweeByUserId | {UserId} | Primary Key | Select COUNT / Query (ID=userId AND begins_with(DataType, followee) |
Follow | getcountFollowerByUsreId | {UserId} | Primary Key | Select COUNT / Query (ID=userId AND begins_with(DataType, follower) |
Tweet | getTweetsByUserId | {UserId} | Primary Key | Query(ID=userId AND begins_with(DataType, tweet) |
Tweet | getTweetByTweetId | {TweetId} | GSI-1 | Query(DataValue=tweetId AND begins_with(DataType, tweet) |
Comment | getCommentsByTweetId | {TweetId} | GSI-1 | Query(DataValue=tweetId AND begins_with(DataType, comment) |
Design API Server
Software Design
Design based on Domain Driven Design. The layer and directory names are matched.
Directory Name | DDD Layer | Components |
---|---|---|
src/domain | Domain Layer | Entity / Value Object / Repository Interface |
src/application | Application Layer | Application Service / Serializer |
src/infrastructure | Infrastructure Layer | Repository / AWS Config |
src/presentation | Presentation Layer | API Server |
ID generation method
Make the order of ID and record creation time the same. It can be handled by ID generation using the numbering table, but it lacks scalability. Use Snowflake as a scalable ID generation method.
This method divides the bit string into three parts. The ID is the decimal number of this bit string.
Part | Description |
---|---|
Epoch time | The number of seconds of difference from a particular time. |
Sequence | It counts up every time an ID is generated and is cleared every second. |
Node number | The number assigned to each node. |
Implementing Snowflake in Node.js is as following.
import { config } from "@src/config";
import { dateToUnixTime } from "./time";
const workerIDBits = 10;
const sequenceBits = 12;
// Use snowflake
// See: https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake
export class IdGenerator {
private workerId: number;
private lastGenerateAt: number;
private sequence: number;
constructor(workerId?: number) {
this.workerId = config.snowflakeWorkerId;
this.lastGenerateAt = dateToUnixTime(new Date());
this.sequence = 0;
}
generate(): number {
const now = dateToUnixTime(new Date());
if (now == this.lastGenerateAt) {
this.sequence++;
} else {
this.sequence = 0;
}
this.lastGenerateAt = now;
// The bit operators ('<<' and '|' ) can handle numbers within
// the range of signed 32 bit integer.
return (
now * 2 ** (workerIDBits + sequenceBits) +
this.workerId * 2 ** sequenceBits +
this.sequence
);
}
}
FAQ
Is the user's profile information duplicated?
Yes, it's a duplicate. When the profile is updated, you need to start Lambda with DynamoDB Stream to keep it asynchronous and consistent.
Isn't the tweet of a user with many followers a heavy writing load?
Yes, it's expensive. Only when the number of followers is large, it is necessary to take some measures such as dynamically merging when the timeline is acquired without writing to the follower's timeline.
Do you not cache?
Let's do it. It's not too late to monitor and find bottlenecks before making a decision.
Conclusion
In this article, I explained how to create a highly scalable API server. Just keep in mind that excessive performance optimization can go wrong when there are no performance issues.
The implementation is published on Github, so please take a look.
Posted on May 1, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.