Back of the Envelope Calculation

Intro

I found this topic rather misleading and always overcomplicated. Though I cannot disagree the version below is a lot more simplified than real life calculations, it's still covers 99% of the things you can encounter in your interview process.

What to estimate?

QPS - queries per second

RPS - reads per second

WPS - writes per second

Peak QPS = QPS * 2 (usually)

RW - read write ratio

Message size - size of the message in bytes if not given

Read Throughput - RPS * message size = N bytes per second

Write Throughput - WPS* message size = N bytes per second

💡 Throughput is how much data actually passed through and bandwidth is how much data CAN be passed through (network configuration)
Ex: 1gbps network bandwidth can pass 125mb/s

Storage - usually storage for N years

Replica storage - storage * 2-3 times

Cache storage - usually 20% of storage or so

Cache replica storage - cache storage * 2-3 times

Basic Numbers

seconds in a day - 24 * 60 * 60 = 86400, roughly 10^5

1 ASCI letter - 1 char

timestamp - 8 bytes (2^64)

10³ - 1kb

10⁶ - 1mb

10⁹ - 1gb

10¹² - 1tb

10¹⁵ - 1pb

10¹⁸ - 1eb

Powers of two

Power           Exact Value         Approx Value        Bytes
---------------------------------------------------------------
7                             128
8                             256
10                           1024   1 thousand           1 KB
16                         65,536                       64 KB
20                      1,048,576   1 million            1 MB
30                  1,073,741,824   1 billion            1 GB
32                  4,294,967,296                        4 GB
40              1,099,511,627,776   1 trillion           1 TB

Latency numbers every programmer should know

Latency Comparison Numbers
--------------------------
L1 cache reference                           0.5 ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns                      14x L1 cache
Mutex lock/unlock                           25   ns
Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy            10,000   ns       10 us
Send 1 KB bytes over 1 Gbps network     10,000   ns       10 us
Read 4 KB randomly from SSD*           150,000   ns      150 us          ~1GB/sec SSD
Read 1 MB sequentially from memory     250,000   ns      250 us
Round trip within same datacenter      500,000   ns      500 us
Read 1 MB sequentially from SSD*     1,000,000   ns    1,000 us    1 ms  ~1GB/sec SSD, 4X memory
HDD seek                            10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
Read 1 MB sequentially from 1 Gbps  10,000,000   ns   10,000 us   10 ms  40x memory, 10X SSD
Read 1 MB sequentially from HDD     30,000,000   ns   30,000 us   30 ms 120x memory, 30X SSD
Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms

Notes
-----
1 ns = 10<sup>-9</sup> seconds
1 us = 10<sup>-6</sup> seconds = 1,000 ns
1 ms = 10<sup>-3</sup> seconds = 1,000 us = 1,000,000 ns

Handy metrics based on latency numbers

Read sequentially from HDD at 30 MB/s
Read sequentially from 1 Gbps Ethernet at 100 MB/s
Read sequentially from SSD at 1 GB/s
Read sequentially from main memory at 4 GB/s
6-7 world-wide round trips per second
2,000 round trips per second within a data center

How to estimate?

Clarify number of daily users and number of total users.
Ask about number of request from user on average. From here you can get QPS.
Think about peak QPS, Reads and Writes.
Assume (clarify) message size.
Calculate throughput.
If it's possible think about average data size. And calculate storage and cache here.

Estimation example

You have 10M daily active users and each of them makes 100 read requests per day on average and new data is created 5 times per day.

RPS = 10M * 100 / 86400 = 12000 r/s

WPS = 10M * 5 / 86400 = 580 w/s

Peak QPS = 24000 r/s

Let's assume(clarify with the interviewer) that the average read message size is 50 bytes and the written message is 1kb.

Avg Read throughput 50 * 12*10^3 = 60kb/s

Avg Write throughput 1kb * 580 = 580kb/s

Here we can think about the type of data/metadata etc. Let's assume that you have clarified with your interviewer and the size of the new data is 1kb.

5 years storage - 10M * 1kb * 5 time per day * 365 days per year * 5 years = 91tb * 3 = 300tb with replicas.

Lets assume that you have only 10% of hot data and you agreed to use 20% as cache.

Cache storage - 10% * 90tb * 20% * 3 replicas = 5.5 tb