Cost of Serialization

In the last article we covered about checking the performance of any given query, but that is not the only reason that will make a “slow query”, moving the data from the database to whoever requested it can be expensive as well, sometimes more than the query itself. In this article i will try to show you why that matters and some information about it.

For this article we will focus into some parts of it, mostly from sending data from the database to the backend, and some others aspects:

What usually happens

When you have a decloupled client from the backend, you usually have something like:

Client makes request
Backend receives request
Authenticate the user
- Maybe with a db call to fetch the user
Make one or more db calls
- Having a single query when you can is the best way to avoid eating the tax to request data to the database
Send the response back to the client
- Maybe you transform the data before sending it back

What you don’t see in the list are the multiple serialization and deserialization costs associated with operations. Fetching from the database or any external source of information, will make you pay the deserialization cost, and when you serialize back to the client you pay for it again.

“Why i pay for it?” you ask, every point of interaction of data: database, server, client; each one of them has its own domain for handling data, even if its the same domain (typescript server/client) you still need to send data between them, and for that you need a common language to be able to send and for the receiver to understand and parse it.

Data being transported

How does data being transported look? Postgres has two data formats: text and binary.

# Packet: t=1695825669.456395, session=213070643341250
PGSQL: type=Query, F -> B
QUERY query=select * from users limit 200;

# Packet: t=1695825669.458287, session=213070643341250
PGSQL: type=RowDescription, B -> F
ROW DESCRIPTION: num_fields=3
  ---[Field 01]---
  name='id'
  type=1043
  type_len=65535
  type_mod=4294967295
  relid=57371
  attnum=1
  format=0
  ---[Field 02]---
  name='username'
  type=25
  type_len=65535
  type_mod=4294967295
  relid=57371
  attnum=2
  format=0
  ---[Field 03]---
  name='created_at'
  type=1114
  type_len=8
  type_mod=4294967295
  relid=57371
  attnum=3
  format=0

# Packet: t=1695825669.458287, session=213070643341250
PGSQL: type=DataRow, B -> F
DATA ROW num_values=3
  ---[Value 0001]---
  length=26
  value='01H7GXCFN7K4ZQEDP59G31SY4D'
  ---[Value 0002]---
  length=10
  value='username-1'
  ---[Value 0003]---
  length=26
  value='2023-08-10 23:44:21.689678'

# Packet: t=1695825669.458287, session=213070643341250
PGSQL: type=DataRow, B -> F
DATA ROW num_values=3
  ---[Value 0001]---
  length=26
  value='01H7GXCFN7865YD7Q16BXEAB8N'
  ---[Value 0002]---
  length=10
  value='username-2'
  ---[Value 0003]---
  length=26
  value='2023-08-10 23:44:21.689678'

...

# Packet: t=1695825677.943246, session=213070643341250
PGSQL: type=BackendKeyData, B -> F
BACKEND KEY DATA pid=1463956811, key=1412773965

# Packet: t=1695825677.943246, session=213070643341250
PGSQL: type=NotificationResponse, B -> F
NOTIFICATION RESPONSE pid=1970496882, channel='name-8785', payload=''

...

This is part of the data being sent from the backend (postgres) to a client (psql);

PGSQL: type=RowDescription, B -> F: This part of the message indicates that it is a "RowDescription" message. This message is being sent from the backend (B) to the frontend (F), typically as part of the result set description when a query is executed.
ROW DESCRIPTION: num_fields=3: This section tells you that the RowDescription message is describing a row with three fields (columns).
[Field 01]: This part provides information about the first field (column) in the row.

name='id': This field is named "id," indicating that it corresponds to a column named "id" in the result set.
type=1043: The "type" field specifies the data type of the column. In this case, the type is represented as an integer, and "1043" corresponds to a specific data type in PostgreSQL.
type_len=65535: The "type_len" field indicates the length of the data type. In this case, it's set to a very large value, 65535, which might be used to represent a variable-length character string (e.g., varchar) with an undefined maximum length.
type_mod=4294967295: The "type_mod" field typically represents the data type's modifier. In this case, the large value "4294967295" might signify that there is no modifier for the data type.
relid=57371: The "relid" field often refers to the OID (Object ID) of the table to which the column belongs. In this case, "57371" would be the OID of the related table.
attnum=1: The "attnum" field represents the attribute number of the column in the table. It's set to "1," indicating that this is the first column in the table.
format=0: The "format" field specifies the format of the data. A value of "0" indicates that the data is in text format. In binary mode, this value would be set to "1."

If you have a deeper knowledge you may think: “why not use something like protobuf? you can make the process faster!”

Yeah, true. But with protobuf you need to know beforehand the shape of the data being sent from both server and client, and that defeat the proposit of fetching less data or using sql aggregations.

Double Serialization

Serialization in the article references taking the data, transforming into a shared common language (json, yaml, xml, etc…), and sending it off to whoever requested it to be deserialized into a known data.

But what is double serialization? Double serialization is when data is being serialized/deserialized twice for the same request. Wonder how? Database → Api → Client. Each arrow means that data was serialized, shipped, then deserialized. If you aren’t careful with the data being passed around, you can end up with a lot of garbage to handle and performance issues as it can be a heavy operation. The more data needs to be transported, the more garbage is created.

async function someFunction(_: Request) {
  // Data is first serialized in the database, shipped,
  // then deserialized here to have a valid data object
  const data = await fetchFromDb()

  // Maybe you need to transform the data to aggregate or remove fields
  // const transformedData = someMapFunction(data);

  // You serialize data again here,
  // shipped for the client deserialize it,
  // then transform into something useful for them
  return new Response(JSON.stringify(data), {
    status: 200,
    headers: {
      'Content-Type': 'application/json'
    }
  })
}

Closing thoughts

Perhaps you've never had to think about how much information you search for and send. It's so much simpler to gather everything without being too careful, but it's these little details that start creating problems in the form of a snowball.

Thinking pessimistically:

You fetched more information than you needed.
The server had to do more work to process and send the information.
There are more data packets to be sent over the network, increasing the chance of problems and packet resending, which raises latency.
The application had more work to do in processing more information than necessary.
- If it's an application with a more "relaxed" approach to memory, the application itself has become heavier (garbage collection).
With more information to be sent, the response has also become heavier. Both in transforming the information from one domain to a common format and in sending all that information over the network again.
More information was transmitted, and now you also start paying for data transmission over the network.
The client receiving the information will need to put in more effort to process the information.

The simple decision to fetch more information than necessary created a huge snowball effect. Perhaps, when everything needs to move quickly for the delivery of something, it makes sense to reconsider later how much information is being transmitted. But thinking that this problem doesn't exist is a considerable risk and can turn into a massive snowball.

References

https://www.wireshark.org/ for checking the packages

Blog