Michał Knasiecki
Posted on November 13, 2023
In this series of articles, I show how we can reduce database size and improve it’s performance just by using the right structures and proper data types. In the previous post, we were checking how json structure may influence database storage size. The results were promising, so it's worth taking a broader look at the topic. Therefore what else can we do to reduce the size of data on disk?
Enumerated types
Now, let's also look at the enumerated types. You can store them in two ways, either by using a label:
{"_id": 1, "source": "ALLEGRO"}
or by ordinal value:
{"_id": 1, "source": 1}
Here an analogy with the first experiment can be found: again we replace a character string with a numerical value.
Since we got a great result the first time, maybe we could repeat it here?
Unfortunately, the result is disappointing and the reduction in size is small. However, if we think more deeply, we will come to the conclusion that it could not have been otherwise. The enumerated types are not unique identifiers. We are dealing only with a few possible values, appearing many times in the whole collection, so it’s a great chance for MongoDB data compression to prove its worth. Most likely the values from the first collection have been automatically replaced by the engine to its corresponding ordinal values. Additionally, we don't have any profit on the schema here, because they are the same in both cases. The situation might be different for collections that do not have data compression enabled.
This is a good time to take a closer look at how snappy compression works in MongoDB. I've prepared two more collections, with identical data, but with data compression turned off. The results are shown in the chart below, compiled together with the collections with data compression turned on.
It is clear that the use of an ordinal value instead of a label of the enumerated type brings considerable profit for collections with data compression disabled. In case of lack of compression it is definitely worth considering using numerical values.
Useless _class field
If we use spring-data
, each of our collections additionally contains a _class
field storing the package and the name of the entity class containing the mapping for documents from this collection. This mechanism is used to support inheritance of model classes, that is not widely used. In most cases this field stores exactly the same values for each document what makes it useless. And how much does it cost? Let's compare the following documents:
{"_id": 1}
and
{"_id": 1, "_class": "pl.allegro.some.project.domain.sample"}
The difference is considerable, over 50%. Bearing in mind that the compression is enabled, I believe that the impact of the data itself is small and the result is caused by the schema of the collection containing only the key being half as small as the one with the key (1 field vs 2 fields). After removal of the _class
field from a document with more fields, the difference expressed in percent will be much smaller of course. However, storing useless data does not make sense.
Useless indexes
When investigating the case with identifiers stored as strings, I checked whether manipulating the data type could reduce the index size. Unfortunately I did not succeed, so I decided to focus on the problem of redundant indexes.
It is usually a good practice to cover each query with an index so that the number of collscan
operations is as small as possible. This often leads to situations where we have too many indexes. We add more queries, create new indexes for them, and often it turns out that this newly created index takes over the queries of another one. As a result, we have to maintain many unnecessary indexes, wasting disk space and CPU time.
Fortunately, it is possible to check the number of index uses with the query:
db.COLLECTION_NAME.aggregate([{$indexStats:{}}])
It allows you to find indexes that are not used so you can safely delete them.
And finally, one more thing. Indexes can be removed quickly, but they take much longer to rebuild. This means that the consequences of removal of an index by mistake can be severe. Therefore it is better to make sure that the index to be deleted is definitely not used by any query. The latest MongoDB 4.4 provides a command that helps to avoid errors:
db.COLLECTION_NAME.hideIndex(<index>)
The above-mentioned command hides the index from the query optimizer. It does not take this index into account when building the query execution plan, but the index is still updated when modifying documents. Thanks to that, if it turns out that it was needed, we are able to restore it immediately and it will still be up-to-date.
Conclusion
Using a few simple techniques, preparing several versions of the schema and using stats()
command we are able to design a model which does not overload our infrastructure. I encourage you to experiment with your own databases. Maybe you too can save some disk space and CPU time.
Posted on November 13, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.