Global Secondary Index of AntDB-M

INTRODUCTION

Indexing is a common performance enhancement technique for databases, and with unique indexes, record uniqueness can also be provided. Usually, each table contains a primary key index (Primary Index), and indexes other than primary key indexes are collectively referred to as secondary indexes (Secondary Index).

In the distributed architecture, data is sliced according to certain rules, and the data node where the data is located is calculated by the slicing key. Distributed data storage and query, mainly face the following 2 problems:

Long query time: When querying, if the query condition contains the partition key, it can quickly locate the data node, otherwise, it needs to traverse all the nodes.
Records are not unique: Unique indexes in standalone mode cannot guarantee the uniqueness of records.

This poses a challenge to the performance and data accuracy of distributed databases. Global Secondary Index (GSI) was created to solve these two problems simultaneously.

The implementation of global secondary index

Outlook

The global secondary index itself is also a distributed table, with the slice key being the index of the table to which the index belongs (hereafter referred to as the primary table). A global secondary index also has its own index and primary key. Its primary key corresponds to the primary key of the primary table, or a unique index without nullable columns, which is used to uniquely locate a primary table record. Its normal index is the same as the primary table index. A primary table can have multiple global secondary indexes.
Figure 1: Visual diagram of global secondary indexes

Global uniqueness

A unique index is a common index type used in databases to both improve query performance and ensure the uniqueness of records. In a distributed environment, an unique index of standalone database would not be able to guarantee the uniqueness of data across all nodes.

For example, table tab1 is defined as follows: user code (uid), order date (odate), and product code (pid) as unique keys. The self-incrementing field id is used as the slice key to distribute the data evenly to all data nodes of the cluster.
Figure 2: Distributed table structure

Standalone mode The unique key uk_1 prevents users from ordering multiple products at the same time. The distributed schema is also based on the uniqueness judgment of the stand-alone schema.
Distributed schema Since the data is distributed by the self-incrementing field id, the same order data may be distributed in different data nodes (ignore the rationality of using id as the slice key here, it is likely that the slice key is different from the unique key in business). There are three options for handling unique keys.

Option 1: Without the special treatment, the problem of inserting duplicate records will occur and the unique key will fail.
Figure 3: Duplicate insertion of unique records due to no global secondary index

Option 2: Iterate through all nodes, and determine whether there is a conflicting primary key by iterating through all nodes query. This way is undoubtedly very inefficient.

Option 3: Add a global secondary index table with the primary table unique key as the slice key and the primary table unique key as the unique index. The records with the same unique key are located to the same data node according to the slice key, and then the local unique key of the node is used to make a quick uniqueness judgment, so that global uniqueness can be achieved efficiently.

Figure 4: Ensuring global uniqueness by using global secondary indexes

Data consistency

From the above, it is clear that the global secondary index is a table independent of the primary table, but logically, the global secondary index is part of the primary table. For OLTP databases, it is important to ensure strong consistency of all data in the primary table, i.e. to ensure consistency of the global index with the primary table data. Here the primary table and the global secondary index table have different partition keys, so there is also the possibility that the same record data is located on different nodes. To ensure consistency, distributed transactions, atomic DDL and other techniques are used here.

Distributed transactions

We know from the above that the primary table and the global secondary index table are two separate tables with different slice keys, which means one data update may involve multiple data nodes. Here distributed transactions are used to ensure data consistency between different nodes, the primary table, and the global secondary index. For data update operations, the normal transaction is converted to a distributed transaction (XA protocol). To reduce the number of RPC calls, the transaction start and data operations are combined into one RPC for processing. If the records of the primary table and the global secondary index are located in the same node, the operations of both tables are in the local transaction of the same node.
Figure 5: Distributed transactions

Atomic DDL

AntDB splits the creation and modification of global secondary index into sub-operations of the creation and modification of primary index, ensuring atomicity globally for the primary table itself and the global secondary index-related operations.

Performance improvement

Dynamic concurrency

Due to business development, when the system data keeps increasing, the application needs to build or adjust indexes on the original table stored in AntDB to speed up the query and access speed. When the data volume of the primary table is particularly large (for example, over 1 million rows), the data change of the global secondary index table is a key issue that cannot be ignored.

AntDB uses dynamic concurrency technology to automatically perform parallel processing according to the number of records of the operated table. The batch allocation technique is adopted considering each parallel local data. You can also adjust the parameters appropriately according to your business volume to achieve the optimal performance.

Index override

The global secondary index of AntDB adopts index override technology to further improve the database query performance. When the global secondary index table satisfies the query column, there is no need to go back to the primary table to query the index override; it also solves the problem that the primary table files are stored in different nodes, avoiding the performance degradation that may be brought by cross-node access.

Conclusion

Through global secondary index, AntDB avoids the partition scan problem caused by inconsistency between partition keys and index columns, and ensures the linear scalability of distributed cluster nodes. Data consistency is ensured through distributed transactions, standalone transactions, atomic DDL and other techniques. Query performance is improved through index overwriting, dynamic concurrency, and other techniques. Adjustable global secondary indexes make the adjustment of business systems simpler and more flexible.

Blog

Global Secondary Index of AntDB-M

AntDB

INTRODUCTION

The implementation of global secondary index

Performance improvement

Conclusion

Join Our Newsletter. No Spam, Only the good stuff.

Related