Handling Dynamic Objects in CrateDB

Introduction

CrateDB combines the advantages of typical SQL databases and strict schemas with the dynamic properties of NoSQL databases. While object-relational databases like PostgreSQL allow to store and process JSON data in columns, CrateDB handles objects as real first-level citizens. This allows users to access object properties in the same manner as columns in a table, including full indexing and aggregation capabilities.

In CrateDB an object is a container data type and is represented as a collection of key-value pairs. An object property can contain a value of any type, including nested objects or arrays. There are different kinds of variations in how CrateDB handles objects. From very strict schema enforcement to completely ignoring schema definitions. Further through the dynamic mapping capabilities, new object properties can be mapped to a datatype and fully indexed on the fly. The goal of this article is to give an overview of different types of objects supported in CrateDB and to improve the overall understanding of the corresponding indexing strategy.

Overview of Objects in CrateDB

In CrateDB objects can be defined as strict, dynamic, or ignored depending on whether we want to enforce strict or dynamic schemas and whether dynamically added properties result in a schema update of sub-columns. For the official syntax on object specification in CrateDB, we refer to the documentation.

Strict, Dynamic and Ignored Objects

In CrateDB, an object can be specified as strict if we want to enforce a predefined number and type of sub-columns. This leads to the rejection of objects with additional sub-columns that are not defined upfront in the schema. If the object schema is not predefined, one can use a dynamic object that allows adding new sub-columns dynamically. However, every time an object with a new sub-column is added, the table schema gets updated. In the updated schema, a new sub-column will have a type that is interfered based on its value and will be indexed. Finally, if we want the flexibility of dynamic objects without changing table schema, we should use ignored objects. With ignored objects, we still have column constraints as with strict or dynamic objects, but adding new sub-columns dynamically does not lead to schema updates or index updates. In the following sections, we will talk about object indexing in more detail. To clarify the difference between different object types let’s consider the table with three columns defined as follows:

CREATE TABLE obj_table(
  obj1 OBJECT(STRICT) AS(
    name TEXT
  ),
  obj2 OBJECT(DYNAMIC) AS (
    name TEXT
  ),
  obj3 OBJECT(IGNORED) AS (
    name Text
  )
)

The first column stores an object with strict predefined mapping, the second column stores objects with dynamic mapping, and the third column stores objects ignoring the datatype of properties. For simplicity reasons, let’s consider each object to have one sub-column called name. Now, let’s imagine that we want to add the following object as the first, second, and third columns:

{
  name = 'Ana',
  age = 21
}

In the case of the first column, CrateDB will throw the Exception because dynamically added columns are rejected in strict objects. In the case of the second column, insertion of a new value causes the change of obj_table schema as follows:

CREATE TABLE IF NOT EXISTS obj_table (
   obj1 OBJECT(STRICT) AS (
      name TEXT
   ),
   obj2 OBJECT(DYNAMIC) AS (
      age BIGINT,
      name TEXT
   ),
   obj3 OBJECT(IGNORED) AS (
      name TEXT
   )
)

Based on the guessed value type, CrateDB updates the sub-column type. In the case of age, the sub-column is anticipated as BIGINT. Finally, the insertion of the same object to the ignored column would store the value without affecting the table schema.

Indexing Objects in CrateDB

Index structures are one of the most important tools that database management systems leverage to improve query performance. CrateDB indexing strategy is based on a Lucene index which enables efficient search on very large datasets.

By default, CrateDB stores object in a flat dotted structure. For example,

obj1 = {
  name = 'Ana',
  age = 21
}

becomes:

{
  obj1.name = 'Ana',
  obj1.age = 21
}

CrateDB builds indexes for each sub-column (i.e., object property), depending on the type of sub-column value. For example, for a sub-column that stores text value, CrateDB builds an inverted index. For a numeric value, it builds BKD-trees, etc. To learn more about data structures used for building indexes in CrateDB, check our previous article on indexing and storage. Furthermore, for each stored object CrateDB builds a Column Store for the JSON representation of that object.

Indexing Dynamically Added Columns

The exception to the indexing strategy comes with objects declared as ignored. Dynamically added columns of ignored objects do not result in new indexes, as they do not affect table schema. Filtering on non-indexed columns is slower than on indexed columns, as value lookup is performed for each matching row. Furthermore, CrateDB does not create column stores for dynamically added values in ignored objects. This means that operations such as sorting, grouping, and aggregation operations are significantly slower on ignored objects.

However, CrateDB indexes all statically defined columns of ignored objects. To further clarify how the object indexing works, let’s consider our obj_table with three object columns: strict, dynamic, and ignored and the following insert statement:

INSERT INTO obj_table VALUES (
  {name = 'A'}, {name = 'B', age = 20}, {name = 'C', age = 22}
)

The resulting indexes are further illustrated in the table below:

Column	Type	Index Structures
`obj1`	static object	column store
`obj1.name`	text	column store and inverted index
`obj2`	dynamic object	column store
`obj2.name`	text	column store and inverted index
`obj2.age`	bigint	column store and BKD tree
`obj3`	ignored object	column store
`obj3.name`	text	column store and inverted index
`obj3.age`	no schema update	no index

As shown, in the case of dynamically added columns, a new index is created if the object is dynamic. If the object is ignored, dynamically columns do not result in a new index.

Accessing Undefined Properties

Another difference between dynamic and ignored objects is the way the queries are performed on objects that do not exist. As an example, let’s consider the list of two dynamic objects as follows:

{
  name = 'Ana',
  age = 21
},
{
  age = 22,
  phone = 'abcde'
}

Now, let’s query the objects on three columns: name, age, and phone. The result is illustrated in the table below:

obj.name	obj.age	obj.phone
'Ana'	20	`NULL`
`NULL`	22	'abcde'

Accessing the column that is not defined the dynamic object returns NULL as long as the column is defined in the table schema. However, if we attempt to query the column that is not part of the schema, e.g., column address, the system will throw ColumnUnknownException. The difference between dynamic and ignored objects in this regard is that a query on a column that is not part of the schema would always return a NULL value.

Now, let’s consider that we want to extend the obj2 column in obj_table with a new subcolumn called address. The query for adding a new subcolumn looks like the follows:

ALTER TABLE obj_table ADD COLUMN obj2['address'] TEXT;

After adding a new subcolumn the table schema changes, as well as the indexes. This means that querying address property on obj2 would return NULL value. The same behavior is expected when querying address on obj3. however, as already discussed this is expected behavior as obj3 is ignored object. What is important to mention here is that changing schema affects existing indices: altering object column has as a consequence that only values added after changing schema are indexed.

Summary

This article describes the fundamentals of objects in CrateDB. Objects can be either strict, dynamic, or ignored and which one works the best depends on a use case. For properties in strict and dynamic objects, CrateDB generates indexes that allow the fast search of stored objects and efficient updates. As we illustrated, there are exceptions to this rule which should be taken into account when defining table schema.

Blog