Mircea Cadariu
Posted on October 31, 2024
He who plants a tree,
Plants a hope.
Plant a tree by Lucy Larcom 🌳
Intro
In this post I'm going to show you a couple of options for managing hierarchical data represented as a tree structure. This is the natural approach when you need to implement things like:
- file system paths
- org charts
- discussion forum comments
- a more contemporary topic: small2big retrieval for RAG applications [1][2]
Now, if you know what a graph is already, a tree is basically a graph without any cycles. Visually, it looks like this:
I've learned there are multiple alternatives for storing trees in relational databases. In the sections below, I'll show you three ways of doing it:
- adjacency list
- materialized paths
- nested sets
There will be two parts to this blog post. In this first one I show you how to load and store data using the above approaches - the basics. Having that out of the way, in the second part, the focus is more on their comparison and trade-offs, for example I want to look at what happens at increased data volumes and what are the appropriate indexing strategies.
All the code you'll see in the sections below can be found here.
The running use-case I picked will be employees and their managers, and the IDs for each will be exactly the ones you saw in the tree visualisation I showed above.
Local environment
I'm using the recently released PostgreSQL 17 with Testcontainers. This gives me a repeatable setup to work with on my laptop. For example, we can provide initialisation SQL scripts. I use this to automate the creation of a Postgres database with the necessary tables and populate with test data.
@TestConfiguration(proxyBeanMethods = false)
class TestcontainersConfiguration {
private static final String POSTGRES = "postgres";
@Bean
@ServiceConnection
PostgreSQLContainer<?> postgresContainer() {
return new PostgreSQLContainer<>(DockerImageName.parse("postgres:latest"))
.withUsername(POSTGRES)
.withPassword(POSTGRES)
.withDatabaseName(POSTGRES)
.withInitScript("init-script.sql");
}
}
Let's jump in and have a look at the first approach.
1. The adjacency list model
This was the first solution for managing hierarchical data, so we can expect that it's still widely present in codebases - chances are, you might encounter it sometime. The idea is that we store the manager's, or more generically said, the "parent ID" in the same row.
Schema
Let's have a look at the table structure.
create table employees
(
id bigserial primary key,
manager_id bigint references employees
name text,
);
I omitted them here, but in order to ensure data integrity, we should also write constraint checks that ensure at least the following:
- there is a single parent for every node
- no cycles
Generating test data
Especially for Part 2 of this post, we need a way to generate as much data as we want for populating the tables. Let's do it at first step by step for more clarity, then afterwards recursively.
Iteration 1 - step by step
We start simple by explicitly inserting three levels of employees in the hierarchy.
Now, you might know already about CTEs in PostgreSQL - they are auxiliary queries executed within the context of a main query. Below, you can see how I construct each level on the basis of the level before.
with root as (
insert into
employees(manager_id, name)
select
null,
'root' || md5(random()::text)
from
generate_series(1, 1) g
returning
employees.id
),
first_level as (
insert into
employees(manager_id, name)
select
root.id,
'first_level' || md5(random()::text)
from
generate_series(1, 2) g,
root
returning
employees.id
),
second_level as (
insert into
employees(manager_id, name)
select
first_level.id,
'second_level' || md5(random()::text)
from
generate_series(1, 2) g,
first_level
returning
employees.id
)
insert into
employees(manager_id, name)
select
second_level.id,
'third_level' || md5(random()::text)
from
generate_series(1, 2) g,
second_level;
Cool. Let's now verify that it works as expected. We do a count to see how many elements have been inserted. You can compare it with the number of nodes in the tree visualisation I showed at the beginning of this post.
postgres=# select count(*) from employees;
count
-------
15
(1 row)
Looks alright! Three levels, and in total we get 15 nodes.
Time to move on to the recursive approach. This is needed for Part 2 of this post, where we want to generate much larger volume of data.
Iteration 2 - recursive
Writing recursive queries follows a standard procedure similar to how it works in regular software development. We define a base step and a recursive step then "connect" them to each other using union all
. At runtime PostgreSQL will follow this recipe and generate all our results. Have a look.
create temporary sequence employees_id_seq;
insert into employees (id, manager_id, name)
with recursive t(id, parent_id, level, name) AS
(
select
nextval('employees_id_seq')::bigint,
null::bigint,
1,
'root' from generate_series(1,1) g
union all
select
nextval('employees_id_seq')::bigint,
t.id,
level+1,
'level' || level || '-' || md5(random()::text)
from
t,
generate_series(1,2) g
where
level < 4
)
select
id,
parent_id,
name
from
t;
drop sequence employees_id_seq;
After running it, let's do a count again to see if the same number of elements are inserted.
postgres=# select count(*) from employees;
count
-------
15
(1 row)
Cool! We're in business. We can now populate the schema with however many levels and elements we want, and thus, completely control the inserted volume. No worries if for now recursive queries look a bit hard to grasp still, we'll actually revisit them a bit later with the occasion of writing the queries to retrieve the data.
For now, let's proceed to have a look at the Hibernate entity we can use to map our table to a Java class. This is it:
@Entity
@Table(name = "employees")
@Getter
@Setter
public class Employee {
@Id
private Long id;
private String name;
@ManyToOne(fetch = FetchType.LAZY)
@JoinColumn(name = "manager_id")
private Employee manager;
@OneToMany(
mappedBy = "parent",
cascade = CascadeType.ALL,
orphanRemoval = true
)
private List<Employee> employees = new ArrayList<>();
}
Nothing special, you saw this coming, just a one-to-many relationship between managers and employees. Let's start querying!
Descendants (top-down)
All subordinates of a manager
For retrieving all employees
which are the subordinates of a specific manager referenced by her ID, we'll write a recursive query again. You'll see again a base step and a recursive step that is linked up with the base step. Postgres will then repeat this and retrieve all the need rows to satisfy the query. Let's take the employee with ID = 2 for example. This is a visual representation that I found helpful which helped me to understand how it works. I haven't included all the output results you'd get, just the first few to show the principle.
Here's the JPQL query for querying descendants:
return entityManager.createQuery("""
with employeeRoot as (
select
employee.employees employee
from
Employee employee
where
employee.id = :employeeId
union all
select
employee.employees employee
from
Employee employee
join
employeeRoot root ON employee = root.employee
order by
employee.id
)
select
new Employee(
root.employee.id
)
from
employeeRoot root
""", Employee.class
)
.setParameter("employeeId", employeeId)
.getResultList();
In order to make the queries cleaner by not needing to write the fully qualified name of the record we write the results into, we can use the hypersistence-utils library to write a ClassImportIntegratorProvider
, like this:
public class ClassImportIntegratorProvider implements IntegratorProvider {
@Override
public List<Integrator> getIntegrators() {
return List.of(
new ClassImportIntegrator(
singletonList(
Employee.class
)
)
);
}
}
Important: reviewing the generated queries
It works, but let's have a deeper look at what Hibernate generated. It's always good to understand what's happening under the hood, otherwise we might incur inefficiencies that will happen with every user request - this will add up.
For this, we'll start the Spring Boot app with the following setting:
@DynamicPropertySource
static void registerPgProperties(DynamicPropertyRegistry registry) {
registry.add("spring.jpa.show_sql", () -> true);
}
Alright, let's have a look. Here's the query for the descendants generated by Hibernate.
with recursive employeeRoot (employee_id) as
(
select
e1_0.id
from
employees eal1_0
join
employees e1_0 on eal1_0.id = e1_0.manager_id
where eal1_0.id=?
union all
(
select
e2_0.id
from
employees eal2_0
join
employeeRoot root1_0 on eal2_0.id = root1_0.employee_id
join
employees e2_0 on eal2_0.id = e2_0.manager_id
order by
eal2_0.id
)
)
select
root2_0.employee_id
from
employeeRoot root2_0
Hmm - looks like there's some extra steps in here! Let's see if we can simplify it a bit, keeping in mind the picture I showed you earlier about the base step and the recursive step linked with the base step. We shouldn't need to do more than that. See what you think of the following:
with recursive employeeRoot (employee_id) as
(
select
e1_0.id
from
employees e1_0
where
e1_0.id = ?
union all
(
select
e2_0.id
from
employees e2_0
join
employeeRoot root1_0 on e2_0.manager_id = root1_0.employee_id
order by
e2_0.id
)
)
select
root2_0.employee_id
from
employeeRoot root2_0
Much better! We removed some unnecessary joins. This is expected to make the query go faster because it will have less work to do.
Final result
As a final step let's clean up the query above and replace the table names that Hibernate adds with ones that are more human readable.
with recursive employee_root (id, name) as
(
select
id,
name
from
employees
where
id = ?
union all
(
select
employees.id,
employees.name
from
employees
join
employee_root on employees.manager_id = employee_root.id
order by
employees.id
)
)
select
id,
name
from
employee_root
order by
name;
Alright, time to see how we go "up" the tree.
Ancestors (bottom-up)
All managers up the chain
Let's first try to write down the conceptual steps for getting the managers of employee with ID = 14.
Looks very much like the one for the descendants you saw above, just the connection between the base step and the recursive step is inverted.
We can write the JPQL query:
return entityManager.createQuery("""
with employeeRoot as (
select
employee.id as employeeId,
employee.manager.id as manager_id
from
Employee employee
where
employee.id = :employeeId
union all
select
employee.id as pid,
employee.manager.id as manager_id
from
Employee employee
join
employeeRoot root on employee.id = root.manager_id
order by
employee.id
)
select
new Employee(root.employeeId)
from
employeeRoot root
""",
Employee.class
)
.setParameter("employeeId", employeeId)
.getResultList();
And that's it! I have looked at the SQL query generated but I could not find any extra commands that I could shave off like before. Time to move on to approach 2.
2. Materialized paths
ltree
is a Postgres extension we can use to work with hierarchical tree structures as materialized paths (starting from the top of the tree). For example, this is how we will record the one path: 1.2.4.8
. There are several useful functions it comes with. We can use it as a table column:
create table employees_ltree
(
id bigserial,
path ltree
);
In order to populate the above table with test data, the approach I took is basically migrate the generated data from the table used for the adjacency list you saw before, using the following SQL command. It's again a recursive query which collects elements into an accumulator at every step.
with recursive leafnodes AS (
select
array_agg(id) as leaves
from
employees
where
id not in (
select
manager_id
from
employees
where
manager_id is not null
)
), chain as (
select
employees.manager_id,
employees.id,
array[]::bigint[] as descendants
from
employees
union all
select
employees.manager_id,
employees.id,
chain.id || chain.descendants
from
employees,
chain
where
employees.id = chain.manager_id
)
insert into employees_ltree(path)
select
array_to_string(chain.id || descendants, '.')::ltree as path
from
chain,
leafnodes
where
manager_id is null and
leaves && (chain.descendants);
Here's the entries that the above command generated.
postgres=# select * from employees_ltree;
id | path
----+----------
1 | 1.2.4.8
2 | 1.3.5.9
3 | 1.2.6.10
4 | 1.3.7.11
5 | 1.2.4.12
6 | 1.3.5.13
7 | 1.2.6.14
8 | 1.3.7.15
(8 rows)
We have our table ready. We can proceed to write the Hibernate entity. In order to map columns of type ltree
, I implemented a UserType. I can then map the path
field with @Type(LTreeType.class)
.
@Entity
@Table(name = "employees_ltree")
@Getter
@Setter
public class EmployeeLtree {
@Id
private Long id;
@Column(name = "path", nullable = false, columnDefinition = "ltree")
@Type(LTreeType.class)
private String path;
}
We're ready to write some queries. In native SQL, it would look like the following:
select
*
from
employees_ltree
where
path ~ '*.2.*'
Now, we can always write a native query in Spring Data JPA and call it a day. But let's push the envelope a bit and write our queries in JPQL. Because we're using a native Postgres feature it's not supported out of the box so we'll have to implement a couple of things to make it possible. We'll first write our custom StandardSQLFunction
. This will allow us to define a substitution for the Postgres native operator in our JPQL code.
public class LtreePathContainsSQLFunction extends StandardSQLFunction {
private static final BasicTypeReference<Boolean> RETURN_TYPE = new BasicTypeReference<>("boolean", Boolean.class, SqlTypes.BOOLEAN);
public LtreePathContainsSQLFunction(String name) {
super(name, true, RETURN_TYPE);
}
@Override
public void render(SqlAppender appender, List<? extends SqlAstNode> args, ReturnableType<?> returnType, SqlAstTranslator<?> walker) {
args.getFirst().accept(walker);
appender.append("~");
appender.append("(");
args.get(1).accept(walker);
appender.append(")::lquery");
}
}
We then have to register it as a FunctionContributor
, like so:
public class CustomFunctionsContributor implements FunctionContributor {
@Override
public void contributeFunctions(FunctionContributions functionContributions) {
var functionName = "ltree_contains";
functionContributions.getFunctionRegistry()
.register(functionName, new LtreePathContainsSQLFunction(functionName));
}
}
The last step is to create a resource file in the META-INF/services
folder called org.hibernate.boot.model.FunctionContributor
where we will add a single line with the fully qualified name of the class above.
Okay, cool! We can now write our JPQL query:
@Repository
public interface EmployeeLtreeRepository extends JpaRepository<EmployeeLtree, Long> {
@Query(value = """
select
employee
from
EmployeeLtree employee
where
ltree_contains(path, :path)
""")
List<EmployeeLtree> findAllByPath(@Param("path") String path);
}
This allows us now to pass an ltree path as argument:
employeeLtreeRepository.findAllByPath("*.2.*")
PostgreSQL offers a wide set of functions for working with ltrees. You can find them in the official docs page. As well, there's a useful cheatsheet.
Like with adjacency lists, it's important to add constraints to our schema in order to ensure data consistency - here's a good resource I found on this topic.
3. Nested sets
Easiest to understand is with an image showing the intuition behind it. At every node of the tree we have an extra "left" and a "right" column besides its ID. The rule is that all the children have their left and right in between their parent's left and right values.
Here's the table structure to represent the tree above.
create table employees_nested_sets
(
id bigint not null,
lft integer,
rgt integer
);
In order to populate the table, I have converted the script from Joe Celko's "SQL for smarties" book into PostgreSQL syntax. It migrates the data from the table used in the adjacency list section to this new table structure. Here it is in all its glory:
create table employees_copy AS
select * from employees;
create function migrate_to_nested_sets() returns void as $$
declare
counter integer;
max_counter integer;
current_top integer;
begin
counter := 2;
max_counter := 2 * (select count(*) from employees_copy);
current_top := 1;
insert into employees_nested_sets
select 1, id, 1, max_counter
from employees_copy where parent_id is null;
delete from employees_copy where parent_id is null;
while counter <= max_counter-1 loop
if exists(select * from employees_nested_sets as s1, employees_copy as t1 where s1.id = t1.parent_id and s1.stack_top = current_top)
then
begin
insert into employees_nested_sets
select (current_top + 1), min(t1.id), counter, cast(null as integer)
from employees_nested_sets as s1, employees_copy as t1
where s1.id = t1.parent_id
and s1.stack_top = current_top;
delete from employees_copy
where id = (select id from employees_nested_sets where stack_top = current_top + 1);
counter := counter + 1;
current_top := current_top + 1;
end;
else
begin
update employees_nested_sets
set rgt = counter,
stack_top = -stack_top
where stack_top = current_top;
counter := counter + 1;
current_top := current_top - 1;
end;
end if;
end loop;
END;
$$ language plpgsql;
Alright, I'm ready to do some queries. Here's how to retrieve the ancestors.
@Query("""
select new Employee(
manager.id
)
from
EmployeeNestedSets employee,
EmployeeNestedSets manager
where
employee.lft between manager.lft and manager.rgt and
employee.id = :id
"""
)
List<Employee> getAncestorsOf(Long id);
For the descendants, it looks a bit different, we have to first retrieve the left and right, after which we can use the below query.
@Query("""
select new Employee(
employee.id
)
from
EmployeeNestedSets employee
where
employee.lft > :lft and
employee.rgt < :rgt
"""
)
List<Employee> getDescendantsUsing(Long lft, Long rgt);
And that's it! You've seen how to go up or down the tree for all three approaches. I hope that you enjoyed the journey and you find it useful.
PostgreSQL vs. document/graph databases
The database we've used for the examples above is PostgreSQL. It is not the only option, for example you might wonder why not choose a document database like MongoDB, or a graph databases like Neo4j, because they were actually built with this type of workload in mind.
Chances are, you already have your source of truth data in PostgreSQL in a relational model with transactional guarantees. In that case, you should first check how well PostgreSQL itself handles your auxiliary use-cases as well, in order to keep everything in one place. This way, you will avoid the increased cost and operational complexity needed to spin up and maintain/upgrade a new separate specialised data store, as well as needing to get familiar with it.
Conclusion
There are several interesting options for modelling hierarchical data in your database applications. In this post I've shown you three ways to do it. Stay tuned for Part 2 where we will compare them as well as see what happens with larger volume of data.
References
Before writing the this post I have looked at various existing ones on the topic and I am grateful for the authors for taking the time to write them.
https://dev.to/yugabyte/learn-how-to-write-sql-recursive-cte-in-5-steps-3n88
https://vladmihalcea.com/hibernate-with-recursive-query/
https://vladmihalcea.com/dto-projection-jpa-query/
https://tudborg.com/posts/2022-02-04-postgres-hierarchical-data-with-ltree/
https://aregall.tech/hibernate-6-custom-functions#heading-implementing-a-custom-function
https://www.amazon.co.uk/Joe-Celkos-SQL-Smarties-Programming/dp/0128007613
https://madecurious.com/curiosities/trees-in-postgresql/
https://schinckel.net/2014/11/27/postgres-tree-shootout-part-2%3A-adjacency-list-using-ctes/
Posted on October 31, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.