Under-the-hood of GraphQL DataLoader
Craig Taub
Posted on March 11, 2022
In recent years GraphQL has really taken off as a pattern/library/type system. It offers much which REST does not and its standardization and flexibility has really helped in its adoption. I have an article focused on digging deeper into GraphQL here, so today we will focus on another tool in the GraphQL ecosystem - one that is very important and interesting in how it assists applications - that tool is the DataLoader.
This is part of my "under-the-hood of" series:
- React hooks
- Web bundlers (e.g. Webpack)
- Type systems (e.g. TypeScript)
- GraphQL
- Git version control
- Source maps
- Docker
- NPM
- Test runners (e.g. Mocha)
- VSCode auto formatters (e.g. Prettier)
- Apollo
The article today will be broken down into 2 parts:
1: Overview
The NodeJS repository for GraphQL's DataLoader is found at https://github.com/graphql/dataloader, however it can be found in many different language implementations. It can be used as part of your applications data fetching layer, and its basic job is to reduce requests to backends by 2 means.
- Batching
- Caching
It utilizes different logic and functionality to perform the above efficiently.
The first question is what does GraphQL have to do with this?
It pairs nicely with GraphQL as GraphQL has fields which are designed to be stand-alone functions (resolvers) and it is very easy to share a class instance via the context. The class instance would be our instance of DataLoader.
The natural pairing of DataLoader and GraphQL has produced high success rates - some examples have seen 13 database queries reduced down to 3 or 4.
Batching
Batching is the primary feature of DataLoader, you must pass the library a "batch function" to detail how to process the batch.
Within a single tick of the event loop DataLoader gathers all individual loads, then calls the "batch loading function" with all requested keys.
VS your ORM
Its important to note DataLoader does not optimize the queries itself - you can look to an ORM for help there. For example Objection-JS has logic to avoid "N+1 selects" by utilizing "where in" queries.
Batch function
This is the function given to the library when you create a new instance
const ourInstance = new DataLoader(keys => myBatchFunction(keys))
The basic idea is that you check your cache first for a given key, if it exists return that value, else hit the data-source e.g database.
It passes in an array of keys, but there is a constraint that:
1) the returned array size must match the keys coming in
2) the returned array indexes must match the keys coming in
There is a sound reason for that limitation and it's related to the implementation - it is covered in part 2 below.
Its worth highlighting that keys are suited to table column ID's, so it stands to reason that having a database table for each entity in your data model would fit this mechanism well.
Caching
DataLoader uses a simple in-memory memoization cache. You can swap the memory store for something else e.g. SQL-lite.
Caching pairs really well with the Batching, because the batch can ensure the requested data has already been pulled from the database, the cache can be utilized to retrieve from there. We will go over this more in the next section
2. Building our own GraphQL Dataloader
In this section we will focus on the and save for another time. Hopefully it will provide enough context on how caching is utilized.
Within a single tick of the event loop DataLoader gathers all individual loads, then calls the "batch loading function" with all requested keys.
You might be wondering how it does this - so let's look at the most simple example.
const run = async () => {
const keys = [];
const batchFunction = (keys) => {
// bad point 1 - called with [1,2], [1,2]
console.log("keys: ", keys);
};
const load = async (id) => {
keys.push(id);
process.nextTick(() => {
batchFunction(keys);
});
// bad point 2 - promise not connected to batch function
return Promise.resolve(`id: ${id}`);
};
const a = await load(1);
const b = await load(2);
console.log("a", a); // id: 1
console.log("b", b); // id: 2
};
run();
This calls our batch function twice - both times with both keys.
The order of events is this:
- call
load
asynchronously with id 1 - call
load
asynchronously with id 2 -
async load(1)
- store key 1 in the global
keys
array - schedule a node process to, on the next tick, run our
batchFunction
with those keys - Return a resolved promise with the id.
- store key 1 in the global
-
async load(2)
- store key 2 in the global
keys
array - schedule a node process to, on the next tick, run our
batchFunction
with those keys - Return a resolved promise with the id.
- store key 2 in the global
- The first scheduled process runs, with both ids 1 and 2 in the
keys
array - The second scheduled process runs, with both ids 1 and 2 in the
keys
array.
So here you can see the basic mechanism of how batching works.
Good 👍🏻
- Runs our batch function with both keys - this will mean we can cache the database response, and next time those keys are included only utilize cache
Bad 👎🏻
- Unnecessarily calling the batch function with the same keys, unnecessarily running the code even if it is hitting the cache.
-
load
does not return anything useful, its a completely isolated resolved promise.
The below example looks to improve on that.
let resolvedPromise;
let batchFunction;
let batch;
const dispatchBatch = () => {
batch.hasDispatched = true;
const batchPromise = batchFunction(batch.keys);
batchPromise.then((values) => {
for (var i = 0; i < batch.callbacks.length; i++) {
var value = values[i];
// resolve promise callback
batch.callbacks[i].resolve(value);
}
});
};
const batchScheduleFn = (cb) => {
// add more logic if scheduling
process.nextTick(cb);
};
const getCurrentBatch = () => {
// !hasDispatched only needed if using a 2nd tick - this example isnt
if (batch && !batch.hasDispatched) {
return batch;
}
const newBatch = { hasDispatched: false, keys: [], callbacks: [] };
batch = newBatch;
batchScheduleFn(() => {
dispatchBatch();
});
return newBatch;
};
const load = async (id) => {
const localBatch = getCurrentBatch();
localBatch.keys.push(id);
// add promise callback to batch
const promise = new Promise((resolve, reject) => {
localBatch.callbacks.push({ resolve, reject });
});
return promise;
};
async function threadTwo() {
const user = await load(2);
console.log("threadTwo user", user.id);
}
async function threadOne() {
const user = await load(1);
console.log("threadOne user", user.id);
}
const run = async () => {
// make async
batchFunction = async (keys) => {
console.log("keys:", keys);
// keys: [ 1, 2 ]
return keys.map((key) => ({ id: key }));
};
threadOne();
threadTwo();
};
run();
It introduces batches which can be sheduled - this is exactly how DataLoader manages it (here).
The order of events is this:
- call
threadOne
- callload
async with id 1 - call
threadTwo
- callload
async with id 2 -
async load(1)
- get the current batch
-
batch
is currently undefined so anewBatch
is created - we schedule a dispatch by calling
dispatchBatch()
inside our schedulerbatchScheduleFn()
- this adds
dispatchBatch
callback to thenextTick
.
- this adds
- lastly we return the batch
-
- we add the
id
to thekeys
array on the current batch - we create a new promise, add the
reject
andresolve
to our current batchcallbacks
(so the list index is important) - lastly we return the new promose
- get the current batch
-
async load(2)
- get current batch
-
batch
currently exists and has not been dispatched so we return that
-
- as above we add the
id
andreject/resolve
to the current batch - as well as return the promise
- get current batch
-
process.nextTick
- the tick runs
dispatchBatch
- call our
batchFunction
with the current batcheskeys
-
batchFunction
returns a promise
-
- when that promise resolves (
.then
), it returns an array of our keys - we iterate over our batch callbacks - for each callback
- we find the associated
batchFunction
key value this is why the batch function response indexes are so important - resolve the callback with that value
- we find the associated
- the tick runs
-
await load(1) resolves
- returning object
{id}
- returning object
-
await load(2) resolves
- returning object
{id}
- returning object
This calls the batch function once with both keys, it returns correct values from batch function - dealing with both "bad" points from the first example.
Thanks so much for reading, I learnt a huge amount about DataLoader and GraphQL from this research and I hope it was useful for you. You can find the repository for all this code here.
Thanks, Craig 😃
Posted on March 11, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.