Using NoSQL as a cache for views

Last time we spoke about the pros and cons of using EntityFramework to manage our database access. This time we will cover our use of a NoSQL database primarily as a cache. We also use it for fancy full text queries, but we won't cover that in this article.

NoSQL as a cache

In the database we have an Authors table, a Books table, a Publishers table, a Marketers table, a Meetings table, and various other tables containing useful info related to these.

The relationships are vaguely as follows: Each Author has many Books, each Book has a Publisher and a Marketer, (and therefore each Author is linked by extension to several Publishers and Marketers). There are also Meetings of various types which can involve some or all of these actors.

(Disclaimer: these table names are fictitious, however the relationships between them do resemble the sorts of relations that I have to deal with at work.)

Now, we need to be able to display the books in a list with their Titles, Authors, Publishers and various assorted details. We also want the user to be able to specify search terms that could match the book's title, the author's name, the publisher's name, the themes that are found in the book, and so on. Now EntityFramework 6 can do all the fancy joins needed to deliver all of this detail, but the requests are far from optimised, and what's more, we need to deliver all of this information page by page. Besides SQL Server isn't great at full text querying. So we put everything into our NoSQL database and access it from there.

A NoSQL database stores json documents and indexes them for querying. We fill the indices by loading the entities from SQL Server using a bunch of Includes and then we push them into the relevant index in our NoSQL database. The code looks something like this...

var authors = ctx.Authors
    .Include(a => a.Books)
    .Include(a => a.Books.Select(b => b.Publisher))
    .Include(...);

foreach (var batch in authors.Batch(batchsize))
{
    noSqlClient.IndexMany(batch);
}

Naturally, there are other list views that we need to populate. For example, a list of Authors with various accompanying details, and a list of Publishers with different details. Hence we end up with three different indices each containing a subset of the data seen from a different angle.

Bad design

Many of you will tell me that we are doing it wrong, and I will agree with you. It is inefficient and not particularly useful to store entire entities with all their fields, plus their related entities with all their fields when we only need subsets of all that information for our views. Instead we should store view models containing a more limited amount of data sufficient for the views we need to present to the user, plus the fields needed for the various search options. But the app wasn't designed like that and we can't take too much time out from implementing new features, so we change it too quickly.

Difficulties

Obviously, such an architecture comes with a number of drawbacks, for instance...

Keeping the cache up to date
EntityFramework is pretty slow at loading an entity with lots of many-to-many joins

Keeping it up to date

This is a kicker because whenever a publisher's details are modified, for example, we need to update all the authors in the authors' index who have books published by that publisher. Likewise when we update an author's details we need to update his/her document in the authors' index, and the publishers he/she has used in the publishers' index. And so on...

So how do we manage that infallibly and efficiently?

Well, we manually transformed the list of Includes into a set of recipes for detecting which indexed entities would be affected by an update to any child entity. Then we added some custom code that would automatically run each time we call ctx.SaveChanges(). This custom code uses the recipes to detect which documents of each index need updating, then loads the root entities from our NoSQL database, loads the child entities from the database with any necessary sub-Includes, and patches the root entities and stores the result back into our NoSQL database. This somewhat more efficient than loading the root entities with their Includes and re-indexing the whole lot.

EntityFramework join optimisation

We noticed recently that when loading the authors in order to fill the authors' index, EF would pause for nearly a full minute before it started transferring any data. We tested the query itself and it wasn't anywhere near that expensive.

EntityFramework.Plus provides an .IncludeOptimized extension method that is faster than EntityFramework's .Include method. But with two drawbacks. It doesn't support many-to-many joins, and it doesn't support .AsNoTracking(). Now obviously, if we are loading entities solely in order to load them into our NoSQL database, we don't need EntityFramework to track the changes to those entities, so we should be able to optimise a little by asking EF not to track changes.

We ended up optimising this by loading the Authors with AsNoTracking, then loading the Books in a separate query also with AsNoTracking and manually attaching each book to the right author. The total time went down from 1 minute 45 seconds, to around 25 seconds. Running the queries in parallel allowed us to nearly halve that again.

Loading an Author from SQL Server using this method takes around 30-40ms, whereas loading the same Author with all the related entities from our NoSQL database takes only ~1ms.

Code maintenance

Unfortunately, instead of a single list of Includes we now have some other bits of code to maintain as well.

The recipes for keeping the indices up-to-date could be fairly easily generated from the list of Includes using reflection, but we haven't taken the time to do it yet.

The code for the schema generation is fully automatic and needs no maintenance. It just needs an up-to-date list of Includes and does its magic automatically. Personally, I am very happy about this because the code that does this is a pretty ugly mess of reflection and Expression tree building.

The code for optimised loading needs adjusting manually each time. In theory this code could also be generated automatically at runtime, but this would be pretty hard to do. The tricky thing is figuring out which navigation properties to use to attach child entities to, though I suspect we could extract those details from EF 6 if we really tried, (or if we read parts of the source code for EntityFramework.Plus).

Where to go from here

Storing view models instead of full entities may represent a decent efficiency gain, but the speed-up could turn out to be negligible. Currently, the cache updates are managed by recipes that could be automatically generated and it is hard to see how we could achieve something similar if we used view models instead. Automapper's ProjectTo might be able to produce an Expression tree that we could usefully parse and convert into recipes for detecting the required updates.

I have been able to use a list of Includes to generate the SQL query necessary to make SQL Server dish up a json document containing an entity and its related entities, and I am confident that the resulting query could be used to create a computed column, and that the resulting column could be indexed in order to force SQL Server to keep the json document up to date. That would solve our slow loading problems and our cache update difficulties in exchange for slower SQL updates, and we would still need our NoSql database for smarter text queries.

That said, we might decide that the web app must be usable offline, in which case we would keep enough data cached in the browser to allow us to run text search queries via javascript. This could eliminate the need for our NoSQL database.

Conclusion

We have faced some serious difficulties in our use of a NoSQL database, but I am confident that we have overcome the worst of these. There are many ways in which we could improve this architecture and I can't wait to see what we will be able to achieve.

Blog