Using NoSQL as a cache for views
jpeg729
Posted on July 2, 2019
Last time we spoke about the pros and cons of using EntityFramework to manage our database access. This time we will cover our use of a NoSQL database primarily as a cache. We also use it for fancy full text queries, but we won't cover that in this article.
NoSQL as a cache
In the database we have an Authors table, a Books table, a Publishers table, a Marketers table, a Meetings table, and various other tables containing useful info related to these.
The relationships are vaguely as follows: Each Author has many Books, each Book has a Publisher and a Marketer, (and therefore each Author is linked by extension to several Publishers and Marketers). There are also Meetings of various types which can involve some or all of these actors.
(Disclaimer: these table names are fictitious, however the relationships between them do resemble the sorts of relations that I have to deal with at work.)
Now, we need to be able to display the books in a list with their Titles, Authors, Publishers and various assorted details. We also want the user to be able to specify search terms that could match the book's title, the author's name, the publisher's name, the themes that are found in the book, and so on. Now EntityFramework 6 can do all the fancy joins needed to deliver all of this detail, but the requests are far from optimised, and what's more, we need to deliver all of this information page by page. Besides SQL Server isn't great at full text querying. So we put everything into our NoSQL database and access it from there.
A NoSQL database stores json documents and indexes them for querying. We fill the indices by loading the entities from SQL Server using a bunch of Include
s and then we push them into the relevant index in our NoSQL database. The code looks something like this...
var authors = ctx.Authors
.Include(a => a.Books)
.Include(a => a.Books.Select(b => b.Publisher))
.Include(...);
foreach (var batch in authors.Batch(batchsize))
{
noSqlClient.IndexMany(batch);
}
Naturally, there are other list views that we need to populate. For example, a list of Authors with various accompanying details, and a list of Publishers with different details. Hence we end up with three different indices each containing a subset of the data seen from a different angle.
Bad design
Many of you will tell me that we are doing it wrong, and I will agree with you. It is inefficient and not particularly useful to store entire entities with all their fields, plus their related entities with all their fields when we only need subsets of all that information for our views. Instead we should store view models containing a more limited amount of data sufficient for the views we need to present to the user, plus the fields needed for the various search options. But the app wasn't designed like that and we can't take too much time out from implementing new features, so we change it too quickly.
Difficulties
Obviously, such an architecture comes with a number of drawbacks, for instance...
- Keeping the cache up to date
- EntityFramework is pretty slow at loading an entity with lots of many-to-many joins
Keeping it up to date
This is a kicker because whenever a publisher's details are modified, for example, we need to update all the authors in the authors' index who have books published by that publisher. Likewise when we update an author's details we need to update his/her document in the authors' index, and the publishers he/she has used in the publishers' index. And so on...
So how do we manage that infallibly and efficiently?
Well, we manually transformed the list of Include
s into a set of recipes for detecting which indexed entities would be affected by an update to any child entity. Then we added some custom code that would automatically run each time we call ctx.SaveChanges()
. This custom code uses the recipes to detect which documents of each index need updating, then loads the root entities from our NoSQL database, loads the child entities from the database with any necessary sub-Include
s, and patches the root entities and stores the result back into our NoSQL database. This somewhat more efficient than loading the root entities with their Include
s and re-indexing the whole lot.
EntityFramework join optimisation
We noticed recently that when loading the authors in order to fill the authors' index, EF would pause for nearly a full minute before it started transferring any data. We tested the query itself and it wasn't anywhere near that expensive.
EntityFramework.Plus provides an .IncludeOptimized
extension method that is faster than EntityFramework's .Include
method. But with two drawbacks. It doesn't support many-to-many joins, and it doesn't support .AsNoTracking()
. Now obviously, if we are loading entities solely in order to load them into our NoSQL database, we don't need EntityFramework to track the changes to those entities, so we should be able to optimise a little by asking EF not to track changes.
We ended up optimising this by loading the Authors with AsNoTracking
, then loading the Books in a separate query also with AsNoTracking
and manually attaching each book to the right author. The total time went down from 1 minute 45 seconds, to around 25 seconds. Running the queries in parallel allowed us to nearly halve that again.
Loading an Author from SQL Server using this method takes around 30-40ms, whereas loading the same Author with all the related entities from our NoSQL database takes only ~1ms.
Code maintenance
Unfortunately, instead of a single list of Include
s we now have some other bits of code to maintain as well.
The recipes for keeping the indices up-to-date could be fairly easily generated from the list of Include
s using reflection, but we haven't taken the time to do it yet.
The code for the schema generation is fully automatic and needs no maintenance. It just needs an up-to-date list of Include
s and does its magic automatically. Personally, I am very happy about this because the code that does this is a pretty ugly mess of reflection and Expression tree building.
The code for optimised loading needs adjusting manually each time. In theory this code could also be generated automatically at runtime, but this would be pretty hard to do. The tricky thing is figuring out which navigation properties to use to attach child entities to, though I suspect we could extract those details from EF 6 if we really tried, (or if we read parts of the source code for EntityFramework.Plus).
Where to go from here
Storing view models instead of full entities may represent a decent efficiency gain, but the speed-up could turn out to be negligible. Currently, the cache updates are managed by recipes that could be automatically generated and it is hard to see how we could achieve something similar if we used view models instead. Automapper's ProjectTo might be able to produce an Expression tree that we could usefully parse and convert into recipes for detecting the required updates.
I have been able to use a list of Include
s to generate the SQL query necessary to make SQL Server dish up a json document containing an entity and its related entities, and I am confident that the resulting query could be used to create a computed column, and that the resulting column could be indexed in order to force SQL Server to keep the json document up to date. That would solve our slow loading problems and our cache update difficulties in exchange for slower SQL updates, and we would still need our NoSql database for smarter text queries.
That said, we might decide that the web app must be usable offline, in which case we would keep enough data cached in the browser to allow us to run text search queries via javascript. This could eliminate the need for our NoSQL database.
Conclusion
We have faced some serious difficulties in our use of a NoSQL database, but I am confident that we have overcome the worst of these. There are many ways in which we could improve this architecture and I can't wait to see what we will be able to achieve.
Posted on July 2, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.