John Carroll
Posted on May 27, 2021
This post is intended for people already familiar with Firebase's Firestore as well as RxJS observables.
Recently, I realized that I could increase performance and reduce the cost of Firebase's Firestore using a simple query cache.
- Edit: having used this in production for a while, I find this technique is mainly a performance improvement and does not have a large impact on cost (possibly because I've also enabled firestore persistence which already reduces cost).
The problem
While Firebase's Firestore is pretty fast to begin with, because it doesn't support automatically joining data together (i.e. SQL joins) a common workaround is to use observables to do something like the following (example taken from rxFire docs):
import { collectionData } from 'rxfire/firestore';
import { getDownloadURL } from 'rxfire/storage';
import { combineLatest } from 'rxjs';
import { switchMap } from 'rxjs/operators';
const app = firebase.initializeApp({ /* config */ });
const citiesRef = app.firestore().collection('cities');
collectionData(citiesRef, 'id')
.pipe(
switchMap(cities => {
return combineLatest(...cities.map(c => {
const ref = storage.ref(`/cities/${c.id}.png`);
return getDownloadURL(ref).pipe(map(imageURL => ({ imageURL, ...c })));
}));
})
)
.subscribe(cities => {
cities.forEach(c => console.log(c.imageURL));
});
In this example, a collection of cities is being loaded and then image data needs to be separately fetched for each city returned.
Using this method, queries can quickly slow down as one query turns into dozens of queries. In a small app I made for my nonprofit, all these queries added up over time (as new features were added) and all of a sudden you'd be waiting 5-30 seconds for certain pages to load. Any delay is especially annoying if you navigate back and forth between pages quickly.
"I just loaded this data a second ago, why does it need to load everything again?"
What I wanted to do was cache query data for a period of time so that it could be quickly reused if someone navigates back and forth between a few pages. However, without giving it much thought, it seemed like implementing such as cache would take time and add a fair amount of complexity. I tried using Firestore persistence with the hope that this would automatically deduplicate queries, cache data, and increase performance, but it didn't have as much of an impact as I'd hoped (it did reduce costs somewhat, but it also unexpectedly reduced performance).
Turns out it was really easy to create a better cache.
The solution
I implemented a simple query cache that maintains a subscription to a query for some configurable amount of time, even after all observers have unsubscribed from the data. When a component executes a new query, instead of immediately calling Firestore, I check the query cache to see if the relevant query was already created. If it was, I reuse the existing query. Else, I create a new query and cache it for the future.
The code:
import { Observable, Subject } from 'rxjs';
import stringify from 'fast-json-stable-stringify';
import { delay, finalize, shareReplay, takeUntil } from 'rxjs/operators';
/** Amount of milliseconds to hold onto cached queries */
const HOLD_CACHED_QUERIES_DURATION = 1000 * 60 * 3; // 3 minutes
export class QueryCacheService {
private readonly cache = new Map<string, Observable<unknown>>();
resolve<T>(
service: string,
method: string,
args: unknown[],
queryFactory: () => Observable<T>,
): Observable<T> {
const key = stringify({ service, method, args });
let query = this.cache.get(key) as Observable<T> | undefined;
if (query) return query;
const destroy$ = new Subject();
let subscriberCount = 0;
let timeout: NodeJS.Timeout | undefined;
query = queryFactory().pipe(
takeUntil(destroy$),
shareReplay(1),
tapOnSubscribe(() => {
// since there is now a subscriber, don't cleanup the query
// if we were previously planning on cleaning it up
if (timeout) clearTimeout(timeout);
subscriberCount++;
}),
finalize(() => { // triggers on unsubscribe
subscriberCount--;
if (subscriberCount === 0) {
// If there are no subscribers, hold onto any cached queries
// for `HOLD_CACHED_QUERIES_DURATION` milliseconds and then
// clean them up if there still aren't any new
// subscribers
timeout = setTimeout(() => {
destroy$.next();
destroy$.complete();
this.cache.delete(key);
}, HOLD_CACHED_QUERIES_DURATION);
}
}),
// Without this delay, very large queries are executed synchronously
// which can introduce some pauses/jank in the UI.
// Using the `async` scheduler keeps UI performance speedy.
// I also tried the `asap` scheduler but it still had jank.
delay(0),
);
this.cache.set(key, query);
return query;
}
}
/**
* Triggers callback every time a new observer
* subscribes to this chain.
*/
function tapOnSubscribe<T>(
callback: () => void,
): MonoTypeOperatorFunction<T> {
return (source: Observable<T>): Observable<T> =>
defer(() => {
callback();
return source;
});
}
I can use this cache like so:
export class ClientService {
constructor(
private fstore: AngularFirestore,
private queryCache: QueryCacheService,
) {}
getClient(id: string) {
const query =
() => this.fstore
.doc<IClient>(`clients/${id}`)
.valueChanges();
return this.queryCache.resolve(
'ClientService',
'getClient',
[id],
query
);
}
}
Now, when the ClientService#getClient()
method is called, the method arguments and identifiers are passed to the query cache service along with a query factory function. The query cache service uses the fast-json-stable-stringify library to stringify the query's identifying information and use this string as a key to cache the query's observable. Before caching the query, the observable is modified in the following ways:
-
shareReplay(1)
is added so that future subscribers get the most recent results immediately and also so that a subscription to the underlying Firestore data is maintained even after the last subscriber to this query unsubscribes. - Subscribers to the query are tracked so that, after the last subscriber unsubscribes, a timer is set to automatically unsubscribe from the underlying Firestore data and clear the cache after a user defined set period of time (I'm currently using 3 minutes).
-
delay(0)
is used to force subscribers to use theasyncSchedular
. I find this helps keep the UI snappy when loading a large dataset that has been cached (otherwise, the UI attempt to synchronously load the large data which can cause stutter/jank).
This cache could be further updated to allow configuring the HOLD_CACHED_QUERIES_DURATION
on a per-query basis.
Conclusion
This simple cache greatly increases performance and potentially reduces costs if it prevents the same documents from being reloaded again and again in rapid succession. The one potential "gotcha" is if a query is built using Date
arguments. In this case, you need to be careful about using new Date()
as an argument to a query since this would change the cache key associated with the query on every call (basically, this would prevent the cache from ever being used). You can fix this issue by normalizing Date
creation (e.g. startOfDay(new Date())
using date-fns
).
Hope this is helpful.
Posted on May 27, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.