Building Reliable Distributed Systems in Node
Loren š¤
Posted on January 24, 2023
This post introduces the concept of durable execution, which is used by Stripe, Netflix, Coinbase, Snap, and many others to solve a wide range of problems in distributed systems. Then it shows how simple it is to write durable code using our TypeScript/JavaScript SDK.
Distributed systems
When building a request-response monolith backed by a single database that supports transactions, we donāt have many distributed systems concerns. We can have simple failure modes and easily maintain accurate state:
- If the client canāt reach the server, the client retries.
- If the client reaches the server, but the server canāt reach the database, the server responds with an error, and the client retries.
- If the server reaches the database, but the transaction fails, the server responds with an error, and the client retries.
- If the transaction succeeds but the server goes down before responding to the client, the client retries until the server is back up, and the transaction fails the second time (assuming the transaction has some checkālike an idempotency tokenāto tell whether the update has already been applied), and the server reports to the client that the action has already been performed.
As soon as we introduce a second place for state to live, whether thatās a service with its own database or an external API, handling failures and maintaining consistency (accuracy across all data stores) gets significantly more complex. For example, if our server has to charge a credit card and also update the database, we can no longer write simple code like:
function handleRequest() {
paymentAPI.chargeCard()
database.insertOrder()
return 200
}
If the first step (charging the card) succeeds, but the second step (adding the order to the database) fails, then the system ends up in an inconsistent state; we charged their card, but thereās no record of it in our database. To try to maintain consistency, we might have the second step retry until we can reach the database. However, itās also possible that the process running our code will fail, in which case weāll have no knowledge that the first step took place. To fix this, we need to do three things:
- Persist the order details
- Persist which steps of the program weāve completed
- Run a worker process that checks the database for incomplete orders and continues with the next step
That, along with persisting retry state and adding timeouts for each step, is a lot of code to write, and itās easy to miss certain edge cases or failure modes (see the full, scalable architecture). We could build things faster and more reliably if we didnāt have to write and debug all that code. And we donāt have to, because we can use durable execution.
Durable execution
Durable execution systems run our code in a way that persists each step the code takes. If the process or container running the code dies, the code automatically continues running in another process with all state intact, including call stack and local variables.
Durable execution ensures that the code is executed to completion, no matter how reliable the hardware or how long downstream services are offline. Retries and timeouts are performed automatically, and resources are freed up when the code isnāt doing anything (for example while waiting on a sleep(ā1 monthā)
statement).
Durable execution makes it trivial or unnecessary to implement distributed systems patterns like event-driven architecture, task queues, sagas, circuit breakers, and transactional outboxes. Itās programming on a higher level of abstraction, where you donāt have to be concerned about transient failures like server crashes or network issues. It opens up new possibilities like:
- Storing state in local variables instead of a database, because local variables are automatically stored for us
- Writing code that sleeps for a month, because we donāt need to be concerned about the process that started the sleep still being there next month, or resources being tied up for the duration
- Functions that can run forever, and that we can interact with (send commands to or query data from)
Some examples of durable execution systems are Azure Durable Functions, Amazon SWF, Uber Cadence, Infinitic, and Temporal (where I work). At the risk of being less than perfectly objective, I think Temporal is the best of these options š.
Durable JavaScript
Now that weāve gone over consistency in distributed systems and what durable execution is, letās look at a practical example. I built this food delivery app to show what durable code looks like and what problems it solves:
Donāt blame me for the logoāthatās just what Stable Diffusion gives you when you ask it for a durable delivery app logo. š¤·āāļøš
The app has four main pieces of functionality:
- Create an order and charge the customer
- Get order status
- Mark an order picked up
- Mark an order delivered
When we order an item from the menu, it appears in the delivery driver site (drive.temporal.menu), and the driver can mark the order as picked up, and then as delivered.
All of this functionality can be implemented in a single function of durable JavaScript or TypeScript. Weāll be using the latterāI recommend TypeScript and our library is named the TypeScript SDK, but itās published to npm as JavaScript and can be used in any Node.js project.
Create an order
Letās take a look at the code for this app. Weāll see a few API routes but mostly go over each piece of the single durable function named order
. If youād like to run the app or view the code on your machine, this will download and set up the project:
npx @temporalio/create@latest --sample food-delivery
When the user clicks the order button, the React frontend calls the createOrder
mutation defined by the tRPC backend. The createOrder
API route handler creates the order by starting a durable order
function. Durable functionsācalled Workflowsāare started using a Client instance from @temporalio/client
, which has been added to the tRPC context under ctx.temporal
. The route handler receives a validated input
(an object with a productId
number and orderId
string) and it calls ctx.temporal.workflow.start
to start an order
Workflow, providing input.productId
as an argument:
import { initTRPC } from '@trpc/server'
import { z } from 'zod'
import { taskQueue } from 'common'
import { Context } from 'common/trpc-context'
import { order } from 'workflows'
const t = initTRPC.context<Context>().create()
export const appRouter = t.router({
createOrder: t.procedure
.input(z.object({ productId: z.number(), orderId: z.string() }))
.mutation(async ({ input, ctx }) => {
await ctx.temporal.workflow.start(order, {
workflowId: input.orderId,
args: [input.productId],
taskQueue,
})
return 'Order received and persisted!'
}),
The order
function starts out validating the input, setting up the initial state, and charging the customer:
type OrderState = 'Charging card' | 'Paid' | 'Picked up' | 'Delivered' | 'Refunding'
export async function order(productId: number): Promise<void> {
const product = getProductById(productId)
if (!product) {
throw ApplicationFailure.create({ message: `Product ${productId} not found` })
}
let state: OrderState = 'Charging card'
let deliveredAt: Date
try {
await chargeCustomer(product)
} catch (err) {
const message = `Failed to charge customer for ${product.name}. Error: ${errorMessage(err)}`
await sendPushNotification(message)
throw ApplicationFailure.create({ message })
}
state = 'Paid'
Any functions that might fail are automatically retried. In this case, chargeCustomer
and sendPushNotification
both talk to services that might be down at the moment or might return transient error messages like āTemporarily unavailable.ā Temporal will automatically retry running these functions (by default indefinitely with exponential backoff, but thatās configurable). The functions can also throw non-retryable errors like āCard declined,ā in which case they wonāt be retried. Instead, the error will be thrown out of chargeCustomer(product)
and caught by the catch block; the customer receives a notification that their payment method failed, and we throw an ApplicationFailure
to fail the order
Workflow.
Get order status
The next bit of code requires some background: Normal functions canāt run for a long time, because theyāll take up resources while theyāre waiting for things to happen, and at some point theyāll die when we deploy new code and the old containers get shut down. Durable functions can run for an arbitrary length of time for two reasons:
- They donāt take up resources when theyāre waiting on something.
- It doesnāt matter if the process running them gets shut down, because execution will seamlessly be continued by another process.
So although some durable functions run for a short period of timeālike a successful money transfer functionāsome run longerālike our order function, which ends when the order is delivered, and a customer function that lasts for the lifetime of the customer.
Itās useful to be able to interact with long-running functions, so Temporal provides what we call Signals for sending data into the function and Queries for getting data out of the function. The driver site shows the status of each order by sending Queries to the order functions through this API route:
getOrderStatus: t.procedure
.input(z.string())
.query(({ input: orderId, ctx }) => ctx.temporal.workflow.getHandle(orderId).query(getStatusQuery)),
It gets a handle to the specific instance of the order function (called a Workflow Execution), sends the getStatusQuery
, and returns the result. The getStatusQuery
is defined in the order file and handled in the order function:
import { defineQuery, setHandler } from '@temporalio/workflow'
export const getStatusQuery = defineQuery<OrderStatus>('getStatus')
export async function order(productId: number): Promise<void> {
let state: OrderState = 'Charging card'
let deliveredAt: Date
// ā¦
setHandler(getStatusQuery, () => {
return { state, deliveredAt, productId }
})
When the order function receives the getStatusQuery
, the function passed to setHandler
is called, which returns the values of local variables. After the call to chargeCustomer
succeeds, the state is changed to āPaidā
, and the driver site, which has been polling getStatusQuery
, gets the updated state. It displays the āPick upā button.
Picking up an order
When the driver taps the button to mark the order as picked up, the site sends a pickUp
mutation to the API server, which sends a pickedUpSignal
to the order function:
apps/driver/pages/api/[trpc].ts
pickUp: t.procedure
.input(z.string())
.mutation(async ({ input: orderId, ctx }) =>
ctx.temporal.workflow.getHandle(orderId).signal(pickedUpSignal)
),
The order function handles the Signal by updating the state:
export const pickedUpSignal = defineSignal('pickedUp')
export async function order(productId: number): Promise<void> {
// ā¦
setHandler(pickedUpSignal, () => {
if (state === 'Paid') {
state = 'Picked up'
}
})
Meanwhile, further down in the function, after the customer was charged, the function has been waiting for the pickup to happen:
import { condition } from '@temporalio/workflow'
export async function order(productId: number): Promise<void> {
// ā¦
try {
await chargeCustomer(product)
} catch (err) {
// ā¦
}
state = 'Paid'
const notPickedUpInTime = !(await condition(() => state === 'Picked up', '1 min'))
if (notPickedUpInTime) {
state = 'Refunding'
await refundAndNotify(
product,
'ā ļø No drivers were available to pick up your order. Your payment has been refunded.'
)
throw ApplicationFailure.create({ message: 'Not picked up in time' })
}
await condition(() => state === 'Picked up', '1 min')
waits for up to 1 minute for the state to change to Picked up
. If a minute goes by without it changing, it returns false, and we refund the customer. (Either we have very high standards for the speed of our chefs and delivery drivers, or we want the users of a demo app to be able to see all the failure modes š.)
Delivery
Similarly, thereās a deliveredSignal
sent by the āDeliverā button, and if the driver doesnāt complete delivery within a minute of pickup, the customer is refunded.
export const deliveredSignal = defineSignal('delivered')
export async function order(productId: number): Promise<void> {
setHandler(deliveredSignal, () => {
if (state === 'Picked up') {
state = 'Delivered'
deliveredAt = new Date()
}
})
// ā¦
await sendPushNotification('š Order picked up')
const notDeliveredInTime = !(await condition(() => state === 'Delivered', '1 min'))
if (notDeliveredInTime) {
state = 'Refunding'
await refundAndNotify(product, 'ā ļø Your driver was unable to deliver your order. Your payment has been refunded.')
throw ApplicationFailure.create({ message: 'Not delivered in time' })
}
await sendPushNotification('ā
Order delivered!')
If delivery was successful, the function waits for a minute for the customer to eat their meal and asks them to rate their experience.
await sleep('1 min') // this could also be hours or even months
await sendPushNotification(`āļø Rate your meal. How was the ${product.name.toLowerCase()}?`)
}
After the final push notification, the order functionās execution ends, and the Workflow Execution completes successfully. Even though the function has completed, we can still send Queries, since Temporal has the final state of the function saved. And we can test that by refreshing the page a minute after an order has been delivered: the getStatusQuery
still works and āDeliveredā is shown as the status:
Summary
Weāve seen how a multi-step order flow can be implemented with a single durable function. The function is guaranteed to complete in the presence of failures, including:
- Temporary issues with the network, data stores, or downstream services
- The process running the function failing
- The underlying Temporal services or database going down
This addressed a number of distributed systems concerns for us, and meant that:
- We could use local variables instead of saving state to a database.
- We didnāt need to set timers in a database for application logic like canceling an order that takes too long or for the built-in functionality of retrying and timing out transient functions like
chargeCustomer
. - We didnāt need to set up a job queue that workers polled, either for progressing to the next step or picking up unfinished tasks that were dropped by failed processes.
In the next post, we look at more of the delivery appās code and learn how Temporal is able to provide us with durable execution.
If you have any questions, I would be happy to help! Temporalās mission is helping developers, and I also personally find joy in it š¤. Iām @lorendsr on Twitter, I answer (and upvote š) any StackOverflow questions tagged with temporal-typescript
, and am @loren on the community Slack š.
Learn more
To learn more, I recommend these resources:
- Video: Intro to Temporal and using the TypeScript SDK
- Some common use cases
- TypeScript SDK docs: t.mp/ts
- TypeScript API reference: t.mp/ts-api
- TypeScript tutorials
More blog posts about our TypeScript SDK:
- Using Temporal as a Node.js Task Queue
- Caching API Requests with Long-Lived Workflows
- Express middleware that creates a REST API for your Workflows
- 1.0.0 release of the TS SDK
- How we use V8 isolates to enforce Workflow determinism
Thanks to Jessica West, Brian Hogan, Amelia Mango, and Jim Walker for reading drafts of this post.
Posted on January 24, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.