Building Reliable Distributed Systems in Node

lorendsr

Loren šŸ¤“

Posted on January 24, 2023

Building Reliable Distributed Systems in Node

This post introduces the concept of durable execution, which is used by Stripe, Netflix, Coinbase, Snap, and many others to solve a wide range of problems in distributed systems. Then it shows how simple it is to write durable code using our TypeScript/JavaScript SDK.

Distributed systems

When building a request-response monolith backed by a single database that supports transactions, we donā€™t have many distributed systems concerns. We can have simple failure modes and easily maintain accurate state:

  • If the client canā€™t reach the server, the client retries.
  • If the client reaches the server, but the server canā€™t reach the database, the server responds with an error, and the client retries.
  • If the server reaches the database, but the transaction fails, the server responds with an error, and the client retries.
  • If the transaction succeeds but the server goes down before responding to the client, the client retries until the server is back up, and the transaction fails the second time (assuming the transaction has some checkā€“like an idempotency tokenā€“to tell whether the update has already been applied), and the server reports to the client that the action has already been performed.

As soon as we introduce a second place for state to live, whether thatā€™s a service with its own database or an external API, handling failures and maintaining consistency (accuracy across all data stores) gets significantly more complex. For example, if our server has to charge a credit card and also update the database, we can no longer write simple code like:

function handleRequest() {
  paymentAPI.chargeCard()
  database.insertOrder()
  return 200
}
Enter fullscreen mode Exit fullscreen mode

If the first step (charging the card) succeeds, but the second step (adding the order to the database) fails, then the system ends up in an inconsistent state; we charged their card, but thereā€™s no record of it in our database. To try to maintain consistency, we might have the second step retry until we can reach the database. However, itā€™s also possible that the process running our code will fail, in which case weā€™ll have no knowledge that the first step took place. To fix this, we need to do three things:

  • Persist the order details
  • Persist which steps of the program weā€™ve completed
  • Run a worker process that checks the database for incomplete orders and continues with the next step

That, along with persisting retry state and adding timeouts for each step, is a lot of code to write, and itā€™s easy to miss certain edge cases or failure modes (see the full, scalable architecture). We could build things faster and more reliably if we didnā€™t have to write and debug all that code. And we donā€™t have to, because we can use durable execution.

Durable execution

Durable execution systems run our code in a way that persists each step the code takes. If the process or container running the code dies, the code automatically continues running in another process with all state intact, including call stack and local variables.

Durable execution ensures that the code is executed to completion, no matter how reliable the hardware or how long downstream services are offline. Retries and timeouts are performed automatically, and resources are freed up when the code isnā€™t doing anything (for example while waiting on a sleep(ā€˜1 monthā€™) statement).

Durable execution makes it trivial or unnecessary to implement distributed systems patterns like event-driven architecture, task queues, sagas, circuit breakers, and transactional outboxes. Itā€™s programming on a higher level of abstraction, where you donā€™t have to be concerned about transient failures like server crashes or network issues. It opens up new possibilities like:

  • Storing state in local variables instead of a database, because local variables are automatically stored for us
  • Writing code that sleeps for a month, because we donā€™t need to be concerned about the process that started the sleep still being there next month, or resources being tied up for the duration
  • Functions that can run forever, and that we can interact with (send commands to or query data from)

Some examples of durable execution systems are Azure Durable Functions, Amazon SWF, Uber Cadence, Infinitic, and Temporal (where I work). At the risk of being less than perfectly objective, I think Temporal is the best of these options šŸ˜Š.

Durable JavaScript

Now that weā€™ve gone over consistency in distributed systems and what durable execution is, letā€™s look at a practical example. I built this food delivery app to show what durable code looks like and what problems it solves:

temporal.menu

Durable Delivery app menu

Donā€™t blame me for the logoā€”thatā€™s just what Stable Diffusion gives you when you ask it for a durable delivery app logo. šŸ¤·ā€ā™‚ļøšŸ˜„

The app has four main pieces of functionality:

  • Create an order and charge the customer
  • Get order status
  • Mark an order picked up
  • Mark an order delivered

The order process, showing both the menu and driver sites

When we order an item from the menu, it appears in the delivery driver site (drive.temporal.menu), and the driver can mark the order as picked up, and then as delivered.

All of this functionality can be implemented in a single function of durable JavaScript or TypeScript. Weā€™ll be using the latterā€”I recommend TypeScript and our library is named the TypeScript SDK, but itā€™s published to npm as JavaScript and can be used in any Node.js project.

Create an order

Letā€™s take a look at the code for this app. Weā€™ll see a few API routes but mostly go over each piece of the single durable function named order. If youā€™d like to run the app or view the code on your machine, this will download and set up the project:

npx @temporalio/create@latest --sample food-delivery
Enter fullscreen mode Exit fullscreen mode

When the user clicks the order button, the React frontend calls the createOrder mutation defined by the tRPC backend. The createOrder API route handler creates the order by starting a durable order function. Durable functionsā€”called Workflowsā€”are started using a Client instance from @temporalio/client, which has been added to the tRPC context under ctx.temporal. The route handler receives a validated input (an object with a productId number and orderId string) and it calls ctx.temporal.workflow.start to start an order Workflow, providing input.productId as an argument:

apps/menu/pages/api/[trpc].ts

import { initTRPC } from '@trpc/server'
import { z } from 'zod'
import { taskQueue } from 'common'
import { Context } from 'common/trpc-context'
import { order } from 'workflows'

const t = initTRPC.context<Context>().create()

export const appRouter = t.router({
  createOrder: t.procedure
    .input(z.object({ productId: z.number(), orderId: z.string() }))
    .mutation(async ({ input, ctx }) => {
      await ctx.temporal.workflow.start(order, {
        workflowId: input.orderId,
        args: [input.productId],
        taskQueue,
      })

      return 'Order received and persisted!'
    }),
Enter fullscreen mode Exit fullscreen mode

The order function starts out validating the input, setting up the initial state, and charging the customer:

packages/workflows/order.ts

type OrderState = 'Charging card' | 'Paid' | 'Picked up' | 'Delivered' | 'Refunding'

export async function order(productId: number): Promise<void> {
  const product = getProductById(productId)
  if (!product) {
    throw ApplicationFailure.create({ message: `Product ${productId} not found` })
  }

  let state: OrderState = 'Charging card'
  let deliveredAt: Date

  try {
    await chargeCustomer(product)
  } catch (err) {
    const message = `Failed to charge customer for ${product.name}. Error: ${errorMessage(err)}`
    await sendPushNotification(message)
    throw ApplicationFailure.create({ message })
  }

  state = 'Paid'
Enter fullscreen mode Exit fullscreen mode

Any functions that might fail are automatically retried. In this case, chargeCustomer and sendPushNotification both talk to services that might be down at the moment or might return transient error messages like ā€œTemporarily unavailable.ā€ Temporal will automatically retry running these functions (by default indefinitely with exponential backoff, but thatā€™s configurable). The functions can also throw non-retryable errors like ā€œCard declined,ā€ in which case they wonā€™t be retried. Instead, the error will be thrown out of chargeCustomer(product) and caught by the catch block; the customer receives a notification that their payment method failed, and we throw an ApplicationFailure to fail the order Workflow.

Get order status

The next bit of code requires some background: Normal functions canā€™t run for a long time, because theyā€™ll take up resources while theyā€™re waiting for things to happen, and at some point theyā€™ll die when we deploy new code and the old containers get shut down. Durable functions can run for an arbitrary length of time for two reasons:

  • They donā€™t take up resources when theyā€™re waiting on something.
  • It doesnā€™t matter if the process running them gets shut down, because execution will seamlessly be continued by another process.

So although some durable functions run for a short period of timeā€”like a successful money transfer functionā€”some run longerā€”like our order function, which ends when the order is delivered, and a customer function that lasts for the lifetime of the customer.

Itā€™s useful to be able to interact with long-running functions, so Temporal provides what we call Signals for sending data into the function and Queries for getting data out of the function. The driver site shows the status of each order by sending Queries to the order functions through this API route:

apps/menu/pages/api/[trpc].ts

  getOrderStatus: t.procedure
    .input(z.string())
    .query(({ input: orderId, ctx }) => ctx.temporal.workflow.getHandle(orderId).query(getStatusQuery)),
Enter fullscreen mode Exit fullscreen mode

It gets a handle to the specific instance of the order function (called a Workflow Execution), sends the getStatusQuery, and returns the result. The getStatusQuery is defined in the order file and handled in the order function:

packages/workflows/order.ts

import { defineQuery, setHandler } from '@temporalio/workflow'

export const getStatusQuery = defineQuery<OrderStatus>('getStatus')

export async function order(productId: number): Promise<void> {
  let state: OrderState = 'Charging card'
  let deliveredAt: Date

  // ā€¦

  setHandler(getStatusQuery, () => {
    return { state, deliveredAt, productId }
  })
Enter fullscreen mode Exit fullscreen mode

When the order function receives the getStatusQuery, the function passed to setHandler is called, which returns the values of local variables. After the call to chargeCustomer succeeds, the state is changed to ā€™Paidā€™, and the driver site, which has been polling getStatusQuery, gets the updated state. It displays the ā€œPick upā€ button.

Picking up an order

When the driver taps the button to mark the order as picked up, the site sends a pickUp mutation to the API server, which sends a pickedUpSignal to the order function:

apps/driver/pages/api/[trpc].ts

  pickUp: t.procedure
    .input(z.string())
    .mutation(async ({ input: orderId, ctx }) => 
      ctx.temporal.workflow.getHandle(orderId).signal(pickedUpSignal)
    ),
Enter fullscreen mode Exit fullscreen mode

The order function handles the Signal by updating the state:

packages/workflows/order.ts

export const pickedUpSignal = defineSignal('pickedUp')

export async function order(productId: number): Promise<void> {
  // ā€¦

  setHandler(pickedUpSignal, () => {
    if (state === 'Paid') {
      state = 'Picked up'
    }
  })
Enter fullscreen mode Exit fullscreen mode

Meanwhile, further down in the function, after the customer was charged, the function has been waiting for the pickup to happen:

packages/workflows/order.ts

import { condition } from '@temporalio/workflow'

export async function order(productId: number): Promise<void> {
  // ā€¦

  try {
    await chargeCustomer(product)
  } catch (err) {
    // ā€¦
  }

  state = 'Paid'

  const notPickedUpInTime = !(await condition(() => state === 'Picked up', '1 min'))
  if (notPickedUpInTime) {
    state = 'Refunding'
    await refundAndNotify(
      product,
      'āš ļø No drivers were available to pick up your order. Your payment has been refunded.'
    )
    throw ApplicationFailure.create({ message: 'Not picked up in time' })
  }
Enter fullscreen mode Exit fullscreen mode

await condition(() => state === 'Picked up', '1 min') waits for up to 1 minute for the state to change to Picked up. If a minute goes by without it changing, it returns false, and we refund the customer. (Either we have very high standards for the speed of our chefs and delivery drivers, or we want the users of a demo app to be able to see all the failure modes šŸ˜„.)

Delivery

Similarly, thereā€™s a deliveredSignal sent by the ā€œDeliverā€ button, and if the driver doesnā€™t complete delivery within a minute of pickup, the customer is refunded.

packages/workflows/order.ts

export const deliveredSignal = defineSignal('delivered')

export async function order(productId: number): Promise<void> {
  setHandler(deliveredSignal, () => {
    if (state === 'Picked up') {
      state = 'Delivered'
      deliveredAt = new Date()
    }
  })

  // ā€¦

  await sendPushNotification('šŸš— Order picked up')

  const notDeliveredInTime = !(await condition(() => state === 'Delivered', '1 min'))
  if (notDeliveredInTime) {
    state = 'Refunding'
    await refundAndNotify(product, 'āš ļø Your driver was unable to deliver your order. Your payment has been refunded.')
    throw ApplicationFailure.create({ message: 'Not delivered in time' })
  }

  await sendPushNotification('āœ… Order delivered!')
Enter fullscreen mode Exit fullscreen mode

If delivery was successful, the function waits for a minute for the customer to eat their meal and asks them to rate their experience.

  await sleep('1 min') // this could also be hours or even months

  await sendPushNotification(`āœļø Rate your meal. How was the ${product.name.toLowerCase()}?`)
}
Enter fullscreen mode Exit fullscreen mode

After the final push notification, the order functionā€™s execution ends, and the Workflow Execution completes successfully. Even though the function has completed, we can still send Queries, since Temporal has the final state of the function saved. And we can test that by refreshing the page a minute after an order has been delivered: the getStatusQuery still works and ā€œDeliveredā€ is shown as the status:

Poke order with Status: Delivered

Summary

Weā€™ve seen how a multi-step order flow can be implemented with a single durable function. The function is guaranteed to complete in the presence of failures, including:

  • Temporary issues with the network, data stores, or downstream services
  • The process running the function failing
  • The underlying Temporal services or database going down

This addressed a number of distributed systems concerns for us, and meant that:

  • We could use local variables instead of saving state to a database.
  • We didnā€™t need to set timers in a database for application logic like canceling an order that takes too long or for the built-in functionality of retrying and timing out transient functions like chargeCustomer.
  • We didnā€™t need to set up a job queue that workers polled, either for progressing to the next step or picking up unfinished tasks that were dropped by failed processes.

In the next post, we look at more of the delivery appā€™s code and learn how Temporal is able to provide us with durable execution.

If you have any questions, I would be happy to help! Temporalā€™s mission is helping developers, and I also personally find joy in it šŸ¤—. Iā€™m @lorendsr on Twitter, I answer (and upvote šŸ˜„) any StackOverflow questions tagged with temporal-typescript, and am @loren on the community Slack šŸ’ƒ.

Learn more

To learn more, I recommend these resources:

More blog posts about our TypeScript SDK:

Thanks to Jessica West, Brian Hogan, Amelia Mango, and Jim Walker for reading drafts of this post.

šŸ’– šŸ’Ŗ šŸ™… šŸš©
lorendsr
Loren šŸ¤“

Posted on January 24, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

How Durable Execution Works
distributedsystems How Durable Execution Works

May 23, 2023

Building Reliable Distributed Systems in Node
distributedsystems Building Reliable Distributed Systems in Node

January 24, 2023