Node.js Resiliency Concepts: Recovery and Self-Healing
Andrei Gaspar
Posted on October 1, 2020
In an ideal world where we reached 100% test coverage, our error handling was flawless,
and all our failures were handled gracefully — in a world where all our systems reached perfection,
we wouldn't be having this discussion.
Yet, here we are. Earth, 2020. By the time you read this sentence,
somebody's server failed in production. A moment of silence for the processes we lost.
In this post, I'll go through some concepts and tools which will make your servers more resilient and boost your process management skills.
Node Index.js
Starting with Node.js — especially if you're new to working with servers — you'll probably want
to run your app on the remote production server the very same way you're running it in development.
Install Node.js, clone the repo, give it an npm install
, and a node index.js
(or npm start
) to spin it all up.
I remember this seeming like a bulletproof plan for me starting out. If it works, why fix it, right?
My code would run into errors during development, resulting in crashes,
but I fixed those bugs on the spot — so the code on the server is uncorrupted.
It cannot crash. Once it starts up, that server is there to stay until the heat death of the universe.
Well, as you probably suspect, that was not the case.
I was facing two main problems that didn't cross my mind back then:
- What happens if the VM/Host restarts?
- Servers crash... That's like, their second most popular attribute. If they weren't serving anything, we would call them crashers.
Wolverine vs T-1000
Recovery can be tackled in many different ways. There are convenient solutions
to restart our server after crashes, and there are more sophisticated approaches
to make it indestructible in production.
Both Wolverine and the T-1000 can take a good beating, but their complexity and recovery rate are very different.
We're looking for distinct qualities based on the environment we're running in.
For development, the goal is convenience. For production, it's usually resilience.
We're going to start with the simplest form of recovery and then slowly work our way up
to elaborate orchestration solutions.
It is up to you how much effort you'd like to invest in your implementation,
but it never hurts having more tools at your disposal, so if this spikes your interest,
fasten your seatbelt, and let's dive in!
Solving Problems as They Arise
You're coding away, developing your amazing server.
After every couple of lines, you switch tabs and nudge it with a node index
or npm start
.
This cycle of constant switching and nudging becomes crushingly tedious after a while.
Wouldn't it be nice if it would just restart on its own after you changed the code?
This is where lightweight packages like Nodemon
and Node.js Supervisor come into play.
You can install them with one line of code and start using them with the next.
To install Nodemon, simply type the below command in your terminal.
npm install -g nodemon
Once installed, just substitute the node
command you've been using
with the new nodemon
command that you now have access to.
nodemon index.js
You can install Node.js Supervisor with a similar approach, by typing the command below.
npm install -g supervisor
Similarly, once installed you can just use the supervisor
prefix to run your app.
supervisor index.js
Nodemon and Supervisor are both as useful as they are popular, with the main difference
being that Nodemon will require you to make file changes to restart your process,
while Supervisor can restart your process when it crashes.
Your server is on the right track. Development speed quadrupled.
These packages do a great job covering development pain-points
and they are pretty configurable as well. But the difficulties we are facing in development
rarely overlap the ones we're facing in production.
When you deploy to the remote server, it feels like sending your kid to college as an overprotective parent.
You want to know your server is healthy, safe, and eats all its veggies.
You'd like to know what problem it faced when it crashed — if it crashed. You want it to be in good hands.
Well, good news! This is where process managers come into the picture. They can babysit your server in production.
Process Management
When you run your app, a process is created.
While running it in development, you would usually open a terminal window and type a command in there.
A foreground process is created and your app is running.
Now, if you would close that terminal window, your app would close with it.
You'll also notice that the terminal window is blocked.
You cannot enter another command before you close the process with Ctrl + C
.
The drawback is that the app is tied to the terminal window,
but you're also able to read all the logs and errors that the process is throwing.
So it's a glass half full.
However, on your production server, you'll want to run in the background,
but then you'll lose the convenience of visibility. Frustration is assured.
Process management is tedious.
Luckily, we have process managers! They are processes that manage other processes for us.
So meta! But ridiculously convenient.
PM2
The most popular process manager for Node.js is called pm2,
and it's so popular for a very good reason. It's great!
It's such a fantastic piece of software that it would take me a separate article to describe its awesomeness
in its entirety, and just how many convenient features it has. Since we're focused on self-healing,
I'll discuss the basics below, but I strongly encourage you to read up on it more in-depth
and check all its amazing features.
Installing pm2 is just as easy as installing the packages we discussed above.
Simply type the following line in your terminal.
npm install -g pm2
Running your app isn't rocket science either. Just type the command below, where index.js
is your main server file.
pm2 start index.js
This time, you might notice something different though.
Seemingly, nothing has happened, but if you go on to visit the endpoint to your app,
you'll notice that it's up and running.
Remember when we discussed running the process in the background?
That is exactly what is happening. pm2 started your server as a background process and it is now managing it for you.
As an added convenience, you can also use the --watch
flag to make sure pm2 watches your files for changes
and reloads your app to make sure it is always up to date.
To do so, you can use the exact command above, but with the flag appended to the end.
pm2 start index.js --watch
Now, pm2 is watching our files and restarts the process anytime the files change or the process crashes.
Perfect! This is exactly what we're after.
It is doing a great job managing our server behind the scenes, but the lack of visibility is anxiety-inducing.
What if you want to see your server logs?
pm2 has you covered. Their CLI is really powerful! I'll list some commands below to get you started.
List your applications with the command below.
Command | Description |
---|---|
pm2 list |
Lists your applications. You'll see a numeric id associated with the applications managed by pm2. You can use that id in the commands you'd like to execute. |
pm2 logs <id> |
Checks the logs of your application. |
pm2 stop <id> |
Stops your process. (Just because the process is stopped, it doesn't mean it stopped existing. If you want to completely remove the process, you'll have to use delete) |
pm2 delete <id> |
Deletes the process. (You don't need to stop and delete separately, you can just go straight for delete, which will stop and delete the process for you) |
pm2 is insanely configurable and is able to perform Load Balancing and Hot Reload for you.
You can read up on all the bells and whistles in their docs, but our pm2 journey comes to a halt here.
Disappointing, I know. But why? I hear you asking.
Remember how convenient it was to install pm2?
We installed it using the Node.js package manager. Wink... Pistol finger. Wink-wink.
Wait. Are we using Node.js to monitor Node.js?
That sounds a bit like trusting your child to babysit itself. Is that a good idea?
There is no objective answer to that question, but it sure sounds like there
should be some other alternatives to be explored.
So, what next? Well, let's explore.
Systemd
If you're planning to run on a good old Linux VM, I think it might be worth mentioning systemd
before jumping onto the deep end of containers and orchestrators.
Otherwise, if you plan to run on a managed application environment
(e.g. Azure AppService, AWS Lambda, GCP App Engine, Heroku, etc.),
this will not be relevant to your use case, but it might not hurt knowing about it.
So assuming that it's just you, your app, and a Linux VM, let's see what systemd
can do for you.
Systemd can start, stop, and restart processes for you, which is exactly what we need.
If your VM restarts, systemd makes sure that your app starts up again.
But first, let's make sure you have access to systemd on your VM.
Below is a list of Linux systems that make use of systemd:
- Ubuntu Xenial (or newer)
- CentOS 7 / RHEL 7
- Debian Jessie (or newer)
- Fedora 15 (or newer)
Let's be realistic, you're probably not using a Linux system from before the great flood,
so you'll probably have systemd access.
The second thing that you need is a user with sudo
privileges.
I'm going to be referring to this user simply as user
but you should substitute it with your own.
Since our user is called user
and, for this example, I'm using Ubuntu,
I'll be referring to your home directory as /home/user/
and I'll go with the assumption that
your index.js
file is located in your home directory.
The systemd Service File
The systemd file is a useful little file that we can create in the system area that holds the
configuration to our service. It is really simple and straightforward, so let's try to set one up.
The systemd files are all located under the directory listed below.
/lib/systemd/system
Let's create a new file there with the editor of your choice and populate it with some content.
Don't forget to use sudo
as a prefix to your command! Everything here is owned by the root user.
Okay, let's start by going into the system directory.
cd /lib/systemd/system
Create a file for your service.
sudo nano myapp.service
And, let's populate it with some content.
# /lib/systemd/system/myapp.service
[Unit]
Description=My awesome server
Documentation=https://awesomeserver.com
After=network.target
[Service]
Environment=NODE_PORT=3000
Environment=NODE_ENV=production
Type=simple
User=user
ExecStart=/usr/bin/node /home/user/index.js
Restart=on-failure
[Install]
WantedBy=multi-user.target
If you glance through the configuration, it's pretty straightforward and self-explanatory, for the most part.
The two settings you might need some hints on are After
and Type
.
After=network.target
means that it should wait for the networking part of the server to be up and running
because we need the port. The simple type just means don't do anything crazy, just start and run.
Running Your App with systemctl
Now that our file has been created, let's tell systemd
to pick up the changes from the newly created file.
You'll have to do this each time you make a change to the file.
sudo systemctl daemon-reload
It is as simple as that. Now that it knows about our service,
we should be able to use the systemctl
command to start and stop it.
We will be referring to it by the service file name.
sudo systemctl start myapp
If you'd like to stop it, you can substitute the start
command with stop
.
If you'd like to restart it, type restart
instead.
Now, on to the part we care most about.
If you'd like your application to start up automatically when the VM boots, you should execute the command below.
sudo systemctl enable myapp
If you want that behavior to stop, just substitute enable
with disable
.
It is as simple as that!
So, now we have another system managing our process that is not Node.js itself.
This is great! You can proudly give yourself a high five, or maybe an awkward elbow bump
depending on the pandemic regulations while reading this article.
Our journey does not stop here though. There's still quite a lot of ground left uncovered,
so let's slowly start diving into the world of containers and orchestration.
What are Containers?
To be able to move forward, you need to understand what Containers are and how they work.
There are a lot of container runtime environments out there such as Mesos, CoreOS, LXC, and OpenVz,
but the one name that is truly synonymous with containers is Docker.
It makes up more than 80% of the containers used and when people mention
containers, it's safe to think they are talking about Docker.
So, what do these containers do anyway?
Well, containers contain. They have a very simple and descriptive name in that sense.
Now the question remains, what do they contain?
Containers contain your application and all of its dependencies.
Nothing more and nothing less. It is just your app and everything that your app needs to run.
Think about what your Node.js server needs to execute:
- Node.js (duh')
- Your index.js file
- Probably your npm packages (dependencies)
So, if we were creating a container, we would want to make sure these things are present and contained.
If we would have such a container ready, then it could be spun up via the container engine (e.g. Docker).
Containers vs VMs, and Italian Cuisine
Even if you haven't worked much with Virtual Machines,
I think you have a general idea about how they work.
You've probably seen your friend running a Windows machine with Linux installed on it,
or a macOS with an additional Windows installation, etc.
So the idea there is that you have your Physical Machine and an Operating System on top,
which then contains your app and its dependencies.
Let's imagine we're making pizza.
- The Machine is the Table
- The OS is the Pizza Dough
- And, your app together with its dependencies are the ingredients on top
Now, let's say you'd like to eat 5 types of pizza, what should you do?
The answer is to make 5 different pizzas on the same table. That's the VM's answer.
But here comes Docker and it says: "Hey, that's a lot of waste! You're not going to eat 5 pizzas,
and making the dough is hard work. What about using the same dough?"
You might be thinking, hey that's not a bad idea actually — but I don't want
my friend's disgusting pineapple flavor (sorry, not sorry) spilling over
into my yummy 4 cheese. The ingredients are conflicting!
And here's where Docker's genius comes in:
"Don't worry! We'll contain them. Your 4 cheese part won't even know about the pineapple part."
So Docker's magic is that it's able to use the same underlying Physical Machine
and Operating System to run well-contained applications of many
different "flavors" without them ever conflicting with each other.
And to keep exotic fruit off your pizza.
Alright, let's move on to creating our first Docker Container.
Creating a Docker Container
Creating a Docker container is really easy, but you'll need to have Docker installed on your machine.
You'll be able to install Docker regardless of your Operating System.
It has support for Linux, Mac, and Windows, but I would strongly advise sticking to Linux for production.
Once you have Docker installed, it is time to create the container!
Docker looks for a specific file called Dockerfile
and it will use it to create
a recipe for your container that we call a Docker Image.
So before we create a container, we'll have to create that file.
Let's create this file in the same directory we have our index.js
file and package.json
.
# Dockerfile
# Base image (we need Node)
FROM node:12
# Work directory
WORKDIR /usr/myapp
# Install dependencies
COPY ./package*.json ./
RUN npm install
# Copy app source code
COPY ./ ./
# Set environment variables you need (if you need any)
ENV NODE_ENV='production'
ENV PORT=3000
# Expose the port 3000 on the container so we can access it
EXPOSE 3000
# Specify your start command, divided by commas
CMD [ "node", "index.js" ]
It is smart to use a .dockerignore
file in the same directory to ignore files
and directories you might not want to copy. You can think of it as working the same as .gitignore
# .dockerignore
node_modules
npm-debug.log
Now that you have everything set up, it's time to build the Docker Image!
You can think of an image as a recipe for your container.
Or, if you're old enough, you might remember having disks for software installers.
It wasn't the actual software running on it, but it contained the packaged software data.
You can use the command below to create the image. You can use the -t
flag to name your image and
find it easier later. Also, make sure you opened up the terminal to the directory where your Dockerfile
is located.
docker build -t myapp .
Now, if you list your images, you'll be able to see your image on the list.
docker image ls
If you have your image ready, you're just one command away from having your container up and running.
Let's execute the command below to spin it up.
docker run -p 3000:3000 myapp
You'll be able to see your server starting up with the container and read your logs in the process.
If you'd like to spin it up in the background, use the -d
flag before your image name.
Also, if you're running the container in the background, you can print a list of containers using the command below.
docker container ls
So far so good! I think you should have a pretty good idea about how containers work at this point,
so instead of diving into the details, let's move ahead to a topic very closely tied to recovery: Orchestration!
Orchestration
If you don't have an operations background, chances are you're thinking about containers
as some magical sophisticated components. And you would be right in thinking that.
They are magical and complex. But it doesn't help to have that model in our minds, so it's time to change that.
It's best to think about them as the simplest components of our infrastructure, sort of like Lego blocks.
Ideally, you don't even want to be managing these Lego blocks individually
because it's just too fiddly. You'd want another entity that handles them for you,
sort of like the process manager that we discussed earlier.
This is where Orchestrators come into play.
Orchestrators help you manage and schedule your containers and they allow you
to do this across multiple container hosts (VMs) distributed across multiple locations.
The orchestrator feature that interests us the most in this context is Replication!
Replication and High Availability
Restarting our server when it crashes is great, but what happens
during the time our server is restarting? Should our users be waiting for the service
to get back up? How do they know it will be back anyway?
Our goal is to make our service Highly Available, meaning that our
users are able to use our app even if it crashes.
But how can it be used if it's down?
Simple. Make copies of your server and run them simultaneously!
This would be a headache to set up from scratch, but luckily, we have everything
that we need to enable this mechanism.
Once your app is containerized, you can run as many copies of it as you'd like.
These copies are called Replicas.
So let's look into how we would set up something like this using a container orchestration engine.
There are quite a few out there, but the easiest one to get started with is
Docker's orchestration engine, Docker Swarm.
Replication in Swarm
If you have Docker installed on your machine, you're just one command away from using Docker Swarm.
docker swarm init
This command enables Docker Swarm for you and it allows you to form a distributed cluster
by connecting other VMs to the Swarm. For this example, we can just use a single machine.
So, with Docker Swarm enabled, we now have access to the components called services
.
They are the bread and butter of a microservice style architecture,
and they make it easy for us to create replicas.
Let's create a service! Remember the image name we used when we built our Docker image?
It's the same image we're going to use here.
docker service create --name myawesomeservice --replicas 3 myapp
The command above will create a service named myawesomeservice
and it will use the image
named myapp
to create 3 identical containers.
You'll be able to list your services with the command below.
docker service ls
You can see that there's a service with the name you specified.
To be able to see the containers that have been created, you can use the following command:
docker container ls
Now that our server is running replicated, the service will make sure to always
restart the container if it crashes, and it can offer access to the healthy containers throughout the process.
If you'd like to adjust the number of replicas of a service, you can use the command below.
docker service scale <name_of_service>=<number_of_replicas>
For example:
docker service scale myapp=5
You're able to run as many replicas as you'd like, just as simple as that.
Isn't that awesome? Let's look at one last example and see how we would approach replication in Kubernetes.
Replication in Kubernetes
It's hard to skip Kubernetes in a discussion about orchestration.
It's the gold standard when it comes to orchestration, and rightfully so.
I think Kubernetes has a much steeper learning curve than Swarm, so if you're just getting
started with containers I'd suggest picking up Swarm first. That said, it doesn't hurt to have
a general understanding of how this would work in the world of K8S.
If you don't feel like installing minikube
or you don't want to fiddle with cloud providers,
there's an easy option to dabble in Kubernetes for a bit, by using the
Play with Kubernetes online tool.
It gives you a 4-hour session which should be more than enough for small experiments.
To be able to follow this exercise, please make sure that you created
a DockerHub account, and pushed up the docker image to your repo!
We're going to create two components by creating two .yml
configuration files:
- A Cluster IP Service — this is going to open up a port for us to communicate with our app.
- A Deployment — which is sort of like a service in Docker Swarm, with a bit more bells and whistles.
Let's first start with the ClusterIP. Create a cluster-ip.yml
file and paste the following content into it.
# cluster-ip.yml
apiVersion: v1
kind: Service
metadata:
name: cluster-ip-service
spec:
type: ClusterIP
selector:
component: server
ports:
- port: 3000
targetPort: 3000
Let's create a Deployment as well. Within a deployment.yml
file, you can paste the following content.
# deployment.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: server-deployment
spec:
replicas: 3
selector:
matchLabels:
component: server
template:
metadata:
labels:
component: server
spec:
containers:
- name: server
image: your_docker_user/your_image
ports:
- containerPort: 3000
You'll need to make sure that you substituted the your_docker_user/your_image
with your
actual user and image name and you have that image hosted on your Docker repo.
Now that we have these two files ready, all we need to do to spin this up is to execute
the command below. Make sure you're executing it in the directory that contains the files.
kubectl apply -f .
You can now check if your server is up and running by listing the deployments and services.
kubectl get deployments
kubectl get services
If everything worked out according to plan,
you should be able to copy-paste the IP
and Port
from your cluster-ip-service
into your
browser's address bar to access your application.
To see the replicas that have been created, you can use the following command:
kubectl get pods
The pods listed should correspond to the number of replicas you specified in your deployment.yml
file.
To clean up all the components, you can simply execute:
kubectl delete -f .
And just like that, we learned about Replication within Kubernetes as well.
Conclusion
So, we have an application that recovers and is highly available. Is that all there is to it?
Not at all. In fact, now that your app doesn't "go down", how do you know what issues it might be having?
By looking at the logs? Be honest. If your app is up every time you check the endpoint,
you'll probably check the logs about two times per year.
There's more interesting stuff to look at on social media.
So, to make sure your app is improving, you'll have to start thinking about monitoring,
error handling, and error propagation. You'll have to make sure that you're aware of issues
as they arise, and you're able to fix them even if they don't keep your server down.
That's a topic for another time though, I hope you enjoyed this article
and it was able to shed some light on some of the approaches you could use
to enable recovery for your Node.js application.
P.S. If you liked this post, subscribe to our new JavaScript Sorcery list for a monthly deep dive into more magical JavaScript tips and tricks.
P.P.S. If you'd love an all-in-one APM for Node.js or you're already familiar with AppSignal, go and check out AppSignal for Node.js.
Posted on October 1, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.