Why we stopped using 'npm start' for running our blockchain core's child processes
Michiel Mulders
Posted on May 23, 2019
You shouldn't start applications through npm when you have child processes natively supported by Node.js. In this article, we will provide a list of best practices for Node.js applications with a code snippet that outlines the core problem and shows you how to reproduce the issue in 3 steps. In short, we stopped using npm start to run our blockchain's core and instead opted for using the native node command.
Introduction to npm and its most well-known command 'npm start'.
Npm is the go-to node package manager when you are working on a JavaScript project. It allows you to install other people's code packages into your own project so you don't have to code everything you need from scratch. Npm also became famous because of its industry-wide usage of script commands that can be entered in the shell to start your application. Of course, the most well-known command is npm start
that acts as a wrapper for node app.js
.
Our challenge: npm runs app.js files as a child process ofnpm.
However, what many don't know is that while using npm start
to trigger node app.js, npm is actually running your app.js file as a child process of npm which manages this. In 99% of the cases, you shouldn't care about this, but things can get tricky when working with child processes in your own project. Can you feel the inception happening here? #child-process-inception
If you want to know more about Lisk first, check out this short explainer clip and our documentation!
To give you a better understanding of how this is relevant to our "npm vs node" problem, let's talk about how we are running Lisk Core. For those who don't know what Lisk Core is, essentially, it is a program that implements the Lisk Protocol which includes consensus, block creation, transaction handling, peer communication, etc. Every machine must set it up to run a node that allows for participation in the network.
Intro to PM2, a production process manager for Node.js apps.
In our case, we use PM2 to restart the application upon failure. PM2 is a production process manager for Node.js applications with a built-in load balancer. It allows you to keep applications alive forever, to reload them without downtime and to facilitate common system admin tasks.
A few weeks ago, we decided to provide the ability to run the http_api
module as a child process to improve the overall efficiency of the Lisk Core application while using the same allocated resources.
Rationale behind the decision to run http_api module as a child process.
The idea behind this decision was mainly funded by the fact that functionally isolated components can form the basis of a multi-process application, in order to utilize the potential of multiple hardware cores of the physical processor if available. Also, to design each component in a resilient way to tackle brittleness of the multi-processing. This means that a failure of one component will have minimal impact on other components and that components can recover individually. More information about child processes can be found in our proposal to introduce a new flexible, resilient and modular architecture for Lisk Core.
We were not able to gracefully exit Lisk Core with npm.
While implementing child processes for the http_api
module, Lightcurve Backend Developer Lucas Silvestre discovered that Lisk Core was not exiting gracefully while running the http_api
module as a child process using PM2. This resulted in a tricky situation where the http_api
kept on running in the background whenever the main process (Lisk Core) crashed.
Whenever this happens, PM2 will attempt to recover the Lisk Core process. However, this would spawn a new http_api
process which was not possible as the port was already in use because of the cleanup process not being called. The resulted in PM2 not being able to restore the application which is a big issue when running a blockchain node that is part of the network. In this case, the user has to manually restart the blockchain node which we absolutely want to avoid.
Running Lisk Core with node command
This issue made us aware of the difference between npm and node and made us reconsider the way we were running Lisk Core. Previously, we just accepted the npm start
industry standard as the go-to way of running an application.
Later, we found the best practices provided by the docker-node GitHub repository dedicated to Dockerizing Node.js applications. Here, a clear warning message can be found about the usage of npm inside of a Dockerfile or any other higher-level application management tool like PM2.
"When creating an image, you can bypass the package.json's start command and bake it directly into the image itself. First off this reduces the number of processes running inside of your container. Secondly, it causes exit signals such as SIGTERM and SIGINT to be received by the Node.js process instead of npm swallowing them."
Whenever we tried to exit Lisk Core or the application crashed, a SIGINT signal is sent to the application. In Node.js, you can listen for this signal and execute a cleanup function in order to gracefully exit the application. In our case, we are removing various listeners and pass the SIGINT signal to the child process in order to exit this one gracefully as well.
As stated by docker-node, npm swallows this signal and does not trigger our listeners for the SIGINT signal causing the application to not being able to clean up gracefully. That's also the reason why the http_api
module kept running inside of PM2.
Nick Parson, an expert when it comes to running Node applications with PM2 also mentions the fact that it is important to gracefully shut down your application in order to maximize robustness and enable fast startup (no downtime) when using PM2.
Termination signals: what are SIGKILL, SIGTERM, and SIGINT?
We have to dive quite deep to find out what these signals are about. These signals are part of a collection of signals to tell a process to terminate, actually many more exist, and can be found in the documentation provided by gnu.org under section 24.2.2 Termination Signals.
- SIGKILL: "The SIGKILL signal is used to cause immediate program termination. It cannot be handled or ignored, and is therefore always fatal. It is also not possible to block this signal."
- SIGTERM: "The SIGTERM signal is a generic signal used to cause program termination. Unlike SIGKILL, this signal can be blocked, handled, and ignored. It is the normal way to politely ask a program to terminate." Interesting to know that the shell command kill generates SIGTERM by default.
- SIGINT: "The SIGINT ('program interrupt') signal is sent when the user types the INTR character (normally
C-c
)." Developers will probably be more familiar with theCTRL/CMD+C
command to interrupt a running process in the shell.
Moving Docker and PM2 to Node.
This made us decide to get rid of npm start
and replacing it by the node command. The start
command was being used in both the Dockerfile as the PM2 run file.
The following image shows a snippet of the typical ENTRYPOINT
for Docker. Previously, this would contain ENTRYPOINT ["npm", "start"]
. This file can be found now in our new Lisk Core repositor which is extracted from the Lisk-SDK Monorepo.
Lisk-SDK Dockerfile.Also, the same applies to the pm2-lisk.json
file which contains the PM2 configuration for starting Lisk Core. The script
property now contains the relative path to the index
file.
Learn how to reproduce the issue in 3 steps.
We can find a cool snippet created by GitHub user EvanTahler addressing the above-mentioned issue. Let's reproduce this!
Step 1. Create package.json and app.js
To emulate this issue, you need to create two files (package.json
and app.js
) in the same directory. Make sure you have Node.js version 10.x
or higher installed on your machine to run the snippet with the node command. As we don't need any code dependencies, we don't have to install anything else.
Package.json
{
"name": "test",
"scripts": {
"start": "node ./app.js"
}
}
App.js
process.on('SIGINT', function(){ console.log("SIGINT"); shutDown() });
process.on('SIGTERM', function(){ console.log("SIGTERM"); shutDown() });
var string = ".";
var shutDown = function(){
console.log("off-ing...");
string = "x";
setTimeout(function(){
console.log("bye!");
process.exit();
}, 1000 * 5);
}
setInterval(function(){
console.log(string);
}, 500)
Snippet clarification - The snippet will print a dot every 0.5 seconds and listens for the SIGINT and SIGTERM signals. Once one of the two termination signals is received, we will delay the shutdown by 5 seconds (5 * 1000ms) and print out "bye!".
Before running this snippet, I want to show you how a killed process is indicated in your terminal when hitting CTRL/CMD+C
. You can notice it by the ^C
characters.
Shows Lisk Core running for exactly 17 minutes after getting killed with the SIGINT signal.
Step 2. Run the snippet with node.
Now we know how the SIGINT is represented in our terminal, let's start the snippet with node app.js
. Let it run for 5 seconds, and hit CTRL/CMD+C
. You will see that the kill signal is properly handled by Node and waits for 5 more seconds before shutting down.
Step 3. Run the snippet with npm start
However, when we run the snippet with npm start
, you will notice two kill signals being received. As we now know, the start command will run node app.js
as a child process. So, when receiving ^C
, it will try to exit the npm process and pass this termination signal to the child which causes the problem that the main process exits but the child is still active for 5 more seconds.
As explained before, this will give all sorts of problems when you try to listen for termination signals while running applications with npm start, especially when operating child processes.
Interested in learning how to set up and run your own Lisk node? More information can be found in the Lisk Core documentation on the website. You can choose between the binary setup which is the default (and most simple) installation technique. Other options include running Lisk Core with Docker to support other platforms or for more advanced users, it is possible to build from Lisk Core.
Because of this "child process inception", the http_api
module could not gracefully exit and kept on running. The only way to stop this process is by using a shell command that kills all Node processes: sudo killall node
(or target the specific process ID to be killed). Luckily, this could be easily resolved by using node to start the application.
Best Practices for Handling Node.js Applications
Felix Geisendörfer, an early contributor of Node.js, makes it very clear how to handle crashed applications:
Source: Node.js Best Practices SlideShare
What does the above teach us? Avoid spinning up your application through npm start
but use node instead. Also, if something goes wrong, exit the process gracefully and accept it. Felix recommends using higher level tools like PM2 to deal with recovering and restarting the application.
We learned from this that you not always should take standards for granted. It is sometimes better to keep things simple and run it with a simple node command.
To conclude what we did at Lisk, we decided to solve the issue by changing the npm start
command to node src/index
in both the PM2 run configuration and Dockerfile. Now, upon receiving a SIGINT signal, the node process receives this directly and can communicate the SIGINT signal to its child processes so every process can be exited gracefully.
Therefore, PM2 can easily restart the application without any downtime. Running our application via this setup allows us to deploy a more stable application which is utterly important for creating a stable blockchain network.
Lisk empowers individuals to create a more decentralized, efficient and transparent global economy. We welcome you to join us in our mission:
Posted on May 23, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
May 23, 2019