Running on Lambda: Serverless Reader View with Chrome and Readability
Chris Cook
Posted on September 25, 2023
Do you remember the Firefox Reader View? It's a feature that removes all unnecessary components like buttons, menus, images, and so on, from a website, focusing on the readable content of the page. The library powering this feature is called Readability.js, which is open source.
Motivation
For one of my personal projects, I needed an API that returns the readable content from a given URL. Initially, that seemed like a straightforward task: just fetch the HTML and feed it into the library. However, it turned out to be a bit more complicated due to the complexity of modern web pages filled with lots of JavaScript.
First of all, in order to retrieve the actual content of a page, a browser is needed to execute all scripts and render the page. And since we're talking Serverless, it has to run on Lambda, of course. Sounds fun?
Stack
I'm usually a Serverless Framework guy, but for this project, I wanted to try something new. So I decided to give the AWS CDK a try and I really liked the experience – more on that at the end. Let's walk through the interesting bits and pieces.
Lambda Layer
The most crucial question was, of course, how to run Chrome on Lambda. Fortunately, much of the groundwork for running Chrome on Lambda had been laid by others. I used the @sparticuz/chromium package to run Chromium in headless mode. However, Chromium is a rather big dependency, so to speed up deployments, I created a Lambda Layer.
The corresponding .zip file was downloaded as artifact from one of the releases.
Lambda Function
The function runs on Node.js v18 and is compiled via ESBuild from TypeScript. There are a few things to note here. I increased the memory to 1600 MB as recommended, and the timeout to 30 seconds to give Chromium enough space and time to start.
I added a reserved concurrency of 1 to prevent this function from scaling out of control due to too many requests.
When bundling this function, the @sparticuz/chromium package has to be excluded because we provide it as a Lambda Layer. On the other hand, the jsdom package can't be bundled, so it has to be installed as a normal node module.
REST API
The function is invoked by a GET request from a REST API and receives the URL as a query string parameter. The url request parameter is marked as mandatory. Moreover, I made use of the new defaultCorsPrefligtOptions to simplify the CORS setup.
First, we declare the browser outside of the handler function to be able to re-use the browser instance on subsequent invocations. The launch of a new instance on a cold start causes the majority of execution time.
We parse the url query string parameter from the API Gateway event and validate it to be a real URL. Then, we use Puppeteer to launch a new browser instance and open a new page. This new page is closed at the end of the function while the browser instance stays open until the Lambda is terminated.
Readability.js requires a DOM object to parse the readable content from a website. That's why we create a DOM object with JSDOM and provide the HTML from the page and its current URL. By the way, the browser may have had to follow HTTP redirects, so the current URL doesn't necessarily have to be the one we provided initially. The parse function of the library returns the following result:
Some meta information is also available in the result object, but since we are returning raw HTML content, we are only interested in the content property. However, we have to add the Content-Type header with text/html; charset=utf-8 to the response object to ensure the browser renders it correctly.
Application
Now comes the fun part. I have created a simple web app with React, Tailwind, and Vite to demonstrate this project. Strictly speaking, you could call the REST API directly from a browser as the Lambda function returns real HTML that renders just fine. However, I thought it would be nicer to use it as a real application.
The following articles are curated examples showcasing the Readability version on the left and the Original article on the right. Of course, you can also try your own article and start here: zirkelc.github.io/lambda-readability
So without further ado, let's read some articles:
Cloud Development Kit
I've got to say, my initial dive into AWS CDK has been quite a pleasant surprise. What impresses me most is the ability to code up my infrastructure using good old JavaScript or TypeScript, the very languages I already use to develop my application. No more fumbling with meta languages or constantly referring to documentation just to figure out how to do this or that – CDK simplifies everything.
The beauty of it all is that I can utilize the fundamental building blocks: if-conditions and for-loops, objects and arrays, classes and functions. I can put my coding skills to work in the same way I always do, without the need for any special plugins or hooks. That’s what Infrastructure as Code should really feel like – a truly great developer experience.
Conclusion
It's pretty amazing how far the Serverless world has come, enabling us to effortlessly run a Chrome browser inside a Lambda function. If you are interested in the mechanics of this project, you can view the full source code on GitHub. I'd really appreciate your feedback, and if you like it, give it a star on GitHub!
I hope you found this post helpful. If you have any questions or comments, feel free to leave them below. If you'd like to connect with me, you can find me on LinkedIn or GitHub. Thanks for reading!