Do you remember the Firefox Reader View? It's a feature that removes all unnecessary components like buttons, menus, images, and so on, from a website, focusing on the readable content of the page. The library powering this feature is called Readability.js, which is open source.

Motivation

For one of my personal projects, I needed an API that returns the readable content from a given URL. Initially, that seemed like a straightforward task: just fetch the HTML and feed it into the library. However, it turned out to be a bit more complicated due to the complexity of modern web pages filled with lots of JavaScript.

First of all, in order to retrieve the actual content of a page, a browser is needed to execute all scripts and render the page. And since we're talking Serverless, it has to run on Lambda, of course. Sounds fun?

Stack

I'm usually a Serverless Framework guy, but for this project, I wanted to try something new. So I decided to give the AWS CDK a try and I really liked the experience – more on that at the end. Let's walk through the interesting bits and pieces.

Lambda Layer

The most crucial question was, of course, how to run Chrome on Lambda. Fortunately, much of the groundwork for running Chrome on Lambda had been laid by others. I used the @sparticuz/chromium package to run Chromium in headless mode. However, Chromium is a rather big dependency, so to speed up deployments, I created a Lambda Layer.

const chromeLayer = new LayerVersion(this, "chrome-layer", {
  description: "Chromium v111.0.0",
  compatibleRuntimes: [Runtime.NODEJS_18_X],
  compatibleArchitectures: [Architecture.X86_64],
  code: Code.fromAsset("layers/chromium/chromium-v111.0.0-layer.zip"),
});

The corresponding .zip file was downloaded as artifact from one of the releases.

Lambda Function

The function runs on Node.js v18 and is compiled via ESBuild from TypeScript. There are a few things to note here. I increased the memory to 1600 MB as recommended, and the timeout to 30 seconds to give Chromium enough space and time to start.
I added a reserved concurrency of 1 to prevent this function from scaling out of control due to too many requests.

const handler = new NodejsFunction(this, "handler", {
  functionName: "lambda-readability",
  entry: "src/handler.ts",
  handler: "handler",
  runtime: Runtime.NODEJS_18_X,
  timeout: cdk.Duration.seconds(30),
  memorySize: 1600,
  reservedConcurrentExecutions: 1,
  environment: {
    NODE_OPTIONS: "--enable-source-maps --stack-trace-limit=1000",
  },
  bundling: {
    externalModules: ["@sparticuz/chromium"],
    nodeModules: ["jsdom"],
  },
  layers: [chromeLayer],
});

const lambdaIntegration = new LambdaIntegration(handler);

When bundling this function, the @sparticuz/chromium package has to be excluded because we provide it as a Lambda Layer. On the other hand, the jsdom package can't be bundled, so it has to be installed as a normal node module.

REST API

The function is invoked by a GET request from a REST API and receives the URL as a query string parameter. The url request parameter is marked as mandatory. Moreover, I made use of the new defaultCorsPrefligtOptions to simplify the CORS setup.

const api = new RestApi(this, "lambda-readability-api", {
  apiKeySourceType: ApiKeySourceType.HEADER,
  defaultCorsPreflightOptions: {
    allowOrigins: Cors.ALL_ORIGINS,
    allowMethods: Cors.ALL_METHODS,
    allowHeaders: Cors.DEFAULT_HEADERS,
  },
});

api.root.addMethod("GET", lambdaIntegration, {
  requestParameters: { "method.request.querystring.url": true },
  apiKeyRequired: true,
});

Furthermore, I created an API key and assigned it to a usage plan to limit the maximum number of calls per day to 1000.

const key = api.addApiKey("lambda-readability-apikey");
const plan = api.addUsagePlan("lambda-readability-plan", {
  quota: {
    limit: 1_000,
    period: Period.DAY,
  },
  throttle: {
    rateLimit: 10,
    burstLimit: 2,
  },
});

plan.addApiKey(key);
plan.addApiStage({ api, stage: api.deploymentStage });

Implementation

Let's take a look at the full implementation first and then go into the interesting parts step by step:

let browser: Browser | undefined;

export const handler: APIGatewayProxyHandlerV2 = async (event) => {
  let page: Page | undefined;

  try {
    const { url } = parseRequest(event);

    if (!browser) {
      browser = await puppeteer.launch({
        args: chromium.args,
        defaultViewport: chromium.defaultViewport,
        executablePath: await chromium.executablePath(),
        headless: chromium.headless,
        ignoreHTTPSErrors: true,
      });
    }

    page = await browser.newPage();
    await page.goto(url);

    const content = await page.content();
    const dom = new JSDOM(content, { url: page.url() });

    const reader = new Readability(dom.window.document);
    const result = reader.parse();

    return formatResponse({ result });
  } catch (cause) {
    const error =
      cause instanceof Error ? cause : new Error("Unknown error", { cause });
    console.error(error);
    return formatResponse({ error });
  } finally {
    await page?.close();
  }
};

First, we declare the browser outside of the handler function to be able to re-use the browser instance on subsequent invocations. The launch of a new instance on a cold start causes the majority of execution time.

We parse the url query string parameter from the API Gateway event and validate it to be a real URL. Then, we use Puppeteer to launch a new browser instance and open a new page. This new page is closed at the end of the function while the browser instance stays open until the Lambda is terminated.

Readability.js requires a DOM object to parse the readable content from a website. That's why we create a DOM object with JSDOM and provide the HTML from the page and its current URL. By the way, the browser may have had to follow HTTP redirects, so the current URL doesn't necessarily have to be the one we provided initially. The parse function of the library returns the following result:

type Result = {
  title: string;
  content: string;
  textContent: string;
  length: number;
  excerpt: string;
  byline: string;
  dir: string;
  siteName: string;
  lang: string;
};

Some meta information is also available in the result object, but since we are returning raw HTML content, we are only interested in the content property. However, we have to add the Content-Type header with text/html; charset=utf-8 to the response object to ensure the browser renders it correctly.

Application

Now comes the fun part. I have created a simple web app with React, Tailwind, and Vite to demonstrate this project. Strictly speaking, you could call the REST API directly from a browser as the Lambda function returns real HTML that renders just fine. However, I thought it would be nicer to use it as a real application.

The following articles are curated examples showcasing the Readability version on the left and the Original article on the right. Of course, you can also try your own article and start here: zirkelc.github.io/lambda-readability

So without further ado, let's read some articles:

Image description — Readability and Original

Cloud Development Kit

I've got to say, my initial dive into AWS CDK has been quite a pleasant surprise. What impresses me most is the ability to code up my infrastructure using good old JavaScript or TypeScript, the very languages I already use to develop my application. No more fumbling with meta languages or constantly referring to documentation just to figure out how to do this or that – CDK simplifies everything.

The beauty of it all is that I can utilize the fundamental building blocks: if-conditions and for-loops, objects and arrays, classes and functions. I can put my coding skills to work in the same way I always do, without the need for any special plugins or hooks. That’s what Infrastructure as Code should really feel like – a truly great developer experience.

Conclusion

It's pretty amazing how far the Serverless world has come, enabling us to effortlessly run a Chrome browser inside a Lambda function. If you are interested in the mechanics of this project, you can view the full source code on GitHub. I'd really appreciate your feedback, and if you like it, give it a star on GitHub!

zirkelc / lambda-readability

Reader View build with Lambda and Readability

Lambda Readability

Lambda Readability is a Serverless Reader View to extract readable content from web pages using AWS Lambda, Chromium, and the Readability.js library.

For more information, read my article on DEV.to: Building a Serverless Reader View with Lambda and Chrome

Features

Serverless project built with AWS CDK
Runs a headless Chrome browser on AWS Lambda
Uses the Readability.js library to extract readable content from a web page
Simple REST API for requests
Frontend built with React, Tailwind, and Vite

Application

Visit zirkelc.github.io/lambda-readability and enter a URL for a website. Here are some examples:

Maker's Schedule, Manager's Schedule by Paul Graham.

Readability vs Original

Understanding AWS Lambda’s invoke throttling limits by Archana Srikanta on the AWS Compute Blog.

Readability vs Original

Advice for Junior Developers by Jeroen De Dauw on DEV.to

Readability vs Original

Development

Install dependencies from root:

npm install

Build and deploy backend with CDK:

cd backend

…

View on GitHub

I hope you found this post helpful. If you have any questions or comments, feel free to leave them below. If you'd like to connect with me, you can find me on LinkedIn or GitHub. Thanks for reading!

Blog

Running on Lambda: Serverless Reader View with Chrome and Readability

Chris Cook

Motivation

Stack

Lambda Layer

Lambda Function

REST API

Implementation

Application

Cloud Development Kit

Conclusion

zirkelc / lambda-readability

Reader View build with Lambda and Readability

Lambda Readability

Features

Application

Development

Join Our Newsletter. No Spam, Only the good stuff.

Related