Building smarter RSS feeds for my newsletter subscriptions

zied

Zied Ben Tahar

Posted on November 23, 2024

Building smarter RSS feeds for my newsletter subscriptions

Photo by Joanna Kosinska on Unsplash

In this article, I’ll share a tool I recently built for my personal use, driven largely by intellectual curiosity. Although I was aware of services like Kill the Newsletter!, I wanted to create a service that generates a personalized RSS feeds for my newsletter subscriptions. I wanted the feed to do more than just list content — it would provide summaries of featured articles and topics shared in newsletters I am subscribed to and powered by LLM to categorize, summarize, and extract key information. This helps me stay on top of relevant updates and ensure I don’t miss out on insights from the community.

To achieve this, I used a feature from Amazon Simple Email Service (SES) that allows for handling incoming emails, making it a suitable option for building event-driven automations that process messages as they arrive. In my case, I used this capability to efficiently manage the newsletters I receive, transforming them into personalized RSS feeds.

captionless image

Designing the smart RSS feed generator

To start, I wanted to be able to set up email addresses that I can create on the fly whenever I want to subscribe to one or many newsletters. Each email address would correspond to a “virtual” inbox, tailored for a specific type of subscription. Although all emails are technically routed to the same location, this approach allows me to manage and organize my subscriptions as if they were separate inboxes, each based on specific interests and topics. I can create as many dedicated inboxes as needed. For example, one could be set up for awesome-serverless-community@my-domain.com, while another could be for awesome-system-design@my-domain.com, each handling different types of newsletters and allowing me to configure separate content filtering rules for each subscription.

High-level overview of smart and personalized RSS feed generation from newsletters

Whenever a newsletter arrives at one of these “virtual” inboxes, the RSS feed generation starts. The first step involves verifying whether the system should handle the email. To achieve this, I implemented an allow list of trusted newsletter senders, which can be configured as needed. This ensures that only approved sources are processed, adding an extra layer of control to the system. Next, the raw email content is converted into Markdown format. An LLM is then used to create a gist of the newsletter and filter the content based on my interests. Both the filter configurations and the allow list are stored in a dedicated table, with filters configured for each feed.

Some of the newsletters I subscribe to feature valuable community content, such as links to blog posts or videos. I also use the LLM to generate a structured list of these links, along with their main topics, so I can easily access the content that’s most relevant to me.

Once the gist of a newsletter is ready, it is finally stored in a dedicated table. Each email address gets its own personalized RSS feed that gets served through an API, so it can be accessed by RSS feed readers.

Solution overview

Let’s now have a deeper look at the solution:

Smart RSS Feed generation — Solution overview

Incoming emails are stored in an S3 bucket. Each time a new email arrives, an event is sent to the default EventBridge bus, triggering a workflow to process the email. The Process Email function handles the entire workflow, including content conversion, filtering, and gist generation. It uses Amazon Bedrock with the Claude Sonnet model to create a structured newsletter gist, which is then stored in a DynamoDB table.

For serving the RSS feeds, I built the api using Hono web framework.

Some notes

  • I chose to use a single Lambda function to handle the entire process of generating the newsletter gist, keeping that part self-contained. In a previous article, I explored another approach using a Step Function to interact with an LLM, as it avoids paying for an active Lambda function while waiting for the LLM response, but it requires a more complex setup.
  • The API is deployed using a Function URL (FURL). I use CloudFront Origin Access Control (OAC) to restrict access to the Lambda function URL origin.

Solution details

I built this solution using Nodejs and typescript for functions code and terraform for IaC

TL;DR

You will find the complete repo of this service here 👉https://github.com/ziedbentahar/smart-feeds

1 — Email handling — Configuring SES

To get started with SES handling incoming emails, we’ll add an MX record to our domain’s DNS configuration. Next, we’ll create an SES receipt rule to process all incoming emails sent to @my-domain.com. This rule includes an action to deliver the raw email content to an S3 bucket.

Here is the how to define this in terraform:

...
resource "aws_route53_record" "email_mx_records" {
  zone_id = var.subdomain_zone.id
  name    = local.email_subdomain
  type    = "MX"
  ttl     = "600"
  records = [    "10 inbound-smtp.us-east-1.amazonses.com",
    "10 inbound-smtp.us-east-1.amazonaws.com",
  ]
}
...

resource "aws_ses_receipt_rule_set" "this" {
  rule_set_name = "${var.application}-${var.environment}-newsletter-rule-set"
}
resource "aws_ses_receipt_rule" "this" {
  name          = "${var.application}-${var.environment}-to-bucket"
  rule_set_name = aws_ses_receipt_rule_set.this.rule_set_name
  recipients = ["${local.email_subdomain}"]
  enabled       = true
  scan_enabled  = true
  s3_action {
    position = 1
    bucket_name = aws_s3_bucket.email_bucket.bucket
    object_key_prefix = "${local.emails_prefix}"
  }
  depends_on = [    aws_ses_receipt_rule_set.this,
    aws_s3_bucket.email_bucket,
    aws_s3_bucket_policy.email_bucket_policy
  ]
}
resource "aws_ses_active_receipt_rule_set" "this" {
  rule_set_name = aws_ses_receipt_rule_set.this.rule_set_name
  depends_on = [aws_ses_receipt_rule_set.this]
}
Enter fullscreen mode Exit fullscreen mode

We also need to update the bucket policy to allow SES to write to the emails bucket:

resource "aws_s3_bucket_policy" "email_bucket_policy" {
  bucket = aws_s3_bucket.email_bucket.id
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [      {
        Effect    = "Allow",
        Principal = { Service = "ses.amazonaws.com" },
        Action    = "s3:PutObject",
        Resource  = "${aws_s3_bucket.email_bucket.arn}/*",
        Condition = {
          StringEquals = {
            "aws:Referer" = data.aws_caller_identity.current.account_id
          }
        }
      }
    ]
  })
}
Enter fullscreen mode Exit fullscreen mode

After deployment, you will be able to view in the console the SES email receiving receipt rule details :

Receipt rule details

2— Generating newsletter gist

The ‘Process Email’ lambda is invoked by an event bridge rule whenever a new object is created on the inbox bucket. Let’s have a look into the different involved steps in generating the newsletter gist:

export const lambdaHandler = async (event: S3ObjectCreatedNotificationEvent) => {
    const rawContent = await getRawEmailContent(event.detail.object.key);
    const emailId = basename(event.detail.object.key);
    if (!rawContent) {
        throw new Error("Email content not found");
    }
    const { newsletterEmailFrom, newsletterEmailTo, html, date, subject } = await parseEmail(rawContent);
    const feedsConfigs = await getFeedConfigurationsBySenderEmail(newsletterEmailFrom);
    if (feedsConfigs.length === 0) {
        console.warn(`No feed config found for ${newsletterEmailFrom}`);
        return;
    }
    let shortenedLinks = new Map<string, string>();
    const markdown = generateMarkdown(html, {
        shortenLinks: true,
        shortener: (href) => {
            let shortened = nanoid();
            shortenedLinks.set(shortened, href);
            return shortened;
        },
    });
    for (const [shortened, original] of shortenedLinks) {
        await addShortenedLink(original, shortened);
    }
    const output = await generateNewsletterGist(markdown);
    if (!output) {
        throw new Error("Failed to generate newsletter gist");
    }
    await Promise.allSettled(
        feedsConfigs.map(async (feedConfig) => {
            await addNewItemToFeed(feedConfig.feedId, {
                feedId: feedConfig.feedId,
                date,
                title: subject,
                emailFrom: newsletterEmailFrom!,
                id: emailId,
                ...output,
            });
        })
    ).catch((e) => {
        console.error(e);
    });

};
Enter fullscreen mode Exit fullscreen mode

🔎 Let’s zoom-in:

  • First, the raw email content is retrieved from the inbox S3 bucket. This content is then parsed to extract the HTML content, the sender, and other relevant details used downstream.
  • Once we confirm the sender is on the allow list, the generateMarkdown function converts the email content into Markdown. During the transformation, unnecessary elements such as headers and styles are stripped from the email’s HTML content.
  • As I am interested in capturing relevant shared content, typically containing links to original sources, the generateMarkdown function extracts these links and transforms them into short Ids. These Ids are used in the prompt instead of the full links, helping to reduce the input context length when invoking the model.
  • The short ids are stored in a table, linked to the original URLs, and used in the RSS feed items.
  • generateNewsletterGist generates the prompt and invokes the model
  • And finally the addNewsletterGistToFeed stores the structured output in the feeds table.

You can find below the the details of the generateMarkdown function, here I am relying on the turndown lib:

import TurndownService from "turndown";
const generateMarkdown = (
    html: string,
    options: 
      { shortenLinks: true; shortener: (href: string) => string } |
      { shortenLinks: false }
): string => {
    const turndownService = new TurndownService({});
    turndownService.addRule("styles-n-headers", {
        filter: ["style", "head", "script"],
        replacement: function (_) {
            return "";
        },
    });
    if (options.shortenLinks) {
        turndownService.addRule("shorten-links", {
            filter: "a",
            replacement: function (content, node) {
                const href = node.getAttribute("href");
                if (href) {
                    const shortened = options.shortener(href);
                    return `[${content}](${shortened})`;
                }
                return content;
            },
        });
    }
    const markdown = turndownService.turndown(html);
    return markdown;
};
Enter fullscreen mode Exit fullscreen mode

One important part is the prompt I use to generate the newsletter gist:

const prompt = `
Process the provided newsletter issue content in markdown format and generate a structured JSON output by performing the following tasks and adhering to the constraints:
<tasks> 
    * Summarize the most important topics in this newsletter. 
    * Identify and extract the list of content shared in the newsletter, including: 
        * Key topics, extracted as paragraphs. 
        * Articles 
        * Tutorials. 
        * Key events
    * For shared content, extract the most relevant link. For each link, generate a summary sentence related to it. Do not create a link if one is not provided in the newsletter. 
    * Exclude any irrelevant content, such as unsubscribe links, social media links, or advertisements. 
    * Do not invent topics or content that is not present in the newsletter. 
</tasks>
Here is the expected JSON schema for the output: 
<output-json-schema>
{{output_json_schema}}
</output-json-schema>
Here is the newsletter content: 
<newsletter-content>
{{newsletter_content}}
</newsletter-content>
`;
export const generatePromptForEmail = (newsletterContent: string, outputJsonSchema: string) => {
    return prompt
        .replace("{{newsletter_content}}", newsletterContent)
        .replace("{{output_json_schema}}", outputJsonSchema);
};
Enter fullscreen mode Exit fullscreen mode

To ensure the LLM generates the expected result in JSON, I provide the JSON schema for the output structure. Instead of hardcoding the output schema in the prompt, I define the schema of the newsletter gist using Zod and infer both the TypeScript type and the JSON schema from it. This way, any changes to the schema are also reflected in the LLM output:

import { z } from "zod";
import zodToJsonSchema from "zod-to-json-schema";
const linkSchema = z.object({
    text: z.string(),
    url: z.string(),
});
t const newsletterGist = z.object({
    summary: z.string(),
    topics: z.array(z.string()),
    links: z.array(linkSchema),
});
export type NewsletterGist = z.infer<typeof newsletterGist>;
export const newsletterGistSchema = zodToJsonSchema(newsletterGist);
Enter fullscreen mode Exit fullscreen mode

To invoke the model, I use Bedrock Converse API, this allows me to write code once and use it with different models:

const prompt = generatePromptForEmail(markdown, JSON.stringify(newsletterGistSchema), config);
const result = await bedrockClient.send(
    new ConverseCommand({
        modelId: process.env.MODEL_ID,
        system: [{ text: "You are an advanced newsletter content extraction and summarization tool." }],
        messages: [            {
                role: "user",
                content: [                    {
                        text: prompt,
                    },
                ],
            },
            {
                role: "assistant",
                content: [                    {
                        text: "{",
                    },
                ],
            },
        ],
    })
);
Enter fullscreen mode Exit fullscreen mode

Since I want to enforce a JSON output, I’ll need to prefill the assistant’s message with an opening {. This is specific to Claude models.

3 — Serving the newsletters gists as an RSS feed

Working with Hono is a breeze. It simplifies many aspects of defining web APIs while supporting Lambda natively. This API serves multiple routes, and I chose to deploy it as a mono-lambda (AKA Lambdalith) to simplify the infrastructure definition:

import { Hono } from "hono";
import { handle } from "hono/aws-lambda";
import { feeds } from "./routes/feeds";
import { newsletters } from "./routes/newsletters";
import { links } from "./routes/links";
import { home } from "./routes/home";
export const app = new Hono();
app.route("/", home);
app.route("/feeds", feeds);
app.route("/links", links);
app.route("/newsletters", newsletters);
export const handler = handle(app);
Enter fullscreen mode Exit fullscreen mode

Easy! The feeds route generates the RSS feed from the newsletter gists already stored in the feeds table:

export const feeds = new Hono().get("/:id/rss", async (c) => {
    const feedId = c.req.param("id");
    const feedConfig = await getFeedConfig(feedId);
    let feedItems: NewsletterIssueGist[] = [];
    for await (const items of getFeedItems(feedId)) {
        feedItems = [...items.map((item) => item.content), ...feedItems];
    }
    const rssFeedItems = feedItems.reduce(
        (acc, item) => {
            acc[item.id] = {
                item: {
                    title: item.title,
                    description: html`<div>
                        <section>📩 ${item.emailFrom}</section>
                        <section>📝 ${item.summary}</section>
                        <section>
                            <div>📝 Topics</div>
                            <ul>
                                ${item.topics.map((t) => {
                                    return `<li>${t}</li>`;
                                })}
                            </ul>
                        </section>
                        <section>
                            <div>
                                <a href="https://${process.env.API_HOST}/newsletters/${item.id}"
                                    >📰 Open newsletter content</a
                                >
                            </div>
                        </section>
                        <section>
                            <ul>
                                ${item.links.map((l) => {
                                    return `
                                            <li>
                                                <a href="https://${process.env.API_HOST}/links/${l.url}"
                                                    >🔗 ${l.text}</a
                                                >
                                            </li>
                                        `;
                                })}
                            </ul>
                        </section>
                    </div>`.toString(),
                    guid: item.id,
                    link: `https://${process.env.API_HOST}/newsletters/${item.id}`,
                    author: item.emailFrom,
                    pubDate: () => new Date(item.date).toUTCString(),
                },
            };
            return acc;
        },
        {} as Record<string, unknown>
    );
    const feed = toXML(
        {
            _name: "rss",
            _attrs: {
                version: "2.0",
            },
            _content: {
                channel: [                    {
                        title: feedConfig?.name,
                    },
                    {
                        description: feedConfig?.description,
                    },
                    {
                        link: `https://${process.env.API_HOST}/feeds/${feedId}/rss`,
                    },
                    {
                        lastBuildDate: () => new Date(),
                    },
                    {
                        pubDate: () => new Date(),
                    },
                    {
                        language: "en",
                    },
                    Object.values(rssFeedItems),
                ],
            },
        },
        { header: true, indent: "  " }
    );
    return c.text(feed);
});
Enter fullscreen mode Exit fullscreen mode

here I am using the [jstoxml](https://www.npmjs.com/package/jstoxml) package to be able to convert the newsletter gist structure to the RSS feed XML format.

The other routes exposed by this API include /newsletters, which renders the HTML of the received email already stored in the inbox bucket, and /links, which redirects the caller to the original content link using the short link id.

And finally, here is the CloudFront OAC configuration for the API exposed via a function URL:

resource "aws_cloudfront_origin_access_control" "this" {
  name                              = "${var.application}-${var.environment}-api-oac"
  origin_access_control_origin_type = "lambda"
  signing_behavior                  = "always"
  signing_protocol                  = "sigv4"
}
resource "aws_cloudfront_distribution" "this" {
  origin {
    domain_name              = replace(aws_lambda_function_url.api.function_url, "/https:\\/\\/|\\//", "")
    origin_access_control_id = aws_cloudfront_origin_access_control.this.id
    origin_id                = "api"
    custom_origin_config {
      http_port              = 80
      https_port             = 443
      origin_protocol_policy = "https-only"
      origin_ssl_protocols   = ["TLSv1.2"]
    }
  }
  enabled         = true
  is_ipv6_enabled = true
  aliases = [local.api_subdomain]
  default_cache_behavior {
    allowed_methods  = ["DELETE", "GET", "HEAD", "OPTIONS", "PATCH", "POST", "PUT"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "api"
    cache_policy_id        = data.aws_cloudfront_cache_policy.disabled.id
    viewer_protocol_policy = "allow-all"
    min_ttl                = 0
  }
  price_class = "PriceClass_200"
  restrictions {
    geo_restriction {
      restriction_type = "none"
      locations        = []
    }
  }
  viewer_certificate {
    acm_certificate_arn      = aws_acm_certificate.this.arn
    ssl_support_method       = "sni-only"
    minimum_protocol_version = "TLSv1.2_2021"
  }
  depends_on = [aws_acm_certificate_validation.cert_validation]
}

Enter fullscreen mode Exit fullscreen mode

Once we deploy the API we can access the RSS feed by using this url https://<some domain>/feeds/<feed-id>/rss

Getting  raw `Awesome serverless updates` endraw  newsletters gists

Here is an example of how it renders in an RSS reader application.

 raw `Awesome serverless` endraw  newsletters gists from an RSS reader

Pretty neat, isn’t it? This way, I can follow updates from the community all in one place ! 👍

Wrapping up

I had fun building this tool. In my initial iteration, I intended to leverage Bedrock’s prompt management and prompt flow features. Unfortunately, at the time of writing, these services were not mature enough. But, I might explore them in the future.

The email automation pattern used here isn’t limited to processing newsletters. It can be applied to a varity of other use cases, such as customer support systems or invoice and receipt handling.

As always, you can find the full code source, ready to be adapted and deployed here 👇

https://github.com/ziedbentahar/smart-feeds

Hope you enjoyed it !

Resources

How to receive emails using Amazon SES.

Hono with AWS Lambda

💖 💪 🙅 🚩
zied
Zied Ben Tahar

Posted on November 23, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related