Part 4: Adding Summaries to the Python Bookmarker App
Matt Butcher
Posted on January 4, 2024
We’re up to Part 4 of our 5 part series on building a bookmarker app with Python, WebAssembly, and the open source Spin framework. In previous parts, we combined Spin’s key value storage with Jinja2 templates and a router to build a fully functional app.
Let’s go one more step. Let’s add a summary to our bookmarks. But instead of requiring the end user to generate the summary, let’s do it automatically. We’ll do this in two phases. First, we’ll just parse the HTML and return some text from the bookmarked page. Then in Part 5 we’ll go one more step and use an AI-powered LLM (Large Language Model) to generate a summary for us.
As we’ll see, even though this sounds sophisticated, it’s not terribly complex. By the end, our total code will still be under 150 lines.
Fetching Web Content
Let’s amp up our bookmarking app by fetching a bookmark and saving a content preview. In a moment we’ll add some LLM support, but to start with, let’s do something a little easier: We’ll fetch the remote page and grab just the title from the HTML document.
To do this, we will change the structure of our KV Store object to look like this:
{
"title": "SOME TITLE",
"url": "SOME URL",
"summary": "THIS IS NEW and will be the summary"
}
So we need to change our add_url()
function:
def add_url(request):
# This gets us the encoded form data
params = parse_qs(request.body, keep_blank_values=True)
title = params[b"title"][0].decode()
url = params[b"url"][0].decode()
# Open key value storage
store = kv_open_default()
# Get the existing bookmarks or initialize an empty bookmark list
bookmark_entry = store.get("bookmarks") or b"[]"
bookmarks = json.loads(bookmark_entry)
# THE NEW PART
# Generate a page summary
summary_text = summarize_page(url)
# Add our new entry.
bookmarks.append({"title": title, "url": url, "summary": summary_text})
# THAT'S ALL
# Store the modified list in key value store
new_bookmarks = json.dumps(bookmarks)
store.set("bookmarks", bytes(new_bookmarks, "utf-8"))
# Direct the client to go back to the index.html
return Response(303, {"location": "/index.html"})
We only add one line and change one line:
# Generate a page summary
summary_text = summarize_page(url)
# Add our new entry.
bookmarks.append({"title": title, "url": url, "summary": summary_text})
But now we need to write the summarize_page()
function. For this first go-around, what we want to do is:
- Fetch the URL
- Parse the returned HTML
- Get just the
title
tag’s content
Again, this is just our first pass. We’ll make it better in a moment. But to do even this is going to require a couple of functions and a class:
import json
from html.parser import HTMLParser # NEW
from http_router import Router
from jinja2 import Environment, FileSystemLoader, select_autoescape
from spin_http import Response, Request, http_send # NEW
from spin_key_value import kv_open_default
from urllib.parse import urlparse, parse_qs
# Omitted the rest of the code
def summarize_page(url):
req = Request("GET", url, {}, None)
res = http_send(req)
match res.status:
# This is to support Spin runtimes that don't automatically
# follow redirects. For Spin itself, it works fine without
# this case.
case 301 | 303 | 304 | 307:
loc = res.headers["location"]
print(f"following redirect to {loc}")
return summarize_page(loc)
case 200:
return summarize(res.body.decode("utf-8"))
case _:
return "Unable to load preview"
def summarize(doc):
parser = HTMLTitleParser()
parser.feed(doc)
return parser.title_data
class HTMLTitleParser(HTMLParser):
title_data = ""
track = False
def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
if tag == "title":
self.track = True
def handle_endtag(self, tag: str) -> None:
if tag == "title":
self.track = False
def handle_data(self, data: str) -> None:
if self.track:
self.title_data = data
Let’s start with the summarize_page()
function. It is our utility class for fetching the remote URL and then getting the page body. It uses Spin’s built-in HTTP client. Again, Spin’s security model requires us to grant some permissions to the app before it is allowed to make external HTTP requests. So we need to add this to spin.toml
:
[component.bookmarker]
source = "app.wasm"
key_value_stores = ["default"]
allowed_outbound_hosts = ["https://*:*"]. # NEW
files = ["index.html"]
[component.bookmarker.build]
command = "spin py2wasm app -o app.wasm"
watch = ["app.py", "Pipfile"]
The allowed_outbound_hosts
parameter lets us declare which external hosts our app is allowed to access. Using "https://*:*"
lets us access any HTTPS endpoint. The Spin HTTP documentation covers the format in more detail.
In the case where we can successfully fetch the remote URL (and status
is 200
), we pass the HTML body on to summarize()
. In the case of 3XX-level requests (redirects), we follow the redirects. In all other cases (404
, 500
, 403
, etc), we just return a message that says we couldn’t load a preview.
Now let’s take a look at the first version of summarize()
. We are going to build a better one later, but for now, it will simply grab the title
text out of the HTML:
def summarize(doc):
parser = HTMLTitleParser()
parser.feed(doc)
return parser.title_data
This creates a new HTMLTitleParser
, parses the doc
, and then returns the title. In a few moments we will rewrite this one. In this version, though, it uses a basic HTML parser that we wrote:
class HTMLTitleParser(HTMLParser):
title_data = ""
track = False
def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
if tag == "title":
self.track = True
def handle_endtag(self, tag: str) -> None:
if tag == "title":
self.track = False
def handle_data(self, data: str) -> None:
if self.track:
self.title_data = data
Python’s core libraries provide an event-based HTML parser. The way the parser works is that it walks through a document and as it parses, it calls handler functions for each token it parses. By extending that parser, we can intercept three events that we care about:
- When the parser hits the start of a tag (
handle_starttag()
- When the parser hits the end of a tag (
handle_endtag()
) - When the parser gets character data (text) between tags (
handle_data()
)
What we do in our parser extension is check whether we’re in the <title>
tag, and if so, get the text data until we hit the </title>
tag.
Putting all of this together, each time we add a new bookmark:
- The
add_url()
function will callsummarize_page()
with the URL of the page we want to bookmark -
summarize_page()
will fetch the HTML from the URL, and then pass it tosummarize()
- And
summarize()
will useHTMLTitleParser
to get the title out of the document. - That data is then returned back to
add_url()
, which will store the summary alongsidetitle
andurl
in our JSON document.
All that is left to do now is alter our template to show the summary.
Add Summary to the Template
In our index.html
Jinja template, we print a list of all of the bookmarks. To display our new summary field, all we need to do is add it to the output:
{% for bookmark in bookmarks %}
<li><a href="{{bookmark.url}}">{{bookmark.title}}</a>: {{bookmark.summary}}</li>
{% endfor %}
We made a minor formatting change, adding :
after the </a>
, and then printing the summary with {{bookmark.summary}}
.
At this point, if we save a new bookmark, the main index.html
page of our app will now look like this:
Note that the last, newly added, link now has a summary. The previous ones do not because they were created before we added the new summary()
logic.
It's time to move on to part 5, where we'll use AI (specifically, the LLaMa2 LLM) to read a webpage and generate a summary for us.
Posted on January 4, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.