Task: Save Article to Markdown
Kees C. Bakker
Posted on September 3, 2022
WordPress rules, but I would like my content to be on other platforms as well. Some platforms like DEV, use Markdown, but I seem to struggle to import my articles. That's why I created a small snippet application to convert an article to markdown.
Final result
Just paste the URL of this blog into this small reppl.it program and watch how it converts the article into a big string of Markdown:
Packages
This solution uses Node.js. NPM has some great packages to work with:
-
node-fetch - to download the HTML. Depending on the version of Node.js, you might not need this to implement
fetch
. I use version 2, as I don't use ESM. - linkedom - to parse HTML into a workable DOM. I used to use jsdom, but I switch for performance reasons.
- node-html-markdown - to parse HTML into markdown.
Install them like:
npm install node-fetch@2 linkedom node-html-markdown
npm install -D @types/node-fetch
Simple scraper
We're going to do the following:
- Fetch the text of the URL. This is HTML, of course.
- Parse it to DOM nodes.
- Detect the article node.
- Convert the article node to Markdown.
This results in the following lines of code:
async function scrape(url: string) {
let f = await fetch(url)
let txt = await f.text()
const { document } = parseHTML(txt)
// custom parsing:
// parseCodeFields(document)
// parseEmbeds(document)
let article = (
document.querySelector('article .entry-content') ||
document.querySelector('article .crayons-article__main') ||
document.querySelector('article') ||
document.querySelector('body'))
let html = article?.innerHTML || ""
let content = NodeHtmlMarkdown.translate(html).trim()
// let header = parseHeader(document)
// content = header + content
return content
}
Code Language Support
Now, my WordPress generates <pre class="lang-ts"><code></code></pre>
blocks. Looks like node-html-markdown only takes <pre><code class="language-ts></code></pre>
. Now, that's easily fixed by adding some extra processing before converting the document to markdown:
function parseCodeFields(document: Document) {
document.querySelectorAll("pre code").forEach(code => {
let lang = [...code.parentElement?.classList || []]
.filter(x => x.startsWith("lang-"))
.find(x => x)
if(!lang) return
lang = lang.replace("lang-", "language-")
code.classList.add(lang)
})
}
Embed rich content
Fortunately, dev.to supports liquid tags to embed rich content like repl.it and tweets. Let's parse our iframe
elements into a liquid tag:
function parseEmbeds(document: Document) {
document.querySelectorAll('iframe').forEach(iframe => {
if (!iframe.src) return
const url = new URL(iframe.src)
const type = url.host
const name = url.pathname
const p = document.createElement("p")
const n = document.createTextNode(`{% ${type} ${name} %}`)
p.appendChild(n)
iframe.parentNode?.insertBefore(p, iframe)
})
}
This will not work for every embed, but it will get you started.
Header support
To be complete, we also need to add a YAML header with the title, tags and the canonical URL. It requires some parsing, but it'll make things easier:
function parseHeader(document: Document) {
let header = '---\n'
let title = (document.querySelector('h1')?.textContent || '').trim()
if (title) {
header += `title: ${title}\n`
}
let tags = [...document.querySelectorAll(".categories a, .tags a")]
.map(a => (a.textContent || '').trim().toLowerCase())
.filter(t => t)
if (tags.length > 0) {
tags.sort()
let t = [... new Set(tags)].join(", ")
header += `tags: [${t}]\n`
}
let canonical = document.querySelector('link[rel=canonical]')?.getAttribute("href")
if (canonical) {
header += `canonical_url: ${canonical}\n`
}
header += '---\n\n'
return header;
}
Final thoughts
I still need to find a better way to detect the language of code snippets, so I don't have to add them by hand. When I look at the result, I know one thing for sure: I'll keep using WordPress to write my blogs, as Markdown does not make it more readable!
Oh, and when you read this post on dev.to: it was created using this code (and yes, that's super meta 🤓).
Changelog
- 2022-08-31 Initial article.
- 2022-01-09 Fixed language support for WordPress code fields (see Code Language Support).
- 2022-03-09 Fixed embedding of repl.it through
iframe
parsing (see Embed rich content). - 2022-03-09 Added title support.
- 2022-03-09 Added YAML header support (see Header support)
Posted on September 3, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.