Task: Save Article to Markdown

WordPress rules, but I would like my content to be on other platforms as well. Some platforms like DEV, use Markdown, but I seem to struggle to import my articles. That's why I created a small snippet application to convert an article to markdown.

Final result

Just paste the URL of this blog into this small reppl.it program and watch how it converts the article into a big string of Markdown:

Packages

This solution uses Node.js. NPM has some great packages to work with:

node-fetch - to download the HTML. Depending on the version of Node.js, you might not need this to implement fetch. I use version 2, as I don't use ESM.
linkedom - to parse HTML into a workable DOM. I used to use jsdom, but I switch for performance reasons.
node-html-markdown - to parse HTML into markdown.

Install them like:

npm install node-fetch@2 linkedom node-html-markdown
npm install -D @types/node-fetch

Simple scraper

We're going to do the following:

Fetch the text of the URL. This is HTML, of course.
Parse it to DOM nodes.
Detect the article node.
Convert the article node to Markdown.

This results in the following lines of code:

async function scrape(url: string) {
  let f = await fetch(url)
  let txt = await f.text()

  const { document } = parseHTML(txt)

  // custom parsing:
  // parseCodeFields(document)
  // parseEmbeds(document)

  let article = (
    document.querySelector('article .entry-content') ||
    document.querySelector('article .crayons-article__main') ||
    document.querySelector('article') ||
    document.querySelector('body'))

  let html = article?.innerHTML || ""
  let content = NodeHtmlMarkdown.translate(html).trim()
  // let header = parseHeader(document)
  // content = header + content

  return content
}

Code Language Support

Now, my WordPress generates <pre class="lang-ts"><code></code></pre> blocks. Looks like node-html-markdown only takes <pre><code class="language-ts></code></pre>. Now, that's easily fixed by adding some extra processing before converting the document to markdown:

function parseCodeFields(document: Document) {
  document.querySelectorAll("pre code").forEach(code => {
    let lang = [...code.parentElement?.classList || []]
      .filter(x => x.startsWith("lang-"))
      .find(x => x)

    if(!lang) return

    lang = lang.replace("lang-", "language-")
    code.classList.add(lang)
  })
}

Embed rich content

Fortunately, dev.to supports liquid tags to embed rich content like repl.it and tweets. Let's parse our iframe elements into a liquid tag:

function parseEmbeds(document: Document) {
  document.querySelectorAll('iframe').forEach(iframe => {
    if (!iframe.src) return

    const url = new URL(iframe.src)
    const type = url.host
    const name = url.pathname

    const p = document.createElement("p")
    const n = document.createTextNode(`{% ${type} ${name} %}`)
    p.appendChild(n)

    iframe.parentNode?.insertBefore(p, iframe)
  })
}

This will not work for every embed, but it will get you started.

Header support

To be complete, we also need to add a YAML header with the title, tags and the canonical URL. It requires some parsing, but it'll make things easier:

function parseHeader(document: Document) {

  let header = '---\n'

  let title = (document.querySelector('h1')?.textContent || '').trim()
  if (title) {
    header += `title: ${title}\n`
  }

  let tags = [...document.querySelectorAll(".categories a, .tags a")]
    .map(a => (a.textContent || '').trim().toLowerCase())
    .filter(t => t)
  if (tags.length > 0) {
    tags.sort()
    let t = [... new Set(tags)].join(", ")
    header += `tags: [${t}]\n`
  }

  let canonical = document.querySelector('link[rel=canonical]')?.getAttribute("href")
  if (canonical) {
    header += `canonical_url: ${canonical}\n`
  }

  header += '---\n\n'

  return header;
}

Final thoughts

~~I still need to find a better way to detect the language of code snippets, so I don't have to add them by hand.~~ When I look at the result, I know one thing for sure: I'll keep using WordPress to write my blogs, as Markdown does not make it more readable!

Oh, and when you read this post on dev.to: it was created using this code (and yes, that's super meta 🤓).

Changelog

2022-08-31 Initial article.
2022-01-09 Fixed language support for WordPress code fields (see Code Language Support).
2022-03-09 Fixed embedding of repl.it through iframe parsing (see Embed rich content).
2022-03-09 Added title support.
2022-03-09 Added YAML header support (see Header support)

Blog