Michael Vigor

Outlined below is the setup for a AWS lambda function which combines

fetching the HTML for a URL, stripping it back to just the essential

article content, and then converting it to Markdown. To deploy it you’ll

need an AWS account, and to have the serverless framework installed.

Step 1 - Download the full HTML for the URL

First get the full html of the url getting converted. As this is

running in a lambda function I decided to try out an ultra-lightweight

node http client called phin (which is 95% smaller than my usual favourite Axios):

const phin = require('phin')
const fetchPageHtml  async fetchUrl => {
  const response = await phin(fetchUrl)
  return response.body;
};

Step 2 - Convert to readable HTML

Converting to readable HTML is a feature originally offered by Instapaper (going back to 2008) as part of the core experience of a “read it later” service, but is now built into most browsers. Before converting to markdown its a good idea to strip out the unnecessary parts of the HTML (adverts, menus, images, etc), and just display the text of the main article in a clean and less distracting way.

This process won’t work for every web page - it is designed for blog posts, news articles etc which have a clear “body content” section which can be the focus of the output.

#serverless #aws-lambda #markdown #aws #function

How To Convert HTML to Markdown with a Serverless Function
3.30 GEEK