HTML to PDF using a Chrome puppet in the cloud

I’m going to take you through the process of setting up a headless chrome browser that you can run on AWS and use an API to do most of the things a browser can do. Our target for today is to have chrome navigate to a URL, wait for the page to fully-load and then create a PDF.

The chromium team have released the headless chrome node API Puppeteer.

https://github.com/GoogleChrome/puppeteer

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

There is also a really useful site where you can go and try puppeteer: https://try-puppeteer.appspot.com/. There sample code they provide to create a pdf looks like this:

const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
  await page.pdf({
    path: 'hn.pdf',
    format: 'letter'
  });
await browser.close();

The API being used above is very well documented here. Looking at page.pdf we see that the function takes and array of options and returns a promise which resolves with a PDF buffer. The options give you a good deal of control. You can set a path to save the pdf if you don’t want to consume the buffer, control headers, footers and page formatting, among other things.

Building and deploying to AWS

Before we get started you will need node8.10 and npm installed on your machine and you will need an AWS account to deploy your code to. AWS Lambda has a reasonably generous free tier — see AWS Lambda Pricing

Serverless

I’m going to use the serverless framework, which I find to be the easiest way to deploy to AWS. If you haven’t used serverless before, start by installing the cli:

npm install -g serverless

You then need to set up your AWS credentials:

Once you’ve finished the setup, create your project.

serverless create --template aws-nodejs --path ./lambda-puppeteer

This will create the lambda-puppeteer folder containing a basic javascript lambda deployment project.

My preference is to use typescript rather than plain javascript so we will convert the project to typescript below. The serverless template aws-nodejs-typescript could be used above but it creates a project that misses out a number of useful comments and it includes webpack, which we don’t need.

cd lambda-puppeteer

The serverless.yml files contains all the configuration necessary to deploy you project and the template creates a project that can be deployed and tested straightaway.

serverless deploy -v

Now test your function and look at the logs with these commands:

serverless invoke -f hello -l
serverless logs -f hello -t

Chromium and puppeteer core

Lambda has a 50Mb deployment limit (unless using layers) but the community has provided an easy way to deploy everything needed in a package of about 35Mb. We will use this library to get the chromium dependencies we need:

https://github.com/alixaxel/chrome-aws-lambda

Initialise node package manager:

npm init

Just accept the defaults for the project setup.

Add chromium:

npm i chrome-aws-lambda --save

and puppeteer-core, which is a version of Puppeteer that doesn’t download Chromium by default:

npm i puppeteer-core --save

Using typescript

There are a number of ways to configure your project for typescript such as using the serverless-plugin-typescript. In this case we’re going to manually convert the project in five steps:

install typescript

npm i --save-dev typescript

rename handler.js to handler.ts
install node types:

npm i @types/node

Add a tsconfig.json file with the following content:

{
	  "compilerOptions": {
	    "lib": ["es6"],
	    "module": "commonjs",
	    "noImplicitReturns": true,
	    "outDir": "lib",
	    "sourceMap": true,
	    "target": "es6",
	    "skipLibCheck": true
	  },
	  "compileOnSave": true,
	  "include": [
	    "*.ts"
	  ]
	}

Add these two scripts to package.json:

"scripts": {
  "build": "tsc",
  "deploy": "npm run build && serverless deploy",
  ...
},

Here we’ve added a deploy command that will compile typescript and do a serverless deploy. You could also run tests as part of the deploy by defining a test script and changing deploy to npm run build && npm run test && serverless deploy.

Implementing the service

Our pdf service will have the following interface:

export interface PdfService {
  getPdf(url: string): Promise<Buffer>;
}

We expose a single function that accepts a URL parameter and returns a promise of a Buffer containing the PDF of the content of the URL.

Create a file named pdf-service.ts and add the interface code above to it.

The implementation of the interface looks like this:

import chromium = require('chrome-aws-lambda');
	import puppeteer = require('puppeteer-core');
	

	export class ChromePdfService implements PdfService {
	  public async getPdf(url: string): Promise<Buffer> {
	    console.log(`Generating PDF for ${url}`);
	

	    let browser = null;
	    try {
	      browser = await puppeteer.launch({
	        args: chromium.args,
	        defaultViewport: chromium.defaultViewport,
	        executablePath: await chromium.executablePath,
	        headless: chromium.headless,
	      });
	

	      const page = await browser.newPage();
	

	      await page.goto(url, {
	        waitUntil: ['networkidle0', 'load', 'domcontentloaded'],
	      });
	      const result = await page.pdf({
	        printBackground: true,
	        format: 'A4',
	        displayHeaderFooter: false,
	      });
	      console.log(`buffer size = ${result.length}`);
	      return result;
	    } catch (error) {
	      throw new Error(`Failed to PDF url ${url} Error: ${JSON.stringify(error)}`);
	    } finally {
	      if (browser !== null) {
	        await browser.close();
	      }
	    }
	  }
	}

Add the implementation code above to pdf-service.ts so that it contains both the interface and the implementation.

This code expands on the simple example near the beginning of this post. One thing to note is the waitUntil options I have included. This setting determines when to consider navigation has succeeded and it defaults to load. When you specify an array of event strings, navigation is considered to be successful after all events have been fired.

load - consider navigation to be finished when the load event is fired.
domcontentloaded - consider navigation to be finished when the DOMContentLoaded event is fired.
networkidle0 - consider navigation to be finished when there are no more than 0 network connections for at least 500 ms.

So capturing the pdf does not proceed until the last of these three have completed.

Wiring up to an https endpoint

To make our service callable, we change the handler code to:

import { ChromePdfService, PdfService } from './pdf-service';
	

	const pdfService: PdfService = new ChromePdfService();
	

	module.exports.pdfReport = async (event, context, callback) => {
	  console.log(`pdfReport request ${JSON.stringify(event, null, 4)}`);
	  const url = event.query.url;
	  const buffer = await pdfService.getPdf(url);
	  callback(null, buffer.toString('base64'));
	};

Here we convert the buffer returned from our PdfService to a base64 string.

Finally, we add an https endpoint /pdf to call our function by replacing the functions section of serverless.yml with:

functions:
  pdfReport:
    handler: lib/handler.pdfReport
    events:
     - http:
        path: pdf
        method: get
        integration: lambda

Note that the handler path of lib matches the outDir specified in tsconfig.json above.

Deploy your service using the deploy script we defined in package.json:

npm run deploy

After the deployment has finished we can call our pdf service by going to the url allocated by the serverless deploy, for example:

https://<your project id and region>.amazonaws.com/dev/pdf?url=https://example.com

If all is well, this should return a long base64 text response. If we use an online base64 to pdf converter (eg base64.guru) to convert the text of the response to a pdf we can see the result.

Returning application/pdf

By changing some settings in API gateway you can have your endpoint return the correct Content-Type to be displayed as a PDF. There is a serverless plugin that is meant to automate these settings:

https://www.npmjs.com/package/serverless-plugin-custom-binary

I wasn’t able to get it to work but it may work for you. However, I was able to make the change manually following these instructions, but it’s not ideal to have configuration outside of your serverless deployment.

Adding header and footer

You can add your own HTML markup to create custom page headers and footers. One thing to note is that none of the stylesheets from the page are available so any styling needs to be done inline.

The header and footer markup can contain the following classes used to inject printing values into them:

date formatted print date
title document title
url document location
pageNumber current page number
totalPages total pages in the document

Here’s an example of adding a footer containing page numbers:

 const result = await page.pdf({
	        printBackground: true,
	        format: 'A4',
	        displayHeaderFooter: true,
	        footerTemplate: `
	        <div style="font-size:10px; margin-left:20px;">Page <span class="pageNumber"></span> of <span class="totalPages"></span></div>
	        `,
	        margin: {
	          top: '20px',
	          right: '20px',
	          bottom: '50px',
	          left: '20px',
	        },
	      });

This is how it looks on the page:

Of course using page.pdf is just one example of the many things you can do with chrome using the puppeteer API.

That completes today’s post. Remember to delete your AWS resources when you’ve finished using serverless remove .

In my next post we’ll add PDF password protection using a command-line tool and in the third post of the series I will cover calling the PDF service from an AWS step function.

#html