I’m going to take you through the process of setting up a headless chrome browser that you can run on AWS and use an API to do most of the things a browser can do. Our target for today is to have chrome navigate to a URL, wait for the page to fully-load and then create a PDF.
The chromium team have released the headless chrome node API Puppeteer.
https://github.com/GoogleChrome/puppeteer
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
There is also a really useful site where you can go and try puppeteer: https://try-puppeteer.appspot.com/. There sample code they provide to create a pdf looks like this:
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
await page.pdf({
path: 'hn.pdf',
format: 'letter'
});
await browser.close();
The API being used above is very well documented here. Looking at page.pdf we see that the function takes and array of options and returns a promise which resolves with a PDF buffer. The options give you a good deal of control. You can set a path to save the pdf if you don’t want to consume the buffer, control headers, footers and page formatting, among other things.
Before we get started you will need node8.10 and npm installed on your machine and you will need an AWS account to deploy your code to. AWS Lambda has a reasonably generous free tier — see AWS Lambda Pricing
I’m going to use the serverless framework, which I find to be the easiest way to deploy to AWS. If you haven’t used serverless before, start by installing the cli:
npm install -g serverless
You then need to set up your AWS credentials:
Once you’ve finished the setup, create your project.
serverless create --template aws-nodejs --path ./lambda-puppeteer
This will create the lambda-puppeteer folder containing a basic javascript lambda deployment project.
My preference is to use typescript rather than plain javascript so we will convert the project to typescript below. The serverless template aws-nodejs-typescript
could be used above but it creates a project that misses out a number of useful comments and it includes webpack, which we don’t need.
cd lambda-puppeteer
The serverless.yml files contains all the configuration necessary to deploy you project and the template creates a project that can be deployed and tested straightaway.
serverless deploy -v
Now test your function and look at the logs with these commands:
serverless invoke -f hello -l
serverless logs -f hello -t
Lambda has a 50Mb deployment limit (unless using layers) but the community has provided an easy way to deploy everything needed in a package of about 35Mb. We will use this library to get the chromium dependencies we need:
https://github.com/alixaxel/chrome-aws-lambda
Initialise node package manager:
npm init
Just accept the defaults for the project setup.
Add chromium:
npm i chrome-aws-lambda --save
and puppeteer-core, which is a version of Puppeteer that doesn’t download Chromium by default:
npm i puppeteer-core --save
There are a number of ways to configure your project for typescript such as using the serverless-plugin-typescript
. In this case we’re going to manually convert the project in five steps:
npm i --save-dev typescript
rename handler.js
to handler.ts
install node types:
npm i @types/node
tsconfig.json
file with the following content:{
"compilerOptions": {
"lib": ["es6"],
"module": "commonjs",
"noImplicitReturns": true,
"outDir": "lib",
"sourceMap": true,
"target": "es6",
"skipLibCheck": true
},
"compileOnSave": true,
"include": [
"*.ts"
]
}
package.json
:"scripts": {
"build": "tsc",
"deploy": "npm run build && serverless deploy",
...
},
Here we’ve added a deploy command that will compile typescript and do a serverless deploy. You could also run tests as part of the deploy by defining a test script and changing deploy to npm run build && npm run test && serverless deploy
.
Our pdf service will have the following interface:
export interface PdfService {
getPdf(url: string): Promise<Buffer>;
}
We expose a single function that accepts a URL parameter and returns a promise of a Buffer containing the PDF of the content of the URL.
Create a file named pdf-service.ts
and add the interface code above to it.
The implementation of the interface looks like this:
import chromium = require('chrome-aws-lambda');
import puppeteer = require('puppeteer-core');
export class ChromePdfService implements PdfService {
public async getPdf(url: string): Promise<Buffer> {
console.log(`Generating PDF for ${url}`);
let browser = null;
try {
browser = await puppeteer.launch({
args: chromium.args,
defaultViewport: chromium.defaultViewport,
executablePath: await chromium.executablePath,
headless: chromium.headless,
});
const page = await browser.newPage();
await page.goto(url, {
waitUntil: ['networkidle0', 'load', 'domcontentloaded'],
});
const result = await page.pdf({
printBackground: true,
format: 'A4',
displayHeaderFooter: false,
});
console.log(`buffer size = ${result.length}`);
return result;
} catch (error) {
throw new Error(`Failed to PDF url ${url} Error: ${JSON.stringify(error)}`);
} finally {
if (browser !== null) {
await browser.close();
}
}
}
}
Add the implementation code above to pdf-service.ts
so that it contains both the interface and the implementation.
This code expands on the simple example near the beginning of this post. One thing to note is the waitUntil
options I have included. This setting determines when to consider navigation has succeeded and it defaults to load
. When you specify an array of event strings, navigation is considered to be successful after all events have been fired.
load
- consider navigation to be finished when the load
event is fired.domcontentloaded
- consider navigation to be finished when the DOMContentLoaded
event is fired.networkidle0
- consider navigation to be finished when there are no more than 0 network connections for at least 500
ms.So capturing the pdf does not proceed until the last of these three have completed.
To make our service callable, we change the handler code to:
import { ChromePdfService, PdfService } from './pdf-service';
const pdfService: PdfService = new ChromePdfService();
module.exports.pdfReport = async (event, context, callback) => {
console.log(`pdfReport request ${JSON.stringify(event, null, 4)}`);
const url = event.query.url;
const buffer = await pdfService.getPdf(url);
callback(null, buffer.toString('base64'));
};
Here we convert the buffer returned from our PdfService
to a base64 string.
Finally, we add an https endpoint /pdf
to call our function by replacing the functions section of serverless.yml
with:
functions:
pdfReport:
handler: lib/handler.pdfReport
events:
- http:
path: pdf
method: get
integration: lambda
Note that the handler path of lib
matches the outDir
specified in tsconfig.json
above.
Deploy your service using the deploy script we defined in package.json
:
npm run deploy
After the deployment has finished we can call our pdf service by going to the url allocated by the serverless deploy, for example:
https://<your project id and region>.amazonaws.com/dev/pdf?url=https://example.com
If all is well, this should return a long base64 text response. If we use an online base64 to pdf converter (eg base64.guru) to convert the text of the response to a pdf we can see the result.
By changing some settings in API gateway you can have your endpoint return the correct Content-Type
to be displayed as a PDF. There is a serverless plugin that is meant to automate these settings:
https://www.npmjs.com/package/serverless-plugin-custom-binary
I wasn’t able to get it to work but it may work for you. However, I was able to make the change manually following these instructions, but it’s not ideal to have configuration outside of your serverless deployment.
You can add your own HTML markup to create custom page headers and footers. One thing to note is that none of the stylesheets from the page are available so any styling needs to be done inline.
The header and footer markup can contain the following classes used to inject printing values into them:
date
formatted print datetitle
document titleurl
document locationpageNumber
current page numbertotalPages
total pages in the documentHere’s an example of adding a footer containing page numbers:
const result = await page.pdf({
printBackground: true,
format: 'A4',
displayHeaderFooter: true,
footerTemplate: `
<div style="font-size:10px; margin-left:20px;">Page <span class="pageNumber"></span> of <span class="totalPages"></span></div>
`,
margin: {
top: '20px',
right: '20px',
bottom: '50px',
left: '20px',
},
});
This is how it looks on the page:
Of course using page.pdf
is just one example of the many things you can do with chrome using the puppeteer API.
That completes today’s post. Remember to delete your AWS resources when you’ve finished using serverless remove
.
In my next post we’ll add PDF password protection using a command-line tool and in the third post of the series I will cover calling the PDF service from an AWS step function.
#html