Reverse engineering Google PageSpeed Insights with machine learning

Image for post

An often-quoted statistic by Ericsson ConsumerLab is that waiting for a slow loading web page on a mobile device creates as much stress as watching a horror film. A study by Cloud Flare showed the connection between page speed and website conversion rate and if a webpage takes longer than 4 seconds to load on a mobile device, the conversion rate drops to less than 1%.

Page speed is an important thing, especially now that the mobile web has become more prevalent. Google has been stating since 2009 that their goal is to ‘make the web faster’ and between then-and-now has been releasing a raft of different initiatives to help webmasters build their websites with speed in mind.

One of the biggest ways that Google incentivizes website owners to build with page speed in mind, is by publicly stating that page speed is a ranking factor (i.e. is the faster a webpage is, the greater the chance of it ranking highly on Google). All of Google’s resources for helping website owners live on the Make the Web Faster website.


The Google PageSpeed Insights API

The Google PageSpeed Insights API is a tool that lets you query Google Page Speed data programmatically on a per URL basis and it gets its results from two data sources. The first source is the Chrome User Experience (CruX) data set, which is a real-world data set of web page performance data anonymously sourced from opted-in users of Google Chrome.

The second source of data is the open-source Lighthouse project which is a tool that gives predictions of a web pages’ page speed performance and gives recommendations on how to make improvements. It can be accessed in a variety of ways including via a web page, a command-line tool or via the developer tools in Google Chrome.

The PageSpeed Insights API comes back with over 80 different results in its response as a JSON object comprised of data from both CRuX and Lighthouse. A lot of this data is rooted in how webpage speed metrics are measured.


How webpage speeds are measured

There are some page speed metrics that Google has created to measure page speed performance, and the most important of these can be seen on the image below:

Image for post

The Speed Indexis a score that is calculated by taking the difference between the First Contentful Painttime and the Visually Ready Timeand in general the lower the Speed Index the better a website performs for page speed.


Why is this important?

Building an enterprise website can be both complicated and expensive and if the website doesn’t perform well in search it can have a detrimental effect on the bottom line of the business the website is representing. If a website is already live but is performing poorly for page speed, it can be expensive to go back in and make updates to make it perform better.

In a competitive business environment where resources are limited, it would be useful to know where to put a budget against making fixes where you know you will get the most benefit. I decided to investigate this further and the first step was to build a data set of page speed performance against rankings and then use machine learning to try and understand the data better and get insights.

I built a data set of 100,000 websites using the top 1000 most popular search phrases on Google and running these through serpapi.com to get the positions of the top 100 websites for each keyword phrase. I then queried each of those 100,000 websites against the Pagespeed Insights API to pull out the most important metrics for analysis.

It is interesting to look at a chart for position versus average Speed Index for the top 10 positions on Google for the dataset:

Image for post

Speed Index versus Position for the top 1000 keywords on Google

Remember that the lower the Speed Index the better and as can be seen that apart from position 1, there is a clear relationship between Speed Index and position (i.e. the faster the page, the higher position). It would be a gross oversimplification to say that Speed Index drives position (as can be seen by position 1), but there is something there to investigate.

#seo #data-science #random-forest-regressor #data-science-workflow #machine-learning #data analysis

Making the web faster
1.15 GEEK