I’ve been thinking a lot about how useful exhaust data is recently, and wanted to share some thoughts. Below I set out what exhaust data is, some examples of how it can add a huge amount of value, and some practical suggestions for data scientists on what to use it for.

Exhaust data

What is exhaust data and why should I care?

Every company with any kind of online footprint will have exhaust data cluttering up its data stores somewhere. Exhaust data is data that is the byproduct of users’ activities. It doesn’t include ‘primary’ or ‘core’ data, like a user’s saved log-in details or their transaction history, but does include all the data that is produced as a side effect of how the user interacts with your digital product.

For example, I used to work as a data scientist at Citymapper, a transport app. You can use the app to get transit directions, and you can use a subscription ticketing product called Pass to pay for your journeys. Let’s say you wanted to travel from your home to your favourite communist dictator-themed pizza and cocktail bar (pre-Covid, of course). You’d open up the app, type in the address, look through some of the transport options, choose one, follow the directions and use your Pass to pay for the journey. Data about the transaction would be recorded, and would count as core data because paid transactions are core to the business function.

An example of getting Citymapper transit directions, which produces exhaust data. Source: The Verge

Event logs from all the other things that you did in the app would also be recorded — including what start and end address you typed in, which options you were shown, which options you clicked on, what the last option you selected was; and if you used the step-by-step route guide (‘Go mode’) your GPS location would also be recorded every 10 seconds. All of this is exhaust data. Similarly, exhaust data at an ecommerce site would include clickstream data — including every product you clicked on, how long you stayed on each page, which sections of each page you looked at, and what actions you took.

Exhaust data is big. Really big. It often has to have rules applied to it to stop it getting even bigger. In the Citymapper example, we could have recorded user locations every second in Go mode instead of every 10 seconds, but then we’d have 10 times as much data, without necessarily getting 10 times as much value. It can also be messy and difficult to work with; especially in terms of distilling it down into something more understandable, but also in terms of simply being able to access the data you need in a timely manner without your notebook hanging or your laptop exploding.

A love letter to trash data

Exhaust data is generally a massive, complex, nuanced and highly unsexy pile of data, which poses major challenges when trying to derive any kind of insight from it. It’s also wonderful, fascinating, and the most amazing source of information about user behaviour. It’s often underutilised because of the inherent difficulties in using it, so it can be a great untapped source of value for an organisation.

Exhaust data is enormously useful because it tells us not just what a person did, but how (and maybe even why) they did it.

I first realised just how important big, messy data that is produced as a byproduct of a core activity can be in my PhD, in the highly obscure area of the application of statistics and data science to the archaeology of human evolution. It’s only looking back on my doctoral thesis now, with the benefit of having spent several years using data science in the real world, that I realise that my whole PhD was about the importance of exhaust data.

Previous work in my field had tended to focus on data about the end product — in my case stone tools, but in my examples above this could be the final transaction on a transport subscription card or an ecommerce site. But by looking at the exhaust data I was able to learn so much more about human behaviour. In my PhD the exhaust data was literally trash data, as it came from measurements of bits of stone that were removed as a byproduct of making, for example, an arrowhead. But in a commercial context this could be the event logs that record each tap a user makes on an app before choosing their journey route, or the series of products a user looked at on an ecommerce site before making their final purchase.

Exhaust data is enormously useful because it tells us not just _what _a person did, but _how _(and maybe even why) they did it. And so, I’m writing this blog post as an homage to exhaust data.

Stone tools and statistics

What can stone tools tell us about human behaviour?

I finished my PhD in the Archaeology department of the University of Oxford back in 2015. I was interested in the evolution of human behaviour, and in particular when we started behaving in a recognisably ‘human’ way and how our species spread around the world. Bones and stones are generally the only things that survive from the time period I was interested in, and so I tried to answer these questions by looking at stone tool technology.

Traditionally, an archaeologist of the Palaeolithic or Stone Age would have a look at some arrowheads from two different areas, decide they look kinda similar, and then go and write a paper about how the same group of people must have lived in the two areas. This is, in fact, exactly how the previously predominant theory of our species’ dispersal around the world was born. This theory was based around the occurrence of a certain type of stone tool technology called microliths. As the name suggests, microliths are very small stone tools — things like arrowheads, stone barbs along the sides of spears, fishing hooks and teeny tiny blades.

Reconstructed microliths from Sweden around 9,000 years ago. Source: Larsson et al. (2017) in Lund Archaeological Review 22.

We find similar-looking stone tools in Southern Africa about 70,000 years ago, East Africa about 50,000 years ago, and South Asia about 40,000 years ago. Therefore, so the theory goes, early modern humans must have left Africa about 50,000 years ago and spread around the world taking their arrowheads with them. The story of early modern human development became reduced down to a map with a big arrow drawn across it saying ‘humans go this way’; a narrative whereby humans followed this big arrow from Africa to Asia and beyond, dropping arrowheads like breadcrumbs along the way.

#analytics #data #machine-learning #data-science #data analysis

What can data scientists learn from clickstream and exhaust data?
1.65 GEEK