Key Takeaways

  • Model dynamic multi-dimensional data as Theta Sketches in such a way as to allow for millisecond-latency queries.
  • Downstream pipelines of services consume the user activity events both directly from Kafka as well as from an Amazon S3 raw data lake which stores the data in Parquet files.
  • HBase NoSQL database unique features were utilized to solve this problem.
  • Apache Spark is used to read the events from the data lake and pre-aggregate them into HBase.

AppsFlyer is a commercial SaaS attribution platform. Its clients, some of the largest mobile app companies in the world, send a large amount of events daily made up of the installs, uninstalls, sessions, in-app events, clicks and impressions performed by their user base.

In this article, I will discuss a system AppsFlyer built for the purpose of quickly and accurately finding the approximate sizes of sets of unique users (represented by a non-PII user ID), segmented by any combination of criteria over the various dimensions of these events. This system (later referred to as “Audiences”) is used by AppsFlyer’s user segmentation product for supplying interactive feedback to its users while they are defining criteria in the UI. Every action in the UI queries this system to find the approximate size of a unique set of users which meet the criteria, allowing users to fine-tune their criteria until they reach a number that they are happy with.

As a brief example, advertisers of an e-commerce application might want to know how many of their unique users installed the app in the last month, and also purchased products A and B, but DID NOT purchase product C; or how many unique users in the US added more than X products to their shopping cart in the past week but never checked out.

RELATED SPONSORED CONTENT

Learning from Failures: Early-Days Microservices Observability at Google (Live Webinar, July 30th, 2020) – Save Your Seat
Free Product Owner Learning Path
Radically Collaborative Patterns for Software Makers
Global Multi-site Clustering without Tradeoffs
From Docker to Kubernetes: Container Networking 101 (By O’Reilly)

RELATED SPONSOR

**NGINX Plus is the complete application delivery platform for the modern web. **Start your 30 day free trial.

One of the challenges faced was that the events that reach AppsFlyer are schemaless: AppsFlyer clients are free to send any number of dimensions (i.e, “product_name” or “level_completed_num”) as part of the payload of their events. This leads to a very high number of different dimensions the multi-tenant system would need to make sense of.

This article will discuss how this system was designed and engineered to provide this approximation, with the following considerations in mind:

  • Latency: every user action in the browser should update the number in sub-second latency.
  • Accuracy: to provide a user with an estimated number that is accurate enough to confidently use.
  • Multi-tenancy: the system would need to host and serve data across all of AppsFlyer’s users, requiring it to tackle the open-ended dimensional cardinality that the data inherently contains.

The core technologies used to build this system are Theta Sketches and HBase, both of which will be discussed with an overview of how they fit into the system’s architecture, and why they fit the specific problem at hand.

#hbase #nosql #article #apache

Counting Large Set of Unstructured Events with Theta Sketches
1.20 GEEK