Extract Nested Structs without Cross Joining Data in BigQuery

As a Google Cloud consultant focused on the Data & Analytics space, I spend a lot of time with BigQuery, Google Cloud’s petabyte-scale data warehouse-as-a-service, and for the most part, I love it. It’s really powerful, there’s no infrastructure management, and it comes with all sorts of built in analytics and AI capabilities.

Because of its nature as a columnar data store, however, BigQuery SQL syntax can sometimes be non-intuitive to work with. Google recommended best-practices for query performance and cost optimization are to denormalize your data, or in other words to take data that is split across multiple tables and combine it into a single table using specialized data types like arrays and key-value stores.

Querying denormalized data may be more performant than joining data at scale but querying the data isn’t that easy if you’re used to working with data in normalized tables, which had been my experience up until the point where I joined Cloudbakers. So in this post, I’ll cover some query syntax for common and less-common denormalized data structures in BigQuery, using a mock GCP billing dataset that has the same structure as the BigQuery billing export.

Basic structs, or key-value fields, are straightforward enough, as you can simply use dot notation to select subfields:

-- service is a STRUCT field and id is one of its keys
SELECT service.id AS service_id
FROM   `gcp-sandbox.gcp_billing_demo.gcp_billing_export`
LIMIT  1000

Query results: key-value pairs flattened using dot notation

When working with arrays or repeated fields, things get a little bit more complicated. You can select individual components of an array using the array index, which can be useful if the order of elements is predetermined or doesn’t matter (you can see from the first two rows of the results below that in this case, order is not predetermined).

SELECT credits[OFFSET(0)] as credit_offset_0,
       credits[OFFSET(1)] AS credit_offset_1
FROM   `gcp-sandbox.gcp_billing_demo.gcp_billing_export`
WHERE  ARRAY_LENGTH(credits) > 1

Query results: array element selected by index

You also have the option to flatten the data using what’s called a correlated cross join. This takes any repeated field, pivots it so that each element in the array is a new row, and then joins that new tabular data with the original table, creating a flattened schema with repeated rows for every element in the original repeated field.

SELECT billing.credits,
       c.*
FROM   `gcp-sandbox.gcp_billing_demo.gcp_billing_export` billing,
       UNNEST(credits) c
WHERE  ARRAY_LENGTH(credits) > 1

Query results: array flattened by correlated cross join

This is useful if you need to flatten your data and calculate aggregate values or metrics based off the data contained in a array — for example, if you need to calculate the total credit amount per credit type for GCP consumption that was invoiced in the month of December, 2019, you would use the following query:

SELECT   c.name        AS credit_type,
         SUM(c.amount) AS total_credit_amount
FROM     `gcp-sandbox.gcp_billing_demo.gcp_billing_export` billing,
         UNNEST(credits) c
WHERE    billing.invoice.month = '201912'
GROUP BY credit_type
ORDER BY credit_type

Query results: array flattened by correlated cross join and aggregated

But what if you simply need the arrayed data to be pivoted so that it can be consumed by another system or exported into spreadsheet software like Excel or Google Sheets? Again because of its nature as a columnar data store, BigQuery doesn’t easily support syntax that would be the equivalent of something like the Pandas library’s unstack method.

As a specific example, let’s take two of the four repeated fields in the BigQuery billing export:

Project Labels
Resource Labels

If you’re not already familiar with the concept, project and **resource labels **allow you as an organization to apply business concepts and processes to GCP compute resources. Because the labels applied to your compute resources are propagated to your billing data, if you set up labels correctly and consistently, you can break down your GCP cost consumption by different environments, organizational departments, or initiatives. You can also initiate processes like chargebacks to individual departments.

Long story short, labels are or should be an important aspect of your GCP cost management strategy. Which makes it all the more frustrating that both the project labels and resource labels in the BigQuery billing export are in fields of repeated struct objects, or in other words are written to the table as arrays of key-value pairs.

If you do a correlated cross join on the project labels in this instance, because the cost field is not contained within the project label array in the same way the credit costs are associated with specific credit types within the credit array, flattening the project label data causes the cost field to be repeated once per project label. Same goes if you do a correlated cross join on the resource labels.

#sql #data-warehouse #data-analytics #bigquery #big-data #data analysis

towardsdatascience.com

Extract Nested Structs without Cross Joining Data in BigQuery