How to build efficient and perfomant Data Structures in BigQuery

BigQuery is a fully managed, serverless data warehouse on the Google Cloud Platform infrastructure that provides scalable, cost-effective and fast analytics over petabytes of data. It is service-software that supports queries using standard SQL. In this article, I would like to mention two main techniques to make your BigQuery Data Warehouse become efficient and performant

Some Theory ahead

SQL vs NoSQL: SQL databases are table-based databases, whereas NoSQL databases can be document-based, key-value pairs, and graph databases. SQL databases are vertically scalable, while NoSQL databases are horizontally scalable. SQL databases have a predefined schema, while NoSQL databases use a dynamic schema for unstructured data[2].

Row vs. Column based Databases: A row-structured database stores data belonging to specific table rows in the same physical location. Famous examples are MySQL and MSSQL databases. Column-oriented databases are in contrast to the most common row-oriented databases. In contrast to the row-oriented databases, they do not store the individual rows next to each other, but the columns. This form of storage is particularly useful for analytical processes involving large amounts of data, since aggregation functions often have to be calculated for individual columns. Well-known examples are HBase, MongoDB or Google BigQuery.

Now, we come to the next two important theoretical terms: OLTP and OLAP. Online transaction processing (OLTP) captures, stores, and processes data from transactions in real time. While online analytical processing (OLAP) uses complex queries to analyze aggregated historical data from OLTP systems. Where columnar databases have problems compared to row structured databases:

Materializing entire rows is expensive. If you want to do the equivalent of select * from, you have to do a Join across all your column indexes to get the values.
While INSERTs are straightforward, UPDATE and DELETE operations are complicated.

For the above reasons, true columnar databases have their main use cases in data warehousing, analytics, and other archive-type data stores, while row-structured databases are generally better suited for OLTP workloads.

So what Now?

For once, it was important for me to present the basic theoretical background. Because you have to know that BigQuery can best be described as a hybrid system. It is definitely a column-based system and therefore more suitable for analytical purposes. This is also important to understand why data should be denormalized — but more on that later. In addition, BigQuery is quite similar to a standard SQL data warehouse, since it can be queried with standard SQL, for example, and serves more as a repository for data from OLTP systems — instead of, for example, image data or similar — but on the other hand allows storage of nested data structures. Therefore, BigQuery can truly be called hybrid.

Denormalized Data

The following sections will describe the technical part behind the whole process. Firstly, one must think about how to build their best data schema. Rather than adopting or redeveloping traditional Star or Snowflake schemas, data engineers should look at the opposite, denormalization. As mentioned before, data is often taken from OLTP systems and normalized. In BigQuery data should be denormalized again. Here is a small basic example:

Simply put, you should join tables in your ETL/ELT process that belong together in terms of content and save them as a new table. A Join over a View would be theoretically also possible, but a stored table — which one renews in its transformation step for example once on the day — is however simply more performant. So in the example above, you would join the customer data with the sales data and save it as a new object. The challenge here is to also deal with the business process in the background, so that meaningful new data objects can be found and as few joins and transformations as possible for the subsequent data analysis process arise.

#big-data #data-engineering #google #bigquery

Some Theory ahead

So what Now?

Denormalized Data

towardsdatascience.com

How to build efficient and perfomant Data Structures in BigQuery