Step by Step Guide to load data into BigQuery

In this Part 6 of the series, “Modernisation of a Data Platform”, we would be focussing a little more on BigQuery’s key concepts which are essential for designing a DWH.

In this part, we will see how to deal with table design in BigQuery using different methods and load a covid19_tweets dataset and run a query to analyse the data.

Creating a Schema:

We can create a schema in BigQuery either while migrating the data from an existing datawarehouse or ingesting the data into BigQuery from various data sources that are either on cloud or on-premise.

Other than manually creating the schema, Bigquery also gives an option to auto-detect the schema.

How does this auto-detect work?

BigQuery compares the header row of an input file and a representative sample of 100 records from row 2 onwards. If the data types of 100 samples differ from the header row, BigQuery proceeds to use them as column names. User will just have to enable auto-detect to have the schema created automatically while load happens.

Datatypes in BigQuery:

While most of the data types are standard ones such as Integer, float, Numeric, Boolean etc, one special data type that we need to discuss is STRUCT.

This data type is particularly used for nested and repeated fields. The best example to represent a STRUCT is addresses. Normally, addresses have multiple sub-fields such as Is_active, address line 1, address line 2, town, city, post code, number of years_addr etc.,

All these fields can be nested under the parent field ‘Addresses’. While normal data types are either ‘nullable’ or ‘non nullable’, the mode of STRUCTS would always be defined as ‘REPEATED’.

Creating Tables & Managing Accesses:

The easiest way to create a table in BQ is by using the Cloud Console. The UI is extremely friendly and user can navigate to BigQuery console to create tables in a dataset.

Alternatively, there is a REST API service that can be used to insert tables with a specific schema into the dataset.

BigQuery provides an option to restrict access at a dataset level. However, there is a beta feature (as of this article is being published) to grant access at a table level or view level too. Access can be granted as a data viewer, data admin, data editor, data owner etc.,

BigQuery allows users to copy table, delete table, alter the expiration time of the table, update table description etc. All these actions are possible either by Console, API, bq command line or using Client libraries.

What is Partitioning & Clustering?

Query performance is paramount in BigQuery and one of the key features that enable the same is table partitioning. Dividing the large sized tables into small partitions is the key in enhancing the query performance and fetching the results quicker.

Partitions can be done by the following methods:

· Ingestion time — On the basis of the time data arrives to bigquery

_Ex: If the data load happens to BigQuery on a daily basis, then partitions are created for every day. A pseudo column called “PARTITIONTIME” is created which calls out the date of load. By default the schema of all these tables will be the same, but bigquery provides an option to change the schema.

· Date/Timestamp — Based on the data or timestamp column of the table

Ex: Based on the timestamp at which a row is created, a partition can be created. In a user table, every user who gets registered will have a registration timestamp. We can define the partition based on the registration timestamp at a granularity of daily or even hourly.

· Integer range — Based on the range of Integer column.

Perhaps the simplest one where partitions are on the basis of IDs. Ex: Customer IDs 1 to 100 in Partition 1 and so on…

In case we do not care on the basis of which tables are to be partitioned, we can use Clustering techniques to split the tables. Both will improve the query performance, but partitioning is more specific and of user choice.

#modernisation #bigquery #gcp #data-platforms #data science

medium.com

Step by Step Guide to load data into BigQuery