There are a few simple data quality checks you can build in your Data Warehouse process to detect data inconsistencies due to errors within your ETL/ELT pipelines or connection failures.
There are a few simple data quality checks you can build in your Data Warehouse process to detect data inconsistencies due to errors within your ETL/ELT pipelines or connection failures. It’s always a good idea to measure data quality and to fastly detect errors, otherwise a lack of trust from the consumer can be the result. The following article will show you some easy to adapt data consistency checks between your source and target system — for example a Data Warehouse or a Data Lake.
Data quality measurement process
A very simple check to detect deviations between source and target systems is to count rows of the tables in both systems. This check of course will only make sense, if you have an equal table structure in your source and target system, which is realized via an ELT process.
Example count rows in Big Query:
SELECT count(1) FROM `double-silo-282323.Test_Data.table_1` ; SELECT count(1) FROM `double-silo-282323.Test_Data.table_2`
You can do this manually in both systems, but an automatic approach would be a greater idea — for example by taking the count result from the source system with your ETL tool to your Data Warehouse/Lake and creating reports like in the figure shown below:
You can also improve it by trigger alarms e.g. with a cloud function and sending them via E-Mail to your data engineers.
To check if the content of the tables is identical you will need other technics than counting the row numbers. One example could be the usage of key figures. You can sum up certain rows like price or amount and check if the sum of source and target system table is the same. The challenge here is to find good key figures, however, you can realize this check without much computing power, but with just a simple
SUM() function in SQL.
In GCP , BigQuery is serverless way of doing petabyte scale analytics. This blog explains about BigQuery data warehouse solution on GCP.
If you looking to learn about Google Cloud in depth or in general with or without any prior knowledge in cloud computing, then you should definitely check this quest out.
This is part two of three. On our last post, we discussed mainly the use of guidelines on how to use BigQuery (BQ) consciously in terms of costs.
This article is meant to act as a continuation, or ”part 2", to a previous article in which I showed How to automate financial data collection with Python using APIs and Google Cloud.
Mismanagement of multi-cloud expense costs an arm and leg to business and its management has become a major pain point. Here we break down some crucial tips to take some of the management challenges off your plate and help you optimize your cloud spend.