Use Case

For data governance purposes, customers often want to store the profile metadata generated by Cloud Dataprep Premium when jobs are run. In this scenario, a customer wants to retain the profiling metadata in BigQuery for reporting purposes.

This article describes how to use webhooks and Cloud Functions to automatically publish Dataprep-generated profile information into BigQuery (after making an intermediate stop in GCS).

We will build the following automated process:

  1. Run a Cloud Dataprep job with profiling enabled.
  2. In Cloud Dataprep, invoke a webhook that calls a Cloud Function.
  3. The Cloud Function calls the GET profile results API.
  4. The Cloud Function saves the API response to GCS.
  5. The Cloud Function triggers a separate Cloud Dataprep job to process the JSON API response and publish a BigQuery table.

If you don’t already have access to Cloud Dataprep Premium, and you want to try this yourself, you can sign up here.

Step-by-step instructions

Step 1: Understand the API output containing the profile metadata

Whenever you run a job with profiling enabled, Cloud Dataprep generates metadata about the profiling results. There are three types of profile metadata information that Cloud Dataprep will output:

  1. profilerRules: Contains information about each DQ rule and the number of passing and failing rows for each rule.
  2. profilerTypeCheckHistograms: Contains information about the number of missing, mismatched, and valid records for each column in your dataset.
  3. profilerValidValueHistograms: Contains information about min/max/median values for numeric or date columns, and the top 20 unique values by count for string columns.

These profile results appear in the Cloud Dataprep UI, and can also be retrieved through an API call. In order to publish the profile metadata to BigQuery, you will need to make an API call to return the JSON representation of the profile information.

You can read about the API call at this link: https://api.trifacta.com/dataprep-premium/index.html#operation/getProfilingInformationForJobGroup

#dataprep #bigquery #data-quality #data-science

Publish Cloud Dataprep Profile Results to BigQuery
1.95 GEEK