1628824885
Want everyone in your organization to be able to easily find the data they need, while minimizing overall risk, and ensuring regulatory compliance? In this episode of BigQuery Spotlight, we’ll examine BigQuery data governance so you can ensure your data is secure. We’ll also go over Cloud Data Loss Prevention and an open-source framework for data quality validation.
Timestamps:
0:00 - Intro
0:27 - What is data governance?
1:00 - Understanding your data
1:27 - Categorize your data
1:58 - Using Data Catalog
2:44 - Using DLP
3:22 - Building an onboarding pipeline
4:06 - Configuring access policies
5:28 - Column level access
6:14 - Row level access
7:02 - Ongoing monitoring
7:44 - Building a data quality pipeline
#bigquery #database
1627010673
How does BigQuery’s internal storage work? In this episode of BigQuery Spotlight, we share how BigQuery stores data so you can make informed decisions on how to optimize your BigQuery storage. We’ll also talk about partitioning, as well as clustering and how it allows for efficient lookups.
Timestamps:
Storage optimization docs → https://goo.gle/2V0q7gj
#bigquery #database
1626402254
What are jobs in BigQuery and how does the reservation model work? In this episode of BigQuery Spotlight, we’ll review jobs, reservations, and best practices for managing workload in BigQuery. We’ll also walk you through the difference between BI Engine reservations and standard reservations, so you can decide what will work best for you.
Timestamps:
#bigquery #developer
1625849880
This Edureka video on ‘𝐆𝐨𝐨𝐠𝐥𝐞 𝐁𝐢𝐠𝐐𝐮𝐞𝐫𝐲 𝐓𝐮𝐭𝐨𝐫𝐢𝐚𝐥’ will give you an overview of BigQuery service in Google Cloud Platform and it will also help you to understand important concepts like its architecture, features. pricing, etc with practical implementation. Following pointers are covered in this Google BigQuery Tutorial:
#cloud #bigquery #google
1625687700
This video explain how to upload data from Pandas DataFrame to BigQuery and then retrieve that data from Google BigQuery to a Business Intelligent (BI) dashboard by using Python as backend and SQL for data extraction from a Google Cloud.
So, the pipeline is very simple: Pandas with Python, then send data to Google BigQuery and then setup BI dashboard to retrieve that data by using SQL. That’s all.
The content of the video is:
0:04 - Useful videos before starting (to learn more about concat CSV files and how to upload Pandas data to BigQuery via Google API).
0:49 - Starting tutorial.
0:55 - Step 1: Preparing data for project.
1:14 - Step 2: Upload Pandas DataFrame to BigQuery via API by Python.
3:08 - Step 3: Checking SQL commands in BigQuey.
3:57 - Step 4: Setup BigQuery credentials (Service account) for BI Framework.
6:50 - Step 5: Retrieve BigQuery data by using SQL in BI dashboard.
7:41 - Step 6: Create your first dashboard from a data retrieved from BigQuery.
8:31 - Step 7: Add new data to BigQuery and update BI dashboard quickly by SQL.
9:20 - Result: Updated and Old charts in BI dashboard.
Additional videos:
As BI dashboard I used Mode Analytics for this tutorial: https://mode.com
I hope this tutorial will be useful for data analytics and data scientists across the world.
Wishes! Vytautas.
#python #bigquery #sql
1625195194
Read the blog → https://goo.gle/363xbLB
Managed tables documentation → https://goo.gle/3qDrldk
External tables documentation → https://goo.gle/3h7VLBq
What is a BigQuery table and how does it work? In this episode of BigQuery Spotlight, we’ll review the different types of tables in BigQuery, including managed tables, external tables, and virtual tables with logical and materialized views. We’ll also walk you through the use of views, which are virtual tables defined by a SQL query.
Timestamps:
Views documentation → https://goo.gle/3jqmqLg
Querying external data video → https://goo.gle/364Pm3q
Authorized views video → https://goo.gle/3y7KG92
Watch more episodes of BigQuery Spotlight → https://goo.gle/BQSpotlight
Subscribe to Google Cloud Tech → http://goo.gle/GoogleCloudTech
#BigQuerySpotlight
Product: BigQuery; fullname: Leigha Jarett;
#bigquery #sql
1625145000
In this walkthrough, I will use OpenStreetMap data from BigQuery. It’s great that Google published it for free in their public data sets. So you can easily query geo information with SQL [1]. Google uploaded their data set once without any updates yet. If you need newer data, then the OpenStreetMap API in combination with some simple python code might be a possible solution for your problem.
OpenStreetMap.org is an international project founded in 2004 with the goal of creating a free map of the world. For this purpose we collect data about roads, railroads, rivers, forests, houses and much more worldwide [2].
Although the data from OpenStreetMap is free, but you can well draw certain business advantages. Some examples that I have helped to develop or at least accompanied are for example:
#big-data #open-data #data-science #bigquery #openstreetmap #working with openstreetmap data
1624860720
These days, data science practitioners find themselves relying on cloud platforms more and more, either for data storage, cloud computing, or a mix of both. This article will demonstrate how to leverage Cloud Run in GCP to access a dataset stored on Google BigQuery, apply a quick transformation, and present the result back to users through a Flask-RESTful API.
Cloud Run is a service that allows you to construct and deploy containers that can be accessed via HTTP requests. Cloud Run is scalable and abstracts away infrastructure management so you can get things up and running quickly.
What is a container you ask? A simple way to think about containers is that they are similar to a Virtual Machine (VM), but much smaller in scale and scope.
With a VM, you typically have a virtual version of an entire OS running (such as a Windows PC running a Linux VM through something like VirtualBox.) This Linux VM will typically have a GUI, a web browser, word processing software, IDEs and a whole host of software accompanying it.
With containers however, you can have the minimal amount of software necessary to perform your desired task, making them compact and efficient, easy to create, destroy and deploy on the fly. For example, the container in this article will just have Python 3.8 installed and nothing else.
Cloud Run is well-suited to deploying stateless containers. For a good insight into stateful vs stateless containers, take a look at this article.
#google-cloud-run #bigquery #flask #google-cloud-platform #docker #cloud
1624255440
This is a short extension to my previous story , where I described how to incrementally export data from Datastore to BigQuery. Here, I discuss how to extend the previous solution to the situation where you have Datastores in multiple projects. The goal remains the same, we would like to have the data in BigQuery.
Overall, the problem can be expressed with the following diagram
Sketch of the architecture (by author)
The Dataflow process can live either in one of the source projects or can be put in a separate project — I will put the dataflow process in a separate project. The results can be stored in BigQuery that is located either in the same project as the dataflow process, or in another project.
Let’s begin with the generalization. First, I have extended the config file with two new fields: SourceProjectIDs
which is nothing more than a list of source GCP projects, and Destination
that defines where the output BigQuery dataset lives.
#data-engineering #gcp #bigquery #serverless #python
1623756660
In the world of Big Data, data visualization tools and techniques are essential to analyze large amounts of information and make data-driven decisions as data is increasingly used for important management decisions. So there is a trend away from gut feeling and emotional decisions towards rational choices that are made based on numbers. Therefore, reports and visualizations have to be easily understood and meaningful.
It is increasingly beneficial for professionals to be able to use data to make decisions and visuals to tell stories that communicate how data informs the question of person, subject, time, place, and method [1]. In the area of Big Data visualization comes with new ways and challenges due to the huge amounts of data. Therefore, new visualization techniques had to be created in order to make the data amounts more tangible for the user.
In the examples that follow immediately below of new visualization possibilities in the area of Big Data, I have used Google’s BigQuery and Data Studio. For the free tier of BigQuery, you can simply register and use public data sets here [2], which definitely fall under the Big Data label. Data Studio is free anyway and a great alternative to MS Power BI, Qlik and other BI Tools. Since you get a whole scalable Data Warehouse technology and the BI layer for free, I find Google as a sandbox for your first steps in the field of Big Data visualization perfectly suited.
Here a few examples of visualizations that I used and have seen the most in the field of representing Big Data.
TreeMaps
A tree map or a tile chart is used to visualize hierarchical structures, which are represented by nested rectangles. In this way, size ratios can be vividly displayed by selecting the area of the rectangles proportionally to the size of the data unit to be displayed.
#big-data #data-studio #business-intelligence #bigquery #data-visualisation
1623713580
Extend the capabilities of Sagemaker Studio container images with new libraries.
In the following post, you will learn how to extend the Sagemaker Studio Spark container image to incorporate additional libraries and interact with Google Cloud Services such as BigQuery. We will then create a notebook to retrieve data from a BigQuery table using Amazon Sagemaker Studio.
On December 3, 2019, AWS introduced Amazon SageMaker Studio as The First Fully Integrated Development Environment For Machine Learning. According to AWS, Amazon SageMaker helps data scientists and developers to prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML.
Amazon SageMaker Studio lets you manage your entire ML workflow, providing features that improve the overall ML engineering experience. It offers SageMaker Notebooks to let you easily create and share Jupyter notebooks without having to manage infrastructure. SageMaker Experiments to organize, track and compare ML training and model evaluation jobs or data processing jobs run via SageMaker Processing. Amazon SageMaker Debugger to analyze complex training issues, and receive alerts. SageMaker Autopilot to build models automatically with full control and visibility.
#sagemaker #aws #bigquery #apache spark
1623207215
If you’re using Cloud Firestore, chances are at some point you want to run some queries on your Firestore data. While you can do that in Firestore directly, there might be times you want to do more advanced analyses that could more easily be performed using something like SQL queries. The Export Collections to BigQuery Extension provides you with an easy way to do just that.
Documentation for all Firebase Extensions → https://goo.gle/3oi53Mg
More about the Export Collections to BigQuery Extension → https://goo.gle/3xhwcTN
Catch more from Meet an Extension → http://goo.gle/meet-an-extension
Subscribe to the Firebase channel → https://goo.gle/Firebase
#bigquery #cloud #firestore #firebase
1622460600
BigQuery is a fully managed, serverless data warehouse on the Google Cloud Platform infrastructure that provides scalable, cost-effective and fast analytics over petabytes of data. It is service-software that supports queries using standard SQL. In this article, I would like to mention two main techniques to make your BigQuery Data Warehouse become efficient and performant
SQL vs NoSQL: SQL databases are table-based databases, whereas NoSQL databases can be document-based, key-value pairs, and graph databases. SQL databases are vertically scalable, while NoSQL databases are horizontally scalable. SQL databases have a predefined schema, while NoSQL databases use a dynamic schema for unstructured data[2].
Row vs. Column based Databases: A row-structured database stores data belonging to specific table rows in the same physical location. Famous examples are MySQL and MSSQL databases. Column-oriented databases are in contrast to the most common row-oriented databases. In contrast to the row-oriented databases, they do not store the individual rows next to each other, but the columns. This form of storage is particularly useful for analytical processes involving large amounts of data, since aggregation functions often have to be calculated for individual columns. Well-known examples are HBase, MongoDB or Google BigQuery.
Now, we come to the next two important theoretical terms: OLTP and OLAP. Online transaction processing (OLTP) captures, stores, and processes data from transactions in real time. While online analytical processing (OLAP) uses complex queries to analyze aggregated historical data from OLTP systems. Where columnar databases have problems compared to row structured databases:
For the above reasons, true columnar databases have their main use cases in data warehousing, analytics, and other archive-type data stores, while row-structured databases are generally better suited for OLTP workloads.
For once, it was important for me to present the basic theoretical background. Because you have to know that BigQuery can best be described as a hybrid system. It is definitely a column-based system and therefore more suitable for analytical purposes. This is also important to understand why data should be denormalized — but more on that later. In addition, BigQuery is quite similar to a standard SQL data warehouse, since it can be queried with standard SQL, for example, and serves more as a repository for data from OLTP systems — instead of, for example, image data or similar — but on the other hand allows storage of nested data structures. Therefore, BigQuery can truly be called hybrid.
The following sections will describe the technical part behind the whole process. Firstly, one must think about how to build their best data schema. Rather than adopting or redeveloping traditional Star or Snowflake schemas, data engineers should look at the opposite, denormalization. As mentioned before, data is often taken from OLTP systems and normalized. In BigQuery data should be denormalized again. Here is a small basic example:
Simply put, you should join tables in your ETL/ELT process that belong together in terms of content and save them as a new table. A Join over a View would be theoretically also possible, but a stored table — which one renews in its transformation step for example once on the day — is however simply more performant. So in the example above, you would join the customer data with the sales data and save it as a new object. The challenge here is to also deal with the business process in the background, so that meaningful new data objects can be found and as few joins and transformations as possible for the subsequent data analysis process arise.
#big-data #data-engineering #google #bigquery
1622443244
Learn the basics of Google BigQuery in this introduction as well as ideal use cases and best practices.
It is incredible to see how much businesses rely on data today. 80% of business operations are running in the cloud, and almost 100% of business-related data and documents are now stored digitally. In the 1960s, money made the world go around but in today’s markets, “Information is the oil of the 21st century, and analytics is the combustion engine.” (Peter Sondergaard, 2011)
Data helps businesses gain a better understanding of processes, improve resource usage, and reduce waste; in essence, data is a significant driver to boosting business efficiency and profitability.
This reliance on data isn’t without challenges though. A business can have large data warehouses and no efficient way of processing the data in them. There is also the challenge of sorting valuable data from noise, primarily when you collect data from public sources. Amassing data is meaningless without the tools and means to process, analyze, and act on it. So the important questions are: How can you make this process painless and how do you become a successful data-driven company? The answer to both lies in Google BigQuery.
#bigquery #big-data #google
1621528252
With the mission of accelerating data-powered innovation for our customers, Google Cloud has always put data first. Recognizing that various organizations within Google have robust catalogs of data available for public or commercial use, we’re delighted to introduce a more unified view of those programs– Google Cloud datasets solutions. Building upon the trends we’re seeing across businesses of every size, our datasets solutions highlight the importance of high-value, curated data assets in strengthening and accelerating decision-making.
Building upon the success of our existing Public Datasets Program, we’ve expanded the aperture to include commercial datasets, synthetic datasets, and first-party Google data assets that can be used to increase the value of analytics and AI initiatives. Since its launch in 2016, the Google Cloud Public Datasets Program has provided a catalog of curated public data assets in optimized formats on BigQuery and Cloud Storage in partnership with a number of data providers including the National Oceanic and Atmospheric Administration (NOAA), National Institutes of Health (NIH), and the United States Census Bureau. Their data supports the analytics workloads of many industries; for example, NOAA’s severe storm event details public dataset can be JOIN’d to a retailer’s private inventory dataset to better understand the impact severe weather has on sales. Another example is how property insurers can use weather data insights to inform policy pricing. These are but two of hundreds of examples of what’s possible when cross-pollinating data from previously orthogonal domains.
In adding commercial, synthetic, and first-party data to the program, we hope to further enhance our customers’ ability to unearth unique insights through data analytics and artificial intelligence. What’s more, datasets made available through the catalogs from Earth Engine and Kaggle are available to those who wish to discover and take advantage of them.
To support our customers, we are also announcing an open source reference architecture for dataset onboarding so that even those customers who currently lack their private datasets on Google Cloud can begin their analytics journey. Learn more about this work and how you can utilize the same architecture for your data onboarding on our Developers & Practitioners blog.
With time, our goal is to grow each corpus of data across these various vectors to increase utility for our customers. We view it as imperative to expand our program to include more than simply public data. As we grow our program with new datasets and solutions, we’ll continue to post regular updates on our datasets solution page, so be sure to check it out.
#bigquery #ai & machine learning #public datasets #ai