In this post, I will talk about Google’s BigQuery service for big data analysis.
BigQuery is a serverless, highly scalable, cost-effective, enterprise-grade modern data warehouse offering on Google Cloud Platform. It allows analysts to use ANSI SQL to analyze petabytes of data at fast speed with no operational overhead.
FEATURES
ServerlessNo-Ops model. Google manages all the resource provisioning behind-the-scene.Fast SQLSupports ANSI SQL with sub-second query response time and high concurrency.Managed StorageOnce the data is loaded to BigQuery, it is stored & managed by BigQuery in effective way.Data Encryption & SecurityData encrypted at rest and integrated with Cloud IAM for security.BigQuery MLEnables data scientists and data analysts to build, train and test ML models inside BigQuery using the SQL syntax.BigQuery GISEnables location intelligence by allowing analysis and visualization of geospatial data inside BigQuery.Flexible pricing modelOn Demand and Flat rate pricing. For latest pricing model, please refer official documentation
For the latest full list, please refer to the official documentation
There are multiple ways to interact with BigQuery:
This post assumes the following:
1. We already have a GCP Project and BigQuery API enabled.
2. Google Cloud SDK (gcloud
). If you don’t have, then refer to my previous blog - Getting started with Google Cloud SDK.
bq mk bq_dataset
bq ls
datasetId
--------------
bq_dataset
bq mk \
--table \
--expiration 3600 \
--description "This is my BQ table" \
--label env:dev \
bq_dataset.first_table \
col1:STRING,col2:FLOAT,col3:STRING
bq show bigquery-public-data:covid19_jhu_csse.summary
Table bigquery-public-data:covid19_jhu_csse.summary
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Clustered Fields Labels
----------------- ----------------------------- ------------ ------------- ------------ ------------------- ------------------ --------------
07 Jun 10:06:41 |- province_state: string 254940 41005062 freebqcovid:
|- country_region: string
|- date: date
|- latitude: float
|- longitude: float
|- location_geom: geography
|- confirmed: integer
|- deaths: integer
|- recovered: integer
|- active: integer
|- fips: string
|- admin2: string
|- combined_key: string
bq query --use_legacy_sql=false \
'SELECT
date,
country_region,
SUM(confirmed),
SUM(deaths)
FROM
`bigquery-public-data.covid19_jhu_csse.summary`
GROUP BY
date,
country_region
HAVING date = "2020-05-31"
AND
country_region IN ("India", "US")'
#big data #sql #data analysis