Gilberto  Block

Gilberto Block

1595064275

Apache Spark on Dataproc vs. Google BigQuery

This post looks at research undertaken to provide interactive business intelligence reports and visualizations for thousands of end users, in the hopes of addressing some of the challenges to architects and engineers looking at moving to Google Cloud Platform in selecting the best technology stack based on their requirements and to process large volumes of data in a cost effective yet reliable manner.

Introduction

When it comes to Big Data infrastructure on Google Cloud Platform, the most popular choices Data architects need to consider today are Google BigQuery – A serverless, highly scalable and cost-effective cloud data warehouse, Apache Beam based Cloud Dataflow and Dataproc – a fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way.

This variety also presents challenges to architects and engineers looking at moving to Google Cloud Platform in selecting the best technology stack based on their requirements and to process large volumes of data in a cost effective yet reliable manner.

In the following sections, we look at research we had undertaken to provide interactive business intelligence reports and visualizations for thousands of end users. Furthermore, as these users can concurrently generate a variety of such interactive reports, we need to design a system that can analyze billions of data points in real time.

Requirements

For technology evaluation purposes, we narrowed down to following requirements –

  1. Raw data set of 175TB size : This dataset is quite diverse with scores of tables and columns consisting of metrics and dimensions derived from multiple sources.
  2. Catering to 30,000 unique users
  3. Serving upto 60 concurrent queries to the platform users

The problem statement due to the size of the base dataset and requirement for a high real time querying paradigm requires a solution in the Big Data domain.

Salient Features of Proposed Solution

The solution took into consideration following 3 main characteristics of desired system:

  1. Analyzing and classifying expected user queries and their frequency.
  2. Developing various pre-aggregations and projections to reduce data churn while serving various classes of user queries.
  3. Developing state of the art ‘Query Rewrite Algorithm’ to serve the user queries using a combination of aggregated datasets. This will allow the Query Engine to serve maximum user queries with minimum number of aggregations.

Tech Stack Considerations

For benchmarking performance and the resulting cost implications, following technology stack on Google Cloud Platform were considered:

1. Cloud DataProc + Google Cloud Storage

For Distributed processing – Apache Spark on Cloud DataProc

For Distributed Storage – Apache Parquet File format stored in Google Cloud Storage

2. Cloud DataProc + Google BigQuery using Storage API

For Distributed processing – Apache Spark on Cloud DataProc

For Distributed Storage – BigQuery Native Storage (Capacitor File Format over Colossus Storage) accessible through BigQuery Storage API

3. Native Google BigQuery for both Storage and processing – On Demand Queries

Using BigQuery Native Storage (Capacitor File Format over Colossus Storage) and execution on BigQuery Native MPP (Dremel Query Engine)

All the queries were run in on demand fashion. Project will be billed on the total amount of data processed by user queries.

4. Native Google BigQuery with fixed price model

Using BigQuery Native Storage (Capacitor File Format over Colossus Storage) and execution on BigQuery Native MPP (Dremel Query Engine)

Slots reservations were made and slots assignments were done to dedicated GCP projects. All the queries and their processing will be done on the fixed number of BigQuery Slots assigned to the project.

#overviews #apache spark #bigquery #google #apache

What is GEEK

Buddha Community

Apache Spark on Dataproc vs. Google BigQuery
Gilberto  Block

Gilberto Block

1595064275

Apache Spark on Dataproc vs. Google BigQuery

This post looks at research undertaken to provide interactive business intelligence reports and visualizations for thousands of end users, in the hopes of addressing some of the challenges to architects and engineers looking at moving to Google Cloud Platform in selecting the best technology stack based on their requirements and to process large volumes of data in a cost effective yet reliable manner.

Introduction

When it comes to Big Data infrastructure on Google Cloud Platform, the most popular choices Data architects need to consider today are Google BigQuery – A serverless, highly scalable and cost-effective cloud data warehouse, Apache Beam based Cloud Dataflow and Dataproc – a fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way.

This variety also presents challenges to architects and engineers looking at moving to Google Cloud Platform in selecting the best technology stack based on their requirements and to process large volumes of data in a cost effective yet reliable manner.

In the following sections, we look at research we had undertaken to provide interactive business intelligence reports and visualizations for thousands of end users. Furthermore, as these users can concurrently generate a variety of such interactive reports, we need to design a system that can analyze billions of data points in real time.

Requirements

For technology evaluation purposes, we narrowed down to following requirements –

  1. Raw data set of 175TB size : This dataset is quite diverse with scores of tables and columns consisting of metrics and dimensions derived from multiple sources.
  2. Catering to 30,000 unique users
  3. Serving upto 60 concurrent queries to the platform users

The problem statement due to the size of the base dataset and requirement for a high real time querying paradigm requires a solution in the Big Data domain.

Salient Features of Proposed Solution

The solution took into consideration following 3 main characteristics of desired system:

  1. Analyzing and classifying expected user queries and their frequency.
  2. Developing various pre-aggregations and projections to reduce data churn while serving various classes of user queries.
  3. Developing state of the art ‘Query Rewrite Algorithm’ to serve the user queries using a combination of aggregated datasets. This will allow the Query Engine to serve maximum user queries with minimum number of aggregations.

Tech Stack Considerations

For benchmarking performance and the resulting cost implications, following technology stack on Google Cloud Platform were considered:

1. Cloud DataProc + Google Cloud Storage

For Distributed processing – Apache Spark on Cloud DataProc

For Distributed Storage – Apache Parquet File format stored in Google Cloud Storage

2. Cloud DataProc + Google BigQuery using Storage API

For Distributed processing – Apache Spark on Cloud DataProc

For Distributed Storage – BigQuery Native Storage (Capacitor File Format over Colossus Storage) accessible through BigQuery Storage API

3. Native Google BigQuery for both Storage and processing – On Demand Queries

Using BigQuery Native Storage (Capacitor File Format over Colossus Storage) and execution on BigQuery Native MPP (Dremel Query Engine)

All the queries were run in on demand fashion. Project will be billed on the total amount of data processed by user queries.

4. Native Google BigQuery with fixed price model

Using BigQuery Native Storage (Capacitor File Format over Colossus Storage) and execution on BigQuery Native MPP (Dremel Query Engine)

Slots reservations were made and slots assignments were done to dedicated GCP projects. All the queries and their processing will be done on the fixed number of BigQuery Slots assigned to the project.

#overviews #apache spark #bigquery #google #apache

akshay L

akshay L

1572939856

Hadoop vs Spark | Hadoop MapReduce vs Spark

In this video on Hadoop vs Spark you will understand about the top Big Data solutions used in the IT industry, and which one should you use for better performance. So in this Hadoop MapReduce vs Spark comparison some important parameters have been taken into consideration to tell you the difference between Hadoop and Spark also which one is preferred over the other in certain aspects in detail.

Why Hadoop is important

Big data hadoop is one of the best technological advances that is finding increased applications for big data and in a lot of industry domains. Data is being generated hugely in each and every industry domain and to process and distribute effectively hadoop is being deployed everywhere and in every industry.

#Hadoop vs Spark #Apache Spark vs Hadoop #Spark vs Hadoop #Difference Between Spark and Hadoop #Intellipaat

Jon  Gislason

Jon Gislason

1619247660

Google's TPU's being primed for the Quantum Jump

The liquid-cooled Tensor Processing Units, built to slot into server racks, can deliver up to 100 petaflops of compute.

The liquid-cooled Tensor Processing Units, built to slot into server racks, can deliver up to 100 petaflops of compute.

As the world is gearing towards more automation and AI, the need for quantum computing has also grown exponentially. Quantum computing lies at the intersection of quantum physics and high-end computer technology, and in more than one way, hold the key to our AI-driven future.

Quantum computing requires state-of-the-art tools to perform high-end computing. This is where TPUs come in handy. TPUs or Tensor Processing Units are custom-built ASICs (Application Specific Integrated Circuits) to execute machine learning tasks efficiently. TPUs are specific hardware developed by Google for neural network machine learning, specially customised to Google’s Machine Learning software, Tensorflow.

The liquid-cooled Tensor Processing units, built to slot into server racks, can deliver up to 100 petaflops of compute. It powers Google products like Google Search, Gmail, Google Photos and Google Cloud AI APIs.

#opinions #alphabet #asics #floq #google #google alphabet #google quantum computing #google tensorflow #google tensorflow quantum #google tpu #google tpus #machine learning #quantum computer #quantum computing #quantum computing programming #quantum leap #sandbox #secret development #tensorflow #tpu #tpus

Edureka Fan

Edureka Fan

1606982795

What is Apache Spark? | Apache Spark Python | Spark Training

This Edureka “What is Apache Spark?” video will help you to understand the Architecture of Spark in depth. It includes an example where we Understand what is Python and Apache Spark.

#big-data #apache-spark #developer #apache #spark

Rylan  Becker

Rylan Becker

1620513960

AWS v/s Google v/s Azure: Who will win the Cloud War?

In the midst of this pandemic, what is allowing us unprecedented flexibility in making faster technological advancements is the availability of various competent cloud computing systems. From delivering on-demand computing services for applications, processing and storage, now is the time to make the best use of public cloud providers. What’s more, with easy scalability there are no geographical restrictions either.

Machine Learning systems can be indefinitely supported by them as they are open-sourced and within reach now more than ever with increased affordability for businesses. In fact, public cloud providers are increasingly helpful in building Machine Learning models. So, the question that arises for us is – what are the possibilities for using them for deployment as well?

What do we mean by deployment?

Model building is very much like the process of designing any product. From ideation and data preparation to prototyping and testing. Deployment basically is the actionable point of the whole process, which means that we use the already trained model and make its predictions available to users or other systems in an automated, reproducible and auditable manner.

#cyber security #aws vs azure #google vs aws #google vs azure #google vs azure vs aws