Azure Databricks Streaming with GCP Pub Sub

Stream Pub/Sub topic using Azure Databricks

Use Case

  • Multi-cloud data processing
  • ability to move data from GCP Pub/sub to Azure databricks to ADLS gen2
  • Store as delta format
  • Event driven data processing

Architecture

Steps

GCP

Azure

  • Create a Azure account
  • Create a Resource group
  • Create a Azure databricks
  • Create a Azure Storage account — ADLS gen2 (delta storage)
  • Create a cluster with runtime 8.2ML
  • Here is the connector URL — https://github.com/googleapis/java-pubsublite-spark
  • Once cluster is started go to library and select maven
com.google.cloud:pubsublite-spark-sql-streaming:0.2.0
  • Wait for cluster to install
  • Meanwhile gather the GCP project id and JSON key file
  • Create a Notebook with python as language
  • Read stream
df = spark.readStream \
  .format("pubsublite") \
  .option("pubsublite.subscription", "projects/$PROJECT_NUMBER/locations/$LOCATION/subscriptions/$SUBSCRIPTION_ID") \
  .option("gcp.credentials.key", "<SERVICE_ACCOUNT_JSON_IN_BASE64>") \
  .load
  • Now write back to Delta for further processing
events.writeStream
  .format("delta")
  .outputMode("append")
  .option("checkpointLocation", "/delta/events/_checkpoints/etl-from-pubsub")
  .start("/delta/pubsub")
  • run the notebook cell and once writestream is invoked please check the folder to see if data is getting written
  • Check delta/pubsub folder in ADLS gen2

#azure-databricks #pub-sub #azure-data-lake

What is GEEK

Buddha Community

Azure Databricks Streaming with GCP Pub Sub

Azure Databricks Streaming with GCP Pub Sub

Stream Pub/Sub topic using Azure Databricks

Use Case

  • Multi-cloud data processing
  • ability to move data from GCP Pub/sub to Azure databricks to ADLS gen2
  • Store as delta format
  • Event driven data processing

Architecture

Steps

GCP

Azure

  • Create a Azure account
  • Create a Resource group
  • Create a Azure databricks
  • Create a Azure Storage account — ADLS gen2 (delta storage)
  • Create a cluster with runtime 8.2ML
  • Here is the connector URL — https://github.com/googleapis/java-pubsublite-spark
  • Once cluster is started go to library and select maven
com.google.cloud:pubsublite-spark-sql-streaming:0.2.0
  • Wait for cluster to install
  • Meanwhile gather the GCP project id and JSON key file
  • Create a Notebook with python as language
  • Read stream
df = spark.readStream \
  .format("pubsublite") \
  .option("pubsublite.subscription", "projects/$PROJECT_NUMBER/locations/$LOCATION/subscriptions/$SUBSCRIPTION_ID") \
  .option("gcp.credentials.key", "<SERVICE_ACCOUNT_JSON_IN_BASE64>") \
  .load
  • Now write back to Delta for further processing
events.writeStream
  .format("delta")
  .outputMode("append")
  .option("checkpointLocation", "/delta/events/_checkpoints/etl-from-pubsub")
  .start("/delta/pubsub")
  • run the notebook cell and once writestream is invoked please check the folder to see if data is getting written
  • Check delta/pubsub folder in ADLS gen2

#azure-databricks #pub-sub #azure-data-lake

Aisu  Joesph

Aisu Joesph

1626490533

Azure Series #2: Single Server Deployment (Output)

No organization that is on the growth path or intending to have a more customer base and new entry into the market will restrict its infrastructure and design for one Database option. There are two levels of Database selection

  • a.  **The needs assessment **
  • **b. Selecting the kind of database **
  • c. Selection of Queues for communication
  • d. Selecting the technology player

Options to choose from:

  1. Transactional Databases:
    • Azure selection — Data Factory, Redis, CosmosDB, Azure SQL, Postgres SQL, MySQL, MariaDB, SQL Database, Maria DB, Managed Server
  2. Data warehousing:
    • Azure selection — CosmosDB
    • Delta Lake — Data Brick’s Lakehouse Architecture.
  3. Non-Relational Database:
  4. _- _Azure selection — CosmosDB
  5. Data Lake:
    • Azure Data Lake
    • Delta Lake — Data Bricks.
  6. Big Data and Analytics:
    • Data Bricks
    • Azure — HDInsights, Azure Synapse Analytics, Event Hubs, Data Lake Storage gen1, Azure Data Explorer Clusters, Data Factories, Azure Data Bricks, Analytics Services, Stream Analytics, Website UI, Cognitive Search, PowerBI, Queries, Reports.
  7. Machine Learning:
    • Azure — Azure Synapse Analytics, Machine Learning, Genomics accounts, Bot Services, Machine Learning Studio, Cognitive Services, Bonsai.

Key Data platform services would like to highlight

  • 1. Azure Data Factory (ADF)
  • 2. Azure Synapse Analytics
  • 3. Azure Stream Analytics
  • 4. Azure Databricks
  • 5. Azure Cognitive Services
  • 6. Azure Data Lake Storage
  • 7. Azure HDInsight
  • 8. Azure CosmosDB
  • 9. Azure SQL Database

#azure-databricks #azure #microsoft-azure-analytics #azure-data-factory #azure series

Eric  Bukenya

Eric Bukenya

1624713540

Learn NoSQL in Azure: Diving Deeper into Azure Cosmos DB

This article is a part of the series – Learn NoSQL in Azure where we explore Azure Cosmos DB as a part of the non-relational database system used widely for a variety of applications. Azure Cosmos DB is a part of Microsoft’s serverless databases on Azure which is highly scalable and distributed across all locations that run on Azure. It is offered as a platform as a service (PAAS) from Azure and you can develop databases that have a very high throughput and very low latency. Using Azure Cosmos DB, customers can replicate their data across multiple locations across the globe and also across multiple locations within the same region. This makes Cosmos DB a highly available database service with almost 99.999% availability for reads and writes for multi-region modes and almost 99.99% availability for single-region modes.

In this article, we will focus more on how Azure Cosmos DB works behind the scenes and how can you get started with it using the Azure Portal. We will also explore how Cosmos DB is priced and understand the pricing model in detail.

How Azure Cosmos DB works

As already mentioned, Azure Cosmos DB is a multi-modal NoSQL database service that is geographically distributed across multiple Azure locations. This helps customers to deploy the databases across multiple locations around the globe. This is beneficial as it helps to reduce the read latency when the users use the application.

As you can see in the figure above, Azure Cosmos DB is distributed across the globe. Let’s suppose you have a web application that is hosted in India. In that case, the NoSQL database in India will be considered as the master database for writes and all the other databases can be considered as a read replicas. Whenever new data is generated, it is written to the database in India first and then it is synchronized with the other databases.

Consistency Levels

While maintaining data over multiple regions, the most common challenge is the latency as when the data is made available to the other databases. For example, when data is written to the database in India, users from India will be able to see that data sooner than users from the US. This is due to the latency in synchronization between the two regions. In order to overcome this, there are a few modes that customers can choose from and define how often or how soon they want their data to be made available in the other regions. Azure Cosmos DB offers five levels of consistency which are as follows:

  • Strong
  • Bounded staleness
  • Session
  • Consistent prefix
  • Eventual

In most common NoSQL databases, there are only two levels – Strong and EventualStrong being the most consistent level while Eventual is the least. However, as we move from Strong to Eventual, consistency decreases but availability and throughput increase. This is a trade-off that customers need to decide based on the criticality of their applications. If you want to read in more detail about the consistency levels, the official guide from Microsoft is the easiest to understand. You can refer to it here.

Azure Cosmos DB Pricing Model

Now that we have some idea about working with the NoSQL database – Azure Cosmos DB on Azure, let us try to understand how the database is priced. In order to work with any cloud-based services, it is essential that you have a sound knowledge of how the services are charged, otherwise, you might end up paying something much higher than your expectations.

If you browse to the pricing page of Azure Cosmos DB, you can see that there are two modes in which the database services are billed.

  • Database Operations – Whenever you execute or run queries against your NoSQL database, there are some resources being used. Azure terms these usages in terms of Request Units or RU. The amount of RU consumed per second is aggregated and billed
  • Consumed Storage – As you start storing data in your database, it will take up some space in order to store that data. This storage is billed per the standard SSD-based storage across any Azure locations globally

Let’s learn about this in more detail.

#azure #azure cosmos db #nosql #azure #nosql in azure #azure cosmos db

Ruthie  Bugala

Ruthie Bugala

1620435660

How to set up Azure Data Sync between Azure SQL databases and on-premises SQL Server

In this article, you learn how to set up Azure Data Sync services. In addition, you will also learn how to create and set up a data sync group between Azure SQL database and on-premises SQL Server.

In this article, you will see:

  • Overview of Azure SQL Data Sync feature
  • Discuss key components
  • Comparison between Azure SQL Data sync with the other Azure Data option
  • Setup Azure SQL Data Sync
  • More…

Azure Data Sync

Azure Data Sync —a synchronization service set up on an Azure SQL Database. This service synchronizes the data across multiple SQL databases. You can set up bi-directional data synchronization where data ingest and egest process happens between the SQL databases—It can be between Azure SQL database and on-premises and/or within the cloud Azure SQL database. At this moment, the only limitation is that it will not support Azure SQL Managed Instance.

#azure #sql azure #azure sql #azure data sync #azure sql #sql server

Nabunya  Jane

Nabunya Jane

1621857540

Revealed: A ridiculously easy way to integrate Azure Cosmos DB with Azure Databricks

Buddy our novice Data Engineer who recently discovered the ultimate cheat-sheet to read and write files in Databricks is now leveling up in the Azure world.

In this article, you will discover how to seamlessly integrate Azure Cosmos DB with Azure Databricks. Azure Cosmos DB is a key service in the Azure cloud platform that provides a NoSQL-like database for modern applications.

As a Data Engineer or a Data Scientist, you may want to use Azure Cosmos DB for serving your data that is modeled and prepared using Azure Databricks or you may want to analyze the data that already exists in Azure Cosmos DB using Databricks. Whatever your purpose simply follow this 3 step guide to get started.

What is Azure Cosmos DB?

For the uninitiated, Azure Cosmos DB worthy of the name is Microsoft’s multi-model database that can manage data at a planet-scale. It belongs to the “NoSQL Database as a Service” stack like its counterpart AWS DynamoDB.

Inside Cosmos DB, each piece of data called an item is stored inside schema-agnostic containers, which means that you don’t need to adhere to any particular schema for your data.

Cosmos DB supports multi-model APIs like MongoDB, Cassandra API, Gremlin API, and the default Core SQL API.

The Core SQL API provides you with JSON like NoSQL document store, which you can easily query using an SQL-like language.

Despite its fancy name and overwhelming features, Cosmos DB is basically a data store, a data store that we can read from and write to.

Through its seamless integration with a plethora of Azure services, Azure Databricks is just the right tool for the job.

In order to execute this exercise you must have an Azure subscription with Cosmos DB and Databricks services running. If you don’t have one, follow the steps below to get it and create the services for Free!

If you have an existing Azure subscription skip to the next section.

**If you do not have an Azure subscription **get a free trial here, it’s quite easy and takes less than 2 minutes. (you will need to give your credit card information, but don’t worry you will not be charged for anything)

Now, all we need is a Cosmos DB account and a Databricks workspace.

How to Create Azure Cosmos DB?

Microsoft makes it easier and easier to deploy services on Azure using quick starter templates.

Follow the link to the quick starter template to deploy Azure Cosmos DB, click on **Deploy to Azure, **this opens up the Azure portal on the browser. Review the steps and create your service. The Cosmos DB account will be ready before your next cup of coffee

Once the account is created you will need to create a database and a container in which your data will be stored. Follow the example below to create a Database called AdventureWorks and a Container named ratings.

Navigate to your deployed Cosmos DB account and click on Data Explorer →New Container → name your database AdventureWorks →your container **ratings **→ Partition key as **/rating → **select **Throughput manual **and set it to 1000.

#data-science #big-data #cloud #azure #azure cosmos db #azure databricks