Kappa and Lambda architecture with a post-relational touch, to create the perfect blend for near-real time IoT and Analytics.
Just before Ignite, a very interesting case study done with RXR has been released, where they showcased their IoT solution to bring safety in building during COVID times. It uses Azure SQL to store warm data, allowing it to be served and consumed to all downstream users, from analytical application to mobile clients, dashboards, API and business users.
If you haven’t done yet, you definitely should watch the Ignite recording (the IoT part start at minute 22:59). Not only the architecture presented is super interesting, but also the guest presenting it — Tara Walker — is super entertaining and joyful to listen. Which is not something common in technical sessions. Definitely a bonus!
If you are interested in the details, beside the Ignite recording, take a look also at the related Mechanics video, where things are discussed a bit more deeply.
Implement a Kappa or Lambda architecture on Azure using Event Hubs, Stream Analytics and Azure SQL, to ingest at least 1 Billion message per day on a 16 vCores database
The video reminded me that in my long “to-write” blog post list, I have one exactly on this subject. How to use Azure SQL to create a amazing IoT solution. Well, not only IoT. More correctly how to implement a Kappa or Lambda architecture on Azure using Event Hubs, Stream Analytics and Azure SQL. It’s a very generic architecture that can be easily turned to IoT just by using IoT Hub instead of Event Hubs and it can be used as is if you need, instead, to implement an ingestion and processing architecture for the Gaming industry, for example.
Goal is to create a solution that can ingest and process up to 10K message/secs, which is close to 1 Billion message per day, which is a value that will be more than enough for many use cases and scenario. And if someone needs more, you can just scale up the solution.
This article is quite long. So, if you’re in hurry, or you already know all the technical details on the aforementioned services, or you don’t really care too much about tech stuff right now, you can just go away with the following key points.
If you’re now ready for some tech stuff, let’s get started.
So, let’s see it in detail. As usual, I don’t like to discuss without also having a practical way to share knowledge, so you can find everything ready to be deployed in your Azure subscription here:
As that would not be enough, I also enjoyed recording a short video to go through the working solution, giving you a glimpse of what you’ll get, without the need for you to spend any credit, if you are not yet ready to do that:
Creating a streaming solution usually means implementing one of two very well know architectures: Kappa or Lambda. They are very close to each other, and it’s safe to say that Kappa is a simplified version of Lambda. Both have a very similar data pipeline:
Event Hubs is probably the easiest way to ingest data at scale in Azure. It is also used behind the scenes by IoT Hub, so everything you learn on Event Hubs, will be applicable to IoT Hub too.
It is very easy to use, but at the beginning some of the concepts can be quite new and not immediate to grasp, so make sure to check out this page to understand all the details: Azure Event Hubs — A big data streaming platform and event ingestion service
Long story short: you want to ingest a massive amount of data in the shortest time possible, and keep doing that for as much as you need. To achieve the scalability you need, a distributed system is required, and so data must be partitioned across several nodes.
In Event Hubs you have to decide how to partition ingested data when you create the service, and you cannot change it later. This is the tricky part. How do you know how many partition you will need? That’s a very complex answer, as it is completely dependent on how fast who will read the ingested data will be able to go.
If you have only one partition and one of the parallel applications that will consume the data is slow, you are creating a bottleneck.
If you have too many partitions, you will need to have a lot of clients reading the data, but if data is not coming in fast enough, you’ll starve your consumers, meaning you are probably wasting your money in running processes that are doing nothing for a big percentage of their CPU time.
So let’s say that you have 10MB/sec of data coming in. If each of your consuming client can process data at 4MB/sec, you probably want 3 of them to work in parallel (with the hypothesis that your data can be perfectly and evenly spread across all partitions), so you will probably want to create at least 3 partitions.
That’s a good starting point, but 3 partitions is not the correct answer. Let’s understand why by making the example a bit more realistic and thus slightly more complex.
Event Hubs let you pick and choose the Partition Key, which is the property whose values will be used to decide in which partition an ingested message will land. All messages with same partition key value, will land in the same partition. Also, if you need to process messages in the order they are received, you must put them in the same partition. If fact, ordering is guaranteed only at partition level.
In our sample we’ll be partitioning by
DeviceId, meaning data coming from the same device will land in the same partition. Here’s how the sample data is generated
stream = (stream .withColumn("deviceId", ...) .withColumn("deviceSequenceNumber", ...) .withColumn("type", ...) .withColumn("eventId", generate_uuid()) .withColumn("createdAt", F.current_timestamp()) .withColumn("value", F.rand() * 90 + 10) .withColumn("partitionKey", F.col("deviceId")) )
In Event Hubs the “power” you have available (and that you pay for) is measured in Throughput Units (TU). Each TU guarantees that it will support 1MB/sec or 1000 messages(or events)/sec , whichever came first. If we want to be able to process 10.000 events/sec we need at least 10 TU. Since it’s very unlikely that our workload will be perfectly stable, without any peak here and there, I would go for 12 TU, to have some margin to handle some expected workload spike.
TU can be changed on the fly, increasing on reducing them as you need.
It’s time to decide how many TU and Partitions we need inour sample. We want to be able to reach at least 10K messages/second. TU are not an issue as they can be change on the fly, but deciding how many partitions we need is more challenging. We’ll be using Stream Analytics, and we don’t exactly know how fast it will be able to consume incoming data.
Of course one road is running test to figure out the correct numbers, but we still need to come up with some reasonable numbers also to just to start with such test. Well, a good rule of thumb is the following:
Rule of thumb: create an amount for partitions equal to the number of throughput units you have or you might expect to have in future
For what concern the ingestion part we’re good now. If you want to know more, please take a look at this article: Partitioning in Event Hubs and Kafka, that will go into detail of this topic. Super recommended!
Let’s now move to discuss how to process the data that will be thrown at us, doing it as fast as possible.
Azure Stream Analytics is an amazing serverless streaming processing engine. It is based on the open source Trill framework which source code is available on GitHub and is capable to process a trillion message per day. All without requiring you to manage and maintain the complexity of a extremely scalable distributed solution.
Stream Analytics support a powerful SQL-like declarative language: tell it what you want and it will figure out how to do it, fast.
SELECT [user], feature, DATEDIFF(second, LAST(Time) OVER ( PARTITION BY [user], feature LIMIT DURATION(hour, 1) WHEN Event = 'start' ), Time) as duration FROM input TIMESTAMP BY Time WHERE Event = 'end'
All the complexity of managing the stream of data used as the input, with all its temporal connotations, is done for you, and all you have to tell Stream Analytics is that it should calculate the difference between a
end event on per
feature basis. No need to write complex custom stateful aggregation functions or other complex stuff. Let’s keep everything simple and leverage the serverless power and flexibility.
As for any distributed system, the concept of partitioning is key, as it is the backbone of any scale-out approach. In Stream Analytics, since we are getting data from Event Hub or IoT Hub, we can try to use exactly the same partition configuration already defined in those services. If was use the same partition configuration also in Azure SQL, we can achieve what are defined as embarrassingly parallel jobs where there is no interaction between partitions and everything can be processed fully in parallel. Which means: at the fastest speed possible.
Streaming Units (SU) is the unit of scale that you use — and pay for—in Azure Stream Analytics. There is no easy way to understand how many SU you need, as consumption will totally depend on how complex your query is. The recommendation is to start with 6 and then monitor the Resource Utilization to see how much percentage of available SU you are using. If your query partition data using
PARTITON BY, SU usage will increase as your are distributing the workload across nodes. This is good, as it means you’ll be able to process more data in the same amount of time . You also want to make sure SU utilization is below 80% as after that your events will be queued, which means you’ll see higher latency. If everything works well, we’ll be able to ingest our target of 10K events/sec (or 600K events/minute as pictured below)
Azure SQL is really a great database for storing hot and warm data of an IoT solution. I know this is quite the opposite of what many thinks. A relational database is rigid, it requires schema-on-write, and on IoT or Log Processing scenarios, the best approach is a schema-on-read instead. Well, Azure SQL actually supports both and more.
With Azure SQL you can do both schema-on-read and schema-on-write, via native JSON support
In fact, beside what just said, there are several reason for this, and I’m sure you are quite surprised to hear that, so, read on:
Describing each one of the listed features, even just at a very high level, would require an article on its own. And of course, such article is available here, if you are interested (and you should!): 10 Reasons why Azure SQL is the Best Database for Developers.
In order to accommodate a realistic scenario where you have some fields that are always present, while some other can vary by time or device, the sample is using the following table to store ingested data
CREATE TABLE [dbo].[rawdata] ( [BatchId] [UNIQUEIDENTIFIER] NOT NULL, [EventId] [UNIQUEIDENTIFIER] NOT NULL, [Type] [VARCHAR](10) NOT NULL, [DeviceId] [VARCHAR](100) NOT NULL, [DeviceSequenceNumber] [BIGINT] NOT NULL, [CreatedAt] [DATETIME2](7) NOT NULL, [Value] [NUMERIC](18, 0) NOT NULL, [ComplexData] [NVARCHAR](MAX) NOT NULL, [EnqueuedAt] [DATETIME2](7) NOT NULL, [ProcessedAt] [DATETIME2](7) NOT NULL, [StoredAt] [DATETIME2](7) NOT NULL, [PartitionId] [INT] NOT NULL )
As we really want to create something really close to a real production workload, indexes have been created too:
EventId, to quickly find a specific event
StoredAt, to help timeseries-like queries, like, querying the last “n” rows reported by devices
DeviceId, DeviceSequenceNumberto quickly return reported rows sent by a specific device
BatchIdto allow the quick retrivial of all rows sent by a specific batch
At the time of writing I’ve been running this sample for weeks and my database is now close to 30TB:
Table is partitioned by
PartitionId (which is in turn generated by Event Hubs based on
DeviceId) and a query like the following
SELECT TOP(100) EventId, [Type], [Value], [ComplexData], DATEDIFF(MILLISECOND, [EnqueuedAt], [ProcessedAt]) AS QueueTime, DATEDIFF(MILLISECOND, [ProcessedAt], [StoredAt]) AS ProcessTime [StoredAt] FROM dbo.[rawdata2] WHERE [DeviceId] = 'contoso://device-id-471' AND [PartitionId] = 0 ORDER BY [DeviceSequenceNumber] DESC
Takes less then 50 msec to be executed including also the time to send the result to the client. That’s pretty impressive. The result also shows something impressive too:
As you can see, there are two calculated columns
ProcessTime that shows, in milliseconds, how much time an event has been waiting in Event Hubs to be picked up by Stream Analytics to be processed, and how much time the same event spent within Stream Analytics before land into Azure SQL. Each event (all the 10K per second) is processed in — overall—less than 300 msec on average. 280msec more precisely.
That is very impressive.
End-to-End ingestion latency is around 300msec
You can also go lower than that using some more specific streaming tool like Apache Flink, if you really need to completely avoid any batching technique to decrease the latency to the minimum possible. But unless you have some very unique and specific requirements, processing events in less than a second is probably more than enough for you.
For Azure SQL, ingesting data at scale is not a particularly complex or demanding job, on the contrary of what can expect. If done well, using bulk load libraries, the process can be extremely efficient. In the sample I have used a small Azure SQL 16 vCore tier to sustain the ingestion of 10K event/secs, using on average 15% of CPU resources on a bit more of 20% of the IO resources.
This means that in theory I could also use an even smaller 8 vCore tier. While that is absolutely true, you have to think of at least three other factors when sizing Azure SQL:
Just as an example, I have stopped Stream Analytics for a few minutes, allowing messages to pile up a bit. As soon as I restarted it, it tried to process messages as fast as possible, in order to empty the queue and return to the ideal situation where latency is less then a second. In order allow Stream Analytics to process data at higher rate, Azure SQL must be able to handle the additional workload too, otherwise it will slow down all the other components in the pipeline.
As expected, Azure SQL handled the additional workload without breaking a sweat.
For all the needed time, Azure SQL was able to ingest almost twice the regular workload, processing more than 1 Million messages per minute. All of this with CPU usage staying well below 15%, and with a relative spike only to the Log IO — something expected as Azure SQL uses a Write-Ahead Log pattern to guarantee ACID properties—which, still, never went over 45%.
Really, really, amazing.
With such configuration — and remember we’re just using a 16vCore tier, but we can scale up to 80 and more — our system can handle something like 1 billion messages a day, with an average processing latency of less then a second.
The deployed solution can handle 1 billion messages a day, with an average processing latency of less then a second.
Partitioning plays a key role also in Azure SQL: as said before, if need to operate on a lot of data concurrently, partitioning is really something you need to take into account.
Partitioning in this case is used to allow concurrent bulk insert into the target table, even if on such table several indexes exists and thus needs to be kept updated.
Table has been partitioned using the
PartitionId column, in order to have the processing pipeline completely aligned. The
PartitionId value is in fact generated by Event Hub, which partitions data by
DeviceId, so that all data coming from the same device will land in the same partition.
Stream Analytics uses the same partitions provided by Event Hub and so it make sense to align Azure SQL partitions to this logic too, to avoid to cross the streams, which we all know is a bad thing to do. Data will move from source to destination in parallel streams providing the performances and the scalability we are looking for.
CREATE PARTITION FUNCTION [pf_af](int) AS RANGE LEFT FOR VALUES (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
Table partitioning also allows Azure SQL to update the several indexes existing on the target table without ending in tangled locking, where transactions are waiting for each other with the result of huge negative impact on performances. As long as table and indexes are using the same partitioning strategy everything will move forward without any lock or deadlock problem.
CREATE CLUSTERED INDEX [ixc] ON [dbo].[rawdata] ([StoredAt] DESC) WITH (DATA_COMPRESSION = PAGE) ON [ps_af]([PartitionId]) CREATE NONCLUSTERED INDEX ix1 ON [dbo].[rawdata] ([DeviceId] ASC, [DeviceSequenceNumber] DESC) WITH (DATA_COMPRESSION = PAGE) ON [ps_af]([PartitionId]) CREATE NONCLUSTERED INDEX ix2 ON [dbo].[rawdata] ([BatchId]) WITH (DATA_COMPRESSION = PAGE) ON [ps_af]([PartitionId])
Higher concurrency is not the only perk of a good partitioning strategy. Partitions allow extremely fast data movement between tables. We’ll take advantage of this ability for creating highly compressed column-store indexes soon.
What if you need to run complex analytical queries on the data being ingested? That’s a very common requirement for Near-Real Time Analytics or HTAP (Hybrid Transaction/Analytical Processing) solutions.
As you have noticed, you still have enough resources free to run some complex queries, but what if you have to run many really complex queries, for example to compare average values of month-over-month, on the same table were data is being ingest? Or what if you need to allow many mobile client to access the ingested data, all running small but CPU intensive queries? Risk of resource contention — and thus low performances — becomes real.
That’s when a scale-out approach start to get interesting.
With Azure SQL Hyperscale you can create up to 4 readable-copies of the database, all with their own private set of resources (CPU, memory and local cache), that will give you access to exactly the same data sitting in the primary database, but without interfering with it at all. You can run the most complex query you can imagine on a secondary, and the primary will not even notice it. Ingestion will proceed as usual rate, completely unaffected by the fact that a huge analytical query or many concurrent small queries are hitting the secondary nodes.
Columnstore tables (or index in Azure SQL terms) are just perfect for HTAP and Near Real Time Analytics scenario, as already described times ago here: Get started with Columnstore for real-time operational analytics.
This article is already long enough, so I’ll not get into details here, but I will focus on the fact that using columnstore index as a target of a Stream Analytics workload, may not be the best option, if you are also looking for low latency. To keep latency small, a small batch size needs to be used, but this is against the best practices for columnstore, as it will create a very fragmented index.
To address this issue, we can use a feature offered by partition table. Stream Analytics will land data into a regular partitioned rowstore table. On scheduled intervals a partition will be switched out into a staging table, so that it be loaded into a columnstore table using Azure Data Factory, for example, so that all best practices can be applied to have the highest compression and the minimum fragmentation.
What if everything just described is still not enough? What if you need a scale so extreme that you need to be able to ingest and process something like 400 Billions rows per day? Azure SQL allows you to do that, by using In-Memory, latch-free, tables, as described in this amazing article:
I guess that, now, even if you have the most demanding workload, you should be covered. If you need even more power…let me know. I’ll be extremely interested in understanding your scenario.
We’re at the end of this long article, were we learned how it is possible with a Kappa (or Lambda) architecture to ingest, process and serve 10K msg/sec using only PaaS services. As we haven’t maxed out any of the resource of our services, we know we can scale to much higher level. At least twice that goal value, without changing anything and much more than that by increasing resources. With Azure SQL we are just using 16 vCores and it can be scale up to 128. Plenty of space to grow.
#serverless #azure #sql #programming #developer
SQL stands for Structured Query Language. SQL is a scripting language expected to store, control, and inquiry information put away in social databases. The main manifestation of SQL showed up in 1974, when a gathering in IBM built up the principal model of a social database. The primary business social database was discharged by Relational Software later turning out to be Oracle.
Models for SQL exist. In any case, the SQL that can be utilized on every last one of the major RDBMS today is in various flavors. This is because of two reasons:
1. The SQL order standard is genuinely intricate, and it isn’t handy to actualize the whole standard.
2. Every database seller needs an approach to separate its item from others.
Right now, contrasts are noted where fitting.
#programming books #beginning sql pdf #commands sql #download free sql full book pdf #introduction to sql pdf #introduction to sql ppt #introduction to sql #practical sql pdf #sql commands pdf with examples free download #sql commands #sql free bool download #sql guide #sql language #sql pdf #sql ppt #sql programming language #sql tutorial for beginners #sql tutorial pdf #sql #structured query language pdf #structured query language ppt #structured query language
In this article, you learn how to set up Azure Data Sync services. In addition, you will also learn how to create and set up a data sync group between Azure SQL database and on-premises SQL Server.
In this article, you will see:
Azure Data Sync —a synchronization service set up on an Azure SQL Database. This service synchronizes the data across multiple SQL databases. You can set up bi-directional data synchronization where data ingest and egest process happens between the SQL databases—It can be between Azure SQL database and on-premises and/or within the cloud Azure SQL database. At this moment, the only limitation is that it will not support Azure SQL Managed Instance.
#azure #sql azure #azure sql #azure data sync #azure sql #sql server
This article will walk you through creating a new SQL pool within an existing Azure SQL Server as well as catalog the same using the Azure Purview service.
Data is generated by transactional systems and typically stored in relational data repositories. This data is generally used by live applications and for operational reporting. As this data volume grows, this data is often required by other analytical repositories and data warehouses where it can be used for referential purposes and adding more context to other data from across the organization. Transactional systems (also known as Online Transaction Processing (OLTP) systems) usually need a relational database engine, while analytical systems (also known as Online Analytical Processing (OLAP) systems) usually need analytical data processing engines. On Azure cloud, it is usually known that for OLTP requirements, SQL Server or Azure SQL Database can be employed, and for analytical data processing needs, Azure Synapse and other similar services can be employed. SQL Pools in Azure Synapse host the data on an SQL Server environment that can process the data in a massively parallel processing model, and the address of this environment is generally the name of the Azure Synapse workspace environment. At times, when one has already an Azure SQL Server in production or in use, the need is to have these SQL Pools on an existing Azure SQL Server instance, so data in these SQL pools can be processed per the requirements on an OLAP system as well as the data can be co-located with data generated by OLTP systems. This can be done by creating SQL Pools within the Azure SQL Server instance itself. In this article, we will learn to create a new SQL Pool within an existing Azure SQL Server followed by cataloging the same using the Azure Purview service.
As we intend to create a new SQL Pool in an existing Azure SQL Server instance, we need to have an instance of Azure SQL in place. Navigate to Azure Portal, search for Azure SQL and create a new instance of it. We can create an instance with the most basic configuration for demonstration purposes. Once the instance is created, we can navigate to the dashboard page of the instance and it would look as shown below.
As we are going to catalog the data in the dedicated SQL Pool hosted on Azure SQL instance, we also need to create an instance of Azure Purview. We would be using the Azure Purview studio from the dashboard of this instance, tonregister this SQL Pool as the source and catalog the instance.
#azure #sql azure #azure sql server #sql #sql #azure
Co-authored by Rodrigo Souza, Ramnandan Krishnamurthy, Anitha Adusumilli and Jovan Popovic (Azure Cosmos DB and Azure Synapse Analytics teams)
Azure Synapse Link now supports querying Azure Cosmos DB data using Synapse SQL serverless. This capability, available in public preview, allows you to use familiar analytical T-SQL queries and build powerful near real-time BI dashboards on Azure Cosmos DB data.
As announced at Ignite 2020, you can now also query Azure Cosmos DB API for Mongo DB data using Azure Synapse Link, enabling analytics with Synapse Spark and Synapse SQL serverless.
Azure Synapse SQL serverless (previously known as SQL on-demand) is a serverless, distributed data processing service offering built-in query execution fault-tolerance and a consumption-based pricing model. It enables you to analyze your data in Cosmos DB analytical store within seconds, without any performance or RU impact on your transactional workloads.
Using OPENROWSET syntax and automatic schema inference, data and business analysts can use familiar T-SQL query language to quickly explore and reason about the contents in Azure Cosmos DB analytical store. You can query this data in place without the need to copy or load the data into a specialized store.
You can also create SQL views to join data in the analytical stores across multiple Azure Cosmos DB containers, to better organize your data in a semantic layer that will accelerate your data exploration and reporting workloads. BI Professionals can quickly create Power BI reports on top of these SQL views in Direct Query mode.
You can further extend this by building a logical data warehouse to create and analyze unified views of data across Azure Cosmos DB, Azure Data Lake Storage and Azure Blob Storage.
#analytics #announcements #api for mongodb #core (sql) api #data architecture #query #azure cosmos db #azure synapse analytics #serverless sql pools #sql on-demand #synapse link #synapse sql serverless
In this article, we will show how to source data from Azure SQL Database to use in a Machine Learning workflow.
Azure offers a variety of data repositories for operational as well as analytical purposes. One of the most popular and highly adopted database services is Azure SQL Database, which is typically used to host transactional data in Online Transaction Processing (OLTP) systems. A typical data pipeline involves ingesting data into different types of data repositories. Data from different repositories may be optionally enriched or standardized using approaches like Master Data Management (MDM). Data is generally moved using Extract Transform Load (ETL) or Extract Load Transform (ELT) mechanisms. Once the data is in a proper state, it may be stored in a data warehouse in a structured format or in a data lake which is a mix of structured, semi-structured, and unstructured formats. SQL Database is one of those versatile data repositories that can store different types of data, which makes it an ideal candidate for being used as a data warehouse or data mart too. Once data is in operational and analytical repositories, this data is used for various types of analytics, prediction, forecasting, and other types of data intelligence.
Machine learning is one of the most popular means of extracting intelligence out of data. Azure offers Azure ML service which is one of the mainstream services for authoring machine learning workflows. Like other data processing systems, Azure Machine Learning service requires and supports sourcing data from different types of data repositories including Azure SQL Database. Sourcing data is usually the first step while authoring Azure Machine Learning workflows. Let’s go ahead and see how you can source data from SQL Database to use in an Azure Machine Learning workflow.
#azure #sql azure #azure sql #sql