Real-time data pipelines provide a notable advantage over batch processing — data becomes available to consumers faster. In the traditional ETL, you would not be able to analyze events from today until tomorrow’s nightly jobs would finish. These days, many businesses rely on data being available within minutes, seconds, or even milliseconds. With streaming technologies, we no longer need to wait for scheduled batch jobs to see new data events. Live dashboards are updated automatically as new data comes in.
Despite all the benefits, real-time streaming adds a lot of additional complexity to the overall data processes, tooling, and even data format. Therefore, it’s crucial to carefully weigh out the pros and cons of switching to real-time data pipelines. In this article, we’ll look at several options to achieve the benefits of a real-time paradigm with the least amount of architectural changes and maintenance effort.
When you hear about real-time data pipelines, you may immediately start thinking about Apache Kafka, Flink, Spark Streaming, and similar frameworks which require a lot of knowledge to operate a distributed event streaming platform. Those open-source platforms are best suited to scenarios:
While many companies and services attempt to facilitate the management of underlying distributed clusters, the architecture still remains fairly complex. Therefore, you need to consider:
In the next sections, we’ll look at alternative options if your real-time needs don’t justify the added complexity and costs of a self-managed distributed streaming platform.
AWS realized the customer’s difficulties in managing message-bus architectures a long time ago (2013). As a result, they came up with Kinesis — a family of services that attempt to make real-time analytics easier. By leveraging serverless Kinesis Data Streams, you can create a data stream with a few clicks in the AWS management console. Once you configured your estimated throughput and the number of shards, you can start implementing data producers and consumers. Even though Kinesis is serverless, you still need to monitor the message size and the number of shards to ensure that you don’t encounter any unexpected write throttles.
In my previous article, you can find an example of a Kinesis producer (source) sending data to a Kinesis data stream using a Python client, and how to continuously send micro-batches of data records to S3 (consumer/destination) by leveraging a Kinesis Data Firehose delivery stream.
Alternatively, to consume data from Kinesis Data Stream, we could:
The main benefits of using Kinesis Data Streams as compared to sending data directly to your desired application are latency and decoupling. Kinesis allows you to store data within the stream for up to seven days and have multiple consumers that would receive data at the same time. This means that if a new application would need to collect the same data, you could add a new consumer to the process. This new consumer would not affect other data consumers or producers thanks to decoupling on the Kinesis architecture level.
#aws #python #serverless #data #data-engineering