Real-time technologies are powerful but they add significant complexity to your data architecture

Real-time data pipelines provide a notable advantage over batch processing — data becomes available to consumers faster. In the traditional ETL, you would not be able to analyze events from today until tomorrow’s nightly jobs would finish. These days, many businesses rely on data being available within minutes, seconds, or even milliseconds. With streaming technologies, we no longer need to wait for scheduled batch jobs to see new data events. Live dashboards are updated automatically as new data comes in.

Despite all the benefits, real-time streaming adds a lot of additional complexity to the overall data processes, tooling, and even data format. Therefore, it’s crucial to carefully weigh out the pros and cons of switching to real-time data pipelines. In this article, we’ll look at several options to achieve the benefits of a real-time paradigm with the least amount of architectural changes and maintenance effort.

Traditional approach

When you hear about real-time data pipelines, you may immediately start thinking about Apache Kafka, Flink, Spark Streaming, and similar frameworks which require a lot of knowledge to operate a distributed event streaming platform. Those open-source platforms are best suited to scenarios:

  • when you need to continuously ingest and process reasonably** large amounts **of real-time data,
  • when you anticipate** multiple producers **and consumers and you want to decouple their communication,
  • or when you want to own the underlying infrastructure, possibly on-prem (e.g. compliance).

While many companies and services attempt to facilitate the management of underlying distributed clusters, the architecture still remains fairly complex. Therefore, you need to consider:

  • whether you have the resources to operate those clusters,
  • how much data do you plan to process by using this platform,
  • whether the added complexity is worth the effort.

In the next sections, we’ll look at alternative options if your real-time needs don’t justify the added complexity and costs of a self-managed distributed streaming platform.

Amazon Kinesis

AWS realized the customer’s difficulties in managing message-bus architectures  a long time ago (2013). As a result, they came up with Kinesis — a family of services that attempt to make real-time analytics easier. By leveraging serverless Kinesis Data Streams, you can create a data stream with a few clicks in the AWS management console. Once you configured your estimated throughput and the number of shards, you can start implementing data producers and consumers. Even though Kinesis is serverless, you still need to monitor the message size and the number of shards to ensure that you don’t encounter any unexpected write throttles.

In my previous article, you can find an example of a Kinesis producer (source) sending data to a Kinesis data stream using a Python client, and how to continuously send micro-batches of data records to S3 (consumer/destination) by leveraging a Kinesis Data Firehose delivery stream.

Alternatively, to consume data from Kinesis Data Stream, we could:

  • aggregate and analyze data with Kinesis Data Analytics,
  • use Apache Flink to  send this data into Amazon Timestream.

The main benefits of using Kinesis Data Streams as compared to sending data directly to your desired application are latency and decoupling. Kinesis allows you to store data within the stream for up to seven days and have multiple consumers that would receive data at the same time. This means that if a new application would need to collect the same data, you could add a new consumer to the process. This new consumer would not affect other data consumers or producers thanks to decoupling on the Kinesis architecture level.

#aws #python #serverless #data #data-engineering

Is Real-Time Processing Worth It For Your Analytical Use Cases?
1.10 GEEK