Stream - Solution for Big Data Structures

Now, let’s take a look at how these functional concepts have been applied to building a type of big-data data structure called a stream.

What Do We Mean by Stream?

What is a stream, exactly? It’s an ordered sequence of structured events in your data.

These could be actual events, like mouse clicks or page views, or they could be something more abstract, like customer orders, bank transactions, or sensor readings. Typically, though, an event is not a fully rendered view of a large data model but rather something small and measurable that changes over time.

Each event is a point of data, and we expect to get hundreds — or even millions — of these events per second. All of these events taken together in sequence form our stream.

How can we store this kind of data? We could write it to a database table, but if we’re doing millions of row insertions every second, our database will quickly fall over.

So traditional relational databases are out.

Enter the Message Broker

To handle streaming data, we use a special piece of data infrastructure called a message broker.

Message brokers are uniquely adapted to the challenges of event streams. They provide no indexing on data and are designed for quick insertions. On the other end, we can quickly pick up the latest event, look at it, and move on to the next one.

The two sides of this system — inserts on the one end and reads on the other — are referred to as the producer and the consumer, respectively.

We’re going to produce data into our stream — and then consume data out of it. You might also recognize this design from its elemental data structure, the queue.

An In-Memory Buffer

So now we know how we’re going to insert and read data. But how is it going to be stored in-between?

One option is to keep it all in memory. An insert would add a new event to an internal queue in memory. A consumer reading data would remove the event from memory. We could then keep a set of pointers to the front and end of the queue.

But memory is expensive, and we don’t always have a lot of it. What happens when we run out of memory? Our message broker will have to go offline, flush its memory to disk, or otherwise interrupt its operation.

An On-Disk Buffer

Another option is to write data to the local disk. You might be accustomed to thinking of the disk as being slow. It certainly can be. But disk access today with modern SSDs or a virtualized disk — like Amazon’s EBS (Elastic Block Store) — is fast enough for our purposes.

Now that our data can scale with the size of the SSD, we can slap on our server. Or even better, if we’re in a cloud provider, we can add a virtualized disk to scale as much as we need.

Aging Out of Data

But wait a minute. We’re going to be shoveling millions of events into our message broker. Aren’t we going to run out of disk space rather quickly?

That’s why we have a time to live (TTL) for the data. Our data will age out of storage. This setting is usually configurable. Let’s say we set it to one hour. Events in our stream will then only be stored for one hour, and after that, they’re gone forever.

Another way of looking at it is to think of the stream on disk as a circular buffer. The message broker only buffers the last hour of data, which means that the _c_onsumer of this data has to be at most one hour behind.

Introducing Your New Friend, Apache Kafka

In fact, the system I’ve just described is exactly how Apache Kafka works. Kafka is one of the more popular big-data solutions and the best open-source system for streaming available today.

Here’s how we create a producer and write data to it in Kafka using Java.

Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
KafkaProducer kafkaProducer = new KafkaProducer(properties);
kafkaProducer.send(new ProducerRecord("mytopic", 0, "test message")); // Push a message to topic "mytopic"

Now on the other side we have our consumer, which is going to read that message.

Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("group.id", "mygroup");
KafkaConsumer kafkaConsumer = new KafkaConsumer(properties);
List topics = new ArrayList();
topics.add("mytopic");
kafkaConsumer.subscribe(topics); // Subscribe to "mytopic"
ConsumerRecords records = kafkaConsumer.poll(1); // Get back the next record, blocking until it's available

There are some details here that are specific to Kafka’s jargon. First of all, we have to point our consumer/producer code to the Kafka broker and configure how we want it to transfer data back and forth out of the topic. Then, we have to tell it what sort of data to fetch by specifying a topic.

A topic is essentially the name of the stream we’re reading and writing from. When we produce data to Kafka, we have to specify which topic we’re writing to. Likewise, when we create a consumer, we must subscribe to at least one topic.

Notice we don’t have any commands in this API to modify data. All we can do is push a ProducerRecord and get the data back as ConsumerRecords.

That’s all well and good, but what happens in-between?

It’s All About the Log

Kafka’s basic data structure is the log.

You’re familiar with logs, right? If you want to know what’s happening on a server, you look at the system log. You don’t query a log — you just read it from beginning to end.

And on servers with a lot of log data, the data is often rotated so older logs are discarded, leaving only the recent events you’re most likely interested in.

It’s the same thing with Kafka data. Our stream in Kafka is stored in rotating log files. (Actually, a single topic will be split among a bunch of log files, which depending on how we’ve partitioned the topic.)

So how does our consumer of the data know where it left off?

It simply saves an offset value that represents its place in the stream. Think of this as a bookmark. The offset lets the consumer recover if it shuts down and has to resume reading where it left off.

Now We Have Live Data

Now that we have our data in a live stream, we can perform analysis on it in realtime. The details of what we do next will have to be left for another article.

Suffice to say, once we have the data in a stream, we can now start using a stream processor to transform the data, aggregate it, and even query it.

It’s Functional

As we’ll see is the case in many big-data systems, Kafka uses the functional model of computation in its design.

Note that the data in Kafka is immutable. We never go into Kafka data and modify it. All we can do is insert and read the data. And even then, our reading is limited to sequential access.

Sequential access is cheap because Kafka stores the data together on disk. So it’s able to provide efficient access of blocks of data, even with millions of events being inserted every second.

But wait a minute. If the data is immutable, then how do we update a value in Kafka?

Quite simply, we make another insertion. Kafka has the concept of a key for a message. If we push the same key twice and if we enable a setting called log compaction, then Kafka will ensure older values for the same key are deleted. This is all done automagically by Kafka — we never manually set a value, just push the updated value. We can even push a null value to delete a record.

By avoiding mutable data structures, Kafka allows our streaming system to scale to ridiculous heights.

Why Not Data Warehousing?

On first glance, a stream might seem like a clumsy solution. Couldn’t we just design a database that can handle our write volume — something like a data warehouse — and then query it later when processing?

In some cases, yes, data warehousing is good enough. And for certain types of uses, it might even be preferable. If we know we don’t need our data to be live, then it might be more expensive to maintain the infrastructure for streaming.

The turnaround time for processing our data out of a data warehouse will be slow, delayed by hours or perhaps days, but maybe we’re happy with a daily report. Let’s call these solutions batch processing systems.

Batch processing is more commonly found in extract-transform-load (ETL) systems and in business-intelligence departments.

The Limitations of Batch Processing

There are lots of cases where batch processing isn’t good enough.

Consider the case of a bank processing transactions. We have a stream of transactions coming in, showing us how our customers are using their credit cards in real time.

Now let’s say we want to detect fraudulent transactions in the data. Could we do this with a data-warehousing system?

Probably not. If our query takes hours to run, we won’t know if a transaction was fraudulent until it’s too late.

Where Can I Use This?

Streaming isn’t the solution for every problem. For certain types of problems that are becoming increasingly relevant in the modern world, these are critical concepts, but not everyone is building a realtime system.

But Kafka is useful beyond these niche cases.

Message brokers are crucially important in scaling any system. A common pattern with a service-oriented architecture is to allow services to talk to one another via Kafka, as opposed to HTTP calls. Reading from Kafka is inherently asynchronous, which makes it perfect for a generic messaging tier.

Look into using Kafka or a similar message broker (such as Amazon’s Kinesis) for streaming your data any time you have a large write volume that can be processed asynchronously.

A messaging tier might seem like overkill if you’re a small company, but if you have any intention of growing, it’ll pay dividends to get this solution in place before the growing pains start to hurt.

Conclusion

As we’ve seen, functional-programming concepts have made their way into infrastructure components in the world of big data. Kafka is a prime example of a project that uses a very basic immutable data structure — the log — to great effect.

But these components aren’t just niche systems used in cutting edge machine-learning companies. They’re basic tools that can provide scaling advantages for everyone, from large enterprises to quickly growing startups.

#big data #kafka #programming #big-data