Elasticsearch Indexing Data Flow

Elasticsearch® is a very powerful and flexible distributed data system, accepting and indexing billions of documents, making them available in near-real time for search, aggregation, and analyses. This article is about how that’s done, focusing on basic new data inserts and the data flow from request all the way down to the disk.

Indexing is a relatively simple high-level process, consisting of:

Data arrival via the API
Routing to the right Index, Shard, and Node
Mapping, Normalization, and Analyses
Storage in memory and on disk
Making it available for search

However, the actual process is quite a bit more complicated, especially given the distributed nature of the cluster and its data, the high data rates involved, and the parallel nature of everything going on at once. Plus it all has to be as reliable and scalable as possible. This is the magic of Elasticsearch.

Let’s look at the steps in more detail.

Arrival & Batching

Elasticsearch first learns about incoming data to index when it arrives via the Index APIs. Clients such as Logstash, the Beats, or even cURL send data to the cluster’s nodes for processing. They can send one document at a time, but usually use the bulk API to send data in batches for less overhead and faster processing. Batches are just groups of documents sent in one API call, and don’t need to be related, i.e they can include data destined for several different indexes.

Ingest data can be sent to any node, though larger clusters often use dedicated coordinating nodes (more for search than ingest), or even dedicated ingest nodes, which can run data pipelines to pre-process the data. Whatever node the data arrives at will be the coordinating node for this batch, and will route the data to the right place, even though the actual ingest work is executed on the data nodes holding the target index data.

#elasticsearch #search #elk-stack #sysadmin #data-science

Arrival & Batching

medium.com

Elasticsearch Indexing Data Flow